Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient short-read sequencing on long-read sequencers
(USC Thesis Other)
Efficient short-read sequencing on long-read sequencers
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EFFICIENT SHORT-READ SEQUENCING ON LONG-READ SEQUENCERS by Rishvanth Kaliappan Prabakar A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) August 2020 Copyright 2020 Rishvanth Kaliappan Prabakar Individuals are not stable things, they are fleeting. Chromosomes too are shuffled into oblivion, like hands of cards soon after they are dealt. But the cards themselves survive the shuffling. The cards are the genes. The genes are not destroyed by crossing-over, they merely change partners and march on. Of course they march on. That is their business. They are the replicators and we are their survival machines. When we have served our purpose we are cast aside. But genes are denizens of geological time: genes are forever. —Richard Dawkins: The selfish gene ii Acknowledgments I would like to thank Dr. Andrew Smith and Dr. James Hicks for advising, mentoring, and tutoring me; each unique in their approach, but common in their passion to push science forward. SMURF- seq would not have been possible without the help and hundreds of hours of discussion with Dr. Liya Xu. Thank you. I would like to thank my committee members Dr. Mark Chaisson, Dr. Michael Waterman, and Dr. Jorge Nieva for their guidance and advice on this research. I am extremely grateful to all the members of SmithLab, past and present, for their scientific and non-scientific support. I would also like to thank the members of Kuhn-Hicks lab for the numerous useful discussions. iii Contents Acknowledgments iii List of Figures vi Abstract vii 1 Introduction 1 2 Background 5 2.1 Nanopore sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Copy number variation and profiling . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Prior protocols based on concatenating DNA molecules . . . . . . . . . . . . . . . 15 3 Sampling molecules using re-ligated fragments (SMURF)-seq 17 3.1 Naive approaches to read-counting on nanopore machines . . . . . . . . . . . . . . 17 3.2 SMURF-seq approach to read counting . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Mapping SMURF-seq reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Generating higher fragment counts with SMURF-seq . . . . . . . . . . . . . . . . 25 3.5 Efficient CNA profiling using SMURF-seq . . . . . . . . . . . . . . . . . . . . . . 27 3.5.1 Accurate CNA profiles using SMURF-seq . . . . . . . . . . . . . . . . . . 27 3.5.2 Concordant profiles from fewer countable fragments . . . . . . . . . . . . 30 3.6 Future of SMURF-seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4 Identifying fragment boundaries on a SMURF-seq read 37 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Fragment Identification problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.1 Approach to the fragment identification problem . . . . . . . . . . . . . . 43 4.4 Identifying fragment boundaries on a SMURF-seq read . . . . . . . . . . . . . . . 45 4.4.1 Fragment boundary identification under exact matching . . . . . . . . . . . 45 4.4.2 Fragment boundary identification allowing mismatches and indels . . . . . 46 4.4.3 Identifying fragment boundaries in practice . . . . . . . . . . . . . . . . . 48 4.5 Alignment score of a SMURF-seq read . . . . . . . . . . . . . . . . . . . . . . . . 49 iv 4.6 Score distribution under a random model . . . . . . . . . . . . . . . . . . . . . . . 52 4.6.1 Score distribution of one fragment . . . . . . . . . . . . . . . . . . . . . . 53 4.6.2 Score distribution for a given fragment set . . . . . . . . . . . . . . . . . . 55 4.7 Estimating the optimal fragment set . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.7.1 Fast computation of p-values . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.8 Limitations and future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5 Conclusions 70 References 72 Appendix A Supplemental methods 89 Appendix B Mapping SMURF-seq reads with long-read aligners 94 Appendix C Data availability and summary of sequencing runs 100 v List of Figures 2.1 Nanopore sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1 Naive approaches to read-counting on nanopore machines . . . . . . . . . . . . . . 18 3.2 SMURF-seq approach to sequencing short fragments . . . . . . . . . . . . . . . . 19 3.3 Schematic of SMURF-seq protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Restriction enzyme digestion and ligation of DNA molecules. . . . . . . . . . . . . 22 3.5 SMURF-seq generates fragments at a faster rate than sequencing short molecules directly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6 Read and fragment lengths from a SMURF-seq sequencing run. . . . . . . . . . . 28 3.7 Accurate copy number profiles with SMURF-seq. . . . . . . . . . . . . . . . . . . 29 3.8 High-resolution CNA profile with SMURF-seq . . . . . . . . . . . . . . . . . . . 31 3.9 Multiple SMURF-seq CNA profiles by multiplexing in a single run . . . . . . . . . 32 3.10 CNA profile with reads obtained in first few minutes of sequencing . . . . . . . . . 33 3.11 SMURF-seq protocol to sequence pre-fragmented DNA molecules . . . . . . . . . 36 4.1 Uniquely mappable fraction of the genome decreases fragment length . . . . . . . 39 4.2 Alignment graph for fragment boundary identification algorithm with a general score function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 Alignment score of SMURF-seq read as a function of number of fragments . . . . 51 4.4 Extreme value distribution approximation forscore T (S; 1) . . . . . . . . . . . . . 54 4.5 Empirical score distribution forscore T (S; 1) . . . . . . . . . . . . . . . . . . . . . 55 4.6 Normal approximation forscore T (S;k) with equal fragment lengths . . . . . . . . 56 4.7 Normal approximation forscore T (S;k) with random fragment lengths . . . . . . . 57 4.8 Determining the optimal fragmentation of a SMURF-seq read . . . . . . . . . . . 59 4.9 Fast computation of p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.10 Extreme value approximation forscore T (S; 1) with a general score function . . . . 64 4.11 Normal approximation forscore T (S;k) with a general score function . . . . . . . 65 vi Abstract We present SMURF-seq, a protocol to efficiently sequence short DNA molecules on a long-read sequencer by randomly ligating them to form long molecules. Applying SMURF-seq using the highly portable and inexpensive Oxford Nanopore MinION yields up to 30 fragments per read, providing an average of 6.2 and up to 7.5 million mappable fragments per run, increasing informa- tion throughput for read-counting applications. Somatic copy number alterations play a significant role in cancer, and can be leveraged for diagnostic and personalized approaches to treatment. High-throughput short-read sequencing has been extremely efficient in copy number profiling; however, its applicability depends on the avail- ability of instrument, and time to obtain profiles can vary from a few days to weeks. We apply SMURF-seq on the MinION to generate copy number profiles and demonstrate that multiple sam- ples can be multiplexed in a single sequencing run. A comparison with profiles from Illumina sequencing reveals that SMURF-seq attains similar accuracy. A SMURF-seq read is aligned to the reference genome by splitting it into its constituent frag- ments, each aligning to a distinct location in the genome. We define a score function for aligning a SMURF-seq read and describe an approach to determine the optimal fragmentation of a read. More broadly, SMURF-seq expands the utility of long-read sequencers for efficient short-read sequencing, enabling applications on long-read sequencers that are currently only efficient on high- throughput short-read sequencers. vii Chapter 1 Introduction In the last decade, massively parallel high-throughput short-read sequencing has revolutionized the efficiency and breadth of applications for DNA sequencing (Kircher and Kelso, 2010). These high-throughput sequencing methods produce millions to billions of short reads in a single run, and have led to the development of many applications that depend on “read-counting” to measure the abundance of specific sequences in a sample. Examples include RNA-seq, ChIP-seq, and whole-genome copy number profiling. Recently, long-read technologies have been developed that are filling the gap left by short-read sequencers in applications such as genome assembly (Jain et al., 2018a; Loman et al., 2015), which benefit from connecting more distant sequences within a contiguous molecule. Among these, the MinION instrument, from Oxford Nanopore Technologies, is highly portable and inexpensive and has shown its unique value for analysis outside of central sequencing facilities (Quick et al., 2016). Long-read sequencers such as the MinION typically produce vastly fewer reads from a sequencing run, and are therefore less efficient in applications that use sequenced reads purely as a means to count molecules. However, these technologies have the enormous advantage of operating in near real-time, with a turnaround time that can be measured in hours for some applications, rather than days or weeks. 1 Copy number alteration (CNA) has been used successfully to understand a variety of diseases (Fanciulli et al., 2010) – notably cancers, which exhibit both extreme variation and recurrent trends that can be used for diagnostics and personalized approaches to treatment. For example, the ampli- fication and loss of certain genes, such as RB1 deletion and MYCN amplification in retinoblastoma, can be prognostic or even predictive for treatment (Berry et al., 2017). High-throughput short-read sequencing has been extremely effective in copy number profiling of cancers (Chiang et al., 2009), including profiling single tumor cells (Navin et al., 2011). However, for many potential users, the efficiency of high-throughput short-read sequencing in CNA analysis is determined by the avail- ability of instruments and the need for heavy multiplexing to hit a reasonable cost per profile. A sequencing core is typically involved and an individual profile must wait for a “full” run before it can be processed. The MinION sequencer has an accessible buy-in and is easy to use. Unfor- tunately, the MinION has optimal nucleotide throughput when producing reads that are orders of magnitude longer than needed for CNA profiling. To make full use of the advantages offered by the MinION sequencer, we introduce sampling molecules using re-ligated fragments (SMURF)-seq, a protocol to efficiently sequence short DNA molecules on a long-read sequencer (Prabakar et al., 2019). The strategy of SMURF-seq is to con- catenate short fragments into very long molecules (8 kb) prior to sequencing. After (or possibly concurrent with) sequencing, the SMURF-seq reads are mapped to the reference genome by split- ting them into their constituent fragments, each aligning to a distinct location in the genome. We demonstrate the utility of SMURF-seq with the low-cost MinION sequencer to obtain data simi- lar to that expected from typical short-read sequencing, and generated high-quality copy number profiles from this output. More broadly, SMURF-seq is an approach for efficient short-read sequencing, as required for read-counting, on long-read sequencing machines. Here, we describe the details of the SMURF- seq approach; both the SMURF-seq protocol prior to sequencing and mapping the sequenced SMURF-seq reads. This study is organized as follows: 2 In the second chapter, we review the relevant background. First, we discuss the concept of nanopore sequencing and summarize its history, the MinION sequencing instrument and its util- ity, library construction methods, and sequencing on these machines. Then, we discuss the copy number profiling and its implications in diversity and disease; especially its involvement in can- cer, and the utility of copy number analysis in understanding the biology of cancer and diagnostic evaluation of tumors. We summarize methods and computational approaches for generating copy number profiles. Finally, we discuss prior protocols that are similar in spirit to SMURF-seq, in- cluding SAGE and its variants, SMASH, and ConcatSeq. In the third chapter, we describe the details of the SMURF-seq approach and demonstrate the accuracy of this approach for CNA profiling. We start with a discussion of sequencing long-reads or short-reads directly for read-counting on nanopore machines, and the merits and limitations of these methods. Then, we introduce the SMURF-seq protocol for efficient short-read sequencing on long-read machines; SMURF-seq combines the merits of sequencing long or short reads directly while alleviating the limitations. We demonstrate that SMURF-seq generates higher read-counts from a sequencing run in comparison to these other methods, the copy number profiles generated with SMURF-seq are as accurate as profiles generated using an Illumina platform, multiple sam- ples can be multiplexed and sequencing in the same sequencing run, and the reads generated in the first few minutes of sequencing are sufficient to generate accurate profiles. Finally, we provide future directions for further improving the efficiency and expanding the utility of SMURF-seq. The fourth chapter is dedicated to algorithmic and statistical aspects of mapping SMURF-seq reads. We discuss the challenges associated with mapping SMURF-seq reads as the fragments get shorter. We introduce the fragment identification problem as a way of identifying fragment bound- aries and estimating the optimal number of fragments on a SMURF-seq read. Next, we define a score function for aligning SMURF-seq reads and describe algorithms to find fragment boundaries on a read such that the score is maximized. Then, we determine the null distribution of aligning a SMURF-seq read generated at random to calculate a p-value for a particular fragmentation of a 3 read. We use these p-values to determine the optimal number of fragments on a read. Finally, we suggest future directions for aligning SMURF-seq reads with short fragments to large reference genomes. We conclude this study by highlighting our vision of using the SMURF-seq approach for short- read sequencing on long-read sequencers; we envision that with further optimizations to SMURF- seq, to both the protocol and mapping algorithms, would drive down the cost of sequencing and broaden the applications of long-read sequencers. 4 Chapter 2 Background 2.1 Nanopore sequencing A brief history of nanopore sequencing The concept of nanopore sequencing is based on the idea that as a single-stranded DNA (or RNA) translocates through a nanometer sized pore, a nanopore, in the presence of an electric field, the change in current level measured across the nanopore would be dependent on the nucleotide passing through the nanopore; thus, measuring the current over time could be leveraged to determine the sequence of nucleotides (Fig. 2.1a). This idea of using transmembrane proteins as nanopores for sensing and sequencing nucleic acids was independently thought of by several researches including David Deamer, Hagan Bayley, and George Church (Bayley, 2015; Branton et al., 2009; Deamer et al., 2016). Initial experiments showed that as a single-stranded DNA or RNA molecules could be driven through a Staphylococcus aureus -hemolysin in the presence of an electric field (Kasianowicz et al., 1996). The current through the pore remained constant in the absence of oligomers; and the presence of oligomers caused transient decreases in current, with the duration of the decrease proportional to the length of the oligomer. Further research demonstrated that the decrease in amplitude of current could be used to differentiate between poly-purine and poly-pyrimidine se- 5 quences of RNA (Akeson et al., 1999) and DNA (Meller et al., 2000). It was also observed that the DNA molecules translocate through the nanopore at few microseconds per base (Meller et al., 2000). Although these experiments demonstrated the potential for nanopores to distinguish nucleic acid polymers, several challenges remained to be addressed to use this approach for reading indi- vidual bases on a DNA or RNA molecule. The most important of these were detecting the bases at a single nucleotide resolution and slowing the rate of translocation through the nanopore so that a readout can be obtained (Bayley, 2015; Branton et al., 2009). These challenges were resolved in the forthcoming years. A few notable milestones are described below. In regard to identifying individual nucleobases, all four bases were identified in a single- stranded DNA that had a terminal hairpin structure (Ashkenasy et al., 2005) and single-stranded DNA attached to streptavidin with a biotin linker (Purnell and Schmidt, 2009; Stoddart et al., 2009). Mean Signal (pA) 1 0 2 3 4 Time (seconds) ONT output (squiggles) + – Motor protein Nanopore Current passed through the pore and is modulated as DNA passed through a b Figure 2.1: Nanopore sequencing. (a) Change in current across a nanopore is measured as a DNA molecule translocates through a nanopore. This figure is adapted from Figure 5Ab in Goodwin et al. (2016). (b) Oxford Nanopore Technologies MinION instrument. 6 These structures immobilized the DNA in the nanopore (either a wildtype-hemolysin or an engi- neered form of it) allowing sufficient time for detection. Thus, with the appropriate sequence, each base could be resolved, and further, the location within a nanopore that is most sensitive to detect bases was also determined. The frequency of translocation of DNA molecules through a-hemolysin pore was increased and the voltage threshold for translocation was decreased by engineering the pore to have positively charged groups in the lining of the lumen (Maglia et al., 2008). However, a disadvantage of the -hemolysin pores is that the region that is sensitive to the nucleotides in the pore is too wide, and so the current differences are small between nucleotides, making single nucleotide detection difficult. Pores derived from Mycobacterium smegmatis porin A have a narrower sensitive region, and could detect nucleotides at a higher resolution (Butler et al., 2008; Manrao et al., 2011). In regard to controlling the rate of translocation through a nanopore, there were two approaches, exo-sequencing and strand-sequencing. In the exo-sequencing approach, individual bases from a DNA are cleaved into a nanopore with an exonuclease and identified one at a time (Astier et al., 2006; Clarke et al., 2009). Alternatively, in the strand-sequencing approach, a DNA molecule is threaded through a nanopore at a controlled rate and the bases are identified from the continuous change in current levels. Initial approaches in strand-sequencing used a DNA polymerase (derived from Escherichia coli Kleenow fragment or bacteriophage T7) and recorded the current levels as the DNA translocates through a pore with the incorporation of each nucleotide (Benner et al., 2007; Chu et al., 2010; Cockroft et al., 2008; Gyarfas et al., 2009). Subsequently, it was shown that29 polymerase could be used to control ratcheting in both the forward and the reverse direction (Cherf et al., 2012; Lieberman et al., 2010; Manrao et al., 2012). 29 polymerase was used to sequence reads up to 4.5 kb from theX174 genome using nanopores (Laszlo et al., 2014). Oxford Nanopore Technologies Oxford Nanopore Technologies was founded in 2005 by Ha- gan Bayley and colleagues (Deamer et al., 2016). Oxford Nanopore Technologies announced the 7 MinION instrument at the Advances in Genome Biology and Technology meeting in 2012 and made it available to early access researches in 2014 (Bayley, 2015; Deamer et al., 2016). The MinION sequencer (Fig. 2.1b) is a portable instrument requiring just a modest computer for control and data acquisition. A MinION flowcell consists of 2,048 nanopores (at present, a derivative of Escherichia coli CsgG protein (Brown and Clarke, 2016)) embedded on a membrane, of which 512 pores can sequence molecules in parallel. Whole genomes of several organisms including humans have been sequenced using the Min- ION instrument (Bowden et al., 2019; Jain et al., 2018a; Loman et al., 2015; Moss et al., 2020; Stancu et al., 2017). It has also been used in several other applications such as disease surveil- lance (Faria et al., 2016; Quick et al., 2016), metagenomics (Charalampous et al., 2019; Goordial et al., 2017; Leggett et al., 2020), direct RNA sequencing (Depledge et al., 2019; Garalde et al., 2018; Workman et al., 2019), and detecting methylated bases (Liu et al., 2019; Rand et al., 2017; Simpson et al., 2017), among others (Jain et al., 2016). In addition to the MinION instrument, Oxford Nanopore Technologies currently offers vari- ous nanopore devices including the bench-top GridION and PromethION which allow for parallel sequencing with up to 5 and 48 flowcells, respectively. Nanopore library preparation and sequencing Sequencing nucleic acids requires preprocess- ing the sample for compatibility with the underlying sequencing technology, a process traditionally referred to as library preparation. Sequencing on a nanopore machine usually requires fragment- ing these molecules to the appropriate length and attaching sequencing adapters. Oxford Nanopore Technologies offers several commercially available library preparation kits for both DNA and RNA samples. Of the available kits, most commonly used ones for DNA samples are the ligation se- quencing kit family and the rapid sequencing kit family. In theory, there is no limit on the length of a molecule that can be sequenced with a nanopore, and thus, the length is determined by the downstream application or the limitations of handling 8 high molecular weight DNA. For the ligation sequencing kit (SQK-LSK108 1D DNA by ligation), the recommended length is8 kb when starting with 1 μg of sample to ensure appropriate molar concentration in the subsequent steps. DNA molecules can be fragmented to the appropriate length using a variety of methods including the Covaris g-TUBE. These molecules are then optionally repaired to remove any nicks, and then the DNA ends are prepared to have a dA tail. Finally, sequencing adapters (that have a dT tail) are ligated to the end-prepared DNA. These adapters contain specific DNA sequenced with attached enzymes that regulate the translocation of a DNA molecule into a nanopore. Library preparation with the ligation kit takes approximately 60 minutes. The rapid library preparation kit (SQK-RAD003 Rapid sequencing) offers a faster method, by simultaneously fragmenting and tagging the ends of high molecular weight DNA (recommended > 30 kb). Adapters are then attached to these tags. Library preparation with the rapid kit takes approximately 10 minutes. Both of these kits offer barcoding capabilities for multiplexing several samples in a single sequencing run. For example, the native barcode kit (EXP-NBD103) is used together with the ligation sequencing kit, and adds a barcode sequence to the end-prepared DNA molecules prior to ligating sequencing adapters. After library preparation with a unique barcode for each sample, they are pooled in appropriate molar concentrations before sequencing. Oxford Nanopore Technologies offers several other preparation kits, such as the 1D 2 kit for higher accuracy reads and PCR based kits when starting with nanogram or picogram amounts of DNA. After library construction the sample is ready to be sequenced. The flowcell is loaded on the sequencing machine, primed with the appropriate buffers, and the sample is loaded. After which sequencing can be started and reads are available as they are sequenced in real-time. The sequenc- ing process is controlled by the MinKNOW tool, and can continue for up to 48 hours (at present, on the MinION instrument). The sequenced reads are converted from current level “squiggles” into base-space using base-calling tools such as Guppy (Oxford Nanopore Technologies). 9 Reads generated from a sequencing run are typically several kilobases long. At present, the reads sequenced with the 1D protocol align with approximately 85% identity to the reference genome (Bowden et al., 2019; Carter and Hussain, 2017; Jain et al., 2018a). The accuracy of reads are continually improving with improvements to both the nanopore and the base-calling tools. Nanopore sequencing does not introduce a significant GC bias when the samples are prepared with a protocol that does not use PCR (Carter and Hussain, 2017). 2.2 Copy number variation and profiling Copy number variation Sources of genomic variation in humans include single-nucleotide polymorphisms (SNPs), short insertions and deletions, and variable number tandem repeats. An- other source of variation, copy number variation (CNV), is the change in number of copies of a region of the genome larger than 1 kb with respect to a reference genome (Feuk et al., 2006; Re- don et al., 2006). The number of copies of a region of the genome could increase resulting in a copy number “gain” (also referred to as an amplification or a duplication), or decrease resulting in a copy number “loss” (also referred to as a deletion). Changes in copy number can occur due to mechanisms such as homologous recombination and non-homologous DNA repair mechanisms (Hastings et al., 2009; Stankiewicz and Lupski, 2010; Van Binsbergen, 2011). Copy number variations contribute both to diversity in the human population and to disease. Variations that contribute to diversity are present in the germline, whereas those that contribute to diseases could in the germline or somatic. (Somatic copy number variations, especially when related to cancer, are referred to as somatic copy number alterations or CNA, in short.) Copy number variation in diversity CNVs as a significant source of diversity in humans was established by Sebat et al. (2004) and Iafrate et al. (2004). Sebat et al. (2004) identified 221 copy number differences in 20 individuals with an average of 11 differences between individuals; 10 these variations had an average length of 465 kb (median length of 222 kb). Iafrate et al. (2004) identified 225 regions in 55 individuals with an average of 12.4 differences between individuals; these variations ranged from 150 kb to 425 kb. These studies were extended to larger populations and to populations of different ancestry (Li et al., 2009; Redon et al., 2006). CNVs in the human genome and those that contribute to diversity are reviewed in Freeman et al. (2006), Feuk et al. (2006), and Zarrei et al. (2015). Copy number variation in diseases Changes in copy number of genes or whole chromosomes are implicated in several diseases. The most notable example is the trisomy of chromosome 21 in Down syndrome (Antonarakis et al., 2004). Mechanisms by which copy number variations alter phenotype include over-expression of amplified genes and under-expression of deleted genes (gene dosage effects), unmasking of a mutant recessive allele after deletion of the dominant allele, altered expression of a gene that overlaps a structural variation (such as an inversion, translocation, or dele- tion), or a structural variation could disrupt the regulatory elements of a gene (Feuk et al., 2006). Changes in copy number are associated with autism (Sebat et al., 2007), sporadic schizophrenia (Xu et al., 2008), psoriasis (Hollox et al., 2008), susceptibility to HIV-1/AIDS (Gonzalez et al., 2005), and several other diseases (Fanciulli et al., 2010; Stankiewicz and Lupski, 2010). Copy number alterations in cancer Evolution of a “normal” single cell into a “tumor” cell, a tumor cell into a mass of cells, into sub-clones of cells, and into metastatic tumor cells all involve genomic changes to the cells (Stratton et al., 2009). Of the possible genomic changes, somatic copy number alterations play a significant role (Beroukhim et al., 2010; Zack et al., 2013). Applications of CNA profiling of cancer can be broadly classified into two categories, un- derstanding the biology of cancer, and diagnostic and prognostic evaluation of tumors. These categories are not mutually exclusive and the advance in one drives advances in the other. CNA profiling for understanding the biology of cancer typically involves profiling and ana- 11 lyzing a collection of tumor samples within or across cancer types. CNA profiles exhibit several recurrent alterations that occur in a significant fraction of samples within a population. These recurrent alterations are considered to be “driver” alterations and play an active role in tumor pro- gression; whereas alterations that are unique to few samples are considered to be “passenger” al- terations which are neutral to the evolution of tumor (Beroukhim et al., 2010; Bignell et al., 2010). These frequently altered driver alterations are identified from regions on the genome that are fre- quently gained or lost above a significance threshold in a population of cancer samples (Mermel et al., 2011). Driver alterations play significant roles in tumor progression (Beroukhim et al., 2010; Bignell et al., 2010). Several studies have identified recurrent alterations in different tumor types (Beroukhim et al., 2007; Etemadmoghadam et al., 2009; Lin et al., 2008; Weir et al., 2007). CNA profiling for diagnostic and prognostic applications would involve profiling a tumor from a patient, and making treatment decisions based on prior knowledge of copy number alterations. These could include deciding the effectiveness of treatment and predicting the survival of breast cancer patients (Hicks et al., 2006; Stuart and Sellers, 2009). Amplification of chromosome 19q12 and 20q11.22-q13.12 is associated with poor response to platinum-based chemotherapy and sur- vival in ovarian carcinomas (Etemadmoghadam et al., 2009). Recurrent alterations are associated with gains of oncogenes and loss of tumor suppressor genes. For example, gain of MYC, and loss of PTEN and RB1 play important roles in prostate cancer (Alexander et al., 2018). The expression of several genes in these altered regions are correlated with the gain or loss of the genomic region harboring the gene (Chitale et al., 2009; Lu et al., 2011; Pollack et al., 2002). Few studies have di- rectly linked the alteration in CNA profiles with diagnostic outcomes (Bardelli et al., 2013; Berry et al., 2018; Etemadmoghadam et al., 2009), and several others have shown that the expression level of these genes are associated with tumor progression and response to treatment (Gorre et al., 2001; Shattuck et al., 2008; Villanueva et al., 2013). In recent years, “liquid biopsies” are increasingly used for cancer screening. As opposed to a tumor biopsy (that is derived from the tumor mass), a liquid biopsy is derived from analytes in 12 body fluids. Sources include extra-cellular DNA (called cell-free DNA or cfDNA) extracted from blood serum or plasma (Chan et al., 2013; Leary et al., 2012; Li et al., 2017), or other fluids such as aqueous humor (Berry et al., 2017) and cerebrospinal fluid (Mouliere et al., 2018b), circulating tu- mor cells (Dago et al., 2014), and extra-cellular RNA (Zaporozhchenko et al., 2018). (Use of liquid biopsies in cancer is reviewed in Heitzer et al. (2019), Crowley et al. (2013), and Schwarzenbach et al. (2011).) Among others, an enormous advantage of a liquid biopsy is the minimally invasive procedure used to obtain the analyte, which could be as simple as a blood draw (Heitzer et al., 2019). CNA profiles obtained from a liquid biopsy has been shown to be concordant with profiles obtained from a solid biopsy of the same patient (Berry et al., 2017; Chan et al., 2013). Gain of chromosome 6p in the CNA profile generated from cfDNA samples is predictive of eye enucleation in retinoblastoma (Berry et al., 2018). Moreover, other properties of cfDNA such the fraction of cfDNA molecules from the tumor (tumor fraction) and length distribution of cfDNA molecules are useful for tumor evaluation (Choudhury et al., 2018; Cristiano et al., 2019; Mouliere et al., 2018a; Underhill et al., 2016). For example, cfDNA tumor fraction is associated with the number of bone metastasis in prostate cancer (Choudhury et al., 2018). CNA profiling methods Initial copy number analysis, prior to high-throughput sequencing, pri- marily used microarray based techniques (Carter, 2007), these include the use of comparative genome hybridization (Pinkel et al., 1998), SNP arrays (Nannya et al., 2005), and oligonucleotide arrays (Lucito et al., 2003). With the advent of high-throughput sequencing, array-based approaches are increasingly re- placed with sequencing based approaches. Techniques for detecting copy number alterations with whole-genome sequence data include paired-end read approach (Campbell et al., 2008; Korbel et al., 2007) and read-counting approach (Yoon et al., 2009). The most commonly used, espe- cially for tumor samples, is the read-counting approach, and this approach can also be used with whole-exome sequencing data (D’Aurizio et al., 2016; Krumm et al., 2012). 13 Read-counting approach is based on the principle that the number of reads originating from a region of the genome is directly proportional to the copy number of that region (Baslan et al., 2015). Tools for detecting copy number changes based on this approach typically use variations of the following steps: (1) Align the reads to the reference genome using any standard mapping tool. (2) Partition the genome into non-overlapping “bins” and determine the number of reads mapped to each bin. The size of the bins determine the resolution of a copy number profile; small bins generate high-resolution profiles, but also require more reads. For cancer, the size of the copy number alterations are typically in the megabase range for focal alterations or chromosomal arm lengths for broad alterations (Beroukhim et al., 2010), and thus, it is common to use bins that are larger than 100 kb. (3) Normalize bin counts to remove the effect of any bias that may have been introduced. Bin counts are usually corrected for GC bias and mappability bias. GC bias could be introduced during the PCR step in the library construction step or due to the sequencing process (Aird et al., 2011; Benjamini and Speed, 2012). Mappability bias is introduced as bins have an unequal number of uniquely mappable positions. (4) Finally, bin counts are segmented to reduce noise and determine regions of uniform copy number. Frequently used approaches include circular binary segmentation (Olshen et al., 2004; Venkatraman and Olshen, 2007) and hidden markov models (Ha et al., 2014). Tumors consist of a mix of several cell types including various clones of tumor cells, non-tumor cells in the tissue microenvironment, and immune cells (Witz and Levy-Nissenbaum, 2006). Due to which the tumor fraction of samples vary significantly (Carter et al., 2012; Oesper et al., 2013; Van Loo et al., 2010). Tumor fraction varies across cancer types, with a significant number of lung, esophageal, and breast cancer samples having under 50% tumor fraction (Carter et al., 2012). In addition to detecting copy number alterations with low-coverage sequencing, tumor fraction, subclonality, and ploidy of a tumor can also be determined (Adalsteinsson et al., 2017; Gusnanto et al., 2012). 14 2.3 Prior protocols based on concatenating DNA molecules The concept of ligating short DNA molecules prior to sequencing was introduced in serial anal- ysis of gene expression (SAGE), a technique to detect thousands of expressed sequence tags for transcript analysis (Velculescu et al., 1995). SAGE technique is based on the principle that a short nucleotide sequence tag ( 9bp) from a defined region of a transcript has sufficient information to uniquely identify the transcript, and the concatenation of the short tags allows efficient analysis by sequencing multiple tags in a single clone. The expression levels of a transcripts are then quantified by “counting” the number of tags for each transcript (Velculescu et al., 1995). Briefly, to quantity mRNA abundance using SAGE, double-stranded cDNA molecules are syn- thesized from mRNA using biotinylated oligo(dT) primers. These cDNA molecules are fragmented with a restriction enzyme (anchoring enzyme). After digestion, the 3’ end of cDNA molecules (from the dA tail to the anchoring enzyme recognition site) are isolated with streptavidin beads that bind to the biotinylated oligo(dT) primers. The isolated molecules are divided into two groups, and each group is ligated to a different linker (linker A and B, respectively) using the overhangs after the anchoring enzyme digestion. These linkers are designed to contain a type IIS restriction site (type IIS restriction enzymes cut 20 bp away from a recognition site). These molecules are then digested with a type IIS enzyme (tagging enzyme) releasing the linker and a 9 bp tag. The molecules from both the pools are ligated with one another to form ditags. Ditags have linker A one end, linker B on the other, and two 9 bp tags in the middle. Ditags are then PCR amplified with the primer binding sites located in the linkers. These molecules are then cleaved again with the anchoring enzyme to remove the linkers, and leaving behind ditags with the anchoring en- zyme overhangs. These ditags are then ligated to form longer molecules that contain several ditags punctuated by the anchoring enzyme site. The concatenated molecules are cloned and sequenced. Subsequently, several variants of SAGE were developed for sequencing longer tags (LongSAGE (Hu and Polyak, 2006; Saha et al., 2002), SuperSAGE (Matsumura et al., 2003)), tags obtained 15 from the 5’ end of the transcript (Wei et al., 2004), sequencing on high-throughput instruments (Matsumura et al., 2010), and several others (Peters et al., 1999; Zawada et al., 2014). A variant of SAGE, digital karyotyping, was developed for copy number analysis (Leary et al., 2007; Wang et al., 2002). Digital karyotyping uses a 21 bp genomic DNA tags from specific loca- tions on the genome (as opposed to mRNA tags in SAGE). These tags are long enough to uniquely determine their location on the reference genome, and the counts of tags along the chromosome are used to quantify copy number changes. Since genomic DNA does not have a dA tail (that is required for SAGE), they are first fragmented with a restriction enzyme (mapping enzyme), lig- ated to biotinylated adapters with ends that match the mapping enzyme digested ends, digested with another restriction enzyme (fragmenting enzyme), isolated with streptavidin beads, and the subsequent steps are similar to the SAGE (or LongSAGE) protocol (Leary et al., 2007; Wang et al., 2002). The concept of ligating short molecules together prior to sequencing was also used in short multiply aggregated sequence homologies (SMASH) for CNA profiling using Illumina short-read technology (Wang et al., 2016) and ConcatSeq for target enrichment workflows on PacBio ma- chines (Schlecht et al., 2017). 16 Chapter 3 Sampling molecules using re-ligated fragments (SMURF)-seq 3.1 Naive approaches to read-counting on nanopore machines Copy number profiling, and read-counting in general, can be done on nanopore sequencers with long-reads (8kb) following the standard sequencing procedure. Since nanopore machines are optimized for long-read sequencing, this method has the advantage of using any standard library preparation protocol that are commercially available. Sequencing long molecules using a nanopore keeps a pore occupied for a longer duration once a pore is loaded, followed by an open pore waiting for a molecule to be reload. Further, technical nucleotides, such as sequencing adapters and barcodes, are sequenced one (or twice) every8k bases, and thus, the fraction of time a nanopore spends sequencing technical nucleotides is low. However, read-counting applications do not benefit from longer reads beyond what is necessary for unique mapping to the reference genome. In these applications, for any fixed number of nucleotides sequenced, more information would be obtained if those nucleotides are organized as more DNA molecules, rather than longer contiguous fragments (Fig. 3.1). 17 An alternate approach for read-counting is to sequence short reads (150 bp) directly on a nanopore sequencer. In general, for a given sample of DNA, a nanopore instrument will generate more reads if the corresponding molecules are shorter. Once a molecule is loaded into a pore, the time spent sequencing is less for shorter reads. In addition, for a fixed amount of DNA, shorter molecules result in higher molar concentration when loaded onto the machine, increasing the rate at which each pore captures molecules (Muthukumar, 2010; Wanunu et al., 2008). Therefore, sequencing short reads on a nanopore machine would generate more reads from a sequencing run than sequencing long reads. However, sequencing short reads requires ad-hoc modifications to the library preparation protocol as these are optimized for longer molecules. Sequencing these shorter molecules keeps a pore occupied for a shorter duration once a pore is loaded, followed by waiting for a pore to be reloaded (but the reload time is usually shorter due to the higher molar concentration). Moreover, technical nucleotides are sequenced every150bp, increasing the fraction of time a nanopore sequences technical bases (Fig. 3.1). SMURF-seq approach combines the advantages of both of these methods and alleviates the Method Advantages Disadvantages Pore reload Long-read sequencing (~8k bp) Short-read sequencing (~150 bp) Pore reload Chr 2 Chr 4 Chr 8 Chr 2 Chr 4 Chr 8 Standard lib. prep. Pore reload every ~8k bp Adapter and barcode every ~8k bp Low read count per run Longer than required for read-counting Optimal sequencing speed (nucs./sec) Modified lib. prep. Pore reload every ~150 bp Adapter and barcode every ~150 bp Reduced sequencing speed (nucs./sec) Higher read count per run Optimal read length for read-counting Figure 3.1: Naive approaches to read-counting on nanopore machines. Sequencing long-reads directly is optimized for nanopore machines but not for read-counting applications. Sequencing short-read is optimized for read-counting applications but not for nanopore sequencing. 18 drawbacks by using a nanopore instrument as intended for long-read sequencing, while generating the desired short fragments. Using the SMURF-seq approach, we generate higher read counts per run than sequencing long or short molecules directly. 3.2 SMURF-seq approach to read counting The SMURF-seq protocol involves cleaving genomic DNA into short fragments, with length just sufficient for an acceptable rate of uniquely mapping fragments in the reference genome. These fragmented molecules are then randomly ligated back together to form artificial long DNA molecules, as required for long-read sequencing. The long re-ligated molecules are sequenced following the standard MinION library preparation protocol. After (or possibly concurrent with) Genomic DNA Fragment DNA molecules Ligation of fragmented DNA Optional barcode SMURF-seq long read (~8k bp) Nanopore sequencing Chr 2 Chr 4 Chr 8 Standard library preparation Mapping identifies fragment boundaries Figure 3.2: SMURF-seq approach to sequencing short fragments. SMURF-seq efficiently se- quences short fragments of DNA, as required for read-counting applications, with a reference genome on long-read sequencers. SMURF-seq sequences short DNA molecules by generating long concatenated molecules from these. SMURF-seq reads are aligned by splitting them into multiple fragments, each aligning to a distinct region in the genome. 19 sequencing, the SMURF-seq reads are mapped to the reference genome in a way that simultane- ously splits them into their constituent fragments, each aligning to a distinct location in the genome (Fig. 3.2). More specifically, genomic DNA was fragmented using restriction enzymes and ligated with T4 DNA ligase, with clean-up steps in between. SMURF-seq protocol is completely enzymatic and takes less than 90 minutes to complete (Fig. 3.3). The details of these steps are given below: 1. Restriction enzyme digestion: Restriction enzymes recognize and cleave specific DNA se- Genomic DNA Restriction Enz. U Ligase 15-30m 37c Transfer to spin-column 30m RT XP beads Elute 5m RT Lib. Prep. Figure 3.3: Schematic of SMURF-seq protocol. SMURF-seq consists of four steps: restriction enzyme digestion, spin-column clean-up, re-ligation of fragmented DNA, and Ampure XP beads clean-up. 20 quences, typically producing sticky-ended DNA molecules. The choice of restriction enzyme used is primarily dependent on the size of the fragmented molecules produced. Based on the downstream application, this choice could also be influenced by other factors such as any sequence-specific biases restriction enzymes could introduce. An advantage of using restric- tion enzymes to fragment DNA molecules, over other fragmentation techniques, is that the fragmented molecules have uniform ends (either sticky-ends with the same overhangs or blunt- ends) and are thus compatible for ligation without an end-repair step in between. 2. Clean-up: The reaction containing the restriction enzymes and the fragmented DNA molecules is cleaned to wash out the enzymes and retain the fragmented DNA molecules. The choice of clean-up kit used determines the length of the retained DNA molecules. We used a spin-column based clean-up that typically retains molecules that are over70 bp. However, other clean-up kits, such as bead-based kits, could also be used at this step. 3. Re-ligation: Fragmented DNA molecules with uniform ends are ligated at random with T4 DNA ligase enzymes. The most important factor in a ligation reaction is the concentration of compatible DNA ends (Dugaiczyk et al., 1975). At high concentrations, the chances are higher for ligation between two molecules than a molecule self-ligating. At low concentrations, the chances are higher for self-ligation. Thus, the main consideration during the ligation step is the duration of the ligation reaction, as the molar concentration of DNA molecules decrease with time. Too little time would lead to insufficient ligation, resulting in molecules of length that do not achieve optimal SMURF-seq efficiency. On the other extreme, too much time would result in circular molecules that are incompatible with the most downstream library preparation process. A typical ligation reaction would contain both short and circularized molecules, and achieving a balance between these determines the efficiency of SMURF-seq. Other factors such as the temperature and buffer contents also affect the ligation process. In our experiments, the ligation reaction was performed at a DNA concentration of 25 ng=μl (500 ng of DNA in 10 μl nuclease-free water and 10 μl DNA ligase) for 30 min. 21 4. Bead-based clean-up: The reaction containing the ligase enzymes and ligated DNA molecules is cleaned to retain only the ligated molecules. We used a bead-based clean-up to avoid damage to long DNA molecules that are typical of spin-column based methods. DNA molecules that are resultant of the SMURF-seq protocol are long molecules that are typically several kilobases, and therefore, any standard library preparation kits that are available for nanopore machines can be used with SMURF-seq molecules. These molecules can also be barcoded with one (or two) barcode sequence per molecule. Thus, the SMURF-seq approach overcomes the disadvantages of sequencing long or short DNA molecules directly on a nanopore machine for read-counting applications, and improves its efficiency of read-counting applications. We also tested dsDNA fragmentase enzymes (New England Biolabs) and acoustic shearing (Covaris) to fragment DNA. However, these methods require an additional end-repair step after Length (bp) RE site frequency 0 200 400 600 800 1000 0 500000 1500000 Mean = 150.23 bp 5` NNT T A ANN 3` 3` NNA A T TNN 5` SaqAI SaqAI RE End-repair T4 ligase + + + + + + 10k bp 5k bp 1k bp 500 bp a b Figure 3.4: Restriction enzyme digestion and ligation of DNA molecules. (a) Length distribution between restriction sites computed by measuring the distance between the recognition sites on the human reference genome. SaqAI recognizes the sequence TTAA and leaves a 2 bp overhang. (b) Negative gel image of fragmented and ligated DNA using SaqAI restriction enzyme and T4 DNA ligase. Sticky-end and blunt-end ligation (by end-repair) of fragmented DNA are shown, and both yield ligated molecules of approximately the same length. 22 fragmentation and the ligated molecules failed to reach the lengths we obtained by using restriction fragmentation. In our applications, we used SaqAI restriction enzyme, which recognizes the sequence TTAA and produces molecules with mean lengths of 150.2 bp (Fig. 3.4a). The fragmented DNA molecules are then ligated randomly to form longer molecules using T4 DNA ligase enzyme (Fig. 3.4b). In our experiments, the resulting long DNA molecules were prepared for sequencing using the Ox- ford Nanopore Technologies 1D DNA by ligation kit (SQK-LSK108) or the rapid sequencing kit (SQK-RAD003) following the standard manufacturers protocol. We also multiplexed samples us- ing the 1D native barcoding genomic DNA kit (EXP-NBD103) followed by library preparation using the 1D DNA by ligation kit. The sequencing was performed on a MinION instrument. The detailed SMURF-seq protocol is given in Appendix A. 3.3 Mapping SMURF-seq reads The reads sequenced using SMURF-seq can be mapped to a reference genome by first identifying short matches within the reads, corresponding to parts of the individual fragments, and then ex- tending those to locate fragment boundaries. As currently implemented, SMURF-seq fragments are longer than150bp, and mapping these reads is handled nicely using the seed-and-extend paradigm implemented in many existing long-read mapping tools. Although none of these tools were designed to align SMURF-seq reads, several long-read aligners such as BWA-MEM (Li, 2013), Minimap2 (Li, 2018), and LAST (Kiełbasa et al., 2011) include steps designed for split- read alignment, which can be leveraged for aligning SMURF-seq reads. Aligning SMURF-seq reads with long-read mapping tools typically involve variations of the following steps: Identifying seeds: Mapping tools have a step of identifying seeds, which are short exactly match- ing parts of the read with parts of the reference genome. Choices in how seeds are defined and 23 used are often made for mapping speed. The total size of SMURF-seq data sets is currently (relatively) small, so speed is not our primary concern. We favor the most sensitive seed strat- egy, but depending on implementation too many seed hits could lead to ambiguity later in the mapping process. Chaining seeds: The identified seeds that are close to one another on the read and the refer- ence, and have the same orientations could be merged. These are further extended into proto alignments, and filtered to avoid aligning potentially false-positive seed hits. Aligning within the chains: In this stage a Smith-Waterman alignment is performed, typically allowing users to specify a mismatch penalty along with penalties for both gap-open and gap- extend. Selecting best alignments: When high-scoring alignments overlap within a read, one of them (or both) could be trimmed or one is selected and the other discarded. The choices made here could lead to discarding entire fragments. Mapping tools have several parameter options, in general, these are related to: (1) the seeding and chaining algorithm used by the individual tool. (2) The Smith-Waterman alignment scores, i.e. the match score, and the mismatch and indel penalty. The seeding and chaining parameters control the number of proto alignments that are further refined by aligning parts of the read to the reference genome using the specified alignment scores. The Smith-Waterman alignment score used to align fragments to the reference genome is cru- cial for determining the optimal fragment length. On one extreme, a match score of 1 with a mismatch and indel penalty of 0 will result in one identified fragment covering the entire read and mapping perfectly, but will always map ambiguously. On the other extreme, a match score of 1 with a mismatch and indel penalty of1 will result in any mismatch or indel on the read to be considered as a fragment boundary. Therefore, it is crucial to determine the optimal alignment score for mapping a SMURF-seq read. We evaluated mapping tools on simulated SMURF-seq data generated by concatenating ran- 24 dom fragments from real Oxford Nanopore reads. This emulates idealized SMURF-seq reads. Within the simulated reads, the boundaries of each fragment are known a priori, as are their map- ping locations when in the context of their original long reads. We used this information to evaluate mapping tools in terms of (1) how well they identify fragments purely for the purpose of counting molecules, which is the primary information used in CNA analysis, and (2) how well they identify individual mapping bases within reads. After mapping these reads, we calculated precision and recall for identifying both the correct fragment locations, and the individual mapping bases within the fragments (i.e. the correct fragment boundaries). Using this simulation setup, we determined the optimal Smith-Waterman alignment score for mapping SMURF-seq reads (detailed procedure is given in Appendix B). Based on these results BWA-MEM outperformed other tools, and thus, we used BWA-MEM to align SMURF-seq reads. Briefly, BWA-MEM uses short seed hits originating from different parts of the long reads (and therefore, in our application, different fragments within those long reads), to form clusters of seed hits in the reference genome. Nearby clusters are joined, and then extended, eventually resulting in (for most fragments) one alignment per fragment. In our analysis, we em- ployed BWA-MEM without any modifications to optimize identification of fragment boundaries. According to our simulations, this mode of operation may not perfectly identify fragment bound- aries, but performs well in identifying mapping locations of the individual fragments, which is the information passed to subsequent steps in our analysis. 3.4 Generating higher fragment counts with SMURF-seq A typical Oxford MinION sequencing run generates approximately 500k reads (length8 kb) (Jain et al., 2018a; Tyson et al., 2018) using the standard library preparation and sequencing protocols. Several studies have used these long reads for copy number profiling (Euskirchen et al., 2017; Magi et al., 2019). 25 0 10 20 30 40 0 2M 4M 6M 8M Time (h) Number of fragments Figure 3.5: SMURF-seq generates fragments at a faster rate than sequencing short molecules directly. Number of fragments obtained from reads plotted as a function of sequencing time. For SMURF-seq, the average number of fragments from runs using the 1D sequencing by ligation kits are plotted (error bars indicate one standard deviation). For short-read sequencing run, each read is considered as one fragment. We tested the ability of the Oxford MinION instrument to sequence short DNA molecules by sequencing restriction enzyme (SaqAI) digested normal diploid genome. The sequencing run produced 2.58 million reads with a mean read length of 630.93 bp. Using the same instrument, with SMURF-seq, we report here an average of 6.2 million mapped fragments per run, which is substantially more fragments than directly sequencing long or short reads directly. Further, the SMURF-seq approach generated fragments at a substantially faster rate than sequencing short molecules directly (Fig. 3.5). The most important factor in the performance of SMURF-seq over sequencing short molecules directly is that sequencing concatenated fragments effectively eliminates the pore reload time for all but the first fragment in each read. However, there are a variety of additional factors that favor further optimization of the approach employed by SMURF-seq. First, reduction of resources spent on technical nucleotides: SMURF-seq uses a single barcode and sequencing adapter per read consisting of multiple fragments; sequencing short reads uses one barcode and adapter per 26 fragment, adding approximately 50 bases to each fragment. This increases the time to sequence each short read. In sequencing short reads, as the reads get shorter the time consumed by these technical bases increases. In SMURF-seq, sequencing either shorter fragments in fixed length reads, or longer reads containing fragments of fixed average length, both reduce the time consumed sequencing these technical bases. In the limit, assuming 100bp DNA fragments, sequencing those fragments as short-reads corresponds to 33% technical nucleotides; for SMURF-seq, the portion of technical nucleotides remains low. Second, more nucleotides sequenced at full speed: We observed that the speed of sequencing was lower when sequencing short molecules. For example, the average sequencing speed was 315.54 bases per second for sequencing the diploid genome without SMURF-seq, and 400.29 bases per second when sequencing using SMURF-seq on the MinION sequencer. Third, leveraging optimizations to long-read protocols: The rapidly evolving nanopore library construction kits are continually optimized for long-read sequencing, and would likely require significant ad-hoc modifications to optimize sequencing of short molecules of length optimal for read-counting applications. 3.5 Efficient CNA profiling using SMURF-seq To demonstrate the utility of SMURF-seq, we generated CNA profiles of normal diploid and highly rearranged cancer genomes. The mapped fragments were grouped into variable length “bins” across the genome and bin counts were used to generate CNA profiles as described in Baslan et al. (2012) and Kendall and Krasnitz (2014). 3.5.1 Accurate CNA profiles using SMURF-seq We sequenced a normal diploid female genome with SMURF-seq, resulting in 270.8k reads (mean read length of 6.75 kb) in a single run. These reads were split into 7.28 million fragments (26.87 mean fragments per read; Fig. 3.6). A CNA profile for this normal diploid genome, with the 27 Read length Frequency 0 10k 20k 30k 0 20k 40k Fragment length Frequency 0 200 600 1000 0 2M 4M 0 10k 20k 30k 0 100 200 300 Read length Number of frags a b c Figure 3.6: Read and fragment lengths from a SMURF-seq sequencing run. (a) Sequenced read length distribution. (b) Mapped fragment length distribution. (c) Scatter plot of read length and the number of fragments contained in the read. expected (approximately flat) appearance can be seen in Fig. 3.7a. A replicate of this experiment resulted in 497.9k reads (mean read length of 3.7 kb), which were split into 7.55 million fragments (15.16 mean fragments per read). The rapid sequencing kit form Oxford Nanopore Technologies offers an extremely fast (10 minute) and simple (2 step) protocol for library preparation. We verified that the SMURF-seq procedure behaves similarly using the rapid sequencing kit. The 213.38k sequenced reads had a mean read length of 3.9 kb, and were split into 2.81 million fragments. The copy number profile 28 of the diploid genome was flat, as expected. Next, we applied SMURF-seq to the breast cancer line SK-BR-3, generating 147.0k reads with mean length of 7.62 kb, which were split into 4.52 million fragments (30.78 mean fragments per read). We then obtained a CNA profile using 5,000 bins, corresponding to an average bin size of approximately 600 kb (Fig. 3.7b). To provide a quantification of accuracy in terms of individual CNA events we conducted whole- Ratio to mean 0.02 1 10 Ratio to mean 0.5 1 2 SMURF−seq Illumina WGS 0 5 10 15 0 5 10 15 0.5 2 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 22 14 16 18 20 21 X Chromosomal position SMURF-seq segment SMURF-seq bin ratio Illumina WGS segment Illumina WGS bin ratio b d c 10 All reads, 5k bins a 0.5 2 1 Ratio to mean All reads, 5k bins TP (1449) FN (17) FP (26) e r = 0.99 Figure 3.7: Accurate copy number profiles with SMURF-seq. (a) CNA profile of a normal diploid genome. Each blue point is a bin ratio to mean and the red line is the segmented bin ratio. (b) Superimposed CNA profiles of SK-BR-3 genome generated using SMURF-seq and Illumina WGS reads. (c) Venn diagram illustrating the accuracy of event calls using SMURF-seq compared with Illumina WGS. (d) Zoom-in of copy number changes on chromosome 8. (e) Scatter plot of bin ratio of SK-BR-3 genome using SMURF-seq and Illumina WGS reads. Pearson correlation of the data is shown. 29 genome sequencing (WGS) on the same SK-BR-3 using Illumina (5.56 million reads; 130 bp, single-end). We used this to define a ground truth by calling CNA events for each of the pre- defined bins (both amplifications and deletions) based on the segmented signal with a cutoff of 1.25/0.8 (Fig. 3.7b) (Berry et al., 2017; Dago et al., 2014). This resulted in 1,466 events (886 amplifications, 580 deletions) from 4,953 bins. We then called events using the identical procedure with SMURF-seq data from the same SK-BR-3 sample. The precision and recall for SMURF-seq relative to the Illumina calls were 0.982 and 0.988, respectively (Fig. 3.7c). Fig. 3.7d shows a zoom-in of a region with extreme copy number alterations. The bin ratios for the Illumina WGS and the SMURF-seq profiles are highly correlated (Pearson r = 0.99; Fig. 3.7e). A replicate of this experiment resulted in 132.64k reads (mean read length of 7.3 kb), which were split into 4.02 million fragments (30.31 mean fragments per read). We also generated higher-resolution CNA profiles with 20,000 and 50,000 bins, corresponding to an average of approximately 150 kb and 60 kb in length, respectively (Fig. 3.8a, b). The profiles obtained at these resolutions have a high correlation with the profiles obtained using Illumina WGS (Pearsonr> 0.97; Fig. 3.8c, d). 3.5.2 Concordant profiles from fewer countable fragments Several cancer-related studies have employed CNA profiling based on low-coverage WGS (Kader et al., 2016; Macintyre et al., 2018). It has previously been demonstrated that 250k reads are suf- ficient for accurate genome-wide CNA profiling of single cells (Baslan et al., 2015). At the same time, the CNA profiles from a population of cells has been shown to have a high correlation with single-cell profiles (Baslan et al., 2015; Navin et al., 2011). We reasoned that using 250k frag- ments for CNA profiling using a population of cells would give useful profiles if they remained sufficiently accurate. By down-sampling our SMURF-seq data, we verified that 10k reads, ap- proximately 250k fragments, result in highly-correlated CNA profiles (Pearsonr = 0.98; Fig. 3.9a, b). 30 0.5 10 0.02 0.5 10 1 2 0.02 1 2 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 22 14 16 18 20 21 X Ratio to mean Ratio to mean SMURF-seq segment SMURF-seq bin ratio Illumina WGS segment Illumina WGS bin ratio 50k bins 20k bins Chromosomal position a b 0 5 10 15 0 5 10 15 SMURF−seq Illumina WGS 0 5 10 0 5 10 Illumina WGS 15 15 c d 20k bins 50k bins SMURF−seq r = 0.99 r = 0.97 Figure 3.8: High-resolution CNA profile generated using SMURF-seq is highly concordant with the profile generated with Illumina WGS. (a, b) Superimposed CNA profiles of SK-BR-3 genome generated using SMURF-seq and Illumina WGS at 20,000 and 50,000 bin resolutions. (c, d) Scatter plot of bin ratios of SK-BR-3 genome using SMURF-seq and Illumina WGS reads at 20,000 and 50,000 bin resolutions. Pearson correlation of the data is shown. Given the total capacity of the MinION instrument, this indicates that multiple samples can effectively be barcoded and multiplexed in a single sequencing run. To verify this, we sequenced two DNA samples (normal diploid female and SK-BR-3) in a single run. These samples were pro- 31 0 5 10 15 0 5 10 15 SMURF-seq (All reads) SMURF-seq (10k reads) Down-sampled 10k reads, 5k bins a b Chromosomal position 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 22 14 16 18 20 21 X Segment Bin ratio 0.1 1 10 0.5 2 0.5 2 0.1 1 10 0 5 10 15 0 5 10 15 All barcode1 reads, 5k bins All barcode2 reads, 5k bins 1 0.5 2 SMURF-seq (All barcode2 reads) Illumina WGS d e c Ratio to mean Ratio to mean Ratio to mean r = 0.98 r = 0.99 Figure 3.9: Multiple SMURF-seq CNA profiles by multiplexing in a single run. (a) CNA profile of SK-BR-3 genome with down-sampled 10k SMURF-seq reads. (b) Scatter plot of normalized bin counts of the original SMURF-seq data and data down-sampled to 10k SMURF-seq reads. Pearson correlation of the data is shown. (c) CNA profile of barcode01 (Normal diploid genome) reads. (d) CNA profile of barcode02 (SK-BR-3 cancer genome) reads. (e) Scatter plot of bin ratios of SK-BR-3 genome using multiplexed SMURF-seq and Illumina WGS reads. cessed with SMURF-seq protocol and then barcoded following the standard library construction. After demultiplexing and mapping the reads, the diploid genome had a CNA profile as expected (Fig. 3.9c) and the SK-BR-3 CNA profile was nearly identical to the profile obtained using Illumina WGS (Pearson r = 0.99; Fig. 3.9d, e). All the sequencing runs and the availability of sequence data are summarized in Appendix C. Further, we verified that the CNA profile with reads generated in the first 45, 90, and 180 minutes of starting a sequencing run had a high correlation to the profile with reads from the 32 0.1 0.5 10 0.1 0.5 10 0.1 0.5 10 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 SMURF−seq (All reads) SMURF-seq (180 min) 1 2 1 2 1 2 SMURF-seq (90 min) SMURF-seq (45 min) Segment Bin ratio 45 min (13.6k reads), 5k bins 90 min (27.4k reads), 5k bins 180 min (48.3k reads), 5k bins Ratio to mean Ratio to mean Ratio to mean a Chromosomal position 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 22 14 16 18 20 21 X b c d e f r = 0.98 r = 0.99 r = 0.99 Figure 3.10: CNA profile with reads obtained in first few minutes of sequencing. (a, c, e) CNA profile with reads obtained in the first 45, 90, and 180 minutes of sequencing. (b, d, f) Scatter plot of bin ratios of the original SMURF-seq data and data obtained in first 45, 90, and 180 minutes of sequencing. Pearson correlation of the data is shown. complete run (Pearsonr> 0.98; Fig. 3.10). In summary, our results demonstrate that SMURF-seq can generate more information for CNA analysis in a single run of the Oxford MinION sequencer, compared with either sequencing long reads in the usual way or direct short-read sequencing on the same instrument. This increased information is in the form of increased numbers of distinct DNA fragments sequenced, and can be leveraged in multiple ways. Applying SMURF-seq on a single sample for a full run corresponds to 33 higher counts for downstream analysis. In CNA analysis, increased counts either add confidence for a fixed resolution, or can allow higher resolution analysis (i.e. smaller bins) at the same level of confidence. Alternatively, the increased information throughput can effectively reduce the time required to produce the same number of counts for CNA analysis by terminating the sequencing earlier. Finally, the increased information yield can be directed towards reducing the cost of gener- ating CNA profiles by allowing a greater degree of multiplexing. For CNA analysis at resolutions permitted by 250k mapped fragments, our results show SMURF-seq allows roughly 20 and up to 30 samples in a single run, compared with 10 per run directly using short-read sequencing. 3.6 Future of SMURF-seq SMURF-seq with shorter fragments For CNA profiling with SMURF-seq, we used restriction enzyme SaqAI which fragments the human genome generating molecules with a mean length of 150 bp. For CNA profiling and other read-counting applications, the length of the fragmented molecules needs to be just long enough to ensure unique mappability to a sufficient fraction of the genome. Thus, reducing the fragment lengths could increase the number of fragments in a SMURF-seq read, and would increase the number of read-counts obtained per sequencing run. The length of the fragmented DNA molecules can be reduced by digestion with a combination of restriction enzymes, as these would recognize more sites on the genome. Depending on the choice of enzymes used, the fragmented molecules could have blunts-ends, sticky-ends that are not compatible with one another for ligation, or a combination of blunt and sticky-ends. These fragmented DNA could be end-repaired to have blunt ends on all molecules prior to ligation. How- ever, this step adds an end-repair and a clean-up step (after end-repair) to the SMURF-seq protocol. Another option is to directly ligate the fragmented molecules even though they do not have com- 34 patible ends. As an example, after digestion with two restriction enzymes that generate different ends, the fragmented molecules can be of three types: The fragmented molecule could have both ends from the first enzyme, both ends from the second enzyme, or one end from each of the en- zymes. Ligating these molecules directly (i.e. without end-repair) would concatenate the two types of molecules that have the same ends within each type; the type of molecules that have two dif- ferent ends would act as bridges between the molecules that have the same ends. Although, this protocol would have fewer steps, the efficiency of ligation could be reduced and would need to be evaluated. The use of restriction enzymes or a combination of enzymes for fragmentation could lead to a bias in read counts for the downstream application. For example, in terms of copy number profiling, the number of restriction sites in a bin could vary, causing variation in bin counts that does not reflect the sequenced sample. The effect of the biases needs to assessed based on the enzymes used and the downstream application. Sequencing fragmented molecules with SMURF-seq Some read-counting applications, such as Ribo-seq (Ingolia et al., 2009), involve extensive steps to produce the molecules of interest. The resultant DNA molecules from these protocols could already be fragmented to length optimal for SMURF-seq. Other applications such are targeted- capture methods and whole-exome sequencing protocols typically produce molecules that are 150-200 bp. Further, cell-free DNA molecules obtained from blood samples are of length150 bp, and are increasingly used in low-coverage copy number profiling (Adalsteinsson et al., 2017; Underhill et al., 2016). Sequencing these molecules using SMURF-seq would require that these molecules have com- patible ends for ligation, and the quantity of DNA is sufficient for nanopore sequencing. To meet these requirements, DNA molecules from these protocols could be PCR amplified by first dA- tailing and ligating adapters to bind the PCR primers. If these adapters are designed to have an 35 Fragmented DNA molecules dA tailing A A A A A A A A A A Ligation of fragmented DNA Ligate adapters with restriction sites PCR amplification RE digestion Standard library preparation Figure 3.11: SMURF-seq protocol to sequence DNA molecules that are already fragmented. The light green bands on the adapters represent restriction sites. These molecules are amplified with an adapter containing a restriction site, followed by restriction digestion, and re-ligation for sequencing on long-read machines. appropriate restriction site in them, after amplification they could be removed with restriction di- gestion, leaving behind DNA molecules with sticky-ends as required for SMURF-seq (Fig 3.11). Designing PCR adapters with restriction sites also provides an opportunity to insert an addi- tional sequence in an adapter prior to restriction site. This sequence could be modified depending on the downstream application. For example, it could be a barcode sequence or a unique molecular identifier sequence. However, SMURF-seq is still limited by the number of fragments produced from a run on the sequencing instrument. We believe that the evolution of SMURF-seq and the improvements to the underlying sequencing technology would expand the utility of SMURF-seq, enabling appli- cations on nanopore instruments that are currently only possible with short-read high-throughput machines. 36 Chapter 4 Identifying fragment boundaries on a SMURF-seq read 4.1 Motivation New sequencing methods motivate development of new algorithms for mapping and analysis of se- quences generated using these methods. A few significant developments include BLAST (Altschul et al., 1990) and FASTA (Pearson and Lipman, 1988) motivated by database searches with the advent of Sanger sequencing, BWA (Li and Durbin, 2009) and Bowtie (Langmead et al., 2009) inspired by high-throughput short-read sequencing, and BLASR (Chaisson and Tesler, 2012) by single-molecule long-read sequencing. SMURF-seq has enabled efficient short-read sequencing for read-counting applications on portable long-read machines. However, efficient methods tai- lored for mapping SMURF-seq reads are still lacking; especially as SMURF-seq protocol evolves and the fragments become shorter, and thus, making the mapping process challenging in terms of identifying accurate fragment locations and boundaries. As currently implemented, SMURF-seq protocol uses a single restriction enzyme (SaqAI) to fragment DNA molecules to150 bp. However, depending on the downstream application, the 37 fragment lengths need to be just long enough to ensure unique mappability to a sufficient fraction of the genome. Fragments could be made shorter using methods discussed in section 3.6. As an example, for copy-number profiling (at low resolutions, as used for tumor samples) the fragment lengths could be as short as 40 bp. We used BWA-MEM (Li, 2013) to align SMURF-seq reads generated with the current protocol, which consists of fragments that are typically over 100 bp. Though not designed to align SMURF- seq reads, BWA-MEM is designed for split-read alignment, and it works sufficiently well at these fragment lengths. SMURF-seq reads can also be aligned with other mapping tools capable of split- read alignment such as Minimap2 (Li, 2018) and LAST (Kiełbasa et al., 2011). However, all of these tools are either designed for aligning short reads with low sequencing error or long reads with high sequencing error. Aligning SMURF-seq reads, especially as the fragments get shorter, would differ significantly from these tools in the following aspects: (1) It would require a seeding approach designed for short fragments sequenced with a high error-rate, and (2) an approach to estimate the number of fragments on a SMURF-seq read and determine the optimal fragment boundaries. The initial step of a typical alignment tool, called seeding, is to find candidate locations of a read on the reference genome, and limit the downstream steps to these locations. This is usually accomplished using a hash table based data structure to find exact matches (Altschul et al., 1990, 1997; Kent, 2002) or non-contiguous exact matches (Chen et al., 2009; Ma et al., 2002), or by using a suffix tree based data structure (Kurtz et al., 2004; Langmead et al., 2009; Li, 2013; Li and Durbin, 2009, 2010). The choice of data structure and parameters such as the length of the match, number of matches, etc. are determined by factors such as the read length and the error profile of the underlying sequencing technology. For example, the optimal seeds parameters for expressed sequence tags and whole-genome sequencing (prior to short-read high-throughput sequencing) was determined in BLAT (Kent, 2002), similarly for low-error rate short-reads in RazerS (Weese et al., 2009), and high-error rate long-reads in BLASR (Chaisson and Tesler, 2012). The fragments 38 20 40 60 80 100 140 75 80 85 90 Fragment length (bp) Unique percent 120 160 Figure 4.1: The fraction of genome that is uniquely mappable decreases with fragment length. could be as short as 40 bp in a SMURF-seq read, and aligning these reads requires identifying can- didate locations for these fragments in the presence of the characteristic error profile of nanopore machines. The other aspect of aligning a SMURF-seq read to split the read into its constituent fragments. As the fragments become shorter it becomes crucial and challenging to determine the number of fragments and fragment boundaries on a read. As the fragments become shorter, the fraction of the reference genome that is uniquely mappable would decrease. For the human reference genome (hg19), when mismatches and indels are not allowed, the fraction of the genome that is uniquely mappable reduces by 0.06% when going from 150 to 145 bp, whereas it reduces by 2.02% when going from 40bp to 35bp (Fig. 4.1). The unique fraction of the genome decreases further when mismatches are allowed (Derrien et al., 2012). Due to shorter fragments, the probability of a frag- ment that originated from a unique location on the genome to misalign to an ambiguous location (or vice versa) due to having incorrect fragment boundaries increases. Thus, as the fragments be- 39 come shorter, the odds of inaccurate fragment boundaries leading to misalignment of fragments or missing fragments entirely increases. Although both seeding and determining the accurate fragmentation is crucial to align a SMURF- seq read, in this study we focus only on the second. To this end, we define the fragment identifica- tion problem for identifying the number of fragments and the fragment boundaries on a SMURF- seq read. We approach the fragment identification problem by defining a score function for aligning a SMURF-seq read, provide algorithms to align a read maximizing this score function, study the null score distribution of aligning reads and reference generated at random, and estimate the num- ber of fragments in a SMURF-seq read by comparing its alignment score with the null distribution. Finally, we discuss the challenges and future directions to utilize this method for aligning real SMURF-seq reads to large reference genomes. 4.2 Background In the early days of DNA sequencing, as the number of nucleotides sequenced grew, comparison of DNA sequences became an indispensable tool to a biologist. DNA sequence comparison can be broadly classified into global alignment (Needleman and Wunsch, 1970) and local alignment (Smith et al., 1981). A global alignment seeks an optimal alignment between two sequences such that each base of one sequence is aligned to each base of the other sequences. On the other hand, a local alignment seeks an optimal alignment between any subsequences of the sequences being compared. Comparison of two sequences, even unrelated or random sequences, always produces an op- timal alignment. This motivated the development of approaches to differentiate a “meaningful” alignment from an alignment of unrelated sequences. These methods determine the significance of an alignment between two sequences by comparing the alignment score with a null distribution of alignment scores of unrelated sequences. Determining the appropriate null distribution was the 40 subject of an enormous amount of research, some of which are summarized below. In the context of local alignment, at the time of the initial studies on the score distribution of unrelated sequences, mathematical tools to understand the null distributions were still lacking, and these studies relied on empirical distributions generated from aligning unrelated sequences. In Smith et al. (1985), it is shown that the similarity score is proportional to the logarithm of the length of the sequences being compared, and the standard deviation is independent of the sequence length. The significance of an alignment was determined from the number of standard deviations over mean of the alignment score. These studies (Lipman et al., 1984) also highlighted that statistical properties (Smith et al., 1983), such as nucleotide frequencies or codon usage, of the sequences affect the distribution of the alignment scores. Generating a null distribution from an incorrect model could lead to an alignment of unrelated sequences being dubiously declared significant. Several methods are available to generate random sequences preserving these statistical properties (Altschul and Erickson, 1985; Fitch, 1983). Erdos and Renyi presented results for the length of the longest headrun in a the firstn tosses of a biased coin (Erd¨ os and R´ ev´ esz, 1975). The length of the longest headrun in coin tosses is equiva- lent to the number of matches between two DNA sequences when shifts in the starting and ending positions of the sequences are not allowed, with the probability of head equal to the probability of match between letters of the DNA alphabet. In Arratia and Waterman (1985), this is generalized to matches between DNA sequences, while allowing shifts. These results indicate that allowing shifts doubles the length of the longest headrun. Results for the longest headrun allowing for up to k mismatches and sequences generated from a Markov chain are also considered. In Arratia et al. (1986) and Gordon et al. (1986) the distribution of the longest matches is shown to a have an extreme value distribution with mean that is proportional to the logarithm of the sequences lengths and variance independent of sequence length. Here, when considering only matches, the asymp- totic extreme value distribution is shown by considering a maximum of geometric distributions, and when mismatches are allowed, it is shown by considering a maximum of negative binomial 41 distributions. An alternate approach is a Poisson approximation for the distribution of the longest match (Arratia et al., 1989). A crucial aspect of in aligning nucleic acid and protein sequences is using an appropriate score function. For example, PAM (Dayhoff et al., 1978) and BLOSUM (Henikoff and Henikoff, 1992) matrices are commonly used for protein sequences. The score function used alters the score of the aligned sequences and thus the alignment score distribution. However, the approach based on the length of the headruns does not consider the score function used for an alignment. In Karlin and Altschul (1990) and Karlin et al. (1990), it is shown that the maximal score of aligning unrelated sequences using a general score function (that has at least one positive score and the expected score is negative) takes the form of an extreme value distribution, and explicit formulas for its parameters are provided. Further, the number of high-scoring alignments with score exceeding a given value is closely approximated by a Poisson distribution. Thus, enabling the calculation of the probability that an alignment of random sequences having a score greater than any given value (usually the score of aligning two sequences of interest). These analytical formulas enable the calculation of parameters for ungapped alignment. How- ever, these formulas do not hold for gapped alignments. Empirical studies showed that similar theory holds for gapped alignments as well (Altschul and Gish, 1996; Pearson, 1998; Smith et al., 1985). For gapped alignments, the parameters for the distribution are estimated empirically from the comparison of random sequences (Altschul and Gish, 1996; Waterman and Vingron, 1994a,b). These approaches are employed in database search tools such as BLAST (Altschul et al., 1990, 1997) and FASTA (Pearson, 1995; Pearson and Lipman, 1988) to determine the significance of matches of a query to a database of sequences. Multiple sequence alignments of families of similar proteins can be summarized in position specific scoring table called a profile (Gribskov et al., 1987). It has been shown that the maximum alignment score of aligning a profile to a genome generated at random is well approximated by an extreme value distribution (Goldstein and Waterman, 1994). 42 4.3 Fragment Identification problem A SMURF-seq molecule consists of short fragments of DNA from distinct regions of the genome concatenated into a longer molecule. After sequencing, aligning these reads requires splitting them into their constituent fragments, each aligning to a distinct region on the genome. Here, we define the fragment identification problem for identifying the number of fragments and the fragment boundaries on a SMURF-seq read. Let be an alphabet. A string X is a sequence of letters a 0 a 1 :::a n1 , where a i 2 ;jXj denotes the length of the stringX; andX[i:::j] =a i :::a j1 is a substring ofX. The reference string T is generated from the DNA alphabet =fA;T;G;Cg, withjTj = n. A SMURF-seq read S is generated by concatenating substrings (called fragments) of T , with no information available a priori about the number, length, orientation (forward or reverse- complement), and the position onT of these fragments. Further,S contains sequencing errors with a rate. LetjSj =m andmn. A fragment setP is an set of start locations of fragments onS.Pf0:::m1g andjPj =k, with the rule that 0 is inP always. By convention, we consider the setP to be ordered such that if i < j thenP i < P j . For a fragment setP , P k i=1 P i+1 P i = m and we say that thei th fragment ofS is the substringS[P i :::P i+1 ], withP k+1 =m. For a givenT andS, the fragment identification problem is to determine the elements of the fragment setP such that it corresponds to the start locations of fragments contained inS. 4.3.1 Approach to the fragment identification problem We approach the fragment identification problem by defining a score function as follows: for a given fragment setP , we define the score of aligningS toT as: score T (S;P ) = k X i=1 maxfs(T [u:::v];S[P i :::P i+1 ]) : 0u<vng; 43 wheres is the score of aligning a two strings under a general scoring scheme. This allows us to consider the fragment identification problem as two inter-related problems: (1) Determiningk, the size of the fragment set, and (2) givenk, determining the elements ofP such thatscore T (S;P ) is maximized. By the score function defined above, determining the elements of the fragment setP requires the knowledge of the number of fragmentsk, and this is not known a priori. Further, thek that maximizes the score function would almost never correspond to the optimal fragment set. As an example, takingk =m 1 which corresponds to taking each base as a fragment would maximize the score, however, this alignment would be non-informative. We propose to estimate the optimal number of fragments by aligning a SMURF-seq read to the reference genome with different values ofk. For each of these fragmentations, we determine the p-value by comparing the alignment score with the null distribution generated from aligning reads generated at random to a reference genome generated at random. Finally, we choose the fragmentation with lowest p-value as the optimal fragmentation. The following sections describe each of these steps in detail. The fragment identification problem differs from the alignment problems described in section 4.2 in a crucial manner. For the fragment identification problem we have the reference genome, and it is assumed that the reads always originate from this genome; the score distribution of sequences generated at random is used to determine the optimal number of fragments on a SMURF-seq read. Whereas in the context of local alignment the score distribution of aligning random reads are used to determine a “meaningful” alignment by comparing the alignment score of sequences with the null distribution. 44 4.4 Identifying fragment boundaries on a SMURF-seq read A SMURF-seq read of lengthm can be fragmented intok = 1 tok =m 1 fragments. For each of these fragmentations, the k 1 fragment boundaries on the read can be at any of the m1 k1 positions. Aligning a SMURF-seq read requires identifying both the number of fragmentsk and the fragment boundaries such that the alignment score is maximized. In this section we present algorithms for identifying the optimal fragment boundaries for any given value ofk. First, an algorithm that does not allow for mismatches or indels is described, and then an algorithm for generalized score functions. Limitations of these algorithms and how the optimal fragment boundaries can be identified on a SMURF-seq read in practice are then presented. 4.4.1 Fragment boundary identification under exact matching We first examine the fragment boundary identification problem assuming the score function re- quires exact matching s(a;b) = 8 > > < > > : 1 ifa =b 1 otherwise. The fragment identification problem then becomes an exact matching problem where the goal is to minimize the number of fragments such thatscore T (S;P ) is maximized. A simple linear time solution to this problem can be obtained as follows. First, we assume some data structure forT has been constructed in linear time and allows for longest prefix matches to be computed in time proportional to the length of the query string. The data structure could be a suffix tree (McCreight, 1976), or a more space efficient and a modern structure like an FM- index (Ferragina and Manzini, 2000). The principle of the algorithm can be seen by starting at the beginning ofS, and identifying the longest prefix match ofS inT . Then retainj as the first position of where this longest prefix matches in T , and denote the first mismatching position on S as i. Repeat the procedure solving the subproblem of fragment identification forS[i:::m]. Repeating 45 these steps, the algorithm iteratively solves the longest prefix match problems, retaining as P i+1 the position of mismatch that terminates matching during iterationi. The following pseudocode describes the procedure: Algorithm 1 ExactFragmentMatching(T;S): 1: i 0 2: whilei<m do 3: P P[fig 4: i LongestMatchLength(S;i;T ) 5: return P Proof: Consider an optimal solution to this problem, where the identified fragment set P opt has minimal size. To prove the optimality of our algorithm we need to show that it finds the same number of fragments as the optimal solution, i.e.jPj =jP opt j. The first iteration of the greedy algorithm will find the longest prefix match. If the optimal solution has its first fragment ending beforeP 1 , i.e. P opt1 < P 1 . Then the longest match starting atP opt1 will end at or beforeP 2 , the end of the second fragment found by the greedy algorithm. If it ends atP 2 then the greedy algorithm has the same number of fragments as the optimal solution so far. And it cannot end beforeP 2 , because then the optimal solution will have more fragments than found by the greedy algorithm. Moreover, we cannot haveP opt1 >P 1 as this would imply a longer prefix than found by the longest prefix match exists. With this reasoning we can say that this greedy approach will find just as little fragments as the optimal solution. 4.4.2 Fragment boundary identification allowing mismatches and indels Now we examine fragment boundary identification for a generalized score function allowing mis- matches and indels. A dynamic programming approach is presented. This algorithm is based on and is inspired by the local alignment algorithm (Smith et al., 1981) and the local alignment with inversions problem (Sch¨ oniger and Waterman, 1992). 46 LetM denote a table withm+1 rows,n+1 columns andk+1 dimensions, wherek is the max- imum number of fragments (1 k m). max 0jn M(i;j;l) represents the best fragmentation ofS[1:::i] withl fragments. The entries ofM are computed as follows: Algorithm 2 FragBoundaryIdentification(T;S;k) 1: M(i;j; 0) 1 for all 0im; 0jn 2: M(0;j; 1) 0 for all 0jn 3: M(i; 0; 1) M(i 1; 0; 1) +s(S[i]; ) for all 0im 4: M(l 1;j;l) 1 for all 2lk; 0jn 5: M(i; 0;l) 1 for all 2lk;lim 6: forl 1 tok do 7: fori l tom do 8: forj 1 ton do 9: M(i;j;l) max 8 > > > < > > > : M(i 1;j 1;l) +s(S[i];T [j]) M(i 1;j;l) +s(S[i]; ) M(i;j 1;l) +s( ;T [j]) max 0hn M(i 1;h;l 1) +s(S[i];T [j]): Time and space complexity: Each entry ofM is computed in constant time by storing the value of max 0jn M(i 1;j;l 1) for every row of M in a separate array. The algorithm runs in O(knm) time and uses O(knm +km) space, where the additional O(km) is used to store the values of max 0jn M(i 1;j;l 1). The optimal alignment and fragment boundaries are determined from the usual traceback pro- cedure starting from max 1jn M(m;j;k) and ending in M 1jn (1;j; 1), with the exception of storing if a new fragment maximized the score at a cell. Intuitively, this algorithm is similar to the local alignment algorithm but instead of picking an empty alignment when the score of an extension is negative, this algorithm starts a new fragment when the score of extending a fragment is less than score of starting a new fragment. In terms of an alignment graph, each node has a zero-weight incoming edge from the node corresponding to max 0jn M(i1;j;l1), in addition to the weighted match/mismatch and indel edges (Fig. 4.2). Although this algorithm provides the exact solution to determine optimal fragment boundaries 47 k = 1 k = 2 k = 3 Weighted edge Zero weight edge Max. score in row Start End Figure 4.2: Alignment graph for fragment boundary identification algorithm with a general score function. The direction of arrows are omitted for clarity. The horizontal edges are directed from left to right and all other edges are directed from top to bottom. for eachk, it has several limitations for use with real SMURF-seq reads in practice: (1) This algo- rithm always starts atk = 1, and for eachk builds an alignment based on the best fragmentation ofk 1. However, in practice the approximate number of fragments and the fragment boundaries are expected to be known after the initial seeding step, and starting atk = 1 is inefficient. (2) The time and space complexity is much too large to be used with any reference genome. Further, after initial seeding, the approximate locations of fragments on the genome would also be known, which is not taken into consideration here to reduce the time or space. (3) This algorithm does not allow for a separate gap open cost for affine gap alignments. (4) Not all bases of a real SMURF-seq read can be expected to align to a reference genome, or fragments on a read could align equally well to multiple locations. 4.4.3 Identifying fragment boundaries in practice The initial step in any traditional mapping algorithm is to determine the approximate mapping locations of the read on the reference genome using a seeding approach. Some algorithms further generate proto alignments using the initial seeds. For example, BWA-MEM joins seeds that are 48 close on the read and reference coordinates and have the same orientation into a “chain” (Li, 2013). Typically, the final step is to perform a Smith-Waterman alignment between the read and a small region on the reference genome. Any algorithm to aligning SMURF-seq reads can be expected to have similar steps. After seeding a SMURF-seq read, seeds from different fragments on a read would cluster at different locations on the genome; likely several different locations for each fragment due to repeats or seeding parameters (such as seed length). These seed hits could be further processed to generate preliminary alignments between parts of the read and the reference genome. Thus, after these steps it is reasonable to assume that the approximate number of fragments and the boundaries of these fragments are known. The final step of the algorithm would be finalize the number of fragments and fragment bound- aries on a read. For a given number of fragments, the fragment boundaries can be easily determined such that the alignment score is maximized (this problem is equivalent to finding a longest path in a single-source directed acyclic graph). For finding the number of fragments on a SMURF-seq read, for each fragmentation, we use the alignment score distribution of random sequences with the same fragment set to determine the p-value for the fragmentation using the procedure described in section 4.6. 4.5 Alignment score of a SMURF-seq read The alignment score of a SMURF-seq read is defined as the sum of the alignment score of each of individual fragment for a particular fragmentation with k fragments, and algorithms given in section 4.4 determine the best fragmentation so that the alignment score is maximized. The alignment score of a SMURF-seq read is a non-decreasing function with the number of fragments, as the score function does not penalize for increasing the number of fragments on a read. Thus, a SMURF-seq read aligned withk fragments can always be aligned withk + 1 fragments 49 with a score at least as high as withk fragments. As an example of how the alignment score of a SMURF-seq read grows as a function of k, consider a read consisting off fragments (typicallyf > 20 for a SMURF-seq read), however,f is not known a priori. If the read is aligned to the reference genome as one fragment, it is likely going to align to a random location on the reference genome; alternatively, one of the fragments in the read could align close to its true location, and the flanking fragments would align to flanking regions on the genome (essentially aligning these flanking fragments to a random location on the genome). Similarly, if the read is aligned as two fragments, it is likely going to align to two random location on the genome, or at most two fragments could align to regions close to their true location. With the alignment score for two fragments at least as high as with one fragment. As the read is aligned with increasingk up tof 1, the number of the number of bases on a read that are aligned to its true location on the genome would increase, and number of based aligned to random location would decrease. Although there will always bases that are not mapped to its true location. At k =f all the fragments on the read would be aligned to their true locations with no bases aligned to random locations. Ask increased beyondf, each fragment will be split into shorter fragments, the fragmentation location likely being at locations of sequencing errors (with a small increase in the alignment score). And if there are no more sequencing errors, the split can be anywhere on the read, with the alignment score remaining the same. Atk =m, i.e. each base on the read aligns as one fragment, the alignment would have the highest possible score. Consider a simulated SMURF-seq read from a genome of length 50 kb, the read has 20 frag- ments each of length 40 bp, and the read does not have any sequencing errors. Fig. 4.3a shows the alignment score as a function of the number of fragmentsk, with scoring a match as 1, a mismatch as 0, and not allowing indels. Atk = 1, the score is at the lowest, and increases withk. Atk = 20 every fragment is mapped to its true location, and since the read does not have any sequencing errors, the score is equal to the read length. Fork> 20, the score remains at the maximum. When a SMURF-seq read has sequencing errors, the alignment score increases with the number 50 0 5 10 20 30 200 400 600 800 Fragments (k) Alignment score 0 5 10 20 30 200 400 600 800 Fragments (k) Alignment score 0 5 10 20 30 200 400 600 800 Fragments (k) Alignment score 0 5 10 20 30 200 400 600 800 Fragments (k) Alignment score n = 50k, m = 800, frag. len. = 40, err. = 0% n = 50k, m = 800, frag. len. = 40, err. = 10% n = 50k, m = 800, frag. len. = 40, err. = 20% n = 50k, m = 753, frag. len. = 20 + Geom(20), err. = 20% 15 25 15 25 15 25 15 25 a b c d Figure 4.3: Alignment score of SMURF-seq read as a function of number of fragments. (a) Alignment score of a SMURF-seq read with 20 fragment (40 bp each) that does not have any errors. (b) Alignment score with 10% errors. (c) Alignment score with 20% errors. (d) Alignment score of a read with 20 fragments generated from a 20 + geometric(20) distribution and with 20% errors. of fragments, but would not reach the peak atk = 20 (Fig 4.3b; the SMURF-seq read is similar to 4.3a, but with 10% mismatch errors). The alignment score continues to increase beyondk = 20 but at a lower rate, and would eventually equal the read length as the number of fragments increase. Similarly, Fig. 4.3c shows the alignment score for a read with 20% mismatch errors. However, the 51 due to higher errors, the score is lower atk = 20, and the curve “flattens”. For a real SMURF-seq read, the fragment lengths cannot expected to be constant. Fig. 4.3d shows the increase in alignment score of read with 20 fragments, with each fragment length sam- pled from a 20+ geometric(20) (mean fragment length of 40 bp) distribution. The alignment score increases withk, but the change in slope atk = 20 is not prominent. 4.6 Score distribution under a random model Calculation of p-value for aligning a SMURF-seq read with a given fragmentation requires the null distribution of aligning reads generated at random with the same fragmentation. The problem of finding the null distribution is defined as follows: consider strings T and S are generated by drawing letters independently from the same distribution from an alphabeta2 with probability p a such that P a2 p a = 1. For a given fragment setP containingk elements, we need to determine the distribution ofscore T (S;P ). We use the following score function to obtain the distribution of score T (S;k) (we usescore T (S;k) as short-hand notation forscore T (S;P ) withjPj =k): s(a;b) = 8 > > > > > > < > > > > > > : 1 ifa =b 0 ifa6=b 1 otherwise. To determine the distribution of score T (S;k), we first consider the score distribution when k = 1, i.e. the entire read aligns as one fragment. Then, we consider the score distribution when k> 1 as the sum ofk = 1 distributions. 52 4.6.1 Score distribution of one fragment The score distribution ofscore T (S; 1) has similarities to the score distribution of local alignment scores (Altschul and Gish, 1996; Smith et al., 1983) and profile alignment scores (Goldstein and Waterman, 1994), but also differs from these. The distribution of score T (S; 1) differs from the local alignment as we require an end-to-end alignment ofS to a substring ofT , and also differs from the profile score distribution since the letters ofS are generated at random. Based on these results, we hypothesized that the distribution ofscore T (S; 1) is likely to follow an extreme value distribution. LetX j denote the score of aligningS withT [j:::j +m 1], a substring ofT starting atj, then X j = m1 X i=0 s(S[i];T [j +i]);j = 0;:::;nm + 1: Since the letters of T and S are i.i.d., we have X j binomial(m;p) wherep = P a= p 2 a . For a large enoughm,X j can be approximated by a normal distribution as X j normal(mp;mp(1p)): score T (S; 1) is the maximum score over all positions inT , score T (S; 1) = max 0jnm+1 X j : score T (S; 1) is a maximum of normal distributions, which follows an extreme value distribution (EVD) (Kotz and Nadarajah, 2000). We verified that score distribution ofscore T (S; 1) is well approximated by an extreme value 53 n = 50k, m = 40, k = 1 n = 50k, m = 100, k = 1 Alignment score Density Density EVD theoretical quantiles EVD theoretical quantiles Alignment score Sample quantiles Sample quantiles a b 20 24 28 0.0 0.2 0.4 20 24 28 32 20 22 24 26 28 30 40 45 50 55 0.0 0.1 0.2 0.3 45 50 55 42 46 50 54 Figure 4.4: Extreme value approximation for score T (S; 1). Empirical score distribution of score T (S; 1) with a fitted EVD using the method of moments estimator. Q-Q plot comparing the theoretical and empirical distributions are shown. (a)m = 40. (b)m = 100. distribution by generating a random genome of length 50 kb from the DNA alphabet with equal probabilities, and reads of length 40 bp and 100 bp. For each read length, we determined the score distribution by aligning 100,000 reads generated at random (Fig. 4.4a, b). The parameters for the EVD was estimated using the method of moments. Further, the score distribution for increas- ing read lengths shows an increasing trend in the mean and standard deviation of the distribution (Fig. 4.5a, b). 54 20 30 40 50 60 30 50 70 90 110 Fragment length (bp) Alignment score 850 950 1050 1150 3100 3400 3700 4000 Fragment length (bp) Alignment score a b Figure 4.5: Empirical score distribution approximation for score T (S; 1). (a) Empirical score distribution for m corresponding to shorter fragments. (b) Empirical score distribution form m corresponding to longer fragments. 4.6.2 Score distribution for a given fragment set The distribution ofscore T (S;k) fork > 1 and a given fragment setP is the sum ofk indepen- dent distributions ofscore T (S; 1), i.e the distribution ofscore T (S;k) is the sum ofk independent extreme value distributions score T (S;k) = k X i=1 score T (S[P i :::P i+1 ]; 1): The independence of the distributions for each fragment is justified because it is required that n>>m, and the probability of two fragments aligning to overlapping location onT is extremely small, and the mapping location of one fragment does not influence the mapping location of other fragments. The distribution of the sum and linear combination of extreme value distributions has been stud- ied (Cetinkaya et al., 2001; Loaiciga and Leipnik, 1999; Marques et al., 2015; Nadarajah, 2008). In Loaiciga and Leipnik (1999) the exact distribution of two independent Gumbel distributions is 55 Normal theoretical quantiles Normal theoretical quantiles Alignment score Alignment score Density Sample quantiles Sample quantiles Density n = 50k, m = 800, k = 20, frag. len. = 40 n = 50k, m = 1600, k = 40, frag. len. = 40 a b 440 450 460 470 0.00 0.04 0.08 435 445 455 465 435 445 455 465 880 900 920 0.00 0.02 0.04 0.06 880 900 920 880 900 920 Figure 4.6: Normal approximation forscore T (S;k) with equal fragment lengths. Empirical score distribution of score T (S;k) with a fitted normal using the method of moments estimator. All fragments are 40 bp. Q-Q plot comparing the theoretical and empirical distributions are shown. (a)m = 800;k = 20. (b)m = 1600;k = 40. given and in Nadarajah (2008) the exact distribution of the linear combination of Gumbel distri- butions is given. However, these distributions are not easily tractable and do not follow a standard named distribution. Here, we take the alternate approach of using the central limit theorem for approximatingscore T (S;k) as sum of independent distributions. The fragments in a SMURF-seq read align independent of one another, and when the fragments are of equal lengths, the distribution ofscore T (S;k) is a sum of i.i.d. random variables. Thus, we 56 Alignment score Density Alignment score Density Normal theoretical quantiles Sample quantiles Normal theoretical quantiles Sample quantiles n = 50k, m = 744, k = 20, frag. len. = 20 + geom(20) a b n = 50k, m = 1627, k = 40, frag. len. = 20 + geom(20) 410 420 430 440 0.00 0.04 0.08 410 420 430 440 410 420 430 440 880 900 920 0.00 0.02 0.04 0.06 880 900 920 880 900 920 Figure 4.7: Normal approximation for score T (S;k) with random fragment lengths. Empirical score distribution of score T (S;k) with a fitted normal using the method of moments estimator. The fragment lengths are generated from a distribution with mean 40. Q-Q plot comparing the theoretical and empirical distributions are shown. (a)m = 744;k = 20. (b)m = 1627;k = 40. can apply the central limit theorem to approximate the score distribution to a normal distribution as k!1. To test the approximation ofscore T (S;k) to a normal, we compared the empirical score distribution to the normal distribution fork = 20 and 40, values that are typical for a SMURF-seq read (Fig. 4.6). The fragments lengths for all the comparisons were kept constant at 40 bp, and the parameters for the normal distribution was determined using the method of moments estimator. In aligning a SMURF-seq read, we cannot expect the fragment lengths to be equal. The distri- 57 bution ofscore T (S;k), when the fragment lengths differ, is a sum of independent, but not identical, random variables. We empirically verified that this distribution is well approximated by a normal (Fig. 4.7). For each k, the fragment lengths were generated from at random from a distribution with mean 40. 4.7 Estimating the optimal fragment set The score distribution of aligning a random readS rand to a random genomeT rand can be used to estimate the optimalk for aligning a SMURF-seq readS SMURF to a reference genomeT ref . For a SMURF-seq read, find the best alignment scorek score and the fragment setk P for allk from 1 to m using algorithms given in section 4.4. In practice,k can be restricted to a much smaller subset of possible values, and the goal is to determine a value ofk that best represents the fragments that were ligated to generate the read. As described in the previous section, the null score distribution of aligning a read generated at random fork = 1 follows an extreme value distribution, and a normal distribution for sufficiently largek. For eachk, the parameters for the null distributions can be estimated from the empirical distribution by aligning random reads with the fragment setk P . The p-value for eachk can be de- termined by finding the probability of a score greater thank score from the null distribution. Finally, the optimalk for a read is the one with the lowest p-value (Algorithm 3). Algorithm 3 OptimalK (T;S) 1: k opt 1 2: pval opt 1 3: fork 1 tom 1 do 4: k score ;k P FragBoundaryIdentification(T ref ;S SMURF ;k) 5: k pval Pr(score T rand (S rand ;k P )>k score ) 6: ifk pval <pval opt then 7: pval opt k pval 8: k opt k 9: return k opt 58 0 5 10 15 20 25 30 300 400 500 600 700 −2000 −1500 −1000 −500 0 Fragments Alignment score log(p−value) n = 50k, m = 800, k = 20, err. = 10% Figure 4.8: Determining the optimal fragmentation of a SMURF-seq read. The black line is the alignment score of a simulated SMURF-seq read with 20 fragments of 40 bp each, and with 10% mismatch errors. The violin plots are the empirical null distributions using fragment set corresponding the best alignment score for eachk. The red line is the p-value for eachk determined from the alignment score and the null distribution. The optimal fragmentation has the lowest p- value. As discussed in section 4.5, as k increases up to k opt the number of bases on a SMURF-seq that are aligned to its true location on the reference genome would increase and the bases aligned to random locations would decrease. Thus, getting further away from a random alignment with an expected decrease in p-value. Atk =k opt , all bases are aligned to their true locations, and would have the lowest p-value. Ask gets farther away fromk opt , fragments on a read are further split into smaller fragments with a small increase in alignment score; eventually, with fragment aligning to random locations on the reference, and thus, getting closer to a random alignment with an expected increase in p-value. As an example, we simulated a reference genome of length 50 kb with the DNA alphabet having equal probabilities. A simulated SMURF-seq read with 20 fragments each of length 40 bp was generated from this genome, and 10% mismatch sequencing errors were introduced. This 59 read was then mapped back to the reference genome for values ofk = 1 to 30 using algorithm 2 (scoring a match as 1, a mismatch as 0, and not allowing indels), yielding the fragment set that maximizes the alignment score for each k. These fragment sets were then used to generate the null distribution by simulating a random reference genome with the same base probabilities, and aligning 10,000 random reads with fragment start locations based on the fragment set. The p-value for each fragmentation was determined using an EVD for k = 1 and normal distributions for k> 1 with parameters estimated using the method of moments from the simulated reads (although the normal approximation does not hold for small values of k, we have included it for clarity). The fragmentation with the smallest p-value was considered as the optimal fragmentation, and as expected, k = 20 has the lowest p-value; with the p-value increasing on either side of k = 20 (Fig. 4.8). The example considered here is simplified to illustrate the procedure to determine the opti- mal fragmentation of a SMURF-seq read. In aligning real SMURF-seq reads: (1) The reference genome would be significantly larger (e.g. the human genome) with different base (or dinucleotide) probabilities. (2) The fragments lengths cannot expected to be constant. (3) The sequencing error model will depend on the properties of the sequencing technology used. (4) The alignment score used would depend on the sequencing error model and would allow indels (with possibly different penalties for gap open and gap extend). (5) Parameters for the null distribution were determined empirically, which cannot be done in practice. Some of these issues are considered in the following sections. 4.7.1 Fast computation of p-values When aligning a SMURF-seq read, p-values need to calculated for a read to determine an optimal fragmentation among the potential fragmentations. It is thus important to be able to determine the p-value using an efficient procedure requiring the least amount of computation. However, the procedure used above requires generating an empirical distribution for each fragmentation to 60 Alignment score Density Alignment score Density Alignment score Density Alignment score Density n = 50k, m = 800, k = 20, frag. len. = 40 n = 50k, m = 1600, k = 40, frag. len. = 40 n = 50k, m = 744, k = 20, frag. len. = 20 + geom(20) n = 50k, m = 1627, k = 40, frag. len. = 20 + geom(20) a b c d Empirical distribution Analytical distribution 440 450 460 470 0.00 0.04 0.08 880 900 920 0.00 0.02 0.04 0.06 410 420 430 440 0.00 0.04 0.08 880 900 920 0.00 0.02 0.04 0.06 Figure 4.9: Fast computation of p-values as the as the sum of mean and variance of thek single fragment (k = 1) distributions corresponding to the fragment set. Red line shows the empirical null distribution, and blue line the sum ofk distributions. (a-b) Reads withk = 20 andk = 40 fragments of constant 40 bp length. (c-d) Reads with k = 20 and k = 40 fragments of length generated at random from a distribution with mean 40. determine the parameters for the null distribution. This process is computationally intensive and cannot be used in practice. For anyk > 1, the score distribution is the sum ofk independent extreme value distributions. 61 An approach for fast computation of p-values is to compute the mean and variance for thek = 1 distribution for all possible values of fragment lengths (i.e. from 1 bp to the maximum possible read length). Then the mean and variance for the normal distribution, for any fragmentation with k> 1 fragments, can be calculated from the sum of the mean and variance of thek single fragment (k = 1) distributions corresponding to the fragment set. As an example, for a fragmentation with three fragments of length 40, 100, 70 bp, the parameters for the normal distribution can be calculated as the sum of the mean and variance of the distribution of aligning 40 bp, 100 bp, 70 bp fragments individually to a random genome. Thus, for a reference genome and a given score function, the parameters for thek = 1 distribu- tion needs to determined just once. The parameters for the null distribution for any fragmentation can be computed by looking-up the table and simply adding the values. We tested the effectiveness of this procedure by computing the parameters for a genome of length 50 kb generated at random. Figure 4.9 compares the null distribution generated by aligning 100,000 random reads and the null distribution computed as the sum ofk = 1 distributions. 4.8 Limitations and future directions As SMURF-seq evolves, the fragments can be expected to get shorter challenging the existing mapping tools. A crucial element of aligning a SMURF-seq read is to determine the number of fragments on a SMURF-seq read and to determine the optimal fragment boundaries on a SMURF- seq read. To this end, we defined a score function for a SMURF-seq read, suggested algorithms to find fragment boundaries, and provided a statistical procedure to estimate the number of fragments on a read. However, there are still several questions that remain to be answered in this regard, both in terms of using this procedure for real SMURF-seq reads and in terms of understanding the score distribution of aligning SMURF-seq reads. Some of these are discussed in below. 62 Aligning with a general score function In the analysis of aligning a SMURF-seq read and the random distributions, we used a simple score function of scoring a match as 1, a mismatch as 0, and not allowing indels. However, such a score function is too simplified for use in practice. A score function for aligning a real SMURF-seq read would depend on several factors such as the error profile of the sequencing technology used and the algorithms used to align these reads. When scoring a mismatch with a negative penalty and allowing indelsX j , which denotes the score of aligning a read S to a substring of the reference T [j:::j +m 1], will not follow a binomial distribution. However, the extreme value distribution can still be used to approximate the maximum of any independent and identical distributions (i.e. not just the normal distribution) (Kotz and Nadarajah, 2000). Thus, we hypothesized that score T (S; 1) = max 0jnm+1 X j could be approximated with an EVD irrespective of the score function used. In the context of local alignment, the theoretical score distributions for a random model were derived when indels were not allowed (Karlin and Altschul, 1990; Karlin et al., 1990); and it was shown empirically that allowing indels follows similar distributions as allowing only mismatches, although a theoretical proof does not exist (Altschul and Gish, 1996; Pearson, 1998; Smith et al., 1985). We verified empirically that the score distribution of for k = 1, score T (S; 1), can still be approximated using an EVD when scoring a match as 1, and mismatch and indels scored as1. We generated a random genome of length 50 kb from the DNA alphabet with equal probabilities, and reads of length 40 bp and 100 bp. For each read length, we determined the score distribution by aligning 15,000 reads generated at random (Fig. 4.10a, b). The parameters for the EVD was estimated using the method of moments. 63 Then, we also verified empirically that the score distribution fork > 1, which is still the sum ofk independent extreme value distributions, is well approximated by a normal distribution. The fragments are 40 bp with 20 and 40 fragments per read (Fig. 4.11a, b). Although the extreme value and normal distributions are good approximations for the score function tested above, we have not tested other score functions. These approximations need to tested for the score function that is used to align SMURF-seq reads. Further, we have also not n = 50k, m = 40, k = 1 n = 50k, m = 100, k = 1 Alignment score Density Density EVD theoretical quantiles EVD theoretical quantiles Alignment score Sample quantiles Sample quantiles a b 12 16 20 24 0.0 0.1 0.2 0.3 15 20 25 30 12 14 16 18 20 22 20 25 30 35 40 0.0 0.1 0.2 25 30 35 40 45 25 30 35 Figure 4.10: Extreme value approximation for score T (S; 1) with a general score function. A match is scored as 1, mismatch and indels as1. Empirical score distribution of score T (S; 1) with a fitted EVD using the method of moments estimator. Q-Q plot comparing the theoretical and empirical distributions are shown. (a)m = 40. (b)m = 100. 64 Normal theoretical quantiles Normal theoretical quantiles Alignment score Alignment score Density Sample quantiles Sample quantiles Density n = 50k, m = 800, k = 20, frag. len. = 40 n = 50k, m = 1600, k = 40, frag. len. = 40 a b 310 330 350 0.00 0.02 0.04 0.06 310 320 330 340 350 310 330 350 630 650 670 690 0.00 0.02 0.04 630 650 670 630 650 670 Figure 4.11: Normal approximation forscore T (S;k) with a general score function. A match is scored as 1, mismatch and indels as1. Empirical score distribution ofscore T (S;k) with a fitted normal using the method of moments estimator. All fragments are 40 bp. Q-Q plot comparing the theoretical and empirical distributions are shown. (a)m = 800;k = 20. (b)m = 1600;k = 40. verified the validity of these approximation when gap open penalties are used in addition to gap extend and mismatch penalties. Error bounds for extreme value and normal approximations Any approximation for the null score distributions are just approximations, and would have asso- ciated limitations. We have not assessed the limitations and error bounds in making these approx- 65 imations. Aligning a SMURF-seq read as one fragment is approximated with an extreme value distribu- tion. Convergence to an EVD depends on the length of the genome,n. The minimum length of the genome for convergence, the error in making this approximation, and its dependence on fragment length needs to be determined. Aligning a SMURF-seq read with more than one fragment (k> 1) is approximated as a normal distribution, and the natural question is how large does k have to be for a good approximation. Further, understanding the dependence of the number of fragments on the fragment lengths are also crucial for determining the effectiveness of calculating the p-value for a fragmentation. For example, aligning two reads with a samek but one with shorter fragments and the other with longer could have to different error bounds for the normal approximation. In aligning a real SMURF-seq read, the dependence of these approximation on the choice of score function used also needs to evaluated. Effectiveness of the p-value procedure We determined the optimal fragmentation of a SMURF-seq read as the one with the lowest p-value. The effectiveness of this procedure for predicting the optimal fragmentation needs to be evaluated. This could be done with simulated SMURF-seq reads for which the number of fragments and fragment boundaries are known. (Simulated SMURF-seq reads can be generated by sampling substrings from a reference genome and then adding sequencing errors. They can also be generated by concatenating substrings from a long read sequenced on a nanopore machine without SMURF- seq, and thus, preserving the error profile.) Shorter fragments on a SMURF-seq read would lower the rate of increase of the alignment score with increasing the number of fragments of the read. This makes predicting the optimal number of fragments on a read challenging. Similarly, an increase in sequencing errors would also make the prediction challenging. Thus, the mispredictions of the optimal fragmentation of a read 66 needs to evaluated based on these factors. The random model used to determine the null distribution would have a significant effect on the p-value, and so on the effectiveness of the procedure to determine the optimal number of fragments on a read. Random models preserving nucleotide frequencies, dinucleotide frequencies, or local base composition could be used. The effect of the these random models for aligning SMURF-seq reads needs to be assessed. Extending to large genomes Here, we used a genome of length 50 kb generated with letters of the DNA alphabet. However, a reference genome for use with SMURF-seq can be expected to be significantly larger (e.g. the human genome). After determining the appropriate random model and the score function use use, the parameters for the null distribution needs to be estimated. For example, if the method described in section 4.7.1 is used, the mean and variance of aligning fragments of different lengths needs to be estimated. However, this requires aligning reads generated at random to a large genome to determine the maximum alignment score for a read. An alternate approach is to determine the dependence of the null distribution parameters on the length of the reads and length of the reference genome. Then the parameters for the required genome length and fragment length could be extrapolated from aligning reads to smaller genomes. Theoretical proof for score distributions We have shown empirically that the alignment score distribution of one fragment and multiple fragments are closely approximated by the extreme value and normal distributions, respectively. However, we have not theoretically proved the convergence to these distributions. A better theoretic understanding of these approximations would enable, among others, deter- 67 mining the error bounds in making these approximations. Similar to BLAST statistics (Karlin and Altschul, 1990; Karlin et al., 1990), it could also provide an efficient analytical procedure for determining the p-value of aligning a SMURF-seq read. SMURF-seq read mapping tool Finally, as SMURF-seq evolves and the fragments get shorter, accuracy in identifying fragment boundaries will begin to impact the ability of aligners to recover fragments, and algorithms de- signed specifically to map SMURF-seq reads will become essential. Mapping tools, in general, typically consists of a “search” step and an “alignment” step. The search step finds the candidate mapping locations of a read on the reference genome. The algorithm and the parameters used in this step determines the number of candidate locations that are passed on to the downstream steps. A “lenient” algorithm would pass too many candidates leading to an increase in the time spent in the subsequent steps. This could also lead to ambiguity in the downstream steps by obscuring a true alignment. On the other extreme, a too “stringent” algorithm would lead to a decrease in the sensitivity of the alignments found. The candidate regions found in the search step could be further refined, typically generating an intermediate alignment to make the final align step efficient. The align step performs a more detailed evaluation of similar regions on the read and the reference found in the previous steps, typically using a some form of modified Smith-Waterman algorithm. A mapping tool designed for SMURF-seq reads would vary substantially in both of these steps. The search step would require finding the candidate location of short fragments in the presence of high sequencing errors that is typical of nanopore sequencers. The align step for SMURF-seq reads would require incorporating a procedure to find accurate fragment boundaries and to estimate the optimal number of fragments on a read. A SMURF-seq read could have fragments that do not or originate from the reference genome (due to DNA molecules from contaminants or regions of high-sequencing errors on a read), or a 68 SMURF-seq read could have too few fragments for the normal approximation to hold. Developing a mapping tool for SMURF-seq reads would require taking into consideration these and several other factors, such as fragments aligning to multiple locations due to repeats on the genome, that we have not considered here. With shorter fragments on a SMURF-seq read and with a dedicated mapping tool for these reads, we expect the number of fragments in a read to increase. Thus, improving the efficiency of short-read sequencing on long-read machines, and potentially expanding the utility of nanopore machines beyond read-counting applications. 69 Chapter 5 Conclusions Long-read sequencers, especially the Oxford Nanopore Technologies MinION instrument, has widened the horizons of genome sequencing applications. SMURF-seq pushes this boundary a lit- tle bit more by allowing such a technology to be more efficiently leveraged in short-read sequencing for read-counting applications, as required for copy number profiling. The SMURF-seq approach sequences highly fragmented DNA molecules by concatenating them into longer molecules, and thus, utilizing long-read sequencers as optimized for long-read sequencing. Copy number profiling as a diagnostic and prognostic tool to evaluate cancer is expanding. With a fast and simple preparation method and a turnaround time measured in hours, the SMURF- seq approach could provide a highly efficient methodology for research and clinical laboratories where access to large-scale sequencing is limited. Multiplexing samples in a single run would drive down the cost of profiling significantly, and thereby, we hope, help translate the research on clinical significance of copy number profiling into patient outcomes. Optimizing SMURF-seq for shorter fragments would enable generating significantly higher read-counts from a single run of the MinION instrument. However, with shorter fragments, ac- curacy in identifying fragment boundaries will begin to impact the ability of aligners to recover fragments, and algorithms designed specifically to map SMURF-seq reads will become essential. 70 We envision a broadening of the applications of SMURF-seq as the underlying sequencing technology evolves and as SMURF-seq itself improves by continual decrease in fragment lengths, increase in sequenced read length, and data analysis methods optimized for SMURF-seq resulting in an increase in information yield per nucleotide sequenced. 71 References Viktor A Adalsteinsson, Gavin Ha, Samuel S Freeman, Atish D Choudhury, Daniel G Stover, Heather A Parsons, Gregory Gydush, Sarah C Reed, Denisse Rotem, Justin Rhoades, et al. Scalable whole-exome sequencing of cell-free dna reveals high concordance with metastatic tumors. Nature communications, 8(1):1324, 2017. Daniel Aird, Michael G Ross, Wei-Sheng Chen, Maxwell Danielsson, Timothy Fennell, Carsten Russ, David B Jaffe, Chad Nusbaum, and Andreas Gnirke. Analyzing and minimizing pcr amplification bias in illumina sequencing libraries. Genome biology, 12(2):R18, 2011. Mark Akeson, Daniel Branton, John J Kasianowicz, Eric Brandin, and David W Deamer. Microsecond time-scale discrimination among polycytidylic acid, polyadenylic acid, and polyuridylic acid as homopolymers or as segments within single rna molecules. Biophysical journal, 77(6):3227–3233, 1999. Joan Alexander, Jude Kendall, Jean McIndoo, Linda Rodgers, Robert Aboukhalil, Dan Levy, Asya Stepansky, Guoli Sun, Lubomir Chobardjiev, Michael Riggs, et al. Utility of single-cell ge- nomics in diagnostic evaluation of prostate cancer. Cancer research, 78(2):348–358, 2018. Stephen F Altschul and Bruce W Erickson. Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Molec- ular biology and evolution, 2(6):526–538, 1985. Stephen F Altschul and Warren Gish. [27] local alignment statistics. In Methods in enzymology, volume 266, pages 460–480. Elsevier, 1996. Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. Journal of molecular biology, 215(3):403–410, 1990. Stephen F Altschul, Thomas L Madden, Alejandro A Sch¨ affer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research, 25(17):3389–3402, 1997. Stylianos E Antonarakis, Robert Lyle, Emmanouil T Dermitzakis, Alexandre Reymond, and Samuel Deutsch. Chromosome 21 and down syndrome: from genomics to pathophysiology. Nature reviews genetics, 5(10):725–738, 2004. 72 Richard Arratia and Michael S Waterman. An erd¨ os-r´ enyi law with shifts. Advances in mathemat- ics, 55(1):13–23, 1985. Richard Arratia, Louis Gordon, Michael Waterman, et al. An extreme value theory for sequence matching. The annals of statistics, 14(3):971–993, 1986. Richard Arratia, Michael S Waterman, et al. The erd¨ os-r´ enyi strong law for pattern matching with a given proportion of mismatches. The Annals of Probability, 17(3):1152–1169, 1989. Nurit Ashkenasy, Jorge S´ anchez-Quesada, Hagan Bayley, and M Reza Ghadiri. Recognizing a single base in an individual dna strand: a step toward dna sequencing in nanopores. Angewandte Chemie International Edition, 44(9):1401–1404, 2005. Yann Astier, Orit Braha, and Hagan Bayley. Toward single molecule dna sequencing: direct identi- fication of ribonucleoside and deoxyribonucleoside 5 -monophosphates by using an engineered protein nanopore equipped with a molecular adapter. Journal of the American Chemical Society, 128(5):1705–1710, 2006. Alberto Bardelli, Simona Corso, Andrea Bertotti, Sebastijan Hobor, Emanuele Valtorta, Giulia Sir- avegna, Andrea Sartore-Bianchi, Elisa Scala, Andrea Cassingena, Davide Zecchin, et al. Ampli- fication of the met receptor drives resistance to anti-egfr therapies in colorectal cancer. Cancer discovery, 3(6):658–673, 2013. Timour Baslan, Jude Kendall, Linda Rodgers, Hilary Cox, Mike Riggs, Asya Stepansky, Jennifer Troge, Kandasamy Ravi, Diane Esposito, B Lakshmi, et al. Genome-wide copy number analysis of single cells. Nature Protocols, 7(6):1024, 2012. Timour Baslan, Jude Kendall, Brian Ward, Hilary Cox, Anthony Leotta, Linda Rodgers, Michael Riggs, Sean D’Italia, Guoli Sun, Mao Yong, et al. Optimizing sparse sequencing of single cells for highly multiplex copy number profiling. Genome Research, 25(5):714–724, 2015. Hagan Bayley. Nanopore sequencing: from imagination to reality. Clinical chemistry, 61(1): 25–31, 2015. Yuval Benjamini and Terence P Speed. Summarizing and correcting the gc content bias in high- throughput sequencing. Nucleic acids research, 40(10):e72–e72, 2012. Seico Benner, Roger JA Chen, Noah A Wilson, Robin Abu-Shumays, Nicholas Hurt, Kate R Lieberman, David W Deamer, William B Dunbar, and Mark Akeson. Sequence-specific de- tection of individual dna polymerase complexes in real time using a nanopore. Nature nan- otechnology, 2(11):718, 2007. Rameen Beroukhim, Gad Getz, Leia Nghiemphu, Jordi Barretina, Teli Hsueh, David Linhart, Igor Vivanco, Jeffrey C Lee, Julie H Huang, Sethu Alexander, et al. Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proceedings of the National Academy of Sciences, 104(50):20007–20012, 2007. 73 Rameen Beroukhim, Craig H Mermel, Dale Porter, Guo Wei, Soumya Raychaudhuri, Jerry Dono- van, Jordi Barretina, Jesse S Boehm, Jennifer Dobson, Mitsuyoshi Urashima, et al. The land- scape of somatic copy-number alteration across human cancers. Nature, 463(7283):899, 2010. Jesse L Berry, Liya Xu, A Linn Murphree, Subramanian Krishnan, Kevin Stachelek, Emily Zolfaghari, Kathleen McGovern, Thomas C Lee, Anders Carlsson, Peter Kuhn, et al. Poten- tial of aqueous humor as a surrogate tumor biopsy for retinoblastoma. JAMA ophthalmology, 135(11):1221–1230, 2017. Jesse L Berry, Liya Xu, Irsan Kooi, A Linn Murphree, Rishvanth K Prabakar, Mark Reid, Kevin Stachelek, Bao Han A Le, Lisa Welter, Bibiana J Reiser, et al. Genomic cfdna analysis of aque- ous humor in retinoblastoma predicts eye salvage: The surrogate tumor biopsy for retinoblas- toma. Molecular Cancer Research, 16(11):1701–1712, 2018. Graham R Bignell, Chris D Greenman, Helen Davies, Adam P Butler, Sarah Edkins, Jenny M Andrews, Gemma Buck, Lina Chen, David Beare, Calli Latimer, et al. Signatures of mutation and selection in the cancer genome. Nature, 463(7283):893, 2010. Rory Bowden, Robert W Davies, Andreas Heger, Alistair T Pagnamenta, Mariateresa de Cesare, Laura E Oikkonen, Duncan Parkes, Colin Freeman, Fatima Dhalla, Smita Y Patel, et al. Se- quencing of human genomes with nanopore technology. Nature communications, 10(1):1–9, 2019. Daniel Branton, David Deamer, Andre Marziali, Hagan Bayley, Steven Benner, Thomas Butler, Slaven Garaj, Andrew Hibbs, Xiaohua Huang, Stevan Jovanovich, Predrag Krstic, and Stuart Lindsay. The potential and challenges of nanopore. Nature biotechnology, 26:1146, 08 2009. Clive G Brown and James Clarke. Nanopore development at oxford nanopore. Nature biotechnol- ogy, 34(8):810–811, 2016. Tom Z Butler, Mikhail Pavlenok, Ian M Derrington, Michael Niederweis, and Jens H Gundlach. Single-molecule dna detection with an engineered mspa protein nanopore. Proceedings of the National Academy of Sciences, 105(52):20647–20652, 2008. Peter J Campbell, Philip J Stephens, Erin D Pleasance, Sarah O’Meara, Heng Li, Thomas San- tarius, Lucy A Stebbings, Catherine Leroy, Sarah Edkins, Claire Hardy, et al. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired- end sequencing. Nature genetics, 40(6):722, 2008. Jean-Michel Carter and Shobbir Hussain. Robust long-read native dna sequencing using the ont csgg nanopore system. Wellcome open research, 2, 2017. Nigel P Carter. Methods and strategies for analyzing copy number variation using dna microarrays. Nature genetics, 39(7):S16–S21, 2007. 74 Scott L Carter, Kristian Cibulskis, Elena Helman, Aaron McKenna, Hui Shen, Travis Zack, Pe- ter W Laird, Robert C Onofrio, Wendy Winckler, Barbara A Weir, et al. Absolute quantification of somatic dna alterations in human cancer. Nature biotechnology, 30(5):413, 2012. Coskun Cetinkaya, Vikram Kanodia, and Edward W Knightly. Scalable services via egress admis- sion control. IEEE Transactions on multimedia, 3(1):69–81, 2001. Mark J Chaisson and Glenn Tesler. Mapping single molecule sequencing reads using basic local alignment with successive refinement (blasr): application and theory. BMC bioinformatics, 13 (1):238, 2012. KC Allen Chan, Peiyong Jiang, Yama WL Zheng, Gary JW Liao, Hao Sun, John Wong, Shing Shun N Siu, Wing C Chan, Stephen L Chan, Anthony TC Chan, et al. Cancer genome scanning in plasma: detection of tumor-associated copy number aberrations, single-nucleotide variants, and tumoral heterogeneity by massively parallel sequencing. Clinical chemistry, 59(1):211–224, 2013. Themoula Charalampous, Gemma L Kay, Hollian Richardson, Alp Aydin, Rossella Baldan, Christopher Jeanes, Duncan Rae, Sara Grundy, Daniel J Turner, John Wain, et al. Nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection. Nature Biotechnology, 37(7):783–792, 2019. Yangho Chen, Tade Souaiaia, and Ting Chen. Perm: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics, 25(19):2514–2521, 2009. Gerald M Cherf, Kate R Lieberman, Hytham Rashid, Christopher E Lam, Kevin Karplus, and Mark Akeson. Automated forward and reverse ratcheting of dna in a nanopore at 5-˚ a precision. Nature biotechnology, 30(4):344, 2012. Derek Y Chiang, Gad Getz, David B Jaffe, Michael JT O’kelly, Xiaojun Zhao, Scott L Carter, Carsten Russ, Chad Nusbaum, Matthew Meyerson, and Eric S Lander. High-resolution mapping of copy-number alterations with massively parallel sequencing. Nature Methods, 6(1):99, 2009. Dhananjay Chitale, Yixuan Gong, Barry S Taylor, Stephen Broderick, Cameron Brennan, Romel Somwar, Benjamin Golas, Lu Wang, Noriko Motoi, Janos Szoke, et al. An integrated genomic analysis of lung cancer reveals loss of dusp4 in egfr-mutant tumors. Oncogene, 28(31):2773, 2009. Atish D Choudhury, Lillian Werner, Edoardo Francini, Xiao X Wei, Gavin Ha, Samuel S Freeman, Justin Rhoades, Sarah C Reed, Gregory Gydush, Denisse Rotem, et al. Tumor fraction in cell- free dna as a biomarker in prostate cancer. JCI insight, 3(21), 2018. John Chu, Marcos Gonz´ alez-L´ opez, Scott L Cockroft, Manuel Amorin, and M Reza Ghadiri. Real-time monitoring of dna polymerase function and stepwise single-nucleotide dna strand translocation through a protein nanopore. Angewandte Chemie International Edition, 49(52): 10106–10109, 2010. 75 James Clarke, Hai-Chen Wu, Lakmal Jayasinghe, Alpesh Patel, Stuart Reid, and Hagan Bayley. Continuous base identification for single-molecule nanopore dna sequencing. Nature nanotech- nology, 4(4):265, 2009. Scott L Cockroft, John Chu, Manuel Amorin, and M Reza Ghadiri. A single-molecule nanopore device detects dna polymerase activity with single-nucleotide resolution. Journal of the Ameri- can Chemical Society, 130(3):818–820, 2008. Stephen Cristiano, Alessandro Leal, Jillian Phallen, Jacob Fiksel, Vilmos Adleff, Daniel C Bruhm, Sarah Østrup Jensen, Jamie E Medina, Carolyn Hruban, James R White, et al. Genome-wide cell-free dna fragmentation in patients with cancer. Nature, 570(7761):385–389, 2019. Emily Crowley, Federica Di Nicolantonio, Fotios Loupakis, and Alberto Bardelli. Liquid biopsy: monitoring cancer-genetics in the blood. Nature reviews Clinical oncology, 10(8):472, 2013. Angel E Dago, Asya Stepansky, Anders Carlsson, Madelyn Luttgen, Jude Kendall, Timour Baslan, Anand Kolatkar, Michael Wigler, Kelly Bethel, Mitchell E Gross, et al. Rapid phenotypic and genomic change in response to therapeutic pressure in prostate cancer inferred by high content analysis of single circulating tumor cells. PloS one, 9(8), 2014. Romina D’Aurizio, Tommaso Pippucci, Lorenzo Tattini, Betti Giusti, Marco Pellegrini, and Al- berto Magi. Enhanced copy number variants detection from whole-exome sequencing data using excavator2. Nucleic acids research, 44(20):e154–e154, 2016. M Dayhoff, R Schwartz, and B Orcutt. 22 a model of evolutionary change in proteins. In Atlas of protein sequence and structure, volume 5, pages 345–352. National Biomedical Research Foundation Silver Spring MD, 1978. David Deamer, Mark Akeson, and Daniel Branton. Three decades of nanopore sequencing. Nature biotechnology, 34(5):518, 2016. Daniel P Depledge, Kalanghad Puthankalam Srinivas, Tomohiko Sadaoka, Devin Bready, Yasuko Mori, Dimitris G Placantonakis, Ian Mohr, and Angus C Wilson. Direct rna sequencing on nanopore arrays redefines the transcriptional complexity of a viral pathogen. Nature communi- cations, 10(1):1–13, 2019. Thomas Derrien, Jordi Estell´ e, Santiago Marco Sola, David G Knowles, Emanuele Raineri, Roderic Guig´ o, and Paolo Ribeca. Fast computation and applications of genome mappability. PloS one, 7(1), 2012. Achilles Dugaiczyk, Herbert W Boyer, and Howard M Goodman. Ligation of ecori endonuclease- generated dna fragments into linear and circular structures. Journal of molecular biology, 96(1): 171–184, 1975. Paul Erd¨ os and P´ al R´ ev´ esz. On the length of the longest head-run. Topics in information theory, 16:219–228, 1975. 76 Dariush Etemadmoghadam, Anna Defazio, Rameen Beroukhim, Craig Mermel, Joshy George, Gad Getz, Richard Tothill, Aikou Okamoto, Maria B Raeder, Paul Harnett, et al. Integrated genome-wide dna copy number and expression analysis identifies distinct mechanisms of pri- mary chemoresistance in ovarian carcinomas. Clinical cancer research, 15(4):1417–1427, 2009. Philipp Euskirchen, Franck Bielle, Karim Labreche, Wigard P Kloosterman, Shai Rosenberg, Mai- lys Daniau, Charlotte Schmitt, Julien Masliah-Planchon, Franck Bourdeaut, Caroline Dehais, et al. Same-day genomic and epigenomic diagnosis of brain tumors using real-time nanopore sequencing. Acta Neuropathologica, 134(5):691–703, 2017. M Fanciulli, E Petretto, and TJ Aitman. Gene copy number variation and common human disease. Clinical genetics, 77(3):201–213, 2010. Nuno Rodrigues Faria, Ester C Sabino, Marcio RT Nunes, Luiz Carlos Junior Alcantara, Nicholas J Loman, and Oliver G Pybus. Mobile real-time surveillance of zika virus in brazil. Genome medicine, 8(1):97, 2016. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Foun- dations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 390–398. IEEE, 2000. Lars Feuk, Andrew R Carson, and Stephen W Scherer. Structural variation in the human genome. Nature Reviews Genetics, 7(2):85–97, 2006. Walter M Fitch. Random sequences. Journal of Molecular Biology, 163(2):171–176, 1983. Jennifer L Freeman, George H Perry, Lars Feuk, Richard Redon, Steven A McCarroll, David M Altshuler, Hiroyuki Aburatani, Keith W Jones, Chris Tyler-Smith, Matthew E Hurles, et al. Copy number variation: new insights in genome diversity. Genome research, 16(8):949–961, 2006. Daniel R Garalde, Elizabeth A Snell, Daniel Jachimowicz, Botond Sipos, Joseph H Lloyd, Mark Bruce, Nadia Pantic, Tigist Admassu, Phillip James, Anthony Warland, et al. Highly parallel direct rna sequencing on an array of nanopores. Nature methods, 15(3):201, 2018. Erik Gerdtsson, Milind Pore, Jana-Aletta Thiele, Anna Sandstr¨ om Gerdtsson, Paymaneh D Mal- ihi, Rafael Nevarez, Anand Kolatkar, Carmen Ruiz Velasco, Sophia Wix, Mohan Singh, et al. Multiplex protein detection on circulating tumor cells from liquid biopsies using imaging mass cytometry. Convergent Science Physical Oncology, 4(1):015002, 2018. Larry Goldstein and Michael S Waterman. Approximations to profile score distributions. Journal of Computational Biology, 1(2):93–104, 1994. Enrique Gonzalez, Hemant Kulkarni, Hector Bolivar, Andrea Mangano, Racquel Sanchez, Gabriel Catano, Robert J Nibbs, Barry I Freedman, Marlon P Quinones, Michael J Bamshad, et al. The influence of ccl3l1 gene-containing segmental duplications on hiv-1/aids susceptibility. Science, 307(5714):1434–1440, 2005. 77 Sara Goodwin, John D McPherson, and W Richard McCombie. Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics, 17(6):333, 2016. Jacqueline Goordial, Ianina Altshuler, Katherine Hindson, Kelly Chan-Yam, Evangelos Marcole- fas, and Lyle G Whyte. In situ field sequencing and life detection in remote (79 26’ n) canadian high arctic permafrost ice wedge microbial communities. Frontiers in microbiology, 8:2594, 2017. Louis Gordon, Mark F Schilling, and Michael S Waterman. An extreme value theory for long head runs. Probability Theory and Related Fields, 72(2):279–287, 1986. Mercedes E Gorre, Mansoor Mohammed, Katharine Ellwood, Nicholas Hsu, Ron Paquette, P Nagesh Rao, and Charles L Sawyers. Clinical resistance to sti-571 cancer therapy caused by bcr-abl gene mutation or amplification. Science, 293(5531):876–880, 2001. Michael Gribskov, Andrew D McLachlan, and David Eisenberg. Profile analysis: detection of distantly related proteins. Proceedings of the National Academy of Sciences, 84(13):4355–4358, 1987. Arief Gusnanto, Henry M Wood, Yudi Pawitan, Pamela Rabbitts, and Stefano Berri. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data. Bioinformatics, 28(1):40–47, 2012. Brett Gyarfas, Felix Olasagasti, Seico Benner, Daniel Garalde, Kate R Lieberman, and Mark Ake- son. Mapping the position of dna polymerase-bound dna templates in a nanopore at 5 ˚ a resolu- tion. ACS nano, 3(6):1457–1466, 2009. Gavin Ha, Andrew Roth, Jaswinder Khattra, Julie Ho, Damian Yap, Leah M Prentice, Nataliya Melnyk, Andrew McPherson, Ali Bashashati, Emma Laks, et al. Titan: inference of copy num- ber architectures in clonal cell populations from tumor whole-genome sequence data. Genome research, 24(11):1881–1893, 2014. Philip J Hastings, James R Lupski, Susan M Rosenberg, and Grzegorz Ira. Mechanisms of change in gene copy number. Nature Reviews Genetics, 10(8):551–564, 2009. Ellen Heitzer, Imran S Haque, Charles ES Roberts, and Michael R Speicher. Current and future perspectives of liquid biopsies in genomics-driven oncology. Nature Reviews Genetics, 20(2): 71–88, 2019. Steven Henikoff and Jorja G Henikoff. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences, 89(22):10915–10919, 1992. James Hicks, Alexander Krasnitz, B Lakshmi, Nicholas E Navin, Michael Riggs, Evan Leibu, Diane Esposito, Joan Alexander, Jen Troge, Vladimir Grubor, et al. Novel patterns of genome rearrangement and their association with survival in breast cancer. Genome research, 16(12): 1465–1479, 2006. 78 Edward J Hollox, Ulrike Huffmeier, Patrick LJM Zeeuwen, Raquel Palla, Jes´ us Lascorz, Diana Rodijk-Olthuis, Peter CM Van De Kerkhof, Heiko Traupe, Gys De Jongh, Martin Den Heijer, et al. Psoriasis is associated with increased-defensin genomic copy number. Nature genetics, 40(1):23, 2008. Min Hu and Kornelia Polyak. Serial analysis of gene expression. Nature protocols, 1(4):1743, 2006. A John Iafrate, Lars Feuk, Miguel N Rivera, Marc L Listewnik, Patricia K Donahoe, Ying Qi, Stephen W Scherer, and Charles Lee. Detection of large-scale variation in the human genome. Nature genetics, 36(9):949–951, 2004. Nicholas T Ingolia, Sina Ghaemmaghami, John RS Newman, and Jonathan S Weissman. Genome- wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. science, 324(5924):218–223, 2009. Miten Jain, Hugh E Olsen, Benedict Paten, and Mark Akeson. The oxford nanopore minion: delivery of nanopore sequencing to the genomics community. Genome biology, 17(1):239, 2016. Miten Jain, Sergey Koren, Karen H Miga, Josh Quick, Arthur C Rand, Thomas A Sasani, John R Tyson, Andrew D Beggs, Alexander T Dilthey, Ian T Fiddes, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature Biotechnology, 2018a. Miten Jain, Sergey Koren, Karen H Miga, Josh Quick, Arthur C Rand, Thomas A Sasani, John R Tyson, Andrew D Beggs, Alexander T Dilthey, Ian T Fiddes, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads, 2018b. URLhttps://github.com/ nanopore-wgs-consortium/NA12878/blob/master/Genome.md. Tanjina Kader, David L Goode, Stephen Q Wong, Jacquie Connaughton, Simone M Rowley, Lisa Devereux, David Byrne, Stephen B Fox, Gisela Mir Arnau, Richard W Tothill, et al. Copy number analysis by low coverage whole genome sequencing using ultra low-input DNA from formalin-fixed paraffin embedded tumor tissue. Genome Medicine, 8(1):121, 2016. Samuel Karlin and Stephen F Altschul. Methods for assessing the statistical significance of molec- ular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences, 87(6):2264–2268, 1990. Samuel Karlin, Amir Dembo, and Tsutomu Kawabata. Statistical composition of high-scoring segments from molecular sequences. The Annals of Statistics, 18(2):571–581, 1990. John J Kasianowicz, Eric Brandin, Daniel Branton, and David W Deamer. Characterization of individual polynucleotide molecules using a membrane channel. Proceedings of the National Academy of Sciences, 93(24):13770–13773, 1996. Jude Kendall and Alexander Krasnitz. Computational Methods for DNA Copy-Number Analysis of Tumors, pages 243–259. Springer New York, New York, NY , 2014. 79 W James Kent. Blat–the blast-like alignment tool. Genome research, 12(4):656–664, 2002. Szymon M Kiełbasa, Raymond Wan, Kengo Sato, Paul Horton, and Martin C Frith. Adaptive seeds tame genomic sequence comparison. Genome research, 21(3):487–493, 2011. Martin Kircher and Janet Kelso. High-throughput DNA sequencing–concepts and limitations. Bioessays, 32(6):524–536, 2010. Jan O Korbel, Alexander Eckehart Urban, Jason P Affourtit, Brian Godwin, Fabian Grubert, Jan Fredrik Simons, Philip M Kim, Dean Palejev, Nicholas J Carriero, Lei Du, et al. Paired- end mapping reveals extensive structural variation in the human genome. Science, 318(5849): 420–426, 2007. Samuel Kotz and Saralees Nadarajah. Extreme value distributions: theory and applications. World Scientific, 2000. Niklas Krumm, Peter H Sudmant, Arthur Ko, Brian J O’Roak, Maika Malig, Bradley P Coe, Aaron R Quinlan, Deborah A Nickerson, Evan E Eichler, NHLBI Exome Sequencing Project, et al. Copy number variation detection and genotyping from exome sequence data. Genome research, 22(8):1525–1532, 2012. Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, Corina An- tonescu, and Steven L Salzberg. Versatile and open software for comparing large genomes. Genome biology, 5(2):R12, 2004. Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome biology, 10(3):R25, 2009. Andrew H Laszlo, Ian M Derrington, Brian C Ross, Henry Brinkerhoff, Andrew Adey, Ian C Nova, Jonathan M Craig, Kyle W Langford, Jenny Mae Samson, Riza Daza, et al. Decoding long nanopore sequencing reads of natural dna. Nature biotechnology, 32(8):829, 2014. Rebecca J Leary, Jordan Cummins, Tian-Li Wang, and Victor E Velculescu. Digital karyotyping. Nature protocols, 2(8):1973, 2007. Rebecca J Leary, Mark Sausen, Isaac Kinde, Nickolas Papadopoulos, John D Carpten, David Craig, Joyce OShaughnessy, Kenneth W Kinzler, Giovanni Parmigiani, Bert V ogelstein, et al. Detection of chromosomal alterations in the circulation of cancer patients with whole-genome sequencing. Science translational medicine, 4(162):162ra154–162ra154, 2012. Richard M Leggett, Cristina Alcon-Giner, Darren Heavens, Shabhonam Caim, Thomas C Brook, Magdalena Kujawska, Samuel Martin, Ned Peel, Holly Acford-Palmer, Lesley Hoyles, et al. Rapid minion profiling of preterm microbiota and antimicrobial-resistant pathogens. Nature Microbiology, 5(3):430–442, 2020. Heng Li. Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. arXiv preprint arXiv:1303.3997, 2013. 80 Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094– 3100, 2018. Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows–wheeler trans- form. bioinformatics, 25(14):1754–1760, 2009. Heng Li and Richard Durbin. Fast and accurate long-read alignment with burrows–wheeler trans- form. Bioinformatics, 26(5):589–595, 2010. Jian Li, Tielin Yang, Liang Wang, Han Yan, Yinping Zhang, Yan Guo, Feng Pan, Zhixin Zhang, Yumei Peng, Qi Zhou, et al. Whole genome distribution and ethnic differentiation of copy number variation in caucasian and asian populations. PloS one, 4(11), 2009. Jian Li, Rachel L Dittmar, Shu Xia, Huijuan Zhang, Meijun Du, Chiang-Ching Huang, Brooke R Druliner, Lisa Boardman, and Liang Wang. Cell-free dna copy number variations in plasma from colorectal cancer patients. Molecular oncology, 11(8):1099–1111, 2017. Kate R Lieberman, Gerald M Cherf, Michael J Doody, Felix Olasagasti, Yvette Kolodji, and Mark Akeson. Processive replication of single dna molecules in a nanopore catalyzed by phi29 dna polymerase. Journal of the American Chemical Society, 132(50):17961–17972, 2010. William M Lin, Alissa C Baker, Rameen Beroukhim, Wendy Winckler, Whei Feng, Jennifer M Marmion, Elisabeth Laine, Heidi Greulich, Hsiuyi Tseng, Casey Gates, et al. Modeling ge- nomic diversity and tumor dependency in malignant melanoma. Cancer research, 68(3):664– 673, 2008. David J. Lipman, W. John Wilbur, Temple F. Smith, and Michael S. Waterman. On the statistical significance of nucleic add similarities. Nucleic Acids Research, 1984. Bo Liu, Dengfeng Guan, Mingxiang Teng, and Yadong Wang. rHAT: fast alignment of noisy long reads with regional hashing. Bioinformatics, 32(11):1625–1631, 2015. Bo Liu, Yan Gao, and Yadong Wang. LAMSA: fast split read alignment with long approximate matches. Bioinformatics, 33(2):192–201, 2017. Huanle Liu, Oguzhan Begik, Morghan C Lucas, Jose Miguel Ramirez, Christopher E Mason, David Wiener, Schraga Schwartz, John S Mattick, Martin A Smith, and Eva Maria Novoa. Accurate detection of m 6 A rna modifications in native rna sequences. Nature communications, 10(1):1–9, 2019. HA Loaiciga and RB Leipnik. Analysis of extreme hydrologic events with gumbel distributions: marginal and additive cases. Stochastic Environmental Research and Risk Assessment, 13(4): 251–259, 1999. Nicholas J Loman, Joshua Quick, and Jared T Simpson. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nature methods, 12(8):733–735, 2015. 81 Tzu-Pin Lu, Liang-Chuan Lai, Mong-Hsun Tsai, Pei-Chun Chen, Chung-Ping Hsu, Jang-Ming Lee, Chuhsing Kate Hsiao, and Eric Y Chuang. Integrated analyses of copy number variations and gene expression in lung adenocarcinoma. PloS one, 6(9):e24829, 2011. Robert Lucito, John Healy, Joan Alexander, Andrew Reiner, Diane Esposito, Maoyen Chi, Linda Rodgers, Amy Brady, Jonathan Sebat, Jennifer Troge, et al. Representational oligonucleotide microarray analysis: a high-resolution method to detect genome copy number variation. Genome research, 13(10):2291–2305, 2003. Bin Ma, John Tromp, and Ming Li. Patternhunter: faster and more sensitive homology search. Bioinformatics, 18(3):440–445, 2002. Geoff Macintyre, Teodora E Goranova, Dilrini De Silva, Darren Ennis, Anna M Piskorz, Matthew Eldridge, Daoud Sie, Liz-Anne Lewsley, Aishah Hanif, Cheryl Wilson, et al. Copy number signatures and mutational processes in ovarian carcinoma. Nature Genetics, 50(9):1262, 2018. Alberto Magi, Davide Bolognini, Niccol´ o Bartalucci, Alessandra Mingrino, Roberto Semeraro, Luna Giovannini, Stefania Bonifacio, Daniela Parrini, Elisabetta Pelo, Francesco Mannelli, et al. Nano-gladiator: real-time detection of copy number alterations from nanopore sequencing data. Bioinformatics, 35(21):4213–4221, 2019. Giovanni Maglia, Marcela Rincon Restrepo, Ellina Mikhailova, and Hagan Bayley. Enhanced translocation of single dna molecules through-hemolysin nanopores by manipulation of inter- nal charge. Proceedings of the National Academy of Sciences, 105(50):19720–19725, 2008. Paymaneh D Malihi, Michael Morikado, Lisa Welter, Sandy T Liu, Eric T Miller, Radu M Cadaneanu, Beatrice S Knudsen, Michael S Lewis, Anders Carlsson, Carmen Ruiz Velasco, et al. Clonal diversity revealed by morphoproteomic and copy number profiles of single prostate cancer cells at diagnosis. Convergent Science Physical Oncology, 4(1):015003, 2018. Elizabeth A Manrao, Ian M Derrington, Mikhail Pavlenok, Michael Niederweis, and Jens H Gund- lach. Nucleotide discrimination with dna immobilized in the mspa nanopore. PloS one, 6(10), 2011. Elizabeth A Manrao, Ian M Derrington, Andrew H Laszlo, Kyle W Langford, Matthew K Hopper, Nathaniel Gillgren, Mikhail Pavlenok, Michael Niederweis, and Jens H Gundlach. Reading dna at single-nucleotide resolution with a mutant mspa nanopore and phi29 dna polymerase. Nature biotechnology, 30(4):349, 2012. Filipe J Marques, Carlos A Coelho, and Miguel De Carvalho. On the distribution of linear com- binations of independent gumbel random variables. Statistics and Computing, 25(3):683–701, 2015. Hideo Matsumura, Stefanie Reich, Akiko Ito, Hiromasa Saitoh, Sophien Kamoun, Peter Winter, G¨ unter Kahl, Monika Reuter, Detlev H Kr¨ uger, and Ryohei Terauchi. Gene expression analysis of plant host–pathogen interactions by supersage. Proceedings of the National Academy of Sciences, 100(26):15718–15723, 2003. 82 Hideo Matsumura, Kentaro Yoshida, Shujun Luo, Eiji Kimura, Takahiro Fujibe, Zayed Albertyn, Roberto A Barrero, Detlev H Kr¨ uger, G¨ unter Kahl, Gary P Schroth, et al. High-throughput supersage for digital gene expression analysis of multiple samples using next generation se- quencing. PloS one, 5(8), 2010. Edward M McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM (JACM), 23(2):262–272, 1976. Amit Meller, Lucas Nivon, Eric Brandin, Jene Golovchenko, and Daniel Branton. Rapid nanopore discrimination between single polynucleotide molecules. Proceedings of the National Academy of Sciences, 97(3):1079–1084, 2000. Craig H Mermel, Steven E Schumacher, Barbara Hill, Matthew L Meyerson, Rameen Beroukhim, and Gad Getz. Gistic2. 0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome biology, 12(4):R41, 2011. Eli L Moss, Dylan G Maghini, and Ami S Bhatt. Complete, closed bacterial genomes from micro- biomes using nanopore sequencing. Nature Biotechnology, pages 1–7, 2020. Florent Mouliere, Dineika Chandrananda, Anna M Piskorz, Elizabeth K Moore, James Morris, Lise Barlebo Ahlborn, Richard Mair, Teodora Goranova, Francesco Marass, Katrin Heider, et al. Enhanced detection of circulating tumor dna by fragment size analysis. Science translational medicine, 10(466):eaat4921, 2018a. Florent Mouliere, Richard Mair, Dineika Chandrananda, Francesco Marass, Christopher G Smith, Jing Su, James Morris, Colin Watts, Kevin M Brindle, and Nitzan Rosenfeld. Detection of cell- free dna fragmentation and copy number alterations in cerebrospinal fluid from glioma patients. EMBO molecular medicine, 10(12), 2018b. M Muthukumar. Theory of capture rate in polymer translocation. The Journal of Chemical Physics, 132(19):05B605, 2010. Saralees Nadarajah. Exact distribution of the linear combination of p gumbel random variables. International Journal of Computer Mathematics, 85(9):1355–1362, 2008. Yasuhito Nannya, Masashi Sanada, Kumi Nakazaki, Noriko Hosoya, Lili Wang, Akira Hangaishi, Mineo Kurokawa, Shigeru Chiba, Dione K Bailey, Giulia C Kennedy, et al. A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer research, 65(14):6071–6079, 2005. Nicholas Navin, Jude Kendall, Jennifer Troge, Peter Andrews, Linda Rodgers, Jeanne McIndoo, Kerry Cook, Asya Stepansky, Dan Levy, Diane Esposito, et al. Tumour evolution inferred by single-cell sequencing. Nature, 472(7341):90, 2011. Saul B Needleman and Christian D Wunsch. A general method applicable to the search for similar- ities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443–453, 1970. 83 Layla Oesper, Ahmad Mahmoody, and Benjamin J Raphael. Theta: inferring intra-tumor hetero- geneity from high-throughput dna sequencing data. Genome biology, 14(7):R80, 2013. Adam B Olshen, ES Venkatraman, Robert Lucito, and Michael Wigler. Circular binary segmenta- tion for the analysis of array-based DNA copy number data. Biostatistics, 5(4):557–572, 2004. William R Pearson. Comparison of methods for searching protein sequence databases. Protein Science, 4(6):1145–1160, 1995. William R Pearson. Empirical statistical estimates for sequence similarity searches. Journal of molecular biology, 276(1):71–84, 1998. William R Pearson and David J Lipman. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 85(8):2444–2448, 1988. David G Peters, Elisa Heidrich O’Hare, Robert E Ferrell, Amin B Kassam, Howard Yonas, and Adam M Brufsky. Comprehensive transcript analysis in small quantities of mrna by sage-lite. Nucleic acids research, 27(24):e39–e44, 1999. Daniel Pinkel, Richard Segraves, Damir Sudar, Steven Clark, Ian Poole, David Kowbel, Colin Collins, Wen-Lin Kuo, Chira Chen, Ye Zhai, et al. High resolution analysis of dna copy number variation using comparative genomic hybridization to microarrays. Nature genetics, 20(2):207– 211, 1998. Jonathan R Pollack, Therese Sørlie, Charles M Perou, Christian A Rees, Stefanie S Jeffrey, Per E Lonning, Robert Tibshirani, David Botstein, Anne-Lise Børresen-Dale, and Patrick O Brown. Microarray analysis reveals a major direct role of dna copy number alteration in the transcrip- tional program of human breast tumors. Proceedings of the National Academy of Sciences, 99 (20):12963–12968, 2002. Rishvanth K Prabakar, Liya Xu, James Hicks, and Andrew D Smith. Smurf-seq: efficient copy number profiling on long-read sequencers. Genome biology, 20(1):134, 2019. Robert F Purnell and Jacob J Schmidt. Discrimination of single base substitutions in a dna strand immobilized in a biological nanopore. ACS nano, 3(9):2533–2538, 2009. Joshua Quick, Nicholas J Loman, Sophie Duraffour, Jared T Simpson, Ettore Severi, Lauren Cow- ley, Joseph Akoi Bore, Raymond Koundouno, Gytis Dudas, Amy Mikhail, et al. Real-time, portable genome sequencing for ebola surveillance. Nature, 530(7589):228–232, 2016. Arthur C Rand, Miten Jain, Jordan M Eizenga, Audrey Musselman-Brown, Hugh E Olsen, Mark Akeson, and Benedict Paten. Mapping dna methylation with high-throughput nanopore sequenc- ing. Nature methods, 14(4):411, 2017. Richard Redon, Shumpei Ishikawa, Karen R Fitch, Lars Feuk, George H Perry, T Daniel Andrews, Heike Fiegler, Michael H Shapero, Andrew R Carson, Wenwei Chen, et al. Global variation in copy number in the human genome. nature, 444(7118):444–454, 2006. 84 Saurabh Saha, Andrew B Sparks, Carlo Rago, Viatcheslav Akmaev, Clarence J Wang, Bert V o- gelstein, Kenneth W Kinzler, and Victor E Velculescu. Using the transcriptome to annotate the genome. Nature biotechnology, 20(5):508–512, 2002. Ulrich Schlecht, Janine Mok, Carolina Dallett, and Jan Berka. ConcatSeq: A method for increasing throughput of single molecule sequencing by concatenating short DNA fragments. Scientific Reports, 7(1):5252, 2017. Michael Sch¨ oniger and Michael S Waterman. A local algorithm for dna sequence alignment with inversions. Bulletin of mathematical biology, 54(4):521–536, 1992. Heidi Schwarzenbach, Dave SB Hoon, and Klaus Pantel. Cell-free nucleic acids as biomarkers in cancer patients. Nature Reviews Cancer, 11(6):426–437, 2011. Jonathan Sebat, B Lakshmi, Jennifer Troge, Joan Alexander, Janet Young, P¨ ar Lundin, Susanne M˚ an´ er, Hillary Massa, Megan Walker, Maoyen Chi, et al. Large-scale copy number polymor- phism in the human genome. Science, 305(5683):525–528, 2004. Jonathan Sebat, B Lakshmi, Dheeraj Malhotra, Jennifer Troge, Christa Lese-Martin, Tom Walsh, Boris Yamrom, Seungtai Yoon, Alex Krasnitz, Jude Kendall, et al. Strong association of de novo copy number mutations with autism. Science, 316(5823):445–449, 2007. Venkatraman E Seshan, Adam B Olshen, et al. DNAcopy: a package for analyzing DNA copy data, 2010. David L Shattuck, Jamie K Miller, Kermit L Carraway, and Colleen Sweeney. Met receptor con- tributes to trastuzumab resistance of her2-overexpressing breast cancer cells. Cancer research, 68(5):1471–1477, 2008. Jared T Simpson, Rachael E Workman, PC Zuzarte, Matei David, LJ Dursi, and Winston Timp. Detecting dna cytosine methylation using nanopore sequencing. Nature methods, 14(4):407, 2017. Temple F Smith, Michael S Waterman, et al. Identification of common molecular subsequences. Journal of molecular biology, 147(1):195–197, 1981. Temple F Smith, Michael S Waterman, and Christian Burks. The statistical distribution of nucleic acid similarities. Nucleic Acids Research, 13(2):645–656, 1985. TF Smith, MS Waterman, and JR Sadler. Statistical characterization of nucleic acid sequence functional domains. Nucleic acids research, 11(7):2205–2220, 1983. Ivan Sovi´ c, Mile ˇ Siki´ c, Andreas Wilm, Shannon Nicole Fenlon, Swaine Chen, and Niranjan Na- garajan. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nature Communications, 7:11307, 2016. 85 Mircea Cretu Stancu, Markus J Van Roosmalen, Ivo Renkens, Marleen M Nieboer, Sjors Mid- delkamp, Joep De Ligt, Giulia Pregno, Daniela Giachino, Giorgia Mandrile, Jose Espejo Valle- Inclan, et al. Mapping and phasing of structural variation in patient genomes using nanopore sequencing. Nature communications, 8(1):1–13, 2017. Paweł Stankiewicz and James R Lupski. Structural variation in the human genome and its role in disease. Annual review of medicine, 61:437–455, 2010. David Stoddart, Andrew J Heron, Ellina Mikhailova, Giovanni Maglia, and Hagan Bayley. Single- nucleotide discrimination in immobilized dna oligonucleotides with a biological nanopore. Pro- ceedings of the National Academy of Sciences, 106(19):7702–7707, 2009. Michael R Stratton, Peter J Campbell, and P Andrew Futreal. The cancer genome. Nature, 458 (7239):719, 2009. Darrin Stuart and William R Sellers. Linking somatic genetic alterations in cancer to therapeutics. Current opinion in cell biology, 21(2):304–310, 2009. John R Tyson, Nigel J O’Neil, Miten Jain, Hugh E Olsen, Philip Hieter, and Terrance P Snutch. Minion-based long-read sequencing and assembly extends the caenorhabditis elegans reference genome. Genome Research, 28(2):266–274, 2018. Hunter R Underhill, Jacob O Kitzman, Sabine Hellwig, Noah C Welker, Riza Daza, Daniel N Baker, Keith M Gligorich, Robert C Rostomily, Mary P Bronner, and Jay Shendure. Fragment length of circulating tumor dna. PLoS genetics, 12(7), 2016. E Van Binsbergen. Origins and breakpoint analyses of copy number variations: up close and personal. Cytogenetic and genome research, 135(3-4):271–276, 2011. Peter Van Loo, Silje H Nordgard, Ole Christian Lingjærde, Hege G Russnes, Inga H Rye, Wei Sun, Victor J Weigman, Peter Marynen, Anders Zetterberg, Bjørn Naume, et al. Allele-specific copy number analysis of tumors. Proceedings of the National Academy of Sciences, 107(39): 16910–16915, 2010. Victor E Velculescu, Lin Zhang, Bert V ogelstein, and Kenneth W Kinzler. Serial analysis of gene expression. Science, 270(5235):484–487, 1995. ES Venkatraman and Adam B Olshen. A faster circular binary segmentation algorithm for the analysis of array cgh data. Bioinformatics, 23(6):657–663, 2007. Jessie Villanueva, Jeffrey R Infante, Clemens Krepler, Patricia Reyes-Uribe, Minu Samanta, Hsin- Yi Chen, Bin Li, Rolf K Swoboda, Melissa Wilson, Adina Vultur, et al. Concurrent mek2 mutation and braf amplification confer resistance to braf and mek inhibitors in melanoma. Cell reports, 4(6):1090–1099, 2013. 86 Tian-Li Wang, Christine Maierhofer, Michael R Speicher, Christoph Lengauer, Bert V ogelstein, Kenneth W Kinzler, and Victor E Velculescu. Digital karyotyping. Proceedings of the National Academy of Sciences, 99(25):16156–16161, 2002. Zihua Wang, Peter Andrews, Jude Kendall, Beicong Ma, Inessa Hakker, Linda Rodgers, Michael Ronemus, Michael Wigler, and Dan Levy. Smash, a fragmentation and sequencing method for genomic copy number analysis. Genome Research, 26(6):844–851, 2016. Meni Wanunu, Jason Sutin, Ben McNally, Andrew Chow, and Amit Meller. DNA translocation governed by interactions with solid-state nanopores. Biophysical Journal, 95(10):4716–4725, 2008. Michael S Waterman and Martin Vingron. Rapid and accurate estimates of statistical significance for sequence data base searches. Proceedings of the National Academy of Sciences, 91(11): 4625–4628, 1994a. Michael S Waterman and Martin Vingron. Sequence comparison significance and poisson approx- imation. Statistical Science, pages 367–381, 1994b. David Weese, Anne-Katrin Emde, Tobias Rausch, Andreas D¨ oring, and Knut Reinert. Razers–fast read mapping with sensitivity control. Genome research, 19(9):1646–1654, 2009. Chia-Lin Wei, Patrick Ng, Kuo Ping Chiu, Chee Hong Wong, Chin Chin Ang, Leonard Lipovich, Edison T Liu, and Yijun Ruan. 5’ long serial analysis of gene expression (longsage) and 3’ longsage for transcriptome characterization and genome annotation. Proceedings of the National Academy of Sciences, 101(32):11701–11706, 2004. Barbara A Weir, Michele S Woo, Gad Getz, Sven Perner, Li Ding, Rameen Beroukhim, William M Lin, Michael A Province, Aldi Kraja, Laura A Johnson, et al. Characterizing the cancer genome in lung adenocarcinoma. Nature, 450(7171):893, 2007. Isaac P Witz and Orlev Levy-Nissenbaum. The tumor microenvironment in the post-paget era. Cancer letters, 242(1):1–10, 2006. Rachael E Workman, Alison D Tang, Paul S Tang, Miten Jain, John R Tyson, Roham Razaghi, Philip C Zuzarte, Timothy Gilpatrick, Alexander Payne, Joshua Quick, et al. Nanopore native rna sequencing of a human poly (a) transcriptome. Nature methods, 16(12):1297–1305, 2019. Bin Xu, J Louw Roos, Shawn Levy, EJ Van Rensburg, Joseph A Gogos, and Maria Karayiorgou. Strong association of de novo copy number mutations with sporadic schizophrenia. Nature genetics, 40(7):880, 2008. Seungtai Yoon, Zhenyu Xuan, Vladimir Makarov, Kenny Ye, and Jonathan Sebat. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome research, 19 (9):1586–1592, 2009. 87 Travis I Zack, Steven E Schumacher, Scott L Carter, Andrew D Cherniack, Gordon Saksena, Bar- bara Tabak, Michael S Lawrence, Cheng-Zhong Zhang, Jeremiah Wala, Craig H Mermel, et al. Pan-cancer patterns of somatic copy number alteration. Nature genetics, 45(10):1134, 2013. Ivan A Zaporozhchenko, Anastasia A Ponomaryova, Elena Yu Rykova, and Pavel P Laktionov. The potential of circulating cell-free rna as a cancer biomarker: challenges and opportunities. Expert review of molecular diagnostics, 18(2):133–145, 2018. Mehdi Zarrei, Jeffrey R MacDonald, Daniele Merico, and Stephen W Scherer. A copy number variation map of the human genome. Nature reviews genetics, 16(3):172–183, 2015. Adam M Zawada, Kyrill S Rogacev, S¨ oren M¨ uller, Bj¨ orn Rotter, Peter Winter, Danilo Fliser, and Gunnar H Heine. Massive analysis of cdna ends (mace) and mirna expression profiling identifies proatherogenic pathways in chronic kidney disease. Epigenetics, 9(1):161–172, 2014. 88 Appendix A Supplemental methods DNA samples The normal diploid female DNA was purchased from Promega (Cat. no. G1521). Breast cancer cell line SK-BR-3 (American Type of Culture Collection (ATCC), Cat. no. HTB-30) was cultured in RPMI-1640 medium (Thermo Fisher Scientific, Cat. no. 11875093) supplemented with 10% fetal bovine serum (FBS) (Thermo Fisher Scientific, Cat. no. 35011CV) and was maintained at 37 in a humidified chamber supplied with 5% CO 2 and was regularly tested for mycoplasma infection. Cell lysis and DNA purification The DNA from SK-BR-3 cells was extracted and purified with the QIAamp DNA Blood Mini Kit (Qiagen, Cat. no. 51104) following the protocol for cultured cells given by the manufacturer. RNA and proteins in the cells were degraded using RNase A stock solution (100 mg/ml) (Qiagen, Cat. no. 19101) and Protease-K (Qiagen, Cat. no. 19133) respectively. Both purchased female diploid DNA and extracted SK-BR-3 DNA were treated with the same downstream processes. 89 Fragmenting genomic DNA 2-3 μg of genomic DNA was fragmented with restriction enzyme Anza 64 SaqAI (Thermo Fisher Scientific, Cat. no. IVGN0644) for 30 min at 37 . The fragmented DNA was cleaned with the QIAquick PCR purification kit (Qiagen, Cat. no. 8106) and eluted with 34 μl nuclease-free water. The concentration of DNA was quantified on a Qubit Fluorometer v3 (Thermo Fisher Scientific, cat. no. Q33216) with the Qubit dsDNA HS assay kit (Thermo Fisher Scientific, cat. no. Q32854). Ligation of fragmented DNA 500 ng of fragmented DNA in 10 μl nuclease-free water was mixed with 10 μl Anza T4 DNA Ligase Master Mix (Thermo Fisher Scientific, Cat. no. IVGN210-4) and incubated for 30 min at room temperature. The ligated DNA was cleaned with 2 volume Ampure XP beads (Beckman Coulter, Cat. no. A63881) and eluted in nuclease-free water. This step was done in multiple tubes if more than 500 ng of fragmented DNA was needed to be ligated. The concentration of DNA was quantified on a Qubit Fluorometer v3 with the Qubit dsDNA HS assay kit to ensure1 μg ( 400 ng, if the Rapid kit was used for library preparation) remained. The size of the ligated DNA molecules were assessed with 1% agarose gel electrophoresis run at 90 V for 30 min. Library preparation (SQK-LSK108 1D DNA by ligation) 1 μg of re-ligated DNA in 45 μl of nuclease-free water was end-repaired and dA-tailed (New Eng- land Biolabs (NEB), Cat. no. E7546), followed by elution in nuclease-free water after 1:5 volume Ampure XP beads clean-up. Sequencing adapters (AMX1D) were ligated with Blunt/TA Ligase Master Mix (NEB, Cat.no. M0367) and cleaned with 0:4 volume Ampure XP beads and eluted using 15 μl Elution Buffer (ELB) following the manufacturer’s protocol (Oxford Nanopore Technologies (ONT), 1D genomic DNA by ligation protocol). 90 Multiplexed library preparation (EXP-NBD103 and SQK-LSK108) 700 ng of each re-ligated sample in 45 μl of nuclease-free water was end-repaired, dA-tailed (NEB, Cat. no. E7546), cleaned with 1:5 volume Ampure XP beads and eluted in nuclease-free water. Different Native Barcodes (NB-x) for each sample was ligated with Blunt/TA Ligase Master Mix (NEB, Cat.no. M0367), cleaned with 2 volume Ampure XP beads and eluted in nuclease-free water. Equimolar amounts of each sample was pooled to have 700 ng of DNA in 50 μl water. Barcode adapters (BAM) were ligated with Quick T4 DNA Ligase (NEB, Cat. no. E6056), cleaned with 0:4 volume Ampure XP beads and eluted using 15 μl Elution Buffer (ELB) following the manufacturer’s protocol (ONT, 1D native barcoding genomic DNA). Library preparation (SQK-RAD003 Rapid sequencing) 400 ng of re-ligated DNA was concentrated with 2 volume Ampure XP beads to 7:5 μl nuclease- free water. DNA was tagmented with Fragmentation Mix (FRA), and Rapid 1D Adapter (RPD) was attached following the manufacturer’s protocol (ONT, rapid sequencing). MinION sequencing and base-calling All the prepared libraries were loaded on R9.5 Flowcells following the manufacturer’s protocol (ONT) and sequenced for up to 48 hours using the script specific to library preparation protocol. Base-calling and de-multiplexing barcoded reads were performed using ONT Guppy (2.3.5) with the appropriate parameters based on the library preparation kit. 91 Sequencing RE digested normal diploid genome 1 μg of genomic DNA was fragmented with restriction enzyme Anza 64 SaqAI (Thermo Fisher Scientific, Cat. no. IVGN0644) for 30 min at 37 . The fragmented DNA was cleaned with the QIAquick PCR purification kit (Qiagen, Cat. no. 8106) and eluted with 31 μl nuclease-free water. The concentration of DNA was quantified on a Qubit Fluorometer v3 (Thermo Fisher Scientific, cat. no. Q33216) with the Qubit dsDNA HS assay kit (Thermo Fisher Scientific, cat. no. Q32854). 0:5 μg of restriction enzyme digested DNA in 45 μl of nuclease-free water was end-repaired and dA-tailed (New England Biolabs (NEB), Cat. no. E7546), followed by elution in nuclease- free water after 1:5 volume Ampure XP beads clean-up. Sequencing adapters (AMX1D) were ligated with Blunt/TA Ligase Master Mix (NEB, Cat.no. M0367) and cleaned with 1:0 volume Ampure XP beads (manufacturer’s protocol uses 0:4 volume XP beads, we increased to 1:0 to get as many short molecules as possible) and eluted using 15 μl Elution Buffer (ELB) following the manufacturer’s protocol (Oxford Nanopore Technologies (ONT), 1D genomic DNA by ligation protocol). The prepared library was loaded on R9.4 Flowcell following the manufacturer’s protocol (ONT) and sequenced for 48 hours. Base-calling was performed using ONT Guppy (2.3.5). Estimation of copy number variations CNV profiles were generated using the procedure described in Baslan et al. (2012) and Kendall and Krasnitz (2014) with the modification employed in Gerdtsson et al. (2018) and Malihi et al. (2018). Briefly, the human reference genome (hg19) was split into 5,000 (20,000 or 50,000) bins containing an equal number of uniquely mappable locations and the bin counts were determined using uniquely mapped fragments. Bins with spuriously high counts (‘bad bins’, typically around centromeric and telomeric regions) were masked for downstream analysis (Kendall and Krasnitz, 2014). This procedure normalizes bin counts for biases correlated with GC content by fitting 92 a LOWESS curve to the GC content by bin count, and subtracting the LOWESS estimate from each bin (Kendall and Krasnitz, 2014). Circular binary segmentation (CBS) (Olshen et al., 2004), implemented in DNAcopy (Seshan et al., 2010) package, then identifies breakpoints in the normal- ized bin counts. Following Gerdtsson et al. (2018) and Malihi et al. (2018), after CBS, spurious segmentation calls were removed. Comparison with Illumina WGS of SK-BR-3 genome. DNA from SK-BR-3 cells was used to construct WGS library with the NEBNext UltraII FS DNA Library Prep Kit (NEB, Cat. no. E7805) following the manufacturer’s instructions. After library quality and quantity assessment with Qubit 3.0 HS dsDNA assay and BioAnalyzer HS dsDNA assay (Agilent), libraries were sequenced on HiSeq 2500 (Illumina) with single-end 130 cycles mode. The reads were mapped with BWA-MEM using the default parameters, PCR duplicates were removed, and CNV profiles were generated using exactly the same method as used for SMURF- seq reads. The scatter plots and Pearson correlations comparing the CNV profiles were produced using R. 93 Appendix B Mapping SMURF-seq reads with long-read aligners Simulating SMURF-seq reads to evaluate mapping programs To test long-read mapping tools, we chose to create simulated reads with the technical character- istics we expect in idealized SMURF-seq data. We first selected a fragment length` and a number k of fragments per read. Then, for a given WGS nanopore data set, we took the set of mapped long reads as determined by BWA-MEM (with -x ont2d option). Each of the mapped reads was split into fragments of length ` (with a random offset of 0 to ` 1 at the start of the long read). Each fragment was validated by requiring that it did not overlap a deadzone in the genome (as determined by the deadzone program available from https://github.com/smithlabcode/utils for 40 bp). The reason for excluding deadzones is that even when a short fragment has a “known” mapping location when it is part of a longer read, we cannot compare its reported mapping loca- tion as a short fragment with that known location, since we expect any good mapping algorithm to identify that the fragment maps ambiguously. Among these validated fragments, subsets ofk were sampled uniformly at random and concatenated (in random order and orientation) to form 94 simulated SMURF-seq reads. The first and last fragments in a read should be slightly easier to identify and map than the rest, since one of their boundaries is known. Using the above procedure, we selectk = 20 so that the simulated reads have a sufficient number of fragments to eliminate the influence of the first and last fragments in each read on the results. There is no need to have largek otherwise. By lowering` and making the fragments shorter, the task of mapping the fragments becomes more challenging. Real SMURF-seq reads have fragment lengths determined by restriction site density, size selection, and other aspects of the experiments. But in testing mapping algorithms and optimizing parameters, there is no disadvantage to making the task more challenging. We only need to be able to distinguish the relative performance of different mapping tools and parameter combinations. Real SMURF-seq reads have varying fragment lengths, but in evaluating mapping tools, there is no need to randomize fragment lengths. None of the algorithms we evaluated are capable of either deducing or leveraging the fact that all simulated fragments have the same length. We selected` = 100, which begins to challenge the various mapping strategies. These values of` are slightly lower than the average in real SMURF-seq data. Evaluating performance using simulated SMURF-seq reads Within the simulated reads, the boundaries of each fragment are known a priori, as are their map- ping locations. We used this information to evaluate mapping tools in terms of (1) how well they identify fragments purely for the purpose of counting molecules, which is the primary informa- tion used in CNV analysis, and (2) how well they identify individual mapping bases within reads. The latter criteria becomes important in challenging cases and will be increasingly important as fragment sizes are reduced. Performance on identifying fragments: After mapping these simulated reads, each mapping result is called a predicted fragment. Each predicted fragment is considered a positive prediction, 95 and we assume an arbitrary order over positive predictions. A positive prediction is a true positive if: The predicted fragment maps uniquely. The mapping locations of at least half the bases in the predicted fragment are equal to the original mapping locations for those bases, and those bases are all part of the same original fragment (we assume that it is unlikely for two fragments on a simulated read to have the same mapping location but opposite orientation, and thus do not check for the orientation of a fragment). In this case, we say the predicted fragment is associated with that original fragment. The predicted fragment is the first among predicted fragments associated the same original frag- ment. False positives are predicted fragments that are not true positives. Any original fragment with no associated predicted fragment is a false negative. These criteria penalize splitting one original fragment or merging two original fragments. By defining true positives, false positives and false negatives we are able to calculate precision, recall, and F-score for a particular mapping strategy. Performance on identifying individual mapping bases: After mapping simulated reads, each mapping result is decomposed into individual nucleotides and associated with a location in the genome. Those locations are retained. We keep multiplicities, so when two mapped fragments overlap in the genome we count certain nucleotides twice. These are the predicted positive bases in the reference. The condition positive bases are those known a priori from the simulation. The original fragment mapping locations may overlap in the reference genome, leading to multiplicities in the condition positive bases, but with low probability. The true positives are the intersection of the condition positive and the predicted positive bases. When there are multiplicities of mapped fragments and simulated fragments overlapping the same bases in the reference genome, this is determined by taking the smaller of the two values. After removing the true positives bases, the remaining predicted positive bases are false positives, and the remaining condition positive bases are false negatives. These criteria penalize mapping approaches that do not cover the entire simu- 96 lated SMURF-seq reads, and also penalize approaches that predict fragments that overlap within the read. The true positives, false positives, and false negatives here allow us to assign precision and recall in terms of individual bases and corresponding F-scores. Although the reference bases for both predicted positive and condition positive could involve multisets, since our simulations used relatively low coverage this almost never happened. To generate simulated reads we used the standard long reads from four sequencing runs (Flow- cell ID: FAB42704, FAB42810, FAB49914, and FAF01253) in the public dataset available at https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md (Jain et al., 2018a,b). We downloaded the raw data from EBI (Run accession: ERR2184696, ERR2184704, ERR2184712, and ERR2184722) and base-called these with Guppy (version: 2.3.5). Initial selection of mapping tools We tested the following mapping tools: BWA-MEM (Li, 2013), Minimap2 (Li, 2018), LAST (Kiełbasa et al., 2011), GraphMap (Sovi´ c et al., 2016), BLASR (Chaisson and Tesler, 2012), rHAT (Liu et al., 2015), and LAMSA (Liu et al., 2017). These were selected either because they are known to perform well on certain mapping tasks or have unique properties that plausibly could help in mapping SMURF-seq reads. We tested each of these using default parameters on simulated reads and downsampled real SMURF-seq reads (data not shown). Among these BWA-MEM, Minimap2, and LAST had higher accuracy on simulated data, and the other tools identified at most 15 fragments per read on real data. Thus, we explored performance of BWA-MEM (0.7.17), LAST (963), and Minimap2 (2.15) in more detail, varying parameters to improve performance. We remark that none of these tools were designed to map SMURF-seq reads; results we report here do not reflect the overall performance of the various mapping tools, only that the three afore- mentioned tools happened to perform relatively well on a task for which they were not directly designed for. 97 Determining the optimal Smith-Waterman score for SMURF- seq reads In order to determine the optimal alignment score, we kept the seeding related parameters constant, and varied the alignment score combinations to perform a grid search. We varied the mismatch penalty from 1 to 6, gap open penalty from 0 to 4, and gap extend penalty form 1 to 4. The match score was fixed at 1. Thus for each tool we tested 120 (654) combinations of alignment scores. The seeding and chaining related parameters for each tool was set at follows (along with the four alignment scores): BWA-MEM:-x ont2d -k 12 -W 12 -T 30 Minimap2: -w 1 -m 10 -s 30 LAST (NEAR):lastal -Q0 -e 20 andlast-split -m 1 -s 30 We set the seeding and chaining parameters in a liberal manner to allow for higher sensitivity than the default parameter of each tool, and the minimum alignment score to output was set at 30. After aligning the simulated reads, we calculated the average precision and recall, each for the mapped fragment locations and nucleotides, for the four datasets. The F-score was computed for each, and the mean of the F-scores was used to determine the optimal alignment parameter for each tool. Based on these results BWA-MEM outperformed other tools for aligning SMURF-seq reads. BWA-MEM performed best with a mismatch, open, and extension penalty of 2, 1, 1 respectively. To further refine the optimal alignment parameter for BWA-MEM, we aligned the simulated reads with parameter values around the value described above with a higher resolution. We varied the mismatch penalty from 1.5 to 2.5, and open and extend penalties from 0.5 to 1.5 in increments of 0.25. However, BWA-MEM does not accept floating point values for alignment score param- eters. To overcome this, we scaled the alignment score proportionately to have integer values, i.e we varied the mismatch penalty from 6 to 10, open and extend penalties from 2 to 6, and fixed the match score at 4 (125 combinations). Based on these results, the highest accuracy was obtained 98 with the mismatch, open, and extension penalty of 2.5, 1.5, and 0.75, respectively (correspond- ing scaled values are 10, 6, and 3). We used these optimal alignment scores for mapping real SMURF-seq read, and all the CNV profiles presented are based on these. 99 Appendix C Data availability and summary of sequencing runs Sample Kit Reads Mean length Fragments Accession Diploid SQK-LSK108 270.82k 6.8 kb 7.28M SRX5893474 Diploid SQK-LSK108 497.92k 3.7 kb 7.55M SRX5893475 SK-BR-3 SQK-LSK108 146.98k 7.6 kb 4.52M SRX5893478 SK-BR-3 SQK-LSK108 132.64k 7.3 kb 4.02M SRX5893479 Diploid SQK-RAD003 213.38k 3.9 kb 2.81M SRX5893473 Multiplexed run EXP-NBD103 + 442.9k Diploid (BC01) SQK-LSK108 138.19k 4.8 kb 2.95M SRX5893472 SK-BR-3 (BC02) 144.57k 7.7 kb 4.97M SRX5893476 Diploid (short-read) SQK-LSK108 2.58M 630.9 bp SRX5893480 SK-BR-3 (WGS) Illumina WGS 5.56M 130 bp SRX5893477 Table C.1: Summary of sequencing run. Samples are processed with the SMURF-seq protocol, unless indicated otherwise. Sequence data generated during the study are available in SRA with the accession number PRJNA454059. 100
Abstract (if available)
Abstract
We present SMURF-seq, a protocol to efficiently sequence short DNA molecules on a long-read sequencer by randomly ligating them to form long molecules. Applying SMURF-seq using the highly portable and inexpensive Oxford Nanopore MinION yields up to 30 fragments per read, providing an average of 6.2 and up to 7.5 million mappable fragments per run, increasing information throughput for read-counting applications. ❧ Somatic copy number alterations play a significant role in cancer, and can be leveraged for diagnostic and personalized approaches to treatment. High-throughput short-read sequencing has been extremely efficient in copy number profiling
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficiently scaling PerM read mapping system in a computing cloud
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
Comparative genomics of translational regulation
PDF
A generalization of the accumulation curve in species sampling and its applications to high‐throughput sequencing
PDF
Computational methods for translation regulation analysis from Ribo-seq data
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Integrating high-throughput sequencing data to study gene regulation
PDF
danbing-tk: a computational genetics framework for studying variable number tandem repeats
PDF
Techniques for de novo sequence assembly: algorithms and experimental results
PDF
Clustering 16S rRNA sequences: an accurate and efficient approach
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Analyzing disease progression using cross-sectional data
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Non-parametric models for large capture-recapture experiments with applications to DNA sequencing
PDF
Fast search and clustering for large-scale next generation sequencing data
PDF
Breaking the plateau in de novo genome scaffolding
PDF
Orthogonal shared basis factorization: cross-species gene expression analysis using a common expression subspace
Asset Metadata
Creator
Kaliappan Prabakar, Rishvanth
(author)
Core Title
Efficient short-read sequencing on long-read sequencers
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
06/12/2020
Defense Date
04/21/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
copy number alteration,long-read sequencing,nanopore sequencing,OAI-PMH Harvest,read-counting applications
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Smith, Andrew (
committee chair
), Chaisson, Mark (
committee member
), Hicks, James (
committee member
)
Creator Email
kaliappa@usc.edu,rishvanth@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-318408
Unique identifier
UC11664357
Identifier
etd-KaliappanP-8586.pdf (filename),usctheses-c89-318408 (legacy record id)
Legacy Identifier
etd-KaliappanP-8586.pdf
Dmrecord
318408
Document Type
Dissertation
Rights
Kaliappan Prabakar, Rishvanth
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
copy number alteration
long-read sequencing
nanopore sequencing
read-counting applications