Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient algorithms to map whole genome bisulfite sequencing reads
(USC Thesis Other)
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Efficient algorithms to map whole genome bisulfite sequencing reads by Guilherme de Sena Brandine A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY & BIOINFORMATICS) August 2022 I would rather have questions that cannot be answered than answers that cannot be questioned. Richard Feynman ii Acknowledgements I am deeply grateful to all the people with whom I shared this adventure of pursuing a Ph.D. Throughout these seven years at USC, I had the honor and privilege to make many good friends and meet so many wonderful scientists and aspiring scientists in all stages of their careers, from undergraduates students to scientists whose works I studied in college courses. Without you, this Ph.D. would not have been possible. I would first like to start by sincerely thanking Prof. Andrew Smith for welcoming me to his lab as a first year Ph.D. student and for subsequently teaching me essentially everything I know today about compu- tational biology, programming, statistics and science in general. Not only will I carry this knowledge with me for the rest of my life, I will also carry the optimistic and rigorous approach to scientific problems you adopt and expect the Smith lab members to adopt. I have seen you and my lab mates solve problems I was convinced were impossible more times than I can remember. I aspire to continue to approach challenging problems the same way throughout my entire future career. I am honored to have you as my Ph.D. advisor and owe all my accomplishments to your mentorship. I would also like to sincerely thank Prof. Mark Chaisson, Prof. Michael “Doc” Edge and Prof. Aiichiro Nakano for kindly agreeing to be part of my dissertation committee, as well as Prof. Fengzhu Sun, Prof. Nils Lindstrom and Prof. Peter Calabrese for participating in my qualifying committee. I have a profound admiration for all your works, and I hope to have the opportunity to contribute to some of them in the future. Thank you very much for helping me produce and communicate the best possible science, you are a major inspiration to me as an aspiring scientist and bioinformatician. I am grateful to Prof. Mike Waterman for his contributions to the field of computational biology and for shaping the way it is taught at USC and many other bioinformatics programs across the world. I discovered the QCB program by learning the Smith- Waterman algorithm. It is still surreal to see how an algorithm influenced so significantly the direction my life would take. I am very lucky to have discovered Prof. Waterman’s works when I was deciding what to do after graduating from college. iii Thank you to all the Smith lab members, past and present, for all the amazing interactions through group meetings, journal clubs and conversations at the 408 quad. I had the privilege to see many of your ideas go from conception to publication, and that made me convinced that I could also do the same. A special thank you to my good friend Amal Thomas who was my class mate, lab mate and roommate for several years, and someone I always knew I could rely on (I hope vice-versa). You are the absolute best! Thank you to the Smith Lab members Meng Zhou, Masaru Nakajima, Jenny Qu, Liz Ji, Ben Decato, Wenzheng Li, Chao Deng, Terence Li, Egor Dolzhenko, Hoda Mirsafian, Rishvanth Prabakar and Saket Choudhary for creating such an amazing academic environment from which I learned so much. Thank you also to the Chaisson lab members, especially Jingwen Ren, Tony Lu and Keon Rabbani for the insightful discussions in group meetings and for the valuable comments on my manuscripts. Thank you to the 2015 Molecular and Computational Biology cohort for being the best group I could hope to meet in this program. I moved to Los Angeles by myself, but you guys provided such a strong sense of community that I never felt alone. Thank you to my CBB colleagues Amal Thomas, Liana Engie and Jitin Singla for being amazing friends and for all the study nights prior to exams and assignment deadlines, and thank you to my MCB friends Veronica Haro-Acosta, Anne Nguyen, Amy Hammerquist, Kevin Manage, Lisa Welter, Michael Lough-Stevens and Hans Sebastian for helping me study for the molecular biology courses and for kindly showing me how to survive as a clueless young adult in America. I will forever cherish these memories! I would also like to thank Prof. Andrew McMahon and the members of the McMahon lab at the Keck School for welcoming me as a collaborator and patiently teaching me everything I know about kidney biology today. I would especially like to thank Prof. Nils Lindstrom, for providing me the opportunity to work with cutting edge data, for patiently going through my analysis results as I learned how to do things myself, for kindly agreeing to be part of my qualifying committee and for overall just being a great friend and collaborator. I would also like to thank Tracy Tran, Andrew Ransick, Lisa Rutledge, Qiuyu Guo, Helena Bugacov, Pooja Hor, Rosa Sierra and Riana Parvez for truly making me feel like part of the Broad CIRM Center group. The time at the McMahon lab was an amazing experience of routine interactions with brilliant molecular biologists in which I learned a lot about where bioinformatics fits in modern day stem cell biology research. I would like to acknowledge Katie Boeck, Luigi Manna, Doug Burleson, Rokas Oginskis, Adolfo Dela Rosa, Miguel Franco, Ashley Tozzi, Hayley Peltz and the entire QCB staff for all the amazing work making iv sure the students needs are always met. It is a lot of work to keep the program running smoothly so we can focus on doing our best research, and I certainly do not take that for granted. Thank you very much for everything you do! Lastly, I would like to thank my family for providing the emotional support without which I would certainly not be here today. Thank you to my amazing mom Rita Sena for always providing for me, guaran- teeing my education and a stable home, and encouraging me to pursue greater challenges, even if it meant letting your only child go. Thank you to the entire Sena family, my uncles, aunts and cousins, for always welcoming me home in my visits to Fortaleza and for always being there for me, no matter what happens or where we go. Thank you to my amazing wife Alexandra Sena, who, ever since we met, has been the foundation of my life. You have been beyond supportive of all the challenges we faced, and life is brighter and happier when we get to celebrate accomplishments and pick ourselves up from failures together. I love you more than words can describe! Thank you also to Gaby Pelayo, Jos´ e Pelayo, Edgar Pelayo, Daniel Pelayo and Maria Pelayo for welcoming me to the family and making sure we are always taken care of. I am very proud to be part of this family! v Table of Contents Epigraph ii Acknowledgements iii Abstract x 1 Introduction 1 1.1 Motivation and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 List of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Whole genome DNA methylation analysis using short reads 6 2.1 Biological relevance of DNA methylation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Molecular mechanisms of DNA methylation . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 DNA methylation correlates with transcriptional repression . . . . . . . . . . . . . . 9 2.1.3 Methylation dynamics in embryogenesis . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.4 Clinical applications of DNA methylation analysis . . . . . . . . . . . . . . . . . . 11 2.1.5 Epigenome-wide association studies (EWAS) . . . . . . . . . . . . . . . . . . . . . 11 2.2 Bisulfite sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Whole genome bisulfite sequencing (WGBS) . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Methylation arrays and the Infinium platform . . . . . . . . . . . . . . . . . . . . . 13 2.2.3 Short reads are ideal for methylation analysis . . . . . . . . . . . . . . . . . . . . . 14 2.3 Computational analysis of bisulfite sequencing data . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 Quality control of WGBS reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Methylation quantification from mapped reads . . . . . . . . . . . . . . . . . . . . 18 2.3.3 Inference of methylation patterns within samples . . . . . . . . . . . . . . . . . . . 19 2.3.4 DNA methylation comparison between samples . . . . . . . . . . . . . . . . . . . . 22 2.4 Challenges in whole genome bisulfite sequencing data analysis . . . . . . . . . . . . . . . . 22 2.4.1 High volumes of sequencing data . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 WGBS analysis often requires genome binning . . . . . . . . . . . . . . . . . . . . 23 2.4.3 Methylome features depend on the reference genome . . . . . . . . . . . . . . . . . 24 2.4.4 Challenges in mapping reads to repetitive DNA . . . . . . . . . . . . . . . . . . . . 25 2.5 Downstream analysis depends on accurate read mapping . . . . . . . . . . . . . . . . . . . 25 vi 3 Mapping bisulfite sequencing reads using a two-letter alphabet 27 3.1 Definition of the read mapping problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.1 Global and local alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.2 Read mapping: filtration and alignment of reads to large genomes . . . . . . . . . . 29 3.1.3 Filtration algorithms for short read mapping . . . . . . . . . . . . . . . . . . . . . . 31 3.1.4 Filtration of bisulfite-converted reads . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Algorithms to map bisulfite-converted reads . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.1 State-of-the-art mappers for bisulfite-converted reads . . . . . . . . . . . . . . . . . 35 3.2.2 Every WGBS mapper uses a three-letter alphabet . . . . . . . . . . . . . . . . . . . 35 3.3 The two-letter alphabet encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.1 Indexing with a two-letter alphabet . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.2 Theoretical analysis of the two-letter encoding . . . . . . . . . . . . . . . . . . . . 39 3.4 Quantifying efficiency of an encoding scheme: the expected hit rate . . . . . . . . . . . . . 39 3.4.1 Definition of the expected hit rate of an encoding scheme . . . . . . . . . . . . . . . 39 3.4.2 Bounds on the expected hit rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.3 Comparison of expected hit rates under two- and three-letter alphabets . . . . . . . . 41 3.5 Another bisulfite mapping algorithm (abismal) . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.6 Assessing performance of the abismal algorithm . . . . . . . . . . . . . . . . . . . . . . . . 48 3.6.1 Comparison criteria: Sensitivity, specificity and speed . . . . . . . . . . . . . . . . 48 3.6.2 Comparison results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.7 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4 Filtration for bisulfite sequencing read mapping using a hybrid alphabet 52 4.1 The two-letter alphabet can be locally inefficient . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Hybrid alphabet index construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Memory storage requirement for the hybrid index . . . . . . . . . . . . . . . . . . . . . . . 57 4.4 Theoretical analysis of partition sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4.1 The additional memory for hybrid indices in real genomes is marginal . . . . . . . . 61 4.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5 Constructing genome indices for efficient filtration 62 5.1 Memory-efficient data structures for genome indices . . . . . . . . . . . . . . . . . . . . . 63 5.1.1 Indexing uniformly spaced positions . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.1.2 Indexing through winnowing schemes . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.1.3 The Burrows-Wheeler transform and the FM index . . . . . . . . . . . . . . . . . . 67 5.1.4 Compression strategies provide opportunity for efficient filtration . . . . . . . . . . 69 5.2 Efficient filtration through index compression . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2.1 Generalizing the expected hit rate for arbitrary indices . . . . . . . . . . . . . . . . 69 5.2.2 Linear time construction of the minimum hit rate index . . . . . . . . . . . . . . . . 70 5.2.3 The minimum hit rate winnowing scheme . . . . . . . . . . . . . . . . . . . . . . . 73 5.2.4 Empirical comparison of indexing strategies . . . . . . . . . . . . . . . . . . . . . . 77 5.3 Empirical comparison of indexing and alphabet encoding strategies . . . . . . . . . . . . . . 77 5.4 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 vii 6 Future opportunities in short bisulfite sequencing read mapping 80 6.1 Strategies to map paired-end reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2 Efficient implementations of the Smith-Waterman algorithm . . . . . . . . . . . . . . . . . 83 6.3 Choosing a local alignment scoring scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.4 Constructing a substitution scoring matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7 Conclusions 92 Bibliography 94 Appendix 107 A The abismal algorithm 108 A.1 Indexing a genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 A.2 Mapping reads to an indexed genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 A.2.1 Selecting hits through Hamming distance comparison . . . . . . . . . . . . . . . . . 111 A.2.2 Local alignment of hits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 A.3 The abismal output in a SAM file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 A.4 Mapping all combinations of bisulfite bases and genome strands . . . . . . . . . . . . . . . 113 A.5 Important implementation choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 B Supplementary Figures 115 viii List of Tables 1.1 List of variables used in this dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1 Datasets used to benchmark abismal with other mappers . . . . . . . . . . . . . . . . . . . 49 4.1 Hybrid alphabet partition sizes fork 2 =25 andk 3 =16 . . . . . . . . . . . . . . . . . . . 61 5.1 Comparison of mapping efficiency under various indexing strategies for k =25 andw =10 76 5.2 Comparison of mapping efficiency under various indexing strategies for k =25 andw =20 76 6.1 Optimal alignment scores for some real datasets . . . . . . . . . . . . . . . . . . . . . . . . 87 A.1 Four-bit encoding of each nucleotide letter . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 ix List of Figures 2.1 Comparison of WGBS datasets of varying quality . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Methylome visualization of several cell phenotypes . . . . . . . . . . . . . . . . . . . . . . 21 3.1 Example comparison of alphabet encodings . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Comparison of expected hit rate in model organism genomes . . . . . . . . . . . . . . . . . 43 3.3 Fingerprint comparison between two-letter and three-letter encodings . . . . . . . . . . . . 46 3.4 An example of the abismal indexing and mapping algorithms . . . . . . . . . . . . . . . . . 47 4.1 Example of the hybrid index construction algorithm . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Hybrid size partitions in theoretical genomes . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1 An example of the minimum expected hit rate problem . . . . . . . . . . . . . . . . . . . . 72 5.2 Comparison ofk-mer frequencies under various encodings . . . . . . . . . . . . . . . . . . 76 5.3 Empirical comparison of combinations of alphabet and indexing strategies . . . . . . . . . . 78 6.1 Distribution of concordant pairs found by chance . . . . . . . . . . . . . . . . . . . . . . . 82 6.2 Reads with fewer unconverted bases are less likely to map . . . . . . . . . . . . . . . . . . 90 B.1 Sensitivity comparison in simulated reads . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 B.2 Specificity comparison in simulated reads . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 B.3 F1 score comparison in simulated reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 B.4 Sensitivity comparison in real paired-end datasets . . . . . . . . . . . . . . . . . . . . . . . 117 B.5 Sensitivity comparison in real RPBAT datasets . . . . . . . . . . . . . . . . . . . . . . . . . 118 B.6 Specificity comparison in real paired-end datasets . . . . . . . . . . . . . . . . . . . . . . . 118 B.7 Specificity comparison in real RPBAT datasets . . . . . . . . . . . . . . . . . . . . . . . . 118 B.8 Mapping speed comparison in real paired-end datasets . . . . . . . . . . . . . . . . . . . . 119 B.9 Mapping speed comparison in random PBAT datasets . . . . . . . . . . . . . . . . . . . . . 119 x Abstract Mapping short reads to a reference genome is a classic problem in computational biology. Efficient solu- tions to the read mapping problem motivated decades of research in the fields of mathematics and computer science. Results of these research efforts are the theoretical basis for mapping software tools that are fun- damental to most modern molecular biology research. With the increasing amount of data generated in high-throughput sequencing, software tools to map reads efficiently are increasingly necessary. Quantifying DNA cytosine methylation is a common application of high-throughput sequencing for epigenome-wide studies. Methylation of cytosines in a DNA molecule is a stable and heritable epigenetic trait that is strongly correlated with various phenotypes. Changes in DNA methylation are associated with many biological processes during the lifespan of metazoan organisms, including stem cell development, aging, X-chromosome inactivation, genomic imprinting and cancer progression. Cross-species analysis of methylation in specific tissues can also demonstrate the evolutionary changes in gene activation or silencing, thus providing greater insights on the history of functional elements beyond sequence alone. Bisulfite sequencing is the gold standard to quantify methylation genome-wide. Through bisulfite treat- ment coupled with sequencing, cytosines that do not carry the methyl mark are sequenced as thymines. Mapping bisulfite sequencing reads to a reference genome allows DNA cytosine methylation to be estimated at single-base resolution. Most bisulfite sequencing read mapping algorithms adapt methods developed for unconverted DNA to account for bisulfite conversion. These direct adaptations can be particularly inefficient due to the uneven frequency of letters resulting from the biochemical conversion of nucleotides in bisulfite sequencing reads. In this dissertation, we provide efficient solutions for the bisulfite-converted short read mapping problem. To attain this, we construct the mathematical framework to quantify the expected number of alignments required to map a read, after which we construct a formal methodology to encode reads and the reference genome to minimize this value. The main contribution of this dissertation is the introduction of a two- xi letter alphabet encoding that distinguishes purines and pyrimidines. We show that this encoding has various applications to reduce the time and memory requirement to map bisulfite-converted reads. We used the two- letter alphabet to create another bisulfite mapping algorithm ( abismal), a novel software tool to map short bisulfite sequencing reads. Using data from various public repositories, we show that abismal is faster, uses less RAM, and attains more accurate results than mappers commonly used in epigenetics research. These results allow mapping efficiency to keep pace with the high volumes of bisulfite sequencing data being generated periodically by epigenetics studies. The improved speed and memory efficiency of the abismal software tool makes genome-wide methylation data analysis more accessible to the scientific community. xii Chapter 1 Introduction 1.1 Motivation and objectives High-throughput sequencing technologies provide unprecedented opportunities to deepen our understand- ing of molecular mechanisms driving the biology of cells. Among its numerous applications, sequencing is a primary tool to quantify genetic and epigenetic information encoded in purified DNA. The word “epi- genetics” is used to describe any heritable changes to the DNA that do not alter the DNA sequence itself. Examples include DNA cytosine methylation, histone modifications, chromatin accessibility and pairwise physical co-localizations of sequences within chromosomes. These modifications often correlate with acti- vation of repression of functional elements in the genome and may have causal relationships to transcription, disease progression and many other relevant biological mechanisms and phenotypes. Of particular importance among epigenetic traits is DNA cytosine methylation, which is characterized by the presence of a methyl radical (− CH 3 ) on the fifth carbon of the pyrimidine ring in cytosines. DNA cytosine methylation is observed in many prokaryote and eukaryote organisms, including all vertebrates. The location of methylated cytosines in DNA is a blueprint of the cell’s identity, as the methylome is often correlated to other traits in the epigenome. These traits often regulate gene expression at the transcription level, thus defining the cell’s phenotype. DNA cytosine methylation is a heritable trait, and enzymes that maintain methylation of daughter cells during cell division are necessary for healthy development. For humans, the disease role of DNA cytosine methylation is becoming evident, with the methylation state of certain cytosines being strong predictors of age for any given tissue, and certain methylation events being consistently observed across tissues prior to the progression of many diseases. With the constantly 1 decreasing cost of high-throughput sequencing, we expect methylation analysis to be an accessible tool for health diagnostics and several molecular biology research studies in the very near future. Over the last decade, whole genome bisulfite sequencing (WGBS) became the standard method to quan- tify methylation levels in cytosines at the whole genome scale. When DNA is isolated from tissue cells, the addition of bisulfite converts unmethylated cytosines to uracils, which are copied as thymines through PCR, while not affecting methylated cytosines. The converted and amplified DNA can then be sequenced using high-throughput sequencing technologies. Short DNA sequences are then analyzed in silico through reconstruction of the sample’s methylome, which can then be used for various application-specific analyses to attain novel biological insights. WGBS is experimentally similar to whole genome sequencing (WGS) assays, which sequence DNA directly, but downstream analysis methods often differ from WGS. In both cases, however, analysis of sequencing data starts with mapping sequenced reads to the organism’s reference genome. This step recovers each read’s most likely location of origin, which is used to estimate the fraction of methylated bases for each cytosine in the genome. Read mapping algorithms are ubiquitous in bioinformatics, but many of the existing mapping tools require a large amount of time and specialized hardware to map entire datasets, especially for WGBS reads. The large number of reads necessary to attain sufficient coverage in animal genomes requires mapping algorithms to be simultaneously time-efficient and sensitive to nucleotide differences to the reference genome in the form of mismatches, insertions and deletions. Coupled with the information loss due to bisulfite treatment, this often poses major limitations in methylation analysis using WGBS. 1.2 Contributions This dissertation introduces three algorithms that improve efficiency of WGBS read mapping compared to other tools commonly used in scientific studies. The algorithms described in this dissertation address the bisulfite sequencing read mapping problem. In loose terms, this problem is to find a genome subse- quence with maximum similarity to a read, while allowing bisulfite-converted letters in the read to match two possible letters in the reference genome. We will describe novel bisulfite read mapping algorithms and evaluate them both theoretically and through empirical comparisons between our implementation and ex- isting software tools. Comparisons with state-of-the-art tools are done in terms of mapping speed, memory requirements to run the algorithm to completion and correctness of the output in the form of sensitivity and specificity. 2 Before introducing the formal problem statement, we describe the main motivations and objectives of WGBS data analysis. In Chapter 2, we describe processes in molecular biology that can be better under- stood by quantifying DNA cytosine methylation. We start with a brief history of the discovery of methyl marks in nucleotide sequences and the subsequent discoveries of the role of DNA cytosine methylation in transcription repression and genome stability. We then show examples of various applications of high- throughput sequencing, in particular WGBS, in historically important studies. We then discuss WGBS data analysis pipelines, the quality control issues of concern, the methods to map high-quality reads to a reference genome and the various features that can be quantified once WGBS reads are mapped. We finish the chapter by addressing the existing challenges in mapping short WGBS reads. Challenges regarding mapping speed will be addressed in subsequent chapters, while challenges related to other aspects of WGBS data analysis will be left open for future research and described in more detail in Chapter 6. In Chapter 3, we discuss the statistical properties of letter frequencies in WGBS data and how bisulfite conversion makes the short read mapping problem more challenging, in many aspects, than mapping un- converted DNA reads. To mitigate these challenges, we propose the use of a two-letter alphabet encoding for bisulfite sequencing reads and show how this alphabet can be used to construct efficient algorithms to map WGBS reads to a reference genome. We will analyze the theoretical advantages of this two-letter en- coding relative to the three-letter encoding that is commonly used to map WGBS reads. In addition to the theoretical analysis, this chapter introduces another bisulfite mapping algorithm (abismal), a novel mapping software tool implemented based on the two-letter alphabet. We will describe many implementation choices that result in abismal’s improved speed and memory efficiency relative to existing WGBS mappers. We will then conclude with empirical comparisons between abismal and various mappers, along with avenues for improvements in speed and accuracy that can be explored. In Chapter 4, we expand the theory devised in Chapter 3 to address some of the limitations of the two- letter alphabet. We show that, if a slightly higher amount of memory is available, we can locally decide, based on the genome sequence, which alphabet encoding is most effective. We then introduce the concept of a hybrid alphabet, where parts of the genome sequence are encoded in the two-letter alphabet and other parts are encoded in the three-letter alphabet. We will devise the mathematical framework that calculates the expected mapping time of this hybrid alphabet as an objective function and how to construct the partition that minimizes this objective. We show that, as expected by the theory constructed in Chapter 3, the hybrid alphabet results in most of the real reference genome being encoded in two letters. We will show that 3 this means that the hybrid alphabet improves mapping times at the expense of a very marginal increase in memory. In Chapter 5, we construct algorithms to improve mapping efficiency through the use of compressed genome indices. We will show that these compression strategies also provide opportunities for new algo- rithms to improve mapping efficiency by accounting for k-mer frequency when deciding which positions to index. We then introduce the concept of the minimum hit rate index, which is the reference index that, under certain constraints, minimizes the expected number of alignments of a read given an encoding scheme. We will conclude by showing that the minimum hit rate index and the hybrid alphabet from Chapter 4 can be easily combined into an algorithm that compounds the advantages of both strategies. In Chapter 6, we discuss other steps in a WGBS read mapping algorithm that affect speed and the accuracy of results. We focus on three elements of mapping that provide future opportunities for improve- ment. First, we discuss the data structures used in paired-end mapping, where ends of a DNA fragment are sequenced and are expected to map in opposite strands and in proximity. Second, we discuss optimized im- plementations of the Smith-Waterman algorithm that use hardware-specific instructions to perform several calculations of the algorithm in parallel with a single instruction. Third, we discuss the effects of scoring matrices in the accuracy of results, and how a score for different combinations of substitutions, as well as insertions and deletions (indels), can be obtained from real reads. The overarching goal of this chapter is to both justify the decisions to overcome these problems in the abismal algorithm and discuss future avenues for improvement. Finally, in Chapter 7, we summarize the main results of Chapters 3 to 6. We describe the potential impact of the research produced in this dissertation and how it fits in the fast-paced field of bioinformatics, where both data and algorithms are produced at an accelerating rate. The overarching goal of this dissertation is to provide the theoretical basis and the algorithms that make DNA cytosine methylation data analysis efficiently feasible in machines with low computing power and pave the short-term roadmap to the upcoming opportunities in bisulfite sequencing short read mapping. The improved empirical mapping efficiency attained using the abismal software tool increases accessibility of in silico DNA cytosine methylation analysis to experimental biologists who may not have access to high- performance computing centers and provides efficient solutions to open problems in associating bisulfite sequencing data with meaningful biological knowledge. 4 1.3 List of variables Most of this dissertation will use mathematical symbols that are shared across chapters. For convenience, we list below the set of letters used in mathematical formulations and what they represent (Table 1.1). These variables will be formally defined when they are first introduced, and subscripts will be used to describe their context-specific applications. Table 1.1: List of variables used in this dissertation symbol description Σ the nucleotide alphabetΣ= {A,C,G,T}. T,Q arbitrary strings, or ordered sets of characters overΣ . s(T,Q) the alignment score of stringsT andQ. M,X,I,D non-negative scores for matches (M), mismatches (X), insertions (I) and deletions (D) in the scoring functions. G the reference genome, an element overΣ m . m the number of characters inG, also represented as|G|. r a query read to be mapped to the reference genome. An element inΣ n . n the number of characters inr. Also called read length and represented by|r|. k the length of a substring ofr orG, with0≤ k <n or0≤ k <m. u an arbitraryk-mer, or a string inΣ k that is a substring ofr orG. h(·) a character encoding function that takes a character inΣ and returns a numerical value. H(·) a fingerprint encoding function that takes a k-mer inΣ k and returns a numerical value. v a fingerprint of an arbitrary k-mer. The output ofH(u) for somek-meru. Ω the set of all possible fingerprints, or all possible values of v. S a genome index. A subset of{0,...,m− 1}. N(v) the number ofk-mers with fingerprint v inG. C(S,v) the number ofk-mers with fingerprint v inG whose starting position is an element inS. When the setS is implied by the context, we will simply sayC(v). Z(S) Expected hit rate of an indexS. Expected value ofC(S,v), wherev is the fingerprint of a uniformly sampledk-mer inG. z Theoretical expected hit rate of an infinite i.i.d. genome. w A window size, used to refer to a set ofw consecutive positions inG orr. 5 Chapter 2 Whole genome DNA methylation analysis using short reads In this chapter, we provide the background on epigenetic studies of methylomes and how methylation is analyzed using high-throughput sequencing. This chapter is divided in four parts. First, we will describe the basic biology of DNA methylation, the molecular mechanisms that control its dynamics and some bi- ologically relevant questions that can be addressed through methylation studies. Second, we will discuss the sequencing technologies used in genome-wide methylation studies, placing a particular focus on whole genome bisulfite sequencing. Third, we describe how computational analysis of high-throughput bisulfite sequencing reads is done. Finally, we discuss the limitations and challenges of methylation data analysis using short reads. Some of the challenges will be addressed in the following chapters, whereas others are still left as open questions for further research. 2.1 Biological relevance of DNA methylation DNA cytosine methylation is characterized by the presence of a methyl group in the fifth carbon of a cytosine (C). Methylated cytosines are often referred to as 5-methyl cytosines (5mCs). Historically, DNA cytosine methylation was discovered in mammals as early as DNA was identified as the primary genetic material of living organisms [1]. Methylation of nucleic acids was first observed in adenine (A) nucleotides of DNA and small bacterial RNA of Escherichia coli. Later, DNA methylation was discovered in cytosines within protists and eukaryotes, where its role in gene regulation was theorized [2]. Interestingly, there are animals in which 6 DNA methylation is almost completely absent, the model organism Drosophila melanogaster being the most prominent example. In most metazoans, methylation is only observed in cytosines. This dissertation focuses exclusively on computational analyses of DNA cytosine methylation, which we will subsequently refer to as simply DNA methylation, where methylation of cytosines will be implied. All vertebrates exhibit DNA methylation, and the location of methylated cytosines correlates with the functional role of cells within tissues. In these organisms, methylation occurs almost exclusively in the palindromic cytosine-guanine (CpG) dinucleotides (here “p” represents the phosphate group, indicating the directionality in which the nucleotides are read), with methylation in cytosines of opposing strands often being identical due to the role of enzymes that ensure methylation symmetry. In some non-metazoan organ- isms, such as the plant Arabidopsis thaliana, methylation of cytosines outside of CpGs is also observed, but the majority of methylated cytosines are still within the CpG context. Methylated CpGs are prone to sponta- neously de-aminate into thymines, causing CpGs to be the least frequent of all dinucleotides in most animals, as well as the most prone to mutation and genetic variability within populations. The human genome, for instance, has only 28 million Cs in CpGs, or about 1% of all the nucleotides. This is a stark contrast with the product of relative frequencies of Cs and Gs, which is about 4.4% [3]. The distribution of these CpGs in the DNA, however, is not uniform, with certain regions, named CpG islands, having clusters of higher-than- average density of CpGs (yet also often less than 4.4%) [4]. These CpG islands are often part of functional elements of the DNA, such as gene promoters, enhancers and insulator regions. Somatic cell CpGs are predominantly (usually 60-80%) methylated, whereas the methylation of CpGs within CpG islands is much lower (often around 10%), and the methylation state of CpGs is dependent on the cell phenotype [5]. Genetic and epigenetic information varies in stability and dynamics. In one extreme, DNA sequences are identical in almost all cells of an organism. In the other extreme, RNA molecules are constantly being transcribed or destroyed based on cell-to-cell interactions, external environmental factors or the cell’s life cycle. The methylome falls between these two extremes. DNA methylation is arguably the most stable epigenetic mark that correlates with developmental regulation and disease. Despite many enzymes existing with the specialized purpose of precisely replicating methylation during cell division, coordinated methyla- tion changes in specific regions of the DNA are essential for healthy development. Through an organism’s lifespan, passive gain and loss of methylation may occur, potentially leading to observable phenotypes like aged tissues and cancer development. In the next section, we describe how methylation is gained, replicated, maintained and lost. We then 7 describe the various biological processes in vertebrates in which dynamic changes in methylation are nec- essary mechanisms. Understanding these biological processes motivates the development of computational tools to analyze methylomes. 2.1.1 Molecular mechanisms of DNA methylation In mammalians, three enzymes of the DNMT3 family - DNMT3A, DNMT3B and DNMT3L - are respon- sible for de novo methylation of CpGs, whereas proteins from the TET family are responsible for targeted active loss of methylation [6]. A gene knockdown of any of these enzymes is lethal in the pre-implantation stage of embryogenesis, indicating that the role of DNA methylation is indispensable for development. In vitro studies suggest that enzymes from the DNMT3 family interact with H3 histones whose fourth lysine (H3K4) are not methylated [7]. These observations indicate that DNA targets of methylation occur through chemical modifications in specific nucleosomes that contain condensed DNA. During cell division, enzymes DNMT1 and its obligate partner, UHRF1, copy the methylation state of CpGs during mitosis. UHRF1 preferentially binds to hemi-methylated CpGs in nascent DNA strands [8], so newly synthesized CpGs are continuously scanned to ensure correct replication of the template strand’s methylome. Loss of methylation occurs either passively or actively. Passive loss can occur through in- action of the DNMT1/UHRF1 enzymes during cell division, leading to unmethylated CpGs in the newly synthesized DNA. Active loss of DNA methylation is thermodynamically a much less favorable reaction, but proteins of the TET family can attain it through successive reactions of oxidation, formilation and car- boxilation of 5-methylcytosines [6]. The roles of active loss of methylation are still not fully understood, but proteins of the TET family are also essential for post-implantation development [9]. An important biochemical property of methylated cytosines is their proneness to mutation. Spontaneous de-amination of pyrimidines in unmethylated cytosines convert them to uracils, which are recognized by mismatch repair enzymes as exogenous bases. Conversely, de-amination of a methylated cytosine converts it to a thymine (T). The resulting T to G base pairing is non-conforming and therefore prone to mismatch repair, but there is an equal probability that a T is repaired to a C or a G is repaired to an A. If the latter occurs, a CpG is permanently mutated into a TpG. This simple model is the most prevalent theory to explain the evolutionary loss of CpGs in mammalian genomes, as well as their preserved co-localization in functional elements that may lead to unviable cellular function if mutated. Mutations in CpGs account, in average, for two-thirds of all point mutations between any two human individuals [10], making them an important part of 8 genome-wide association studies (GWAS), which seek to find mutations that are correlated with phenotypes like physical traits and proneness to genetic disease. 2.1.2 DNA methylation correlates with transcriptional repression With few exceptions, heavily methylated CpGs in vertebrate promoters is associated with transcriptional silencing of the gene, a property that can be validated experimentally by transferring exogenous genes into cultured cells and observing that they are expressed only when their promoters are not methylated [11]. Similarly, methylation through distal regulatory regions, such as enhancers, tends to be associated with inactivity of the region [12]. These gene regulation mechanisms often occur as a consequence of co- localization with nucleosome-bound DNA regions where the twenty-seventh lysine in histone 3 is tri- methylated (H3K27Me3), The limited physical accessibility of DNA in these regions likely leads to tran- scriptomic repression [13]. Conversely, in vitro studies suggest that an approximately equal number of human transcription factors have positive, negative and neutral affinity to their binding motifs when CpGs within the motif sequence are methylated [14], further suggesting that the correlation between methylation and transcriptional repression is not causal, but rather that the methylation correlates with the spatial organi- zation of DNA. It is theorized, however, that the methylome of newly synthesized DNA serves as a template for chromosomal reorganization after cell division. If true, this implies that there is a mutual dependence between chromatin formation and the methylome. 2.1.3 Methylation dynamics in embryogenesis X chromosome inactivation was the first biological role of DNA methylation uncovered in mammalians [15], where the existence of enzymes that target hemi-methylated cytosines in a double strand, specifically present in maternally-inherited X chromosomes, was theorized. Later, it would be discovered that DNA methylation occurs at later stages during X inactivation, with much of the silencing occurring through histone modifi- cations mediated by the X-inactive specific transcript (XIST) gene, with methylation likely playing a role in the maintenance of the inactivated state. It is now also known that the global methylation levels of the active and inactive X chromosomes are largely similar, but certain CpG islands were consistently methy- lated only in the inactive X chromosome allele [16]. More generally, DNA methylation is a mechanism through which genomic imprinting occurs, with imprinting being established in gametes via DNMT3A in a parent-of-origin-specific manner [17]. At least 150 genes are known to be imprinted via DNA methylation 9 in either sperm or egg cells [18]. Germ cells are as heterogeneous in their methylomes as in the genetic content of haploid chromosome sequences. It is not yet fully understood how the individual-specific methy- lation landscape of germline cells is related, if at all, to offspring phenotype, but germline methylomes are accurate predictors of embryo viability in humans [19]. Coordinated changes in DNA methylation are part of important steps in healthy mammalian develop- ment. There are major waves of methylation change in pre-implantation embryo cells that are necessary for normal differentiation. Mammalian germ cells are predominantly methylated, but global loss of methy- lation occurs in the zygote upon fertilization, leading to globally unmethylated pre-implantation embryo cells and primordial germ cells. This occurs through inactivity of the DNMT1/UHRF1 enzymes, which is likely a necessary condition to maintain totipotency [20] and to erase parent-of-origin specific imprints in the developing germline [7]. Somatic cells then gradually re-gain methylation to attain specialized biolog- ical roles through loss of methylation in phenotype-specific genes [9]. The targeted gain of methylation in stem cells is likely the mechanism through which cells attain specialization, in particular through silencing of pluripotency genes upon differentiation [21]. Methylation of DNA repeats is a blueprint of their evolution and a mechanism through which retroele- ments are silenced in early development [22]. Retroelements are DNA sequences that create copies of them- selves through transcription and subsequent retrotransposition. During the embryonic waves of de novo methylation rewriting, pathway cascades methylate retroelement copies to ensure transcriptomic silencing. Retroelements are silenced through the piRNA pathways as a mechanism to control self-replication, but active copies with certain motifs escape this methylation wave [23]. Silencing of retroelements plays an important role in genome stability by preventing further retrotransposition activity, which, in some cases, may cause cells to become unviable [24]. Once organisms reach adulthood, their tissues can be accurately distinguished based on CpG methylation alone [25]. This is analogous to classifying tissues based on gene expression. Through the lifespan of mammals, methylation changes through passive gains and losses genome-wide, resulting in less co-localized hypo- and hyper-methylated regions. Surprisingly, this process is not stochastic and seems to differ across cell types in rate of change and likelihood of mutations in specific loci [26]. 10 2.1.4 Clinical applications of DNA methylation analysis In clinical settings, there is great potential to use DNA methylation as a biomarker for early disease prognos- tics, as well as predictors of successful drug response and precision medicine. For cancer detection, early tumors often undergo hypomethylation in cancerous retroelements and genes. Quantification of methylation levels in specific cytosines of bulk tissues allows the early steps of tumorgenesis to be identified prior to observable tumor formation [27]. DNA methylation studies also have important applications for drug development. Targeted changes in methylation are easier to attain than CRISPR-based genetic mutations, so commercial medicines can potentially induce recovery of methylation in these target regions [28]. Drugs are developed to add or remove methylation of functional elements, thus “toggling” their repression state. Epigenetic drugs that incorporate genetic sequences that are targeted by DNMTs into DNA and RNA have been approved by the Food and Drug Administration to treat several hematologic malignancies [29]. Aging treatment and rejuvenation are also promising targets of epigenetic drugs. The methylome of nearly any tissue can be used to accurately predict a person’s age, and only about 300 CpGs are necessary for highly accurate predictions to be made [30]. The regression model that assigns an age based on these CpGs is called the Horvath clock. Targeting CpGs that are prone to methylation maintenance loss with age has shown promising potential to treat age-related diseases like macular degeneration [31, 32]. However, the mechanisms that relate aged cell phenotype and changes in methylation in thes CpGs is not fully under- stood. It is also not known whether these CpGs are causal to aging or simply correlate to other unobserved epigenetic mechanisms. 2.1.5 Epigenome-wide association studies (EWAS) There is still much to learn about the association between observable human phenotypes and the methylome. Therefore, epigenome-wide association studies (EWAS) are necessary to comprehensively profile CpGs that correlate with disease prognosis and treatment [33]. EWAS create population-wide data to associate the epigenome with phenotype, with DNA methylation being experimentally the easiest and least expensive epigenetic trait that is quantifiable genome-wide. Because of the dynamic changes in DNA methylation through our lifespan, EWAS can be either based on case-control studies or longitudinal studies. In case-control studies, two groups of individuals, one with a given phenotype and one without, have their methylation quantified in a set of CpGs. To assess potential CpGs that correlate with the phenotype, 11 generalized linear models are used, with the phenotype being the outcome and all CpG methylation values being the covariates. The models must account for confounding factors such as gender, age, ethnicity and other genetic and lifestyle traits in order to quantify the statistical significance of the correlation between individual CpGs and the outcome of interest. This is similar to GWAS statistical tests, with the difference that CpG methylation values are continuous between 0 and 1, whereas genotype can only have a discrete set of possible combinations based on possible haplotypes. Longitudinal studies quantify the correlation between changes in methylation and phenotype throughout the lifespan. For instance, during the first five years of life, the global methylation landscape of human blood cells undergoes drastic remodeling toward hypermethylation, especially in genes related to development, such as tissue morphogenesis and neuron development [34]. Longitudinal studies can also demonstrate how lifestyle choices, such as dietary habits or smoking, affect our epigenome through long-term methylome quantification in twins and other individuals with similar habits [35]. Although no longitudinal studies were yet conducted in a controlled manner, it is known that the epigenome-based age estimate (based on the Horvath clock) of tobacco smokers is higher than that of non-smokers [36], and that caloric restriction simultaneously delays aging and decreases the epigenetic age in rhesus monkeys [37]. In the coming decades, we expect to obtain comprehensive methylation data through the entire lifespan of individuals, which can aid in constructing more robust models to associate traits with phenotypes, including the mid-life progression of diseases and the environmental effects that may affect the aging process. 2.2 Bisulfite sequencing Bisulfite sequencing is the gold standard to analyze cytosine methylation [38]. Bisulfite treatment converts an unmethylated cytosine to uracil while not affecting a methylated cytosine. The converted DNA is then amplified in a polymerase chain reaction, which copies uracil as thymine (T). The resulting DNA is used as input for a wide range of sequencing technologies. Bisulfite treatment causes the input DNA to be fragmented, often resulting in fragments of at most a few thousand base pairs. For this reason, bisulfite sequencing libraries are often analyzed through short read sequencing technologies, predominantly Illumina machines. 12 2.2.1 Whole genome bisulfite sequencing (WGBS) Whole genome bisulfite sequencing (WGBS) couples bisulfite treatment with high-throughput sequencing, which generates hundreds of millions of short reads ranging from 50 to 150 bases. WGBS data analysis allows cytosine methylation estimates genome-wide at single-base resolution [39, 40]. Bisulfite conversion leads to information loss in the sequenced DNA, as one of the bases in the observed reads may originate from two possible bases in the unconverted DNA (e.g. a sequenced T can originate from a C or a T in a DNA molecule). The resulting WGBS reads can be T-rich or A-rich, with one letter composing nearly half of the sequenced bases. The traditional WGBS protocol, which requires microgram quantities of DNA, generates T-rich reads in single-end sequencing and A-rich reads in the second end of paired-end sequencing, whereas random priming post-bisulfite adapter tagging (RPBAT) protocols generate randomly sampled A-rich and T-rich reads with complementary bisulfite-converted bases in paired ends [41, 42]. RPBAT protocols reduce the DNA denaturation caused by bisulfite treatment, allowing WGBS to be applied in nanogram quantities of DNA, often needed for tissues with low cell counts like pre-implantation embryos and circulating tumor cells [43]. 2.2.2 Methylation arrays and the Infinium platform Technologies not based on high-throughput sequencing also exist to quantify methylation in genomes. The Illumina Infinium 450K DNA methylation platform provides high accuracy methylation estimates for CpGs that can be targeted by the chip probes [44]. The machine provides continuous methylation values for up to 450 thousand CpGs in the human genome with very high depth. Methylation arrays have been used for several successful epigenome-wide association studies (EWAS), including the Horvath clock. More recently, Illumina introduced the EPIC HumanMethylation850, which quantifies about 850 thousand CpGs in the human genome. These CpGs are predominantly located in promoter regions, whereas coverage for enhancers is about 60%, but coverage for proximal regulatory regions is only about 7% [45]. Methylation arrays can provide much higher depth per CpG than WGBS, but every analysis is limited to the CpGs provided by the beadchip, which are still orders of magnitude below the 28 million CpGs that comprise the human genome. This severely limits genome-wide methylation studies using chip-based tech- nologies. For instance, de novo discovery of CpGs associated with certain phenotypes is limited to the chip CpGs. Similarly, the possibility to analyze repeats and larger functional regions containing multiple CpGs is also drastically reduced. Arrays-based technologies also constrain the analysis to the organisms for which 13 the beadchips are available. Conversely, sequencing analysis is entirely digital and only requires reads and a high-quality genome assembly. WGBS can therefore be extended to any model organism or experimentally designed genome sequence. As the cost of sequencing decreases, WGBS data may eventually provide large enough CpG depths to attain similar statistical power to methylation arrays. Note, however, that this will require massive amounts of sequencing data, which poses challenges to downstream analysis. For instance, to attain 100-fold coverage for each CpG, a WGBS dataset composed of 100 bp reads must contain more than 3 billion reads. 2.2.3 Short reads are ideal for methylation analysis In theory, DNA methylation can be analyzed in silico using three possible technologies (i) array-based quantification (ii) short read sequencing and (iii) long read sequencing. In the previous section, we outlined the advantages and limitations of array-based sequencing. Here we discuss the trade-off between current short- and long-sequencing. Technologies that sequence longer reads are fundamental tools to study structural variation and for genome assembly applications. The two most commonly used long read sequencing technologies are Pacific Biosciences (PacBio) HiFi sequencers and the Oxford Nanopore (ONT) MinION sequencer. Both technolo- gies provide long read single-molecule sequencing, with PacBio attaining reads in the order of magnitude of tens of thousands [46], whereas ONT can sequence up to millions of nucleotides per molecule [47]. The two major advantages of long read-based sequencing are (i) the possibility of detecting long structural variation which cannot be captured by Illumina read lengths and (ii) the more uniform error rate in molecules, unlike Illumina’s proneness to exponentially decreasing sequencing quality. Despite the advantages of long read sequencing, WGBS is still predominantly analyzed through Illumina short read technologies. The first reason for this is cost, with Illumina sequencing still being considerably less expensive per nucleotide than PacBio and ONT. This, however, may not be true in the long term. The second reason is that bisulfite treatment creates small DNA fragments which, even in long read mapping protocols, would not generate large sequence fragments. The third reason is unbiased global representation of a tissue. When a single long read molecule is sequenced, all the methylation information from that read originates from a single cell in the tissue. This means that the epigenome will be reconstructed by as many cells as there are reads in the sample. While PacBio libraries provide only a few thousand reads per library, Illumina sequencers provide hundreds of millions of sequenced reads, all of which come from 14 uniformly sampled cells in the tissue. An Illumina library is therefore much less likely to be biased by individual outlier cells, and the quantified methylome represent the global average methylome of the tissue. For instance, consider a 10000 bp interval in a genome and a 1x coverage sequencing experiment using PacBio sequencing of reads of length 10000 bp. All the information in this region will likely originate from a single molecule, whereas an Illumina experiment with reads of length 100 bp would contain, about 100 reads from 100 different cells in the same region. For the rest of this dissertation, we will focus on the downstream analysis of Illumina-generated WGBS reads to the reference genome. We will henceforth assume that the reads from our sequencing experiment are short, ranging from 50 to 200 bases. These assumptions will be used when developing algorithms for read mapping in Chapters 3, 4 and 5, as many of the parameters, such as k-mer lengths, will be selected based on the assumption that they comprise a sufficiently large fraction of the reads. Assumptions on read length will also be used to decide the number of bits used to store variables like read length, number of mismatches and alignment scores. 2.3 Computational analysis of bisulfite sequencing data The rest of this dissertation will focus on data analysis of WGBS reads. In this section, we describe the usual steps to extract biological information starting with a set of WGBS reads. To attain this, we will assume that a high-quality reference genome for the organism from which the reads originated is available. Broadly speaking, WGBS analysis pipelines are divided in four steps (i) quality control, (ii) read mapping, (iii) quantification of cytosine methylation levels and (iv) downstream analysis. Some quality control metrics for WGBS require reads to be mapped to the reference genome, so (i) and (ii) are intertwined. Downstream analysis involves quantification and statistical tests that depend on the biological question of interest. In this section, we will dedicate a subsection for (i) and (iii). Chapters 3, 4 and 5 are entirely dedicated to the intricacies of step (ii). We will finish this section by describing several biologically meaningful downstream analyses that can be applied to a genome whose cytosine methylation levels have been quantified. This will give a broad overview of the state-of-the-art in (iv), which will serve as the main motivation to con- struct efficient read mapping algorithms, as mapping is the starting point for these biologically meaningful analyses. 15 2.3.1 Quality control of WGBS reads The WGBS data analysis workflow is similar to that of other short read high-throughput sequencing data, with some additional analysis steps being necessary due to bisulfite conversion. Sequenced reads undergo quality control evaluations prior to any further computation. These include, among other metrics, assessing if (i) any individual sequence composes a large fraction of sequenced reads, (ii) there are adapters that were frequently sequenced at the ends of reads and (iii) if the base content at each position is uniform, with the amount of A+G being approximately equal to the amount of C+T in every position. Counting over-represented sequences (criterion (i)) may detect biases in the PCR amplification step, which leads to non-uniform sequencing libraries and will severely bias downstream analysis to the over-amplified reads. Sequencing Illumina adapters (criterion (ii)) makes the problem of mapping a read to the reference genome more difficult. For WGBS reads, most protocols methylate cytosines within adapters prior to adding them to the DNA and applying bisulfite treatment. If adapters are mapped to a genome location, the methylation level of certain cytosines may be overestimated (this problem is addressed in more detail in Chapter 6). Finally, the condition C+T = A+G (criterion (iii)) ensures that the sequencing process is not biased to specific sequence patterns and motifs. In a high-quality WGBS library, every read has uniform probability of being sequenced from any location in the genome. Therefore, the proportions of base frequencies in any individual position across reads must reflect the underlying base frequencies of the genome. Figure 2.1 compares a high-quality WGBS public dataset with a low-quality one based on criterion (iii), where examples of good and bad quality WGBS data are compared on their base composition at each read position. Analyzing summary statistics on a set of WGBS reads can identify serious problems prior to any time- consuming downstream analysis. However, satisfiable quality control at the read level is only a necessary condition for a WGBS sample to be of high quality. Some statistics are only quantifiable after mapping reads to the reference genome and estimating methylation levels for each cytosine. In what follows, we will temporarily skip the mapping step and assume each read was effectively mapped to their most likely location of origin in the reference genome. We will also assume that, if reads map to more than one location with equal probability, they will be subsequently discarded, so our analysis is restricted to reads that map uniquely. The issue of ambiguity will be addressed in more detail at the end of this chapter, in section 2.4.4. 16 (a) (b) (c) good quality: expected frequency biases in the 5' end bad quality, A+G > C+T Figure 2.1: Comparison of WGBS datasets of varying quality based on nucleotide content quantified using falco (https://github.com/smithlabcode/falco) (a) a WGBS dataset where base content matches expectations. Typical high-quality WGBS dataset contain low frequencies of Cs (or Gs) and high frequencies of Ts (or As). The sum of As and Gs must equate the sum of Cs and Ts in every position. (b) an example of a WGBS dataset with content biases on the 5’ end of reads. These may be due to sequencing errors or adapter content in the 5’ end. (c) a low-quality WGBS dataset, where the A+G content is higher than C+T. This often occurs when bisulfite treatment causes biases in the sequenced nucleotides, making DNA fragments with many cytosines more likely to be damaged and less likely to be sequenced. This is an extremely common phenomenon in current public WGBS data. 17 2.3.2 Methylation quantification from mapped reads Once reads are mapped, a methylation level can be assigned to each cytosine in the reference, provided that it was covered by a sufficient number of read bases. The methylation level of each cytosine in the reference genome is a feature taking a value between 0 and 1. These values represent the estimated fraction of cells within a sample that contain the methyl mark in the cytosine. The estimated methylation level is a function of the number of reads Cs and Ts that map to reference Cs. For instance, one could estimate each level as the fractionn C /(n C +n T ), wheren C andn T are the number of read nucleotides sequenced as Cs and Ts, respectively, for a genome C. Bayesian methods that use prior knowledge of an organism’s expected methylation levels are also commonly used [48], and posterior estimates can be calculated based on the observed number of Cs and Ts. The beta distribution is the most common prior for underlying methylation levels due to its flexibility, small number of parameters, easily quantifiable estimates and the fact that it takes continuous values between 0 and 1. Many of the downstream analysis methods described in the next section model the number of Cs and Ts through the beta-binomial distribution. Genome-wide methylation summary statistics from mapped reads can provide valuable information about data quality and identify potential biases in the sequencing library. For instance, animal genomes are not expected to have methylated cytosines outside of the CpG context. Methylation of non-CpGs is of- ten used as an estimate of bisulfite conversion success. If too many non-CpGs are not converted, we cannot trust the methylation estimates of CpGs, which will likely be over-estimated. Conversely, de novo methy- lation by DNMT3 enzymes occurs through an initial transient step of non-CpG methylation [49]. DNA sequences whose cytosines are fully unconverted may provide interesting biological insights. If treated as low-quality unconverted reads, they may underestimate the bisulfite conversion success rate of a sample. Similarly, mitochondrial DNA is known to be devoid of methylation, including in CpGs [50]. If a sufficient number of reads map to the mitochondrial chromosome, the fraction of cytosines within these reads can also be used as an estimate of successful bisulfite conversion rate. However, the human mitochondrial DNA is much smaller than any nuclear chromosome. The current mitochondrial assembly contains about 16000 nucleotides, which is orders of magnitudes smaller than the total human genome size (3 billion bases). The number of reads that map to mitochondrial DNA is often small enough that high variance in bisulfite conversion success estimates may occur simply due to low sampling bias. It is also possible to estimate bisulfite conversion through additional experimental steps. Exogenous DNA spike-ins can be mixed with the biological DNA and the expected spike-in sequences can be added as 18 additional chromosomes to the reference genome. A sufficiently large set of synthetic spike-in chromosomes circumvents the short length of mitochondrial DNA. The limitations of spike-in DNA is the potential for bisulfite conversion to not be uniform across all possible sequence motifs. Since spike-in DNA must be sufficiently different from any human sequence to map uniquely to its reference, the bisulfite treatment may affect human and spike-in sequence patterns differently. Furthermore, the additional effort required to sequence spike-in reads may lower the resulting coverage attained for the reference genome. Ultimately, the criteria to estimate bisulfite conversion success is largely dependent on the experiment, but non-CpG methylation estimates are the most common approach in metazoan genomes. 2.3.3 Inference of methylation patterns within samples The estimated methylation values for each cytosine can be used for in silico analysis both within a sample and between samples (e.g biological replicates and case-control comparisons). The most common analysis step within samples is detecting contiguous genomic regions with a certain methylation pattern. Gener- ally, and with few exceptions, spatially adjacent CpGs tend to correlate in methylation. Clustering these CpGs based on methylation patterns and spatial proximity may provide biologically relevant information about the sample’s epigenomic landscape. Examples of within-sample analyses include the inference of hypomethylated regions (HMRs), partially methylated domains (PMDs), allele-specific methylated (AMRs) regions and methylation entropy. Figure 2.2 shows a typical visualization of CpG methylation levels in a chromosome region, with computationally inferred HMRs and PMDs displayed along with the methylation levels. HMRs are contiguous regions in the DNA where methylation levels of CpGs are low. Identification of these regions stems from known properties of CpG methylation in metazoan genomes [25]. In particular, CpG islands tend to have low methylation, whereas CpG-poor genomic regions often have fully methylated CpGs. The exceptions to these general rules are often interesting biological observations. For instance, distal regions to transcription start sites in mouse genomes display consistently low CpG methylation of about 30%. These HMRs are evolutionarily conserved and significantly overlap with DNase-I hypersensitivity sites, which often contain histone marks associated with enhancers and insulators [12, 51]. Separation of genomic regions based on bins of similar methylation levels is often attained by modeling methylation with a multiple-state Hidden Markov Model. The hidden states are the methylation level regions (e.g., 0% for no methylation, 30% for low methylation and 100% for high methylation). The classification of each CpG is 19 based on the likelihood of observing the CpG’s methylation level given each possible state and the transition probabilities. Estimation of these probability values can be inferred from the data based on maximum- likelihood using the Baum-Welch algorithm [52]. PMDs are hallmarks of immortalized cell lines, but surprisingly also characterize most of the genome of cancer cells and the placenta [53]. PMDs are defined as megabase-scale genome regions where methylation levels are consistently at about 50%. They generally reside in gene-sparse domains that replicate late during cell division. PMDs can be experimentally produced by addition of the Epstein-Barr Virus to cell lines and removed by inducing pluripotency with Yamanaka factors [54]. In immortalized and cancer cells, PMDs originate through passive loss of methylation in late-replicating regions. The mechanisms through which they originate in vivo are not yet fully understood, but they are known to correlate with particular histone marks like H3K36me3 [53], suggesting potential binding of proteins that recognize this mark. Allele-specific methylated regions (AMRs) are regions with differing methylation states in two copies of diploid chromosomes. They often originate from imprinted epigenomes of parental germline cells [55]. AMRs can be identified de novo akin to heterozygous genotyping through SNP identification, but the like- lihood model must account for the continuous underlying methylation levels which go beyond the three possible genotypes (i.e. homozygous matching reference, heterozygous or homozygous not matching refer- ence). AMR identification is a likelihood ratio test that compares the total data likelihood of two possible methylation states versus one. The best model is selected by maximum Bayesian information criterion. Methylation entropy is a local measure of heterogeneity in neighboring CpGs. It is analogous to the concept of entropy of a probability distribution. Loosely speaking, if a set ofn adjacent CpGs have identical methylation states in every read, the entropy is zero. Conversely, if every one of the2 n possible methylation combinations is seen with equal probability, the methylation entropy attains its maximum value ofn. Most regions in the genome have consistent methylation entropy across phenotypes, but entropy is significantly higher in certain tumors like primary ependymomas. Furthermore, tumor aggressiveness is correlated with higher global methylation entropy genome-wide [56]. Another marking property of entropy is its high variability within copies of certain retroelements. Within elements of the Alu family in primates, copies with more GC content but fewer CpGs are associated with higher methylation entropy, possibly due to greater genome stability that is less likely to be disrupted by methylation or mutation of the remaining CpGs. Therefore, identifying regions of high methylation entropy may yield insights into genomic regions that are prone to transcriptomic instability and subsequent disease progression. 20 Figure 2.2: UCSC genome browser (http://genome.ucsc.edu) view of CpG methylation levels in a two megabase interval of chromosome 10 (positions 131,905,980 to 133,676,692). Samples were processed from the Roadmap epigenomics project [5]. Depicted are somatic cells (liver, fetal heart, mesenchymal, natural killer cells and placenta) and the IMR90 cell line of immortalized fibroblasts. The methylation of each CpG is depicted by the height of vertical beige lines. Blue bars represent hypo-methylated regions (HMRs) and green bars represent partially methylated domains (PMDs) in the placenta methylome. 21 2.3.4 DNA methylation comparison between samples The patterns described in the previous section are informative in isolation, but differences in these patterns across controlled experimental conditions may provide insights into the possible mechanisms causing ob- servable differences. Differential tests are possibly the most common tool for unsupervised inference of significant molecular drivers of phenotype [57]. Differential tests compare quantifiable traits genome-wide between two or more conditions. When only two conditions are compared, they are often referred to as “case” (where an experimental treatment is applied) and “control” (where no treatment is applied). Statis- tical tests are applied to each trait to assess if the null hypothesis that the traits are identical in case and control can be rejected. Samples with low probability under the null (often corrected for multiple testing) are reported along with the estimated fold change of the trait. The same workflow applies for WGBS. For a given annotated region of interest (e.g. an individual CpG, a CpG island or a repeat copy), hypothesis tests compare differential average methylation [48]. Unlike testing for differential expression or SNPs, features in methylomes may be less obvious. For example, differential tests in HMRs can be used to identify regions that expanded, contracted or shifted in genomic coordinates. However, each sample has its unique set of HMRs, which may not be comparable across samples. An HMR in sample 1 may contain multiple smaller HMRs in sample 2. If a single HMR in sample 2 was compared in size with its containing HMR in sample 1, the difference may be significant, but if every smaller HMR in sample 2 was treated as a single feature, the test for expansion may not be rejected because of insufficient data between HMRs. In other words, the lack of one-to-one correspondence between features in different samples reduces the power of differential tests in these settings. 2.4 Challenges in whole genome bisulfite sequencing data analysis WGBS libraries provide accurate reads that are uniformly sampled from the genome, providing enormous opportunity for biological discoveries associated with the epigenome. However, there are theoretical and practical limitations on what is possible to quantify using WGBS. Below we briefly describe these limita- tions. For some of these, we will outline solutions, whereas others will be left as open computation problems for the short-term future. 22 2.4.1 High volumes of sequencing data To attain sufficient coverage in mammalian larger genomes, hundreds of millions of sequences must be produced per sample in an experiment. The large number of sequences to be analyzed requires a large amount of computational effort. This fact stands true for most high-throughput sequencing technologies, but bisulfite sequencing poses particular challenges due to the severe loss in nucleotide entropy caused by bisulfite conversion. Solving this problem requires efficient computational methods to convert a set of short reads into quantifiable methylation levels. As we will discuss in more detail in Chapter 3, great progress has been attained in developing highly efficient software for short read mapping of DNA. These methods, however, are not equally effective for WGBS read mapping due to the uneven distribution of nucleotides in bisulfite sequencing reads. Besides the challenges of accounting for bisulfite bases, short read mapping of WGBS reads must also address sequence differences to the reference in the form of mismatches, indels and sequenced adapters. Bisulfite treatment is also a major contributor of sequencing differences, often due to the nicks and naked bases it produces in the isolated DNA [58]. 2.4.2 WGBS analysis often requires genome binning A major challenge in WGBS is the great level of “quantization” for individual CpG methylation estimates. It is often the case that WGBS does not provide sufficient information about an individual CpG, and most methylation level estimates are fractions of small numbers. For this reason, most of the identifiable patterns described in the previous section require splitting the genome into bins, assuming the property that neighbor- ing CpGs are correlated and using information of every CpG within each bin to summarize the methylation level of that genomic region. The choice of bin size is a trade-off in resolution. If bins are too small, most of them will not have enough data. If bins are too large, local signals may disappear and resolution may be lost. Furthermore, the non-uniform distribution of CpGs in the genome may cause both problems to occur simultaneously. Many algorithms for pattern discovery require dynamic bin selection, that is, finding the smallest bin size for which every bin contains sufficient information [53]. Algorithms may also integrate continuous-time Markov chains to the Hidden Markov Model parametrization of the genome. The distance between CpGs can be treated as a continuous parameter and incorporated into models that estimate transition probabilities. 23 2.4.3 Methylome features depend on the reference genome Reference genomes represented as sequences of characters cannot capture the full variability of individuals, even if the sequence is of perfect quality. Because CpG locations vary so frequently between individuals, the set of CpGs of the static reference genome will not be the same as the set of CpGs in the individual. This is not ideal assuming a perfect reference, but reference “perfection” in mammalian organisms is also far from the truth in practical settings. As it stands, 70% of the human reference genome was based on a single individual, and nearly the entire genome is based on six individuals [59]. This means that the existing CpGs used in most studies are limited to this individual’s CpGs. Mapping reads to the reference genome will assume that read CpGs that map to other dinucleotides are marked as sequencing errors, so any sample CpG that does not exist in the reference is lost when quantifying methylation from mapped reads. It has been shown that creating a consensus genome, where each nucleotide is the most common allele in a given population, improves sensitivity of RNA-Seq read mapping [60]. The same may be true for WGBS. Discussions on how to represent reference genomes to encompass all possible genotypes are ongoing. The most widely accepted solution is the use of pan-genome graphs [61]. In this type of representation, genomes are not single sequences, but rather directed graphs. In these graphs, nodes are sequences, there exist a node without incoming edges (starting node) and nodes without outgoing edges (terminal nodes). The graph is such that any path from the starting node to a terminal node represents a possible individual genome. This abstraction allows the representation of much more complex structural variation than individual muta- tions, such as variable number tandem repeats and individual-specific occurrences of SINEs/VNTRs/Alus. Although these data structures allow expansion of genome sequences such that a small enough graph is still representative of the most common genetic variations within a population, they still represent a static subset of possible sequences, and problems relating to aligning a new unobserved genome to it and treat- ing variability as errors still exist, albeit at a lesser degree. CpGs are particularly challenging to represent because their variability goes beyond large structural variation, and it may not even be possible to create a non-degenerate graph that is sufficiently close to every possible set of observable CpGs in any popula- tion. The ideal solution, discussed in more detail in Chapter 6, is to allow read CpGs to map everywhere in the genome and adjust scoring schemes accordingly as to not penalize CpGs mismatches. Then, at the methylation quantification steps, nucleotide positions that mutated into CpGs can be identified if they con- tain sufficient coverage in both strands to confidently identify the mutation. Another solution is to allow the reference genome to change during mapping. After reads are mapped to the available genome, with 24 differences being treated as errors, locations with frequent mutations in both strands can be changed, and reads can be re-mapped to this corrected genome, as reads can have different optimal mapping locations in this new genome. This can be repeated until convergence. 2.4.4 Challenges in mapping reads to repetitive DNA We previously outlined the importance of quantifying DNA methylation in retroelements. Many repeat fam- ilies span large portions of the genome. A prominent example are retrotransposons of the long interspersed nuclear elements (LINE) family. The LINE1 (L1) retroelement in this family comprises 17% of the human genome. Typical L1 elements contain about 6000 bases, many of which have multiple identical copies in the genome. Illumina sequencers can often produce a maximum insert size of 600 bases, which is much shorter than the L1 length. Thus, if a short WGBS read from an Illumina sequencer is fully within an L1 with at least one copy, it often cannot be uniquely assigned to an individual element and methylation of this element cannot be studied, thus posing major limitations in understanding the biology of certain regions. About 15% of the human genome cannot be uniquely mapped with single-end reads of length 100, and although this is somewhat remedied with paired-end mapping, no study has ever attained full coverage of the human genome. Although ambiguous reads are routinely discarded in downstream analysis, it is possible, in certain cases, to estimate methylation levels of these regions based on reads that map ambiguously. Consider two copies of an unmappable region in the genome to which a set of reads map ambiguously. If all CpGs in these two copies are methylated, reads originating from these copies are expected to have all their CpGs sequenced as Cs. In this case, even if we do not know where the reads originate from, we can infer that both copies are likely methylated. Conversely, if half of the reads have CpGs sequenced as Cs and the other half have CpGs sequenced as TpGs, we cannot distinguish if the two copies have different methylation (e.g. 100% in one copy and 0% on the other), or if both have 50% methylation, among other possibilities. 2.5 Downstream analysis depends on accurate read mapping Methylation quantification, and consequently every subsequent downstream analysis in WGBS, depends on accurately mapping a read to the reference genome. It is therefore of utmost importance that, for each read, the reference subsequence with largest similarity to the read is found. This is a challenging problem for WGS data, and many difficulties are aggravated because of bisulfite conversion. In the next chapter, we 25 define the read mapping problem in more detail and address how the uneven nucleotide frequency caused by bisulfite conversion makes the read mapping problem challenging. In simple terms, nearly half of sequenced WGBS bases are the same letter (either a T or an A), causing the entropy of letter distribution to be low. This means that manyk-mers are repeatedly sampled, and thesek-mers also appear often in the genome (reads and reference have approximately the same relativek-mer frequencies). The next chapter will describe this problem more formally. We will then quantify how many comparisons are necessary to map a read to the reference genome. Using this framework, we introduce a binary encoding that has both theoretical and practical advantages for WGBS read mapping. 26 Chapter 3 Mapping bisulfite sequencing reads using a two-letter alphabet In this chapter, we introduce another bisulfite mapping algorithm (abismal), a fast and accurate read mapping method based on filtration on a two-letter alphabet. Before describing the algorithm, we will define the read mapping problem and explain the limitation of current WGBS mapping software tools. In particular, we will show that the uneven frequency of nucleotides caused by bisulfite conversion leads to a high number of false-positive alignments necessary to find the optimal mapping location of a read. The main purpose of this chapter is to show that the seemingly problematic property of WGBS nucleotide frequencies can be converted into an efficient read mapping strategy. Loosely speaking, the read mapping problem is defined as follows. We are given a read r of lengthn, a genomeG of lengthm≫ n, as well as a scoring scheme for the alignment of two strings. We seek to find a substring ing ofG that maximizes the local alignment score betweeng andr. The WGBS read mapping problem must further allow one of the read letters (either a T or an A) to match two possible genome letters (a T matching a C or a T, or an A matching a G or an A). In the next section, we will devise a more rigorous definition of this problem. 3.1 Definition of the read mapping problem A string T is a sequence of letters in an alphabet Σ (in this dissertation Σ = {A,C,G,T} ). The string length is|T|. The characters ofT are denoted byT 0 ,...,T |T|− 1 . The substring ofT starting at positioni 27 and ending at positionj (and including bothi andj) isT[i,j]. 3.1.1 Global and local alignments Global alignment of two strings The global alignment of a template string T to a query string Q is the process of replacing, inserting or deleting characters in Q to make it identical to T . An alignment requires a scoring scheme, which is a set of four numbers (M,X,I,D), with M,X,I,D ≥ 0. The score of an alignment with x matches, y replacements of characters (mismatches), i insertions and d deletions is s(T,Q) = xM − yX − iI − dD. Many alignments exist to convert Q to T . We are almost always interested in the one that maximizes the scores. The maximum score alignment (and its optimal score) can be constructed using the Needleman-Wunsch algorithm inO(|T||Q|) time and memory [62]. Through this dissertation, we will use the term “alignment” between two strings as the maximum scoring alignment. It is noted that there are cases in which many alignments yield the same maximum score. In these cases, we choose the one with the shortest number of operations. Although theoretically possible, alignment ties in the short read mapping context occur very infrequently. Local alignment of two strings Particularly of interest for the context of read mapping is the concept of local alignment, where we are interested in the substringsT[i t ,j t ] andQ[i q ,j q ] whose global alignment score is maximum. Local alignment methods have first been devised to identify conserved subsequences in homologous proteins. Biologically, subsequence similarity is likely related to preservation of function. Local alignment is also the predominant strategy to map reads to a reference genome. The local alignment of T andQ can also be constructed inO(|T||Q|) time and memory using the Smith-Waterman algorithm [63]. We note that the scoring scheme s defined above is very “simple”, meaning that we do not penalize different types of mismatches (e.g. transitions or transversions) differently, and we do not penalize gaps (i.e. consecutive runs of insertions or deletions) differently based on their size. In read mapping software tools, more complex alignment scoring schemes are used in application-specific manners. For the purposes of studying the complexity of read mapping algorithms, this simplified formulation suffices. In this dis- sertation, we will focus predominantly on the number of alignments needed to map a read as opposed to the complexity of constructing an individual alignment. Efficient implementations of the Smith-Waterman algorithm with affine gap penalties are discussed in Chapter 6. The Smith-Waterman algorithm uses dynamic programming to construct the alignment of T and Q based on the alignment of their prefixes. For i,j ≥ 0, the algorithm fills a dynamic programming matrix 28 A of dimensions (|T|+1)× (|Q|+1). The element A i,j is the optimal local alignment score of prefix T[0,i− 1] with prefix Q[0,j− 1]. The scoring matrixA can be filled by the edge cases A 0,i = A j,0 = 0 and fori,j >0, through the relationship A i,j =max A i− 1,j− 1 +t(T i− 1 ,Q j− 1 ) A i− 1,j − D A i,j− 1 − I 0 , (3.1) wheret(a,b) = +M ifa =b or− X otherwise. Indicesj t andj q are given by the indices of the maximum element of A, and indices i t and i q can be inferred by keeping a traceback, for every row i and column j, of the element that attained the maximum in equation (3.1). Indices j t and j q are decremented based on matches, insertions and deletions until the first zero element in A is found. The complexityO(|T||Q|) of constructing the local alignment is prohibitive in many practical applica- tions, such as database searches or mapping long reads. Accelerating alignments is still an active field of research, and many optimizations can be attained either by imposing specific alignment scores [64], incor- porating heuristics such as maximum gap size [65] or using hardware-specific instruction to fill several cells of the dynamic programming matrix in parallel [66, 67, 68]. 3.1.2 Read mapping: filtration and alignment of reads to large genomes The problem of mapping a readr to the reference genomeG is the problem of finding the local alignment of r andG. Defined slightly differently, mapping r toG means finding indices i G ,j G , with0≤ i G ,j G <|G|, such that the local alignment score of r and G[i G ,j G ] is maximized. We will often refer to the optimal mapping location or optimal mapping position in the genome as the index i G for read r. The optimal mapping location is interpreted as the most likely location in the reference genome from whichr originates, and can be attained by tracing back the highest scoring cell in the dynamic programming matrix A, If the maximum score is attained in more than one cell, the read maps ambiguously to more than one genome location. The typically large sizes of G and the large number of reads r to be mapped in traditional Illumina datasets make this implementation prohibitively slow and never used in practice. Instead, mapping algo- 29 rithms filter a set of substrings of size approximately |r| from G and align r to these substrings, keeping the one(s) with maximum alignment score and reporting them if their score is above a defined cut-off. This removes the factor of|G| in the mapping complexity, making the amount of computation to map a read prac- tically feasible. The process of selecting these substrings is called filtration , and the substrings selected from G are called hits or candidates. Extensive research in read mapping algorithms focuses on data structures used to attain efficient filtration in terms of time, memory and accuracy, and this will also be the overarching goal of this entire dissertation. Aligning reads using filtration often implies that we lose the guarantee of finding the optimal alignment for arbitrary reads. In practice, we are only interested in mapping reads that attain a sufficient minimum degree of similarity to its optimal match. Reads for which this property does not apply are often assumed to have too many sequencing errors or originate from contamination of DNA from other organisms. Neither of these types of reads are useful for downstream analyses. The filtration step in read mapping reduces alignments to a small set of reference subsequences that are likely, in some sense, to be optimal candidates. Most filtration methods assume that the read and its optimal mapping location contain a common subsequence of minimum size. A read is mapped by selecting a set of subsequences from it, then aligning the read to all locations in the reference that match a selected subsequence. We call subsequences selected from reads for filtration seeds. The algorithmic efficiency of both filtration and comparison between the read and each candidate globally determine the speed of a mapping algorithm. Data structures used in practice for filtration are often able to query exact matches in G efficiently, either in O(|r|) (e.g. using direct hashing), O(|r|log|G|) (e.g. through successive binary searches of genome suffix arrays), or a combination of both. Reads can originate from either of two complementary genome strands. The most common way to map a read to both strands is to map the sequence and its reverse-complement to one strand. This strategy bypasses the need to have information on both strands of the genome available in-memory at the same time. However, for WGBS data, reverse-complementing a T-rich read makes it A-rich, so any filtration method that stores only one strand of the reference genome must allow filtration under assumption of both T-rich and A-rich reads. 3.1.3 Filtration algorithms for short read mapping Mapping algorithms must be sensitive to differences between read and reference sequences, which may originate from, among other sources, sequencing errors or true genetic variation. The large number of reads 30 necessary to attain sufficient coverage in large genomes requires mapping algorithms to be simultaneously fast and sensitive to various sequence differences between read and reference. To attain this, filtration steps must be efficient in reducing the number of alignments while maximizing the probability that the optimal candidate is not discarded. The two most common filtration approaches are (i) maximum exact matches and (ii) selecting fixed seed patterns from reads. Maximum exact matches A maximum exact match (MEM) is the largest substring in r that matches some substring ofG exactly. The goal of filtration through MEMs is maximizing specificity and minimizing the number of comparisons necessary to map a read. If a substring of lengthk ≤ n is selected from string over an alphabet Σ , the number of hits decreases proportionally to|Σ | k . The exponential decrease often reduces the number of alignments significantly. Alignments can also be accelerated using MEMs through the seed-and-extend paradigm [69]. Ifr[i,j] is the MEM forr, then only the prefix r[0,i− 1] and the suffix r[j +1,n− 1] need to be aligned, as the MEM is, by definition, a series of matches. Larger MEMs result in fewer alignment cells to be filled. Conversely, a common consequence of using very large exact matches is lower sensitivity. If a substring r[i,j] matches a set of locations in G but the character r j+1 is an error relative to its true mapping location, then extending the MEM to r[i,j +1] will only select false-positive hits and discard the true-positive hit. False-positive hits in this context can often happen in genome repeats, where a single mismatch may, by chance, match another repeat copy exactly, even if the rest of the sequence outside the MEM is very different from the matching repeat copy. It is often more appropriate to stop extending the MEM once the number of hits is below a specified cut-off. This attains a trade-off between sensitivity and speed. Filtration using fixed seed patterns As an alternative to MEMs, selecting fixed seed patterns involves using a fixed procedure to select subsequences from r, finding the candidate locations in G that match the subsequences exactly, then mapping r to the candidates retrieved from these locations. The procedure to select seeds from r creates a trade-off between sensitivity and specificity. If seeds are too large, higher specificity is attained but the optimal candidate may be lost. If seeds are too small, finding the optimal candidate may require a very large number of alignments. Seeds can be designed around the maximum amount of difference accepted between the read and its optimal alignment location. For instance, if no indels and at mostx mismatches are allowed, the readr can be partitioned intox+1 substrings of length ⌊ n x+1 ⌋. If an acceptable candidate exists, then the candidate shares an exact match to one of the substrings 31 by the pigeonhole principle. Comparingr to every exact match of all thesex+1 substrings guarantees the best match to be found. This is the foundation of the Baeza-Yates-Perleberg (BYP) algorithm [70]. Fixed seeds can also reduce the number of dynamic programming cells to be filled through the seed-and-extend paradigm, but the reduction is often less effective than what is attained in MEMs. Under the assumption of no indels, the number of comparisons can be further reduced by using spaced seeds, which motivates the seed design problem [71, 72]. Given a maximum number of mismatches x, an optimal seed set is a set of subsequences from r such that, for any of the n x possible combinations of mismatch locations, there exists at least one seed that does not contain any of the mismatch positions. We are interested in constructing an optimal seed set where the average seed length is maximized. Seed design algorithms are decoupled from read mapping. Designing a set of seeds only requiresn andx as an input and can be pre-computed a priori assuming a maximum length for a short input read and a maximum acceptable number of mismatches. The assumptions of fixed seeds and MEMs fall short when indels are allowed in reads. While it is still reasonable to assume that a read contains a subsequence that matches its optimal mapping location exactly, the construction of seed patterns that guarantee exact matches withx edits (mismatches or indels) is a much more complex problem for which little progress has been attained. The most common strategy in read mapping is using everyk-mer in the read and focusing on highly efficient implementations of local alignments. In the following sections, we will describe the construction of a read mapping algorithm. In order to attain sensitivity to mismatches and indels, we will also follow the paradigm of selecting every k-mer and aligning the read to every location that matches thek-mer exactly in the genome. This means we are restricted to mapping reads where the optimal alignment contains an exact match of minimum size k. This is a reasonable assumption for sufficiently small k. If the best alignment for a read does not meet this criterion, this means too many edits are necessary to align it, implying that read does not originate from the reference genome to which it is being mapped. Reads from non-reference organisms often originate from contamination and can be identified through “dumpster-diving” methods that search for unmapped reads in large databases [73]. For our applications, unmapped reads will be summarized as part of mapping accuracy summary statistics and subsequently discarded in downstream analyses. Complexity of the read mapping algorithm using fixed seeds An important property of read mapping through fixed seeds is that the worst case complexity of aligning a read to a set of candidates retrieved from exact matches can be O(mn 2 ). A simple construction of a worst case algorithm is setting G = 32 AAAAAAA··· as a sequence of As of sizem,r = AAAAAT andk = 5. Consider the first k-mer from r (that is,AAAAA). By selecting exact match locations inG, we will retrieve almost every position, all of which must be subsequently compared to r. We cannot stop mapping because we do not know a priori if any of the candidates will be an exact match tor, so we must alignr tom− k+1 substrings of G. This artificial case is obviously not reflective of real complex genomes, but most genomes sequences contain large substrings composed of mostly a single letter. Selecting seeds that match substrings of repetitive patterns may lead to a large number of hits. The most common way to handle these seeds is to set an upper bound c on the maximum number of acceptable hits. If a seed retrieves more than c candidates, then the seed is ignored and other seeds are used instead. This means that the optimal alignment is no longer guaranteed to be found. In the BYP algorithm, for instance, the guarantee require us to compare the read to every substring used as a seed. We will address the issue of minimizing the number of false-positive hits without losing the guarantee of mapping reads with a minimum exact match length in Chapter 5. Until then, we will focus on minimizing the average time complexity of mapping reads using seeds. By the law of large numbers, if the number of reads to be mapped is sufficiently high, the total time required to map a set of reads is dictated by the average number of hits per seed. In the next sections, we will describe how to calculate the average time complexity as a function ofk-mer frequencies inG and the number of bits available to encodek-mers numerically. We will use this as a basis to compare the efficiency of different filtration strategies. 3.1.4 Filtration of bisulfite-converted reads Bisulfite conversion makes the problem of mapping WGBS reads different than the traditional DNA read mapping problem. WGBS reads are either T-rich (where most Cs are sequenced as Ts) or A-rich, with the latter occurring when the complementary strand of a converted read is sequenced. We refer to Ts in T-rich reads and As in A-rich reads as bisulfite bases , signifying that they can map to two possible reference bases. The reverse-complement of a T-rich read is A-rich, so mapping a T-rich read to the complementary strand is equivalent to mapping an A-rich to the original strand. Multiple four-letter sequences can generate a given T-rich or A-rich sequence once converted, so WGBS mappers must allow both bisulfite bases to each match two bases in the reference to quantify the similarity between a read and a reference sequence. For filtration algorithms, bisulfite bases must be treated as “wildcards” that can match two possible letters in the genome. Any data structure used in filtration for WGBS read mapping must consider bisulfite bases in reads as wildcards. 33 In the following section, we will analyze the effect of bisulfite treatment on letter frequencies, as well as its effect on filtration efficiency when small k-mers are used for filtration based on exact wildcard matches to the genome. In broad terms, we show that the uneven letter frequency leads to less efficient filtration and longer mapping times. We then provide an alternative encoding of read and reference letters that recovers uniformity of letter frequencies, leading to more efficient filtration and empirically faster mapping times, among other practical advantages. 3.2 Algorithms to map bisulfite-converted reads Any algorithm constructed for WGS reads can be easily adapted to map WGBS reads through the following procedure: 1. Create two copies of G, a copy G T where all Cs are converted to Ts and a copy G A where Gs are converted to As. 2. Similarly, to map a readr, create two copiesr T andr A following the same procedure. Also construct r ′ T andr ′ A as the reverse-complements ofr T andr A , respectively. 3. If the read is T-rich, alignr T andr ′ A toG T using the existing read mapping algorithm. If the read is A-rich, alignr A andr ′ T toG A . If the bisulfite base is unknown, try both possibilities. 4. Report the location with the best alignment score among all converted reads mapped in the previous step. Several software tools exist to map bisulfite-converted reads to a reference genome. Most tools leverage the highly efficient WGS mapping algorithms software tools and use the aforementioned procedure, but other algorithms were also constructed to account for the specific properties of bisulfite conversion. In what follows, we will describe the most commonly used WGBS mapping software tools and provide a surface-level description of their algorithmic steps. The software tools described here will be later used as benchmarks to compare the performance of the abismal algorithm proposed in this chapter. 3.2.1 State-of-the-art mappers for bisulfite-converted reads WGBS mappers can be divided in two categories: (i) wrappers of short unconverted DNA read mapping algorithms adapted for bisulfite sequencing and (ii) mappers designed specifically for bisulfite-converted 34 reads. Tools within category (i) operate by creating two copies of the reference genome, one that converts all Cs to Ts and one that converts all Gs to As, then mapping the input reads by also replacing any unconverted read Cs with Ts. Bismark [74] and BWA-meth [75] are two commonly used tools that adopt this approach. Bismark is a wrapper for Bowtie 2 [76], whereas BWA-meth wraps the BWA-MEM program [77]. Both BWA-MEM and Bismark compress the genome through its Burrows-Wheeler Transform [78], described in greater detail in Chapter 5. The main difference between the two algorithms is that BWA-MEM filters reads through the MEM procedure, whereas bowtie2 selects seeds of size 22 at regularly spaced intervals from the read. The number of seeds selected by bowtie2 is proportional to the square root of the read length. Tools within category (ii) often incorporate filtration and alignment that account for bisulfite conversion in their implementation, which allows for potential algorithmic optimizations that may result in faster and more memory-efficient mapping times. BSMAP [79] and WALT [80] are mappers designed specifically for WGBS data. These two mappers implement similar algorithms: a direct address table of reference subsequences is constructed using the same letter encoding for bisulfite bases, and subsequences are selected from reads to retrieve exact matches using the table as a filtration step, also reducing bisulfite bases to the same letter. BSMAP indexes contiguous reference sequences of size 16 in the direct address table and maps reads by also partitioning them into k-mers of size 16. WALT encodes periodically spaced subsequences spanning most of the genome. Spaced seeds are constructed to retrieve every possible sequence with two mismatches. WALT only calculates Hamming distance between the read and each candidate, so it does not report alignments with indels. 3.2.2 Every WGBS mapper uses a three-letter alphabet Despite their algorithmic and conceptual differences, a common property of WGBS mappers is filtering candidates by allowing bisulfite bases in seeds to match two possible reference letters, but requiring all other seed bases to match reference letters exactly. Under these conditions, bisulfite bases represent half of the bases seen in both reads and the reference genome, while the other two non-bisulfite bases, individually, appear less frequently. The uneven frequency of letters results in less efficient filtration in every filtration method presented thus far. In particular, if a seed is selected for filtration based on exact matches, the number of candidates retrieved from the genome will likely increase with the number of bisulfite bases in the selected subsequence, resulting in a greater number of false-positive matches. Furthermore, mappers that use the three-letter encoding are often implemented by keeping two copies of the reference, one for 35 each bisulfite base. In practice, this often requires mappers to use more memory. In the next section, we will describe the concept of a purine-pyrimidine encoding for DNA sequences. We will show that filtration under this encoding is efficient for bisulfite-converted reads. Specifically, we will show that accounting for the statistical properties of nucleotide frequencies after bisulfite conversion provides an avenue for more efficient filtration and potential improvement in the speed of WGBS read mapping, as well as the memory requirement to map reads to larger genomes. 3.3 The two-letter alphabet encoding Strand symmetry in eukaryotic genomes results in complementary bases being equally frequent in both strands. This symmetry is a result of millions of years of accumulated genomic inversions in eukaryotes, including every organism that exhibits DNA methylation [81]. As a consequence of this symmetry, purines (As and Gs) and pyrimidines (Cs and Ts) are also equally represented both before and after bisulfite conver- sion. This motivates the use of filtration using a two-letter alphabet encoding, which simultaneously converts purines to one letter (R) and pyrimidines to another letter (Y). In this resulting alphabet, the letters R and Y appear with equal frequency, so the k-mer distribution under the two-letter alphabet has higher entropy. This encoding can be used for filtration in both T-rich and A-rich reads. We show that, compared to the commonly adopted three-letter encoding, filtration using the two-letter alphabet increases specificity when a fixed number of bits is available to encode an arbitrary subsequence selected from a read. We describe theoretical results for fingerprint-based filtration [82] in genomes formed by independent and identically distributed (i.i.d) letters. We will then show that many eukaryotic genomes have comparable properties to those derived theoretically assuming i.i.d sequences despite not being i.i.d. themselves. We also introduce an implementation of this approach as a novel algorithm and software tool, named abismal, that is optimized for WGBS read mapping. Using publicly available data, we show that abismal attains equivalent accuracy to commonly used WGBS mappers in scientific research, but requires less time and memory to generate its results. 3.3.1 Indexing with a two-letter alphabet Mapping bisulfite sequencing reads typically involves converting the genome into a three-letter sequence, with C→T or G→A to simulate the bisulfite conversion process. The approach proposed in this section is motivated by the observation that if the reads and the genome are converted into two-letter sequences, with 36 both C→T and G→A simultaneously, a match in the two-letter converted sequences is a necessary condition for there to be a match between the three letter sequences. Moreover, the same two-letter encoding works for both T-rich reads and A-rich reads, which, in some increasingly important protocols, must be considered simultaneously and on a per-read basis [41, 83]. If this necessary condition is specific enough, and if it can be tested efficiently, the approach may have advantages. Beyond the ability to use the same encoding to map both A-rich and T-rich reads, we present the statistical rationale for the benefits of this strategy. We define a k-mer as a sequence of k consecutive letters. We encode sequences by converting each k-mer to a numerical representation that uses one bit per position, distinguishing purines from pyrimidines. In particular, with h 2 (A) = h 2 (G) = 0 and h 2 (C) = h 2 (T) = 1, the two-letter fingerprint for a k-mer u=u 0 ··· u k− 1 is defined as H 2 (u)= P k− 1 i=0 h 2 (u i )× 2 i . (3.2) Under this scheme, k-mers over the original DNA alphabet are associated with k-bit binary numbers, and 2 k distinct four-letterk-mers have the same fingerprint. This same strategy can be applied in the context of spaced seeds [72], encoding ordered but non-contiguous sets ofk letters. The advantages of spaced seeds are most pronounced when they span large portions of the read, which renders them less sensitive to insertions and deletions (indels). We therefore restrict our analyses to contiguous sequences. 37 G T C C C A A G T T C C G A T A T A T A G C T A C 0 1 1 1 1 0 0 0 1 1 1 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 2 2 2 2 0 0 1 2 2 2 2 1 0 2 0 2 0 2 0 1 2 2 0 2 radix-2 fingerprint radix-3 fingerprint 0 0 1 2 3 4 5 6 7 0 0 1 2 3 4 5 6 7 8 frequency frequency genome: two-letter encoding: three-letter encoding: Figure 3.1: An example comparison between the expected hit rates. A small i.i.d “genome” of size 25 is shown, as well as its two-letter and three-letter conversions, according to equations (3.2) and (3.3). Fingerprints can take values inΩ= {0,··· ,8}, and each value is represented ask-mers both in radix-2 and radix-3 encodings. The expected hit rates for both encodingsZ 2 andZ 3 are calculated using equation (3.4). Red boxes show examplek-mers that are counted in both encodings. 38 3.3.2 Theoretical analysis of the two-letter encoding Here we analyze the theoretical efficiency of the two-letter fingerprint strategy, defined in equation (3.2), as a function of the number of bits in the fingerprint. This will be done by comparison with the three-letter encod- ing used in most bisulfite sequencing mappers, which reflects the chemical process of bisulfite conversion. In particular, the three-letter strategy encodes letters as h 3 (A) = 0, h 3 (G) = 1 and h 3 (C) = h 3 (T) = 2. This encoding simply equates T and C. Usingh 3 , the fingerprint for sequence u can be constructed as H 3 (u)= P k− 1 i=0 h 3 (u i )× 3 i . (3.3) The functionH 3 evaluates a radix-3 polynomial, while in practice it has been more common to use a radix-4 variation, defined as R(u)= P k− 1 i=0 h 3 (u i )× 4 i . The two main distinctions between radix-3 and radix-4 are (i) the size, in bits, for the resulting fingerprint, and (ii) the operations required to evaluate the fingerprint function. Note that R can be evaluated using logical operations, whileH 3 requires multiplication and modular arithmetic. SinceH 3 (u) = H 3 (u ′ ) if and only ifR(u) = R(u ′ ), the radix-3 and radix-4 encodings are equivalent with respect to accuracy character- istics as fingerprint functions, but when bounded by a common maximum value, the image of the radix-3 encoding contains more elements. 3.4 Quantifying efficiency of an encoding scheme: the expected hit rate 3.4.1 Definition of the expected hit rate of an encoding scheme As previously outlined, the time required by a mapping algorithm depends on the total number of fingerprint hits, which, for a particular fingerprinting scheme, is proportional to the expected hit rate for a random k- mer. We assume a random genome sequence of infinite size with i.i.d letters, and will address relaxations of this assumption later. We further assume thatPr(A) = Pr(T) = p andPr(C) = Pr(G) = q, reflecting strand symmetry in eukaryotic genomes. This symmetry ensures thatp+q = 1/2. We define the expected hit rate of a fingerprint function as the expected fraction of positions whose fingerprint is H(u), whereu is a k-mer starting at an uniformly sampled genome position. This emulates the process of selecting fingerprints based on seeds from a large number of reads that were sampled from the reference sequence, as is the case 39 with real short read sequencing data (with the exception of sample contamination and PCR biases). The expected hit rateZ can be calculated as Z = X u Pr(u) X u ′ Pr(u ′ )× I(H(u)=H(u ′ )) , for indicator functionI. Ifk-mersu andu ′ are sampled at random from the genome, this is equivalent to Z =Pr(H(u)=H(u ′ )). If we assign fingerprints to each genome position using two functions H andH ′ with the same codomain Ω , the one with lower expected hit rate is preferred. For a fixed reference genome, let v∈Ω be a fingerprint andN(v) be the number ofk-mers inG whose fingerprint is v. The empirical expected hit rate is calculated as Z = P v∈Ω N(v) 2 P v∈Ω N(v) 2 = 1 m 2 X v∈Ω N(v) 2 . (3.4) We will also define the theoretical hit rate z to be the expected hit rate in the specific case of infinite i.i.d genomes wherep+q =1/2. 3.4.2 Bounds on the expected hit rate In this section, we devise bounds on the expected hit rate and show that these bounds can be attained under specific fingerprint distributions across genome positions. We formalize these bounds as a proposition. Proposition 1. The expected hit rate satisfies 1 |Ω | ≤ Z≤ 1. Proof. The expected hit rate is the squared norm of a vector of numbers between 0 and 1 over all possible fingerprints in Ω . Letm= P v∈Ω N(v) and letα be a vector of dimension|Ω |, for which we represent each element asα (v)=N(v)/m for everyv∈Ω . The expected hit rate is given byZ = P v∈Ω α (v) 2 under the constraint that0≤ α (v)≤ 1 and P v∈Ω α (v) = 1. The fact thatα (v) 2 ≤ α (v), means thatZ is bounded by 1 (by summing over all possible values ofv on both sides of the inequality). The maximum value ofZ is attained when α (v 0 ) = 1 for some v 0 ∈ Ω and α (v) = 0 for v ̸= v 0 . Intuitively, this means that every position in a genome has the same fingerprint, and any sampled position will retrieve the entire genome as candidate hits (this is the worst-case outcome described in section 3.1.3). 40 The minimum ofZ can be found through the use of Lagrange multipliers. FunctionZ(α )= P v∈Ω α (v) 2 is minimized under constraintγ (α ) = P v∈Ω α (v) = 1 when the gradients ofZ andγ are parallel, that is, for someλ ∈R and for any fingerprint v∈Ω , ∂Z(α ) ∂α (v) =λ ∂γ (α ) ∂α (v) →α (v)= λ 2 . This means thatα (v) is the same for everyv∈ Ω , and the constraintγ (α ) = 1 impliesα (v) = 1/|Ω |. The second partial derivative is∂ 2 Z(α )/∂α (v) 2 =2>0, so this is the local minimum ofZ. This lower bound is attained when entropy of fingerprints is maximized across Z, and every position has a distinct fingerprint (this is not possible if m>|Ω | andm is not a multiple of|Ω |). The value of the lower bound ofZ is then given by Z = X v∈Ω α (v) 2 ≥ X v∈Ω 1 |Ω | 2 = |Ω | |Ω | 2 = 1 |Ω | , which concludes the proof. 3.4.3 Comparison of expected hit rates under two- and three-letter alphabets We can directly devise bounds for the theoretical expected hit rate under the two-letter encodingz 2 and three- letter encoding z 3 , both of which assume infinite i.i.d. genomes. First, consider the two-letter fingerprint strategy H 2 defined in equation (3.2). Our statistical assumptions on the base frequencies in the genome imply that the theoretical hit rate is z 2 = (1/2) k regardless of the values of p and q. This is also the theoretical minimum ofZ according to proposition 1. We can also deduce that the lowest possible theoretical hit rate z 3 for H 3 is (3/8) k . This can be proven as follows. Consider u and u ′ to be two independently sampledk-mers. The independence allows the following rearrangement Pr(H 3 (u)=H 3 (u ′ ))= k− 1 Y i=0 Pr(h 3 (u i )=h 3 (u ′ i )) = k− 1 Y i=0 2 X j=0 Pr(h 3 (u i )=j) 2 = (p+q) 2 +p 2 +q 2 k =(1/4+p 2 +q 2 ) k , 41 by our assumption thatp+q =1/2. The minimal value of(3/8) k is attained whenp=q =1/4. If we evaluate fingerprints for k-mers within reads, the specificity is greater for the three-letter strategy, which should be intuitive. However, our capacity to index fingerprints for efficient filtration depends on the number of distinct values taken by the fingerprint function. In other words, it is more appropriate to compare the two strategies when the number of fingerprints is fixed, rather than the number of letters in the fingerprint. We will therefore devise the bounds for z 2 andz 3 when the codomain Ω is shared by both encoding strategies. In this case, the two-letter encoding has lower expected hit ratio. We formalize this through the following proposition. Proposition 2. LetG be an infinite i.i.d. genome with Pr(A) = Pr(T) = p andPr(C) = Pr(G) = q. For i∈{2,3} letz i be the expected hit rates ofG when fingerprints are assigned based on the encoding function H i usingk-mers starting at each position under the codomainΩ= {0,··· ,2 b − 1}, thenz 2 /z 3 ≤ 0.93 b . Proof. Assume fingerprints are b-bit non-negative integers. For the two-letter encoding, b letters can be encoded usingk 2 =b bits. For the three-letter encoding, the number of letters that can be represented using b bits isk 3 =⌊b/log 2 (3)⌋. In this case, the expected hit rate is lower for the two-letter encoding. This can be proven by analyzing the ratio betweenz 2 and the lowest possible value ofz 3 , since z 2 z 3 ≤ 1 2 b 3 8 ⌊b/log 2 (3)⌋ ≤ 1 2 × 3 8 − log 3 (2) b <0.93 b , (3.5) which concludes the proof. 42 number of bits in fingerprint ratio of expected hit rates ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 15 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● H. sapiens ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 15 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● P . troglodytes ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 15 25 0.0 0.4 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● M. musculus ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 15 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● D. rerio ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 15 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● G. gallus ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 15 25 0.0 0.4 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● A. thaliana 0.0 0.4 0.8 0.0 0.4 empirical theoretical 0.93 0.8 0.0 0.4 0.8 0.0 0.4 0.8 b Figure 3.2: Comparison between the ratio of empirical expected hit rates Z 2 /Z 3 and the theoretical ratio z 2 /z 3 , which assumes infinite i.i.d genomes. Empirical expected hit rates Z 2 and Z 3 were calculated using equation (3.4). The k-mer sizes are given by k 2 = b and k 3 =⌊b/log 2 (3)⌋, whereb is the number of available bits, ranging from 2 to 28. 43 Proposition 2 shows that that any specificity lost in the two-letter encoding, compared with the three- letter encoding, is compensated by increased uniformity in the distribution of fingerprint values (Figure 3.1). This result, however, assumes k-mers are sampled with i.i.d letters. Consider a genome containing an extreme abundance of poly-purine sequences. Allk-mers sampled from this genome would have the same fingerprint under H 2 but would be distinguishable underH 3 . This means that, in real genomes, expected hit rates under two- and three-letter alphabet encodings must be calculated empirically using equation (3.4). We compared the empirical and theoretical hit rates for six species: H. sapiens, M. musculus, D. rerio, G. gallus, P . troglodytes and A. thaliana, both under the two-letter and three-letter encodings. For2≤ b≤ 28, and in all six genomes, the empirical and theoretical ratios between expected hit rates are below the bound derived in equation (3.5), which suggests that the k-mer distribution in real genomes favors specificity in the two- letter encoding at least as much as in genomes generated by infinite i.i.d sequences (Figure 3.2). Based on the theoretical analysis ofZ from section 3.4.2, the distribution of fingerprint frequencies is expected to be more uniform under two letters. This can be observed by direct comparison of fingerprint counts using fixed values of k 2 and k 3 (Figure 3.3). These results show that, in theory, the two-letter encoding can be used as an efficient filtration strategy for read mapping in various model organisms used for DNA methylation studies. 44 3.5 Another bisulfite mapping algorithm (abismal) Both two-letter and three-letter encodings can be incorporated as an intermediate step in any DNA read mapping algorithm adapted for bisulfite sequencing reads. Wrappers for three-letter encoding adaptations of existing DNA read mapping software often operate by creating two indices of the original genome that simulate bisulfite conversion, then mapping reads using the resulting indices and combining results. It is more challenging, however, to directly write wrappers around existing DNA read mapping code to use the two-letter encoding. This is because encoded reads are used for filtration, but both read and reference must be available as four-letter sequences in order to calculate alignment scores. To demonstrate the use of the two-letter encoding for WGBS read mapping, we introduce another bisul- fite mapping algorithm (abismal), a specific implementation of a bisulfite sequencing mapper that uses the two-letter encoding as a filtration strategy. Abismal is summarized in Figure 3.4 and explained in detail in Appendix A. The overarching idea of abismal is to index the genome through a direct address table that associates fingerprints with genome positions. For a k-mer size k 2 , there are 2 k 2 possible fingerprints, each one of which maps to a set of genome positions where the corresponding fingerprint is observed in the genome. Reads are mapped by similarly selecting everyk-mer in the read as a seed, convertingk-mers to their two- letter fingerprints, retrieving locations in the genome where each fingerprint is observed, and aligning the read to all hits associated with each fingerprint. 45 two-letter rank (log) fingerprint count (log) three-letter Figure 3.3: Comparison of fingerprint frequencies for the two-letter and three-letter encodings in the human genome. Fingerprint values of every k-mer were counted according to the radix-2 and radix-3 encodings. Fingerprints were stored usingb=24 bits, so values range from0 to2 b − 1. Fingerprints counts were sorted in decreasing order. Thek-mer sizes arek 2 = 24 for the two-letter encoding andk 3 =⌊24/log 2 (3)⌋ = 15 for the three-letter encoding. 46 A A T T 3 11 C 2 A A 3 A T T 10 A G T T T 11 T T C T T T A B abismal index: mapping to an indexed genome: G C C A A T C C T G A A G T T T T A G A 0 1 1 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 G C C A A T C C T G A A G T T T T A G A A G T C T T T C C C A A T 0 0 1 1 1 1 000 001 010 011 100 101 110 111 indexing a reference genome: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 9 10 17 3 11 0 4 12 2 8 16 1 7 15 5 6 13 14 0 3 5 5 8 11 11 14 C C T A G T A G Figure 3.4: An example of the abismal indexing and mapping algorithms (a) Indexing of an example genome composed of M = 20 bases usingk =3 sorting positions. The genome is converted to a two-letter sequence where purines (As and Gs) are encoded as 0s and pyrimidines (Cs and Ts) are encoded as 1s. The index then stores three arrays: a counts array of size 2 k + 1 = 9, where each position stores the cumulative sum ofk-mer counts lexicographically below the position’s binary representation, anindex array of sizeM− k+1=18, where a permutation of the genome positions is stored based on two-letter strings of sizek starting at each base, and a copy of thegenome string, which is kept in the index for subsequent alignments between reads and candidates. (b) Mapping of an example T-rich read withn = 6 bases to the reference genome shown in (a). The read is initially encoded in two letters, and substrings of sizel = 3 are used to retrieve candidates from the genome using counts and index arrays. Two such substrings are highlighted in dashed rectangles. Genome positions where fingerprints are observed are retrieved, and Smith-Waterman alignments performed between read and candidates. Mismatches to the input read are represented as gray letters. 47 3.6 Assessing performance of the abismal algorithm The mapped reads are used to estimate the methylation level at each CpG site (or each cytosine) in the genome. The methylation level at a site is the fraction of original molecules (e.g. 2 per cell for a diploid sample) carrying the methyl mark at that site. This situation is similar to genotyping a site using mapped reads from a whole genome sequencing experiment. Unlike genotype, methylation levels may take values beyond the 1.0 and 0.5 frequencies associated with homozygous and heterozygous sites, and may instead take any value between 0 and 1. Consequently, systematic errors in mapping may lead to incorrect biological conclusions. In this section, we describe the framework used to compare state-of-the-art WGBS mappers with abis- mal. The comparison includes metrics to assess correctness of the output mapping locations, as well as the computational efficiency of the mapping algorithms, specifically the software tool run time and maxi- mum memory use for each tool. The public datasets used to benchmark abismal are also used in subsequent chapters to measure the performance improvement of various algorithmic steps proposed in this dissertation. 3.6.1 Comparison criteria: Sensitivity, specificity and speed To evaluate performance, we used both simulated data and fifteen public high-quality datasets obtained from the Sequence Read Archive (SRA) [103]. These datasets include single-end and paired-end reads from traditional WGBS protocol (i.e. with T-rich reads on the first end and A-rich reads on the second end) and from the random-priming post-adapter bisulfite tagging (RPBAT) protocol [41]. RPBAT is an amplification-free protocol that randomly sequences T-rich and A-rich reads with the property that paired ends have complementary bisulfite bases. Traditional single-end datasets were selected from human and four model organisms (mouse, chicken, zebrafish and arabidopsis), whereas RPBAT datasets originate from human samples (Table 3.1). We show performance results on three distinct metrics: simulation accuracy (in the form of sensitivity and specificity), number of reads (or read pairs) mapped at least once, and average error rate between reads and the reference genome. Simulation accuracy for reads sampled with error from the reference genome is an objective assessment of mapper performance, even if they do not always reflect every property of real data. Mapped reads can be compared to their true locations of origin to assess the sensitivity and specificity of each mapper. Sensitivity is defined as the fraction of correctly mapped reads relative to the total number of simulated reads, and specificity is the fraction of reported reads that are correctly mapped. To quantify both metrics, paired-end 48 Table 3.1: Datasets used to compare performance of WGBS mappers and other algorithmic strategies through this dissertation. Datasets are identified by their Sequence Read Run (SRR) accession, the protocol, with Trad. = traditional (or directional) and RPBAT = random PBAT protocol (non-directional), and the Illumina instrument used to sequence the dataset. Accession Study Species Protocol ID Length reads Instrument SRR3498383 [84] H. sapiens Trad (SE) Brocks 100 163 M HiSeq 2000 SRR2096734 [85] M. musculus Trad (SE) Manakov 100 308 M HiSeq 2500 SRR10606701 [86] D. rerio Trad (SE) Mendoza 96 168 M HiSeq 1500 SRR5015166 [87] G. gallus Trad (SE) Zhang 125 139 M HiSeq 4000 SRR12075121 [88] A. thaliana Trad (SE) Shahryary 151 76 M NextSeq 500 SRR1067551 [5] H. sapiens Trad (PE) Bernstein 100 168 M HiSeq 2000 SRR10165530 H. sapiens Trad (PE) Tycko 144 52 M NovaSeq 6000 SRR17509863 H. sapiens Trad (PE) LCL 150 29 M NovaSeq 6000 SRR6020687 H. sapiens Trad (PE) Shenzhen 150 534 M HiSeq X Ten SRR1104864 [5] H. sapiens Trad (PE) Roadmap 100 134 M HiSeq 2000 SRR949196 [89] H. sapiens Trad (PE) Meissner 99 217 M HiSeq 2000 SRR5590025 [90] P . troglodytes Trad (PE) Gokhman 101 107 M HiSeq 2000 SRR948779 [91] M. musculus Trad (PE) Low 101 220 M HiSeq 2000 SRR11179759 [92] M. musculus Trad (PE) Yamada 110 98 M HiSeq 2500 SRR3897178 [93] M. musculus Trad (PE) Decato 90 124 M HiSeq 2000 SRR5337988 [94] M. musculus Trad (PE) Baubec 126 175 M HiSeq 2500 SRR13202262 [95] M. musculus Trad (PE) Dura 100 518 M NovaSeq 6000 SRR5756263 [96] D. rerio Trad (PE) Kamstra 150 121 M HiSeq 4000 SRR8307571 [97] D. rerio Trad (PE) Skvortsova 140 148 M HiSeq X Ten SRR5644125 [98] G. gallus Trad (PE) Lee 125 59 M HiSeq 2500 SRR2296821 [99] A. thaliana Trad (PE) Yong 90 21 M HiSeq 2000 DRR019425 [41] H. sapiens RPBAT (PE) Miura 1 100 69 M HiSeq 2500 SRR7461526 [100] H. sapiens RPBAT (PE) Bian 150 199 M HiSeq X Ten SRR7757863 [83] H. sapiens RPBAT (PE) Miura 2 151 99 M HiSeq X Ten SRR2013804 [101] H. sapiens RPBAT (PE) Guo 125 91 M HiSeq 2500 SRR9646129 [102] H. sapiens RPBAT (PE) Leng 126 71 M HiSeq X Ten 49 reads were simulated, both in traditional and RPBAT modes, using Sherman [104] and setting various error rates from 0% to 5%. We also simulated 1% and 70% methylation levels for cytosines outside and inside of the CpG context, respectively. These parameters simulate well-established methylation rates from most healthy mammalian somatic cells estimated from WGBS data with high conversion rate [5]. We varied read lengths from 50 to 150, a range that encompasses most read lengths generated by Illumina sequencers to date. For real data, the true location of origin for each read is not known, but metrics can be used to assess the relative quality of mapper outputs. The percentage of reads mapped at least once is an estimate of mapper sensitivity. In particular, paired-end reads that map concordantly (here defined as at most 3000 bases apart and in opposite strands), are likely to be correct candidates. Many mappers, including abismal, map each end of read pairs independently. Since all tested reference genomes are large (the smallest with 135 Mbp), it is unlikely that reads incorrectly mapped are concordant by chance alone. To estimate specificity, we measured the error rate of mapped reads, defined as the ratio between the total number of edits (mismatches, insertions and deletions) and the total number of bases aligned to the reference. We expect more specific mappers to have lower error rates. We interpreted abismal’s accuracy, time and memory metrics by comparing it with five WGBS mappers. In addition to abismal, we mapped simulated and real reads with Bowtie 2 (through Bismark), BSMAP, BWA-MEM (through BWA-meth), HISAT-3N and WALT. We mapped RPBAT datasets using abismal, Bis- mark and BSMAP, since these are the only mappers with modes that allow T-rich and A-rich reads to be mapped interchangeably. Abismal, Bismark, HISAT-3N and BWA-meth perform local alignment. BSMAP and WALT count mismatches based on Hamming distance, with BSMAP also allowing one gap of size up to 3 in the read. Parameters for mappers were chosen to maximize the similarity between algorithms (e.g. similar alignment scores and similarity cut-offs to consider a read “mapped”). We ran each mapping tool on machines with identical hardware, system software and user environments. We instructed mappers to use 16 threads, which favors mappers that make efficient use of parallelism. 3.6.2 Comparison results Abismal version 1.0.0 was compared to Bismark, BSMAP, BWA-METH, HISAT-3N and WALT in a pre- vious publication [105]. The implementation of abismal 1.0.0 was fully centered around the two-letter encoding. In this dissertation, we will present further optimizations attained through algorithmic improve- 50 ments. A two-letter implementation of abismal is consistently faster than other mappers in human, mouse, zebrafish and chicken, and is as fast as other mappers in arabidopsis. We also quantified various quality statistics of mapped reads to ensure that the abismal output does not cause global biases in methylation. Bisulfite conversion rate and average methylation levels were quantified for each dataset used for compari- son, as described in section 2.3.2. We concluded that these statistics are nearly identical across all mappers. Abismal requires less than 4 GB of RAM to map reads to the human genome. This leads to future opportu- nities to use the mapper on smaller devices and makes WGBS data analysis more accessible to researchers with limited computing resources. 3.7 Concluding remarks In this chapter, we showed that the information loss caused by bisulfite conversion provides opportunities for novel encoding strategies that are efficient for filtration in WGBS read mapping. We proposed the use of the two-letter encoding, which distinguishes purines and pyrimidines in both reads and the reference genome. We showed, through theoretical analysis of filtration efficiency, that the two-letter encoding is expected to retrieve fewer hits in random genomes than the three-letter encoding when a fixed number of bits is available to encode fingerprints. We also showed that abismal, a WGBS mapper that is based on the two-letter encoding, is faster and more accurate than most WGBS mappers used in scientific research, demonstrating that the two-letter encoding has practical applications. In the next chapter, we will show that, despite real genomes attaining similar properties to i.i.d. genomes in terms of relative filtration efficiency of the two encodings, certain “artificial” genomes do not. We will analyze the worst-case relative efficiency of the two-letter encoding compared to the three-letter encoding and describe a strategy that mixes both alphabet encodings based on global genome statistics to maximize filtration efficiency. 51 Chapter 4 Filtration for bisulfite sequencing read mapping using a hybrid alphabet In the previous chapter, we used the equal frequency of purines and pyrimidines in most genomes to con- struct a two-letter alphabet encoding that attains higher k-mer entropy and, consequently, more efficient filtration to map T-rich and A-rich reads. Comparisons between strategies were carried out in both i.i.d. and real genomes from various model organisms used for whole genome methylation studies. We concluded that, both in i.i.d. genomes and in real genomes of eukaryote model organisms, the two-letter encoding globally provides more efficient filtration. The improved efficiency of the two-letter encoding is, however, not always guaranteed for every genome, even if we assume equal frequency of purines and pyrimidines. In particular, large consecutive runs of poly- purines or poly-pyrimidines can have high nucleotide entropy in four letters but completely lose the entropy when encoded in two letters. In this chapter, we propose a solution to this problem that attains the guarantee that the expected hit rate is minimized in any genome sequence. We do this by partitioning the genome in two subsets of positions. In the first subset, we encode k-mers under the two-letter alphabet, whereas positions in the second set is encoded using two fingerprint functions, both with three letters, and each one of which accounts for different bisulfite base conversions ( i.e. C→T or G→A). The choice of encoding for each position is based on which alphabet attains the lowest fingerprint frequency in the genome. In this chapter, we will lay the theoretical foundation of the hybrid alphabet and describe a simple algorithm to construct it. Results on empirical comparisons between the three-letter, two-letter and hybrid alphabets will be described in more detail at the end of Chapter 5. 52 4.1 The two-letter alphabet can be locally inefficient An “adversary” who seeks to construct a genome that minimizes the two-letter encoding efficiency, relative to the three-letter encoding, can proceed as follows. Consider a genome composed of m i.i.d. As and Gs, followed bym i.i.d Cs and Ts. Clearlyp=q =1/4 for this genome, but it is globally not composed of i.i.d letters. Under the two-letter encoding, and for sufficiently large m, only two fingerprints exist (excluding the ones that contain the midpoint of the genome), with m− k + 1 hits for each fingerprint. Under the three-letter encoding, the k-mers in the first half of this genome are distinguishable, with approximately m× (3/8) k positions in each, whereas the second half also containsm positions with identical fingerprints. The ratioZ 2 /Z 3 of the empirical expected hit ratios can be approximated by Z 2 Z 3 ≈ 1 2m × (m 2 +m 2 ) 1 2m m 2 3 8 2k +m 2 ≈ 1 2 8 3 2k , indicating that the efficiency of the three-letter encoding increases exponentially with the three-letter k-mer size. This reflects the low specificity for the two-letter encoding in genome regions with high poly-purine or poly-pyrimidine content. While this adversary genome is unlikely to represent a functioning organism, it may represent a signif- icant portion of an organism genome. Low complexity regions comprise a significant part of the DNA in many eukaryotes, and the low entropy of the two-letter alphabet may locally compromise mapping specificity for reads that originate from these regions. The human genome, for instance, contains over 6.5 million bases (or over 2% of the genome) in regions annotated as either A-enriched, G-enriched or AG-enriched [106]. Any k-mer selected from a read that contains many As and Gs will retrieve multiple positions from these regions under the two-letter encoding, causing filtration to be locally inefficient. Simi- larly, a spuriousk-mer comprised of poly-purines from an otherwise heterogeneous read will retrieve a large fraction of low complexity regions as hits. When we further assume reads can come from either strand of the genome, reads that come from the reverse strand in CT-rich regions of the reference genome will also attain very low filtration efficiency. It is intuitively desirable to encode this adversary genome entirely in a three-letter alphabet, since this allows reads from the forward strand in the first half and the reverse strand in the second half to be mapped with high specificity through the use of the additional letters. We will define “specificity” more formally when we have a choice of the encoding for each genome position. 53 Consider a k-mer sampled from a read which is itself uniformly sampled from the genome. Assume the read has probability 1/2 of being either T-rich or A-rich (note that we map the reverse-complement of a T-rich read as A-rich, so mapping to both strands means we are mapping an equal number of T-rich and A-rich reads to the forward strand). We can further assume that only the forward strand of the genome is indexed, withk-mers from the reverse strand being encoded using the G→A bisulfite conversion. Consider a uniformly sampled position and the k-mer sequence starting at that position. Let v, v t and v a be the fingerprints of the position’s k-mer under two-letter encoding and the three-letter encodings that convert C→T and G→A, and letC(v),C t (v t ) andC a (v a ) be the number of positions in the genome with the same fingerprints. We define the hybrid expected hit rate as Z h = X v,vt 1 2 (C(v)+C t (v t ))Pr(H 2 (i)=v,H 3 (i)=v t )+ X v,va 1 2 (C(v)+C a (v a ))Pr(H 2 (i)=v,H 3 (i)=v a ), for uniformly sampled position i from the genome. Formally, Z h is the expected value of the number of hits for a k-mer uniformly sampled from the genome if it is encoded in both two and three letters, with the three-letter encoding having probability 1/2 of coming from the forward strand (encoded as C→T) or the reverse strand (encoded as G→A). As shown in Chapter 3, the probability of a k-mer having a certain fingerprint is given by the relative frequency of the fingerprint in the genome, so Z h can be rewritten as Z h = 1 m 2 X v N(v)C(v)+ 1 2m 2 X vt N t C t (v t )+ 1 2m 2 X va N a (v a )C a (v a ), wherev iterates over values between0 and2 k 2 − 1 and bothv t andv a iterate over values from0 to3 k 3 − 1. Note that if only one alphabet encoding is used, Z h is the empirical expected hit rate Z defined in equation 3.4. For instance, if only the two-letter encoding is used, thenC t (v t ) = C a (v a ) = 0 andC(v) = N(v), leading to the sum-of-squares definition of empirical hit rates (Equation 3.4). This means that Z h is comparable toZ. We are interested in constructing a genome index that minimizes Z h when uniformly sampled reads from the genome are mapped back to it. We will do this by partitioning genome positions in two subsets. The first will index a set of substrings of the genome under two-letter encoding, and the second will index genome positions under the three-letter encoding. The read mapping algorithm will then select seeds in both 54 two and three letters and the read will be aligned to the union of all retrieved hits in both encodings. While some positions can be missed for ak-mer in one encoding, we guarantee that every position will be retrieved by exactly one encoding function. We seek the partition for which the number of false-positive alignments is minimized. In the next section, we will construct the partition that minimizes Z h . We will also show that optimal partition for most genome sequences, as expected by the results of Chapter 3, indexes mostk-mers in two letters. Through these results, we can show that, with a marginal increase in the size of the index, mapping reads to most model organism genomes can be done more efficiently and accurately using a hybrid alphabet. 4.2 Hybrid alphabet index construction In this section, we will construct an index where each position is indexed using its two-letter encoding or added to two separate indices, one assuming C→T and one assuming G→A. We assume the size of two- letterk-mers is k 2 and the size of three-letterk-mers is k 3 . Note that they do not have to be bound by the relationship devised in Chapter 3 (i.e. k 3 = ⌊k 2 /log 2 (3)⌋). For position i, let v(i) be its fingerprint in two letters, v t (i) its fingerprint in three letters assuming C →T and v a (i) its fingerprint under G →A. For simplicity, we will define the fingerprint triplet of positioni as(v,v t ,v a ). We must decide how to encodei based on the frequency with whichv,v t andv a are seen in the genome. Similar to the framework previously devised, for fingerprint v, we define N(v), N t (v) and N a (v) the number of positions in the genome that have the same fingerprint as v in two letters and both three-letter encodings. This means that we count three different fingerprint values for each position in the genome and thus P v N(v) = P vt N t (v t ) = P va N a (v a ) = m. Next, we define C(v), C t (v) and C a (v) as the frequencies of the fingerprint v in an index assumed to contain a hybrid mixture of encodings. We impose two constraints: 1. P v C(v) + P vt C t (v t ) = m : This constraint means that every position is indexed either in two letters or three letters, but not both. 2. P vt C t (v t ) = P va C a (v a ) : A position encoded in three letters will be indexed through its finger- printsv t andv a , so the total sum of fingerprint frequencies in C →T and G→A must be identical. In this framework, the probability of sampling a fingerprint from the read and the number of hits retrieved by a fingerprint are no longer the same. The probability of observing a fingerprint in a read is always dictated by the global genome statistics regardless of index. In other words, the probability of fingerprint 55 v is N(v)/m, and the number of retrieved hits for this fingerprint is C(v). The values C,C t and C a are dictated by the genome partition, whereasN,N t andN a are fixed based on the genome sequence. LetS ={0,...,m− 1} be the set of all genome positions and letS 2 andS 3 be two subsets that partition S, meaning that S 2 ∩S 3 = ∅ and S 2 ∪S 3 = S. Subsets S 2 and S 3 define the positions in the genome encoded as the alphabets defined by the subscripts. Consider a position i in the genome with fingerprint triplet (v,v t ,v a ). If we add i to S 2 , then C(v) increments by 1 and the contribution of encoding i in two letters toZ h isN(v). Similarly, ifi∈ S 3 , bothC t (v t ) andC a (v a ) increment by 1 and the contribution of encoding i in three letters is (N t (v t )+N a (v a ))/2. This leads to a very simple algorithm to construct the partition(S 2 ,S 3 ) that minimizesZ h : 1. InitializeS 2 =S 3 =∅ 2. Scan the genome G and count the frequencies of fingerprints v, v t and v a , with 0 ≤ v < 2 k 2 and 0≤ v t ,v a <3 k 3 . This calculatesN(v),N t (v t ) andN a (v a ) for all possible fingerprint values. 3. For0≤ i<m, let(v,v t ,v a ) be the fingerprint triplet in position i. IfN(v)≤ (N t (v t )+N a (v a ))/2, thenS 2 ← S 2 ∪{i}, otherwiseS 3 ← S 3 ∪{i}. 4. Return the partition(S 2 ,S 3 ). Once the genome is partitioned, an index can be constructed through three counting sort algorithms similar to Figure 3.4. There are three counting arrays: counter, counter t andcounter a, as well as three index arrays: index,index t andindex a. The latter two indices contain the same elements, but they are ordered differently based on the numerical values of fingerprints, which differ depending on the numerical values assigned to nucleotides. In the next section, we discuss the effect in memory of these extra arrays. 56 4.3 Memory storage requirement for the hybrid index One of the major advantages of the two-letter encoding is the significant memory reduction in most data structures used to index the genome, relative to the three-letter encoding. When we move to a hybrid alphabet encoding, we are again “doubling” the index size for positions encoded in three letters, since we index each of these positions with both C→T and G→A bisulfite conversions. The potential increase in memory may hinder the possible advantages of faster mapping times. In this section, we calculate the total memory requirement to store a genome index using the hybrid alphabet. We will later analyze, for various model organisms, what proportion of the genome is indexed in each encoding to assess if the practical applications of the hybrid alphabet are justified. Using the direct address table strategy proposed in Chapter 3, the hybrid index requires a two-letterk- mer counting table (counter) of size2 k 2 , as well as two direct address tables (counter t andcounter a) of size 3 k 3 . The size of the index array is|S 2 | and the sizes of arrays index t and index a are both |S 3 |, so the total size of the index arrays is|S 2 |+|S 3 |+|S 3 | = m+|S 3 |, whereas the size of the counter arrays is given by2 k 2 +2× 3 k 3 . Similar to Chapter 3, the counter arrays use a “baseline” amount of memory that does not depend on the genome being indexed. If only a two-letter encoding is used to index the genome, then the index size, in bits, is proportional to 2 k 2 +m. Conversely, if only the three-letter encoding is used, the index size is proportional to2× 3 k 3 +2m. The difference between the hybrid alphabet index size and the two-letter alphabet index size is 2× 3 k 3 + |S 3 |, meaning that the size increases proportionally to the number of positions indexed in three letters. It is possible that the hybrid index degenerates to the three-letter encoding, as in the “adversary” genome described in the start of the chapter, where the average frequency in three letters is lower than the frequency in two letters in every position. In the next section, similar to Chapter 3, we calculate the theoretical sizes of partitions for infinite i.i.d. genomes. This allows us to assess the conditions in which genomes may “degenerate” to a fixed alphabet encoding. After setting theoretical expectations, we will quantify partition sizes in real model organism genomes. 4.4 Theoretical analysis of partition sizes We will devise the relative sizes of S 2 and S 3 in an infinite i.i.d. genome with Pr(A) = Pr(T) = p and Pr(C) = Pr(G) = q as a function of p, k 2 and k 3 . This is done by iterating over all possible four-letter 57 G T C C C A A G T T C C G A T A T A T A G C T A C 0 1 1 1 1 0 0 0 1 1 1 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 2 2 2 2 0 0 1 2 2 2 2 1 0 2 0 2 0 2 0 1 2 2 0 2 radix-2 fingerprint 0 0 1 2 3 4 5 6 7 frequency 2 1 0 0 0 2 2 2 1 1 0 0 2 2 1 2 1 2 1 2 2 0 1 2 0 radix-3 fingerprint (C to T) radix-3 fingerprint (G to A) 2-index: 2-letter: 3-index: 0 0 1 2 3 4 5 6 7 8 0 0 1 2 3 4 5 6 7 8 3-C-to-T: 3-G-to-A: 2-frequency: 3-frequency: (average) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 13 14 15 16 17 17 18 19 20 21 22 23 24 frequency frequency frequency Figure 4.1: Example of the hybrid index construction algorithm. The same genome of size 25 from Figure 3.1 is shown, along with its two-letter encoding and both possible three-letter encodings with C→T and G→A. Usingk 2 = 3 andk 3 = 2, the frequencies of fingerprints in each encoding are counted (shown at the bottom). Positions are partitioned based on whether the two-letter encoding frequency is lower than the average frequency of the two fingerprints under three-letter encodings. This results in the partition given byS 3 ={11,12,20,23} (red circles) andS 2 ={0,...,22}− S 3 . 58 k-mers of size k 3 and counting the expected average frequency of its three-letter fingerprints. Sequence k-mers that are expected to appear less frequently than every uniformly sampled k-mer of size k 2 will be included inS 3 . Consider a random k-mer of size k 3 generated by this procedure. The number of nucleotides A, C, G and T in thisk-mer, represented byx A ,x C ,x G andx T , follows a multinomial distribution with Pr(x=(x A ,x C ,x G ,x T ))= k 3 ! x A !x C !x G !x T ! p x A +x T q x C +x G . For a given k-mer with three-letter fingerprints v t and v a and nucleotide counts x, the expected relative number of positions with the same C→T and G→A fingerprints are E[N t (v t )]=m p x A q x G 2 x C +x T , E[N a (v a )]=m p x T q x C 2 x A +x G , whereas the expected number of positions with the same two-letter fingerprint is E[N(v)] = m/2 k 2 , since purines and pyrimidines are equally frequent. This means that thek-mers indexed in three letters are all the k-mers whose tuplesx satisfy 1 2 p x A q x G 2 x C +x T + p x T q x C 2 x A +x G < 1 2 k 2 . (4.1) By iterating over all possible tuplesx A ,x C ,x G ,x T , withx A +x C +x G +x T = k 3 , we can count the expected number ofk-mers in an infinite genome that would be encoded in three letters. Figure 4.2 summa- rizes the fraction of positions encoded in each alphabet for various values ofp andq. For this comparison, we chosek-mer sizes that would realistically be used in short read mapping algorithms and represent min- imum exact match sizes for reads. We chosek 2 = 25 andk 3 = 16, which yields a comparable number of fingerprints for both encodings, with more fingerprints in the three-letter encoding ( 2 25 = 33,554,432 and 3 16 =43,046,721). For these values, approximately2.8% of the genome would be indexed in three letters. When we incrementk 3 by one (k 3 = 17), the fraction increases to45%, due to the much larger number of possible three-letter fingerprints, each one of which will appear less frequently in the infinite genome. 59 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.8 1.0 theoretical partition sizes in the hybrid alphabet Figure 4.2: Examples of fractions of the genome partitioned inS 2 andS 3 in infinite i.i.d. genomes. Selected are three pairs of values for (k 2 ,k 3 ), namely (25,16), (25,17) and (26,17). Values of p varied from 0 to 0.5 in steps of 0.05. The dashed vertical gray line representsp = 0.3, the most common letter frequency of weak bases (As and Ts) in eukaryotic genomes. 60 4.4.1 The additional memory for hybrid indices in real genomes is marginal Table 4.4.1 shows the results of hybrid partitions for the same model organisms studied in Chapter 3. Frac- tions of the genome encoded in three letters vary from 5.40% (G. gallus) to 13.87% (H. sapiens). There are two important conclusions from these results (i) there is a non-negligible subset of positions in most genomes that benefit from higher specificity under three-letter encoding and (ii) the index size under the hybrid alphabet for these genomes is close to the two-letter index size, with a marginal increase from the regions indexed with three letters. This is not surprising, since we have shown, in Chapter 3, that the global expected hit rate in real genomes closely resembles the i.i.d. theory. However, it is noteworthy that, in every genome, the size of S 3 far exceeds the theoretical fraction of 2.8% expected in i.i.d. genomes with p=0.3. This suggests that the additional three-letter index is favorable for low complexity regions in most real genomes. Table 4.1: Hybrid alphabet partition sizes fork 2 =25 andk 3 =16 organism |S 2 | (millions) |S 3 | (millions) H. sapiens 2,660 (86.13%) 428 (13.87%) M. musculus 2,394 (87.86%) 330 (12.14%) D. rerio 1,489 (88.70%) 189 (11.30%) G. gallus 1,007 (94.60%) 57 (5.40%) A. thaliana 113 (94.30%) 6 (5.70%) 4.5 Concluding remarks We demonstrated that the potential local limitations of the two-letter alphabet in low complexity regions can be amended by the use of a hybrid alphabet, where the encoding is chosen based on the least frequent fingerprint using each alphabet. Genomes can be partitioned by alphabet encoding using a very simple criterion which compares the frequency of the two-letter encoding of eachk-mer with the average frequency of the two fingerprints for the two possible three-letter encodings. Using this criterion, most real vertebrate genomes are predominantly encoded using the two-letter alphabet, but the fraction of the real genomes encoded in three letters is larger than what is theorized assuming i.i.d. genomes. This suggests that the hybrid encoding may favor specificity for reads originating from these regions. Comparisons of mapping speed between the hybrid, two-letter and three-letter alphabets are shown at the end of Chapter 5. 61 Chapter 5 Constructing genome indices for efficient filtration In this chapter, we will describe how reducing the set of indexed positions in the reference genome can be used to attain more efficient filtration at the expense of a slightly higher exact match size necessary for correctness of mapping location. Unlike Chapters 3 and 4, we will explore avenues for filtration efficiency using data structures to represent the reference genome, rather than choosing a specific encoding function for k-mers in the genome. We will formulate this goal as an objective function based on the expected hit rate proposed in Chapters 3 and 4. Our approach expands on various strategies that were previously used in mapping software tools to reduce index size. Similar to previous chapters, we are given an input genome G of size m as a string over the alphabet Σ = {A,C,G,T}. We are also given a readr of sizen≪ m. For stringT , we sayT[i,j] is the substring starting ati and ending atj for0≤ i,j <|T|. We assumer is a substring uniformly sampled fromG with some added error in the form of mismatches and indels. We seek to to minimize the time to map readr to genomeG. Like in Chapters 3 and 4, this will be attained by reducing the expected number of alignments to map the read. We will assume that every position in the genome is assigned a numerical fingerprint v based on the k-mer starting at each position (e.g. through the hybrid alphabet encoding, although the use of this particular encoding is not imposed). We will also assume, like in previous chapters, that reads are selected from genome positions with equal probability. 62 5.1 Memory-efficient data structures for genome indices A genome index is a data structure stored in-memory and used to retrieve positions (filtration) efficiently using subsequences of short query reads. For the purposes of this chapter, our mathematical definition of an index ofG will be a setS⊆{ 0,...,m− 1}. Like in Chapter 3, the data structure assumed to represent this index is a direct address table that associates fingerprints to positions in the genome where the fingerprint is observed. An index is allowed to skip some genome positions, which is important in reducing the necessary mem- ory to run a mapping algorithm. We say a positioni is indexed inS ifi∈ S. We define b as the number of bits necessary to represent a position in the genome, which we calculate as b = 2 ⌈log 2 (log 2 (m))⌉ . For most genomes used in methylation analysis, which range from hundreds of millions to 3 billion bases, b = 32, but if a genome has more than 2 32 bases (which is approximately 4.29 billion bases), then b = 64. The size ofS is given byb|S|. Since memory is almost always the limiting factor on which machines can run a mapper (not considering how long it takes for the program to finish running), reducing the index size without affecting mapping speed or accuracy is a problem of great importance. We define the compression factorα of indexS asα (S)=m/|S|, so1≤ α ≤ m. We seek to construct indices that attain a certain compression factor while maintaining a guarantee that any position in the genome is mappable, even if it is not indexed. In this section, we describe data structures often adopted by read mapping tools to create a compressed genome index. This will serve as the basis for section 5.2, where we use some of these compression strategies to simultaneously compress the index and reduce mapping time. We will assume the genome index is used to map reads using a similar algorithm as proposed in Chapters 3 and 4. Each positioni indexed inS is given a fingerprint v(i). To map a query readr, we select positions fromr and retrieve their fingerprints. Let v r (j) be a fingerprint of position j inr, for0≤ j <|r|. We use the direct address table to find every position i in S for which v(i) = v r (j). Finally, we align r to every substring of length|r| inG starting at positioni− j filtered through this procedure. By selecting fingerprints in various positions inr, we alignr to various substrings inG and report the position whose substring has highest alignment score tor. To provide a brief motivation to address index compression methods, we calculate a lower bound on the size of the uncompressed human genome index. We define the trivial indexS t ofG to be the entire set of positions as S t = {0,...,m− 1}, so|S t | = m and α (S t ) = 1. For human in specific, m ≈ 3× 10 9 . Using b = 32, the size of S t in bits is b|S| ≈ 32× 3× 10 9 ≈ 11.17 GB. Since we also need to store 63 the genome sequence itself, which requires at least two bits per letter, an additional 750 MB of storage is required. Other data structures, such as the direct look-up table, must also be stored as part of the index, and when reads are mapped, a batch of reads are loaded and stored in RAM. A 12 GB baseline memory requirement severely limits the number of machines in which mapping can be performed. For instance, if an index takes 12 GB of RAM in a 16 GB RAM machine, only 4 GB are available to load reads to be mapped to the index. With the addition of other operating system processes that also need RAM, this amount may become prohibitive. Finally, many studies generate libraries that contain mixtures of DNA from different species, often as an assessment of read quality or for posterior computational demultiplexing. Mapping reads to both human and mouse genomes simultaneously, wherem≈ 6× 10 9 andb = 64, would require over 40 GB of RAM. While mapping reads to this genome may be prohibitive if it is fully indexed, mapping to a compressed representation of the index may be possible depending on how efficient the compression strategy is. Reducing memory use is therefore of great interest for accessibility of mapping algorithms, throughput of mapped reads for arbitrary memory availability, and to make mapping to combinations of reference genomes feasible. Below we outline some strategies to attain reduced memory and discuss their potential effects on accuracy. 5.1.1 Indexing uniformly spaced positions A simple approach to reduce index size is to only store positions at uniformly spaced steps. We call the uniform index for window length w to be the set S u (w) = {0,w,2w,...,w⌊m/w⌋}. This compression factor of this index isα (S u )≈ w. The trivial index is a special case wherew =1 sinceS u (1)=S t . The indexS u can only be used in practice if, when mapping reads, either the value ofw used to index the genome is known or every k-mer is used as a seed. If the fingerprint of position j in a read is used, the true mapping location may not be retrieved simply because it is not indexed. To correct for this, the mapping algorithm must also use additional fingerprints in positions j +1,...,j +w− 1. This correction guarantees that, if there is an exact match of sufficiently large length in r starting at positionj, the correct mapping location is retrieved. For instance, if fingerprints are calculated based on k-mers starting at each position, then an exact match of lengthw+k− 1 starting atj with the true mapping position is a sufficient condition forr to be correctly mapped. Conversely, any false-positive hit on the additional fingerprints used will require additional mapping time. In average, the number of indexed positions for each fingerprint will also be divided byw, so the expected number of comparisons in aligning a read to candidates from position 64 j usingS t and candidates from positionsj,...,j+w− 1 inS u (w) is the same for anyw. Also note that, if mismatches are located in thew− 1 letters outside of thek-mer that retrieves the correct mapping location, then accuracy is not compromised. The sizew+k− 1 of the sufficient exact match size to map a read correctly demonstrates the trade-off between compression and accuracy. The larger the value of w, the more stringent exact matches need to be, resulting in loss of sensitivity. More generally, we can retrieve an indexed position with guarantee by selecting positions fromr at uniform shiftsδ , that is, positionsj,j+δ,j +2δ,...,j +(w− 1)δ , such that gcd(w,δ ) = 1. This guarantees that at least one of the seeds must retrieve the correct mapping location as they have different remainders modulo w. The adjacent positions case previously described is the special case where δ = 1. If positions are not adjacent, the lower overlap between seeds requires the size of the exact match between the read and the correct mapping location to be longer, resulting in a loss of sensitivity. For this reason, δ = 1 is the optimal choice for the shortest exact match length that guarantees the correct location to be retrieved. 5.1.2 Indexing through winnowing schemes Winnowing schemes are simple and effective strategies to reduce index sizes. They have been adopted in general-purpose read mappers like minimap2 [107], long read mappers like lra [108] and for metagenome databases like Kraken [109]. Compared to uniformly spaced positions, the necessary number of fingerprints to select fromr is lower. Winnowing schemes simultaneously reduce the index size of large genomes while not requiring much additional effort to find the location of query reads. A winnowing scheme is a tuple(w,≺ ), where≺ is an operator that induces an ordering over elements in the fingerprint codomain Ω (note that, for v,v ′ ∈ Ω , v < v ′ does not necessarily imply v ≺ v ′ ). The winnowing indexS m (w) works as follows. For adjacent positionsi,...i+w− 1 inG, we index inS m (w) the position with lowest fingerprint rank in the ordering defined by ≺ . The indexed positions is called the minimizer of the window. If there are ties in a window, all lowest-rank positions are indexed and considered minimizers. As a result, each window of length w has at least one minimizer, and various overlapping windows may share the same minimizer. To map reads using this index, the mapping algorithm must also have knowledge of the winnowing scheme(w,≺ ) that was used to index the genome. For positionsj,...j+w− 1 inr, we select minimizers using the same ordering≺ and retrieve positions in the index whose fingerprint matches the minimizers in 65 the window. Similar to the uniformly spaced index, if fingerprints are calculated based on k-mers starting at each position, then an exact match of sizew+k− 1 between the read and its correct mapping location is a sufficient condition for correct mapping. Unlike uniformly spaced positions, however, only one fingerprint is selected in each window, so fewer fingerprints are used for filtration if adjacent windows share a common minimizer. The compression factor ofS m depends on≺ and can vary between1 andw. It is conjectured that the maximum compression factor attainable by winnowing schemes is2w/3, but this has not yet been proven or disproven. In the vast majority of practical cases, the compression factor α (S m (w)) is approximately (w + 1)/2, which implies that, relative to S u , winnowing schemes require approximately twice the disk space for the samew. Minimizer selection is linear on the string length, so it is fast to both index the genome and select fingerprints from r. It is expected that winnowing schemes will continue to attain larger compression factors through strategies to devise the ordering≺ . Note that there are|Ω |! possible orderings of fingerprints, so it is infeasible to exhaustively test the compression factor of each possible ordering. There are various approaches that can aid in the construction of minimizers with better compression factors than(w+1)/2. One example is the DOCKS algorithm, which attains an empirical compression factor of(w+1)/1.77 in many animal genomes [110]. The idea of the algorithm is to construct a universal hitting set (UHS). A hitting setΩ ′ (k,w), is a subset ofΩ with the property that any string of lengthw+k− 1 contains at least onek-mer whose fingerprint is in Ω ′ . The universal hitting set is the hitting set with lowest cardinality (note that a universal hitting set is not dependent on the genomeG). The problem of constructing a UHS is NP-complete. The DOCKS algorithm is a greedy heuristic that constructs an approximately optimal UHS. The algorithm consecutively removes the k-mers whose fingerprint appears in the most strings of length w +k− 1. This can be done efficiently through the construction of a De Bruijn graph, where nodes are k-mers and edges connectk-mers that are substrings of any string of lengthk+1 (in other words, the two k-mers have an overlap of sizek− 1). An approximate UHS construction can be done by removing nodes of this graph until there are no more paths of lengthw. At each step, the DOCKS algorithm removes the node with most paths passing through it, which can be calculated by the product of paths starting at a node and the number of paths ending at that node. The construction of a UHS can be incorporated into a winnowing scheme by placing k-mers in Ω ′ first in the ordering ≺ , that is, if v ∈ Ω ′ and v ′ / ∈ Ω ′ , then v ≺ v ′ . This ensures that onlyk-mers inΩ ′ are indexed since, by definition, every window contains a k-mer inΩ ′ . The low cardinality ofΩ ′ (by construction) results in the higher sparsity ofk-mers selected and the empirically 66 observed compression rate. Note also that onlyk,w and the alphabet size are necessary to construct a UHS, so a mapper can construct the UHS for the values it uses prior to indexing the reference genome and simply read thek-mers in the UHS when constructingS m (w) or mapping reads. In winnowing schemes, the compression factor α affects both the number of k-mers indexed and the number ofk-mers selected from the reads. This means that the expected number of comparisons scales with 1/α 2 . An effective winnowing scheme for filtration must simultaneously index low-frequency k-mers and attain high sparsity. In section 5.2.3, we will define more objectively how to quantify the filtration efficiency of arbitrary winnowing schemes. 5.1.3 The Burrows-Wheeler transform and the FM index The Burrows-Wheeler transform (BWT) is a data structure initially constructed to attain similar lossless compression properties to the Lempel-Ziv encoding for texts with small alphabets, but allowing decom- pression to be linear in the string length [78]. The BWT of string G is a permutation of the characters in G constructed by the following procedure: m strings are constructed from G through rotations (cyclic shifts) ofG. These rotations are sorted lexicographically (essentially forming the suffix array of G) and the i-th character of the resulting Burrows-Wheeler transform L = BWT(G) is the last character of the i-th sorted cyclic rotation. The Burrows-Wheeler transform ofG has desirable compressing properties. Specifi- cally, there is high probability that several runs of consecutive characters in the BWT are grouped together, especially whenm is much larger than the alphabet size, as is the case for genomic sequences. The BWT is lossless because only knowledge of the index of G in the sorted list of rotations and L are necessary to reconstruct G. This is done through a backwards search as follows. The search starts by creating a new stringF that is defined as the sorted characters of L (or, equivalently, the sorted characters of G, although G is not available). Another vector T points the index of each element in L to the index of that same element on F . In other words, L(T(i)) = F(i). Then G can be constructed backwards by starting at L(I), then finding the preceding character as L(T(I)), then L(T 2 (I)) = L(T(T(I))), and so on until L(T m (I)). This procedure works because of the property that, for a given character, the ranks of the character in the original string is permuted the same way in bothL andF , so that the mapping can be constructed without any additional knowledge. In practice, the knowledge of index I is replaced by the addition of a special character $ that precedes every character in the alphabet. Then, to reconstruct G, we can start with the position in which $ occurs inL. 67 The BWT can also be used to efficiently count the number of occurrences of a substring W inG through some auxiliary data structures. For charactera, letK(a) be the number of symbols inG lexicographically smaller than a and let O(a,i) be the number of occurrences of a in the prefix L[0,i] of the BWT, then Ferragina and Manzini [111] proved that the first occurrence q(aW) of stringaW in theF string is given byq(aW) = K(a)+O(a,q(W)− 1)+1 and the last occurrenceQ(aW) is given byQ(aW) = K(a)+ O(a,Q(W)). This leads to a linear time backwards search algorithm that counts the occurrences of the entire wordW inG by successively reducing the interval onG in which it can be matched. Many mapping algorithms, such as BWA, HISAT and Bowtie, use the BWT of the reference genome to create an index that allows extraction of exact matches from substrings of a query read. In order to use the BWT, some additional data structures are also necessary to store certain locations of the reference. Note that the backwards search only counts the number of occurrences of a match, but it does not retrieve their locations. An initial inefficient solution is to store, along with the L and F strings, the suffix array of the reference genome, which is much larger than any of the strings individually and would compromise any compression attained (the size of the suffix array is m). The FM index [111] solves this problem by proposing that only a subset of suffix array should be indexed, specifically the location of every w-th position in the genome, so that every stored location in the suffix array is a multiple of w. The compression factor of the FM index is alsow, but additional data structures must be stored along with the index. Assume, for simplicity, that we are able to know whether a certain position in the suffix array was indexed. After doing the backwards search and retrieving an interval ofF with the exact matches, we would test for each position whether the suffix array location is available. If it is, then we are done. If it is not, then we can continue the backwards search until a position is found. The number of additional jumps in the backwards search necessary to find an indexed position will tell us the location of the original query. In other words, if y additional jumps are necessary to retrieve position xw in the suffix array, then our initial hit inG occurs at positionxw+y. Assessing if a position in the suffix array was indexed required yet another additional data structure. Ferragina and Manzini proposed the following solution to this query. The positions inG are partitioned in buckets of size Θ( nlog 2 m). For each bucket, every indexed position is stored in a packet B-tree using a key as their distance to the start of the bucket. It can be proven that queries in the FM index can be attained atO(nlogm). Further refinements in the backwards search can improve this bound to O(n+log ϵ m) for anyϵ> 0, thus attaining both space and time improvements over suffix arrays for large values of m. 68 5.1.4 Compression strategies provide opportunity for efficient filtration Most research developed for genome index compression focuses on reducing the memory requirement to store these reference sequences without compromising the accuracy of finding the best scoring match of smaller query sequences. In the next section, we will show that we can attain improvements in mapping speed when constructing genome indices. To attain this, we modify the expected hit rate framework in- troduced in Chapter 3 to account for the distribution of fingerprints in an arbitrary index. We then use the expected hit rate as an objective function for which we find the index that minimizes its value. While this framework does not explicitly minimize the index size, we will also show that a near-optimal index size is attained as a byproduct of minimizing the expected number of comparisons. 5.2 Efficient filtration through index compression One of the major challenges regarding the efficiency of read mapping is that seeds must be small enough to allow for sensitivity to mismatches and indels. Small seeds in large genomes may lead to a large number of false-positive hits that must be aligned to find the best hit. This problem is aggravated in genomes that contain many repeat copies and low complexity sequences. If a k-mer selected from a read appears very frequently in the genome, then no k-mer encoding function will be efficient to map that read. Conversely, we can index the genome in such a way that the number of positions where this k-mer is seen is kept to a minimum. In this section we will expand this idea formally by defining the expected hit rate of an index S. 5.2.1 Generalizing the expected hit rate for arbitrary indices We assume reads are uniformly sampled from the genome, and that each of them positions in the genome has a fingerprint. Consider a uniformly sampled position i with fingerprint v, and letN(v) be the number of positions in the genome also with fingerprint v. The probability of sampling a position with fingerprint v isPr(H(i) = v) = N(v)/m. If every position is indexed, the expected number of hits from a uniformly sampled position is therefore Z ′ =E[N(v(i))|i∼ Uniform({0,··· ,m− 1})]= X v N(v)Pr(H(i)=v)= 1 m X v N(v) 2 . This is similar to the expected hit rates in Chapters 3 and 4, but we dropped the quadratic exponent on m as we are calculating the number of hits as opposed to the fraction of hits relative to the genome size. The 69 quadratic factor inN(v) means that frequent fingerprints are simultaneously more likely to be sampled and retrieve more positions. Consider now an arbitrary indexS , where, for fingerprint v,C(v) is the number of positions inS whose fingerprint is v, then the expected hit rateZ is defined as Z(S)= X v C(v)Pr(H(i)=v)= 1 m X v C(v)N(v). (5.1) The definition of Z(S) is a slight adaptation of the expected hit rate defined in Chapter 3 for arbitrary indices. Informally,Z decreases if more positions with low fingerprint frequency in the genome are indexed, that is, N(v) is low if C(v) is high. For the trivial index S t , Z(S t ) = Z ′ . We seek to find the index S that minimizes Z(S), but we must impose that any index S must obey a mappability constraint: among w consecutive positions, at least one is in S. This is similar to the constraint imposed in uniform and winnowing scheme indices. Among indices that follow this constraint, we define S Z (w) = argmin S Z(S) to be the the minimum hit rate index. 5.2.2 Linear time construction of the minimum hit rate index In this section, we will show how to construct the minimum hit rate index S Z (w). To attain this, we will use an equivalent but slightly different formulation of the problem. For 0≤ i < m, we assign to position i a penalty P i > 0. The total penalty of a subset S of{0,...,m− 1} is Z(S) = P i∈S P i . We seek to construct the set that minimizes Z(S) under the constraint that{i,i+1,...,i+w− 1}∩S ̸= ∅ for 0≤ i<m− w+1. Figure 5.1 shows a graphical depiction of the minimum cost subset on a small example for three values ofw. This formulation can be used to constructS Z (w) by settingP i = N(v), wherev is the fingerprint of position i. It follows directly that the sum of penalties of an index is the expected number of hits from equation (5.1), multiplied by the constant factorm. We will use the penalty formulation to construct a linear time algorithm to find S Z (w). For0≤ i<m, let Opt(i) be the minimum penalty index for for the genome prefix G[0,i] under the constraint that i ∈ Opt(i). For0≤ i < w,Opt(i) ={i}, since we obey the mappability constraint, we must indexi, and any additional position increases the value ofZ. Fori≥ w, we have the following relationship: Opt(i)={i}∪Opt( ˜ j), 70 where ˜ j =argmin i− w≤ jc. The functionν ′ is then used as the weighted ordering and accounts for the frequency of fingerprints in the genome. In this section, we will define an expected hit rate metric for winnowing schemes which will allow us to compare two schemes and decide which one is more efficient for mapping. Similar to previous sections, we assume a small read is sampled from the genome. We slide through the read k-mers selecting minimizers on windows of lengthw. We then compare the read with every position where the minimizer was indexed 73 in the genome. We seek to find the optimal ordering that minimizes the expected number of comparisons performed by this mapping algorithm. The key difference between this framework and the framework used to compute S Z is that the frequency of k-mers sampled from the read is dependent on the full genome, whereas the number of hits is decided by the k-mer selected in the index. In winnowing schemes, the ordering will affect the frequency in which k-mers are selected both in the reads and in the reference. Furthermore, different orderings can have vastly different compression rates that may result in a larger number of comparisons because morek-mers are selected from the read. We define the expected hit rate for winnowing schemes in a similar manner to previous chapters and sections. Consider a winnowing scheme (w,≺ ). Once we slide through G and select minimizers based on fingerprints and the ordering ≺ , let C(v) be the number of positions with fingerprint v selected as a minimizer in at least one window. When mapping reads, the fraction of windows of sizew+k− 1 where a fingerprint v is a minimizer will once again resemble the genome statistics, so among allw+k− 1-mers in reads, the probability that fingerprint v is selected as a minimizer is proportional toC(v) and will retrieve C(v) hits in the genome. In particular, setting y = P v C(v) be the number of indexed positions in the genome, the probability of a(w+k− 1)-mer from a uniformly sampled read in the genome to havev as a minimizer isC(v)/y, so the expected hit rate for minimizers is Z m = 1 y X v C(v) 2 . Note that Z m calculates the expected number of comparisons of a single k-mer selected from a read. In reality, minimizers select fewer k-mers from reads, proportionally to the compression factor α = m/y. Specifically, a read of length n contains n− k− w + 2 windows of size w + k− 1 and, in average, α (n− k− w+2), minimizers will be selected from the read, since the compression factor for the genome will also apply for the reads. In comparison, the uniform and minimum expected hit rate compression strategies require all of then− k+1k-mers to be used as seeds. If we wish to compare the normalized total number of comparisons between strategies, the more appropriate expected hit rateZ ′ m must be normalized by the compression factor, leading to Z ′ m = 1 αy X v C(v) 2 = 1 m X v C(v) 2 . Using Z ′ m , we can decide which of two winnowing schemes is expected to yield faster mapping times by 74 comparing the sum of squares of indexed fingerprint counts. Minimizing Z ′ m , however, is a more complex problem than maximizing sparsity of winnowing schemes, which is currently a very active field of research. For this reason, we will not delve into possible solutions, but we will propose a conjecture: The frequencies N(v) of each fingerprint in the genome induce an ordering of fingerprints, specifically, we can construct a winnowing scheme where the frequency condition applies: v ≺ v ′ if N(v) < N(v ′ ). We conjecture that the minimum expected hit rate winnowing scheme is close, based on permutation distance, to this induced winnowing scheme. In real genomes, this frequency-based winnowing scheme has a very poor compression factor, which means that, although, less frequentk-mers are favored, the relative number of indexed positions is much larger than2/(w+1). If we allow certain permutations that violate the frequency conditions (e.g. we can allow v ≺ v ′ with N(v) = 2 and N(v ′ ) = 1), we might get lower values of Z ′ m than strictly prioritizing lowest frequencyk-mers. In the following sections, we will use the traditional random ordering of fingerprints to compare the efficiency of minimizers with the other indexing strategies proposed thus far. 75 Table 5.1: Comparison of mapping efficiency under various indexing strategies for k =25 andw =10 index bases indexed reads mapped (%) mapping time (s) RAM (GB) S u (w) 308,828,638 96.94 4,393 2.736 S m (w) 541,500,923 96.72 7,116 3.603 S Z (w) 334,760,651 96.94 3,470 2.833 Table 5.2: Comparison of mapping efficiency under various indexing strategies for k =25 andw =20 index bases indexed reads mapped (%) mapping time (s) RAM (GB) S u (w) 160,867,345 97.271 2,306 2.245 S m (w) 305,215,908 97.106 3,142 2.783 S Z (w) 180,670,579 97.276 1,433 2.318 Figure 5.2: Comparison of two-letter k-mer frequencies under three indexing strategies: uniform S u (w), minimum hit rateS Z (w) and minimizersS m (w) fork =25 andw =10. 76 5.2.4 Empirical comparison of indexing strategies We compared the sensitivity, time and memory requirement to map one million error-free read pairs of 100 bases each, simulated uniformly from the human genome. We mapped the same set of reads to three indices: S u (w),S Z (w) andS m (w) fork =25 and bothw =10 andw =20. Despite fewerk-mers selected from reads, mapping using minimizers is slower, less sensitive and uses more memory. This is due to the increased number of indexed positions inS m (w), which also results in high frequencyk-mers being indexed twice as often (Figure 5.2). Comparisons suggest that the difference in mapping speed between indexS u (w) andS Z (w) increases withw (Tables 5.1 and 5.2). This is because larger values ofw allows more options to select positions with low-frequency fingerprints in S Z (w), so the ratioZ(S m (w))/Z(S u (w)) is a decreasing function ofw for any genome. 5.3 Empirical comparison of indexing and alphabet encoding strategies In this section, we provide a comprehensive comparison of the strategies proposed in Chapters 3, 4 and 5. Using the abismal algorithm, we mapped 1 million randomly sampled reads from various public datasets under identical hardware settings. Reads were mapped under six combinations: two indexing strategies and three alphabet encodings. The two-letter encoding imposes S 3 = ∅ and the three-letter encoding imposes S 2 =∅. The minimum hit rate strategy uses the indexS Z (w), whereas the uniform strategy uses the index S u (w). We set w = 20 for both cases. These comparisons provide objective results on the individual contributions of each strategy proposed in this dissertation. The time to map one million reads was compared for twelve datasets. Two of these are simulated error- free reads from human and mouse. The remaining ten datasets are from six organisms: human, mouse, chimpanzee, zebrafish, chicken and arabidopsis. We selected three human and three mouse datasets from different sequencing instruments (Table 3.1). For all datasets, we set the cut off to c = ∞, meaning that reads are compared to every hit regardless of the number of hits per fingerprint. Mapping times are consistently faster through the use of the minimum hit rate algorithm (Figure 5.3), whereas sensitivity and specificity are nearly identical given that the overall algorithm is the same with the exception of encoding function. Interestingly, improvements in mapping speed relative to the uniform index is more pronounced when a three-letter alphabet encoding is used. Within human and mouse, mapping speeds are variable across datasets, with the relative efficiencies of two-letter and three-letter encodings 77 being vastly different. Consistently, however, the hybrid alphabet attains superior mapping speed, often similar to one of the two encodings. Overall, the results suggest that a combination of the hybrid alphabet and the minimum hit rate encoding is the fastest implementation for WGBS read mapping. hybrid three −letter two−letter Simulated error-free (H. sapiens) 11.4 7.2 7.6 3.8 7.3 4.9 hybrid three −letter two−letter LCL (H. sapiens) 6 3.8 3.6 1.5 4.1 2.8 hybrid three −letter two−letter Shenzhen (H. sapiens) 11.5 8.8 9.8 5.2 9 6.5 hybrid three −letter two−letter Tycko (H. sapiens) 6 4 4 1.9 4.1 2.9 hybrid three −letter two−letter Simulated error-free (M. musculus) 11 8 8.5 5.1 7.9 5.6 hybrid three −letter two−letter Decato (M. musculus) 23.4 17.3 18.1 14.6 15.7 11.7 hybrid three −letter two−letter Passaro (M. musculus) 5.8 4.4 4.5 2.8 7.6 5.8 hybrid three −letter two−letter Dura (M. musculus) 9.3 7.1 7.3 4.6 9 6.8 hybrid three −letter two−letter Gokhman (P . troglodytes) 9 7 7.7 4.8 7.8 6.2 hybrid three −letter two−letter Kamstra (D. rerio) 35.6 27.5 27.3 18.1 19 15 hybrid three −letter two−letter Lee (G. gallus) 53.7 40.9 33.6 17.7 51.4 40.4 hybrid three −letter two−letter Yong (A. thaliana) 128.6 105.9 97.3 78.3 124.1 102.9 minimum hit rate uniform million mapped reads per hour Figure 5.3: Empirical comparison of the hybrid, two-letter and three-letter alphabets using both uniform and minimum hit rate indices on two simulated error-free sets of reads and ten real datasets. Datasets are represented by the IDs described in Table 3.1. 78 5.4 Concluding remarks We constructed an indexing strategy that minimizes the expected hit rate under the constraint that at least one amongw consecutive genome positions must be indexed. The minimum hit rate index can be constructed in linear time, and its compression rate is approximatelyw, similar to the uniform index and the best possible winnowing scheme of length w. Empirical comparisons in various model organisms show that this index is faster than both uniform indices and winnowing schemes for the same window size w. Both uniform and minimum hit rate indices require a minimum exact match size ofw+k− 1 with the optimal mapping location to ensure correctness, but the minimum hit rate index attains the results significantly faster due to fewer false-positive hits. The framework to calculate expected hit rates assumes that the read will be compared to every candidate retrieved by a seed. In practice, there are many implementation strategies in a mapping algorithm that may reduce the number of comparisons without affecting mapping accuracy. For instance, if a mapper only reports unique reads, then the search can stop once two exact matches are found in different locations (in paired-end mapping, multiple candidates should be kept because they can be resolved by the corresponding read mate). For contiguous seeds in a genome indexed either through winnowing schemes or uniformly spaced positions, these exact matches will necessarily be found using seeds from the first w read positions. Reducing the number of false-positive hits is advantageous to find exact matches faster, since the number of comparisons from other seeds that do not retrieve exact matches is also reduced. Another common heuristic is to skip seeds that retrieve more than c candidates, where c is a cut-off estimated from genome fingerprint statistics ( e.g. an upper quantile of the fingerprint frequency distribution). In this case, for any value of c, the minimum expected hit rate index will contain more k-mers below the cut-off c (as it favors less-frequent k-mers). This means that, in the minimum hit rate index, fewer seeds will be skipped, possibly resulting in more comparisons. The process of skipping frequentk-mers is known to cause biases in mappable regions of the genome, resulting in regions with zero reads even when the rest of the genome has high coverage covered [112]. For this reason, one of the major focuses of future improvements in read mapping software is attaining efficient read mapping without resorting to arbitrary cut-offs for seeds, but rather designing algorithms that minimize mapping effort while guaranteeing that a minimum exact match length is sufficient for correct mapping. This is especially challenging for WGBS. We empirically demonstrated that the abismal mapping speed is maximized when a combination of the hybrid alphabet and the minimum hit rate index is used. This is in agreement with theoretical expectations. 79 Chapter 6 Future opportunities in short bisulfite sequencing read mapping So far we constructed three strategies to improve filtration efficiency in bisulfite sequencing read mapping. All of these strategies focus on reducing the number of hits retrieved by fingerprints selected from reads and, consequently, the number of alignments necessary to map a read. Albeit essential, filtration is one of many steps in a read mapping algorithm that must be efficient. If the alignment algorithm or the process of storing alignment results are inefficient or poorly implemented, the mapper will perform poorly despite efficient filtration. Certain choices of parameters may also lead to incorrect mapping locations. In this chapter we discuss three important aspects of a fast and accurate short bisulfite sequencing map- ping algorithm, specifically (i) which data structures should be used for a paired-end read mapping algo- rithm, (ii) optimal implementations of the Smith-Waterman algorithm and (iii) the choice of the scoring scheme used to align reads to candidates. 6.1 Strategies to map paired-end reads In paired-end read mapping, the ends of DNA fragments that typically range from 100 to 1000 bases are sequenced. Since fragments are small, the two resulting sequenced reads (also called “mates”) are expected to map to nearby locations in the genome and in opposite strands. If these conditions are met, we say the mates map concordantly. Reads in which these conditions are not met but each read maps uniquely with high alignment scores are called discordant mates. While these reads do not match standard expectations, 80 they can provide valuable information about structural variation within the sample, such as DNA inversions, translocations and kilobase-scale insertion and deletions [113]. Unlike single-end mapping, the position with the best alignment score of each independent read is not necessarily the position that will be reported. This is because the best sum of concordant alignment scores is not necessarily the sum of the independent best scores in each end. An optimal paired-end mapping strategy must find, among all possible concordant candidates, the one with highest sum of alignment scores. Depending on biological applications and the underlying assumptions of the sequence reads, the mapper should also decide, based on this resulting score, if the pair should be reported or if the mapper should attempt to find discordant alignments for each end independently. Discordant pairs arise due to expected biological differences between the read and the reference, but they are often expected to comprise the vast minority of read pairs. Specifically, they should not exist if reads are expected to originate from a genome identical to the available reference. We set aside issues relating to these decisions and henceforth focus on the algorithms used to find concordant pairs. There are mainly two strategies used by mapping algorithms to find concordant locations for paired-end reads (“mating”), although other strategies are also possible. These are (i) map each mate independently, then look for concordant pairs and (ii) map one mate and, for each candidate mapping location of the first mate, search for a location for the second mate in the concordant vicinity. Strategy (i) is implemented by BWA-MEM, BSMAP, and HISAT-3N, among others. Strategy (ii) is implemented by Bowtie2. We refer to strategy (i) as independent mating and strategy (ii) as vicinity mating. In this section, we describe the advantages and disadvantages of both strategies. Independent mating In this strategy, readsr 1 andr 2 are mapped independently. For each read, filtration is applied and the read is aligned to its filtered candidates. For each candidate, the alignment score, location and strand is stored. After both reads are aligned to their candidates, all pairs of candidates in each end are tested for concordance, and the concordant pair with highest sum of alignment scores is selected. The simplest implementation of this approach is to store every mapped candidate in one array for each end. After aligning candidates independently, the arrays can be sorted and iterated forward and backward to find all concordant pairs. This implementation is particularly inefficient in fingerprint-based filtration because small contiguous seeds will often lead to several false-positive hits with very low alignment score, which must then be linearly scanned in the mating step. This problem is aggravated in WGBS due to the reduced alphabet. Problems arising from too many false-positive hits can be circumvented by only keeping an 81 alignment if (i) the score is above a certain threshold or (ii) the only available alignments have a score below the threshold. These criteria are heuristics that will lead to inaccuracies over a sufficiently large number of reads, specifically in cases where best concordant alignment has a low score but the array is already full of discordant high-scoring alignments. Conversely, this approach can be efficiently implemented using data structures for priority queues, which provide the three necessary operations: retrieval of the score cut-off, insertion into the queue and deletion of the lowest-scoring queue element. 100 position pairs number of concordant pairs Density 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 300 position pairs number of concordant pairs Density 0.0 0.5 1.0 1.5 2.0 0 2 4 6 8 10 1,000 position pairs number of concordant pairs Density 0 1 2 3 4 0.0 1.0 2.0 3.0 10,000 position pairs number of concordant pairs Density 20 30 40 50 0.00 0.02 0.04 0.06 Figure 6.1: Distribution of concordant pairs found by chance as a function of the number of position pairs stored using the independent mating strategy. Positions were uniformly sampled from a genome of size m = 3× 10 9 . The number of sampled positions is equal to the priority queue size. Pairs of positions that were at most 1000 bases apart were counted as concordant pairs. Vicinity mating The drawback of independent mating is the potentially high number of unnecessary com- parisons that need to be performed. If a mate has a single match in the genome and the other has thousands, the number of comparisons for the second mate can be narrowed by only doing comparisons with the region 82 that is concordant with the match of the first end. In vicinity mating, the candidates of one end (usually the one with fewer sum of candidates retrieved from all seeds) are used as basis to map candidates from the second end. Then, with prior information on the maximum insert size, the intersection between the second end’s fingerprints and the fingerprints within the maximum insert size of each candidate’s position is calculated, and the second end is aligned to the positions within the insert limit that match that intersection. Independent mating is the biologically accurate approach The vicinity mating approach strongly relies on the assumption that reads are expected to be concordant in the reference genome to which the reads are being mapped. If genetic variation is considered, reads mapped discordantly in the reference genome may be a result of structural variation. In particular, kilobase-scale insertions and deletions can generate read pairs that originate from a small genome fragment but map discordantly to the reference genome. The vicinity mating strategy may favor sub-optimal concordant alignments with various mismatches and indels to a more likely pair of discordant exact matches for these fragments. Conversely, the independent mating strategy may cause false-positive concordant pairs by chance if too many candidates are stored (Figure 6.1). The abismal approach for paired-end mapping To mitigate false-positive concordant pairs, abismal allows the number of independent mates to increase depending on the number of “good candidates” found through seeds. Initially, an array of size 32, organized as a heap, is used to store candidates. The root of the heap is located in the first element of the array, which stores the maximum number of mismatches among the stored candidates. If the heap has less than 32 elements, then no cut-off is used. Candidates are successively added to the heap. If a “good” candidate, defined as a read with at most 10% mismatches, is added to the heap that is itself full of good candidates, then the heap is allowed to “overflow” and increase in size. The maximum heap size, which is pre-allocated before mapping reads, is 32,768. Note that the heap will only attain extremely large sizes if there are good hits across the entire genome (e.g. in cases of repeats or low complexity sequences), in which case storing all these locations is necessary for subsequent mating with candidates from the other read. 6.2 Efficient implementations of the Smith-Waterman algorithm The Smith-Waterman algorithm is the most efficient algorithm guaranteed to find the best local alignment between two sequences. In the context of read mapping, its quadratic complexity on the query read makes 83 it slow in practice, especially in WGBS mappers where many alignments are necessary. Queries in large genomics databases suffer a similar drawback, which motivated several proposed solutions to accelerate the algorithm. Faster approaches include either a banded Smith-Waterman, which prohibits more than B consecutive indels for some fixed value of B, or using hardware-specific instructions to fill several cells of the alignment matrix A in parallel. The banded alignment can be calculated in O(Bn) for a read of lengthn, but if the optimal alignment has more thanB consecutive indels, the banded alignment will have sub-optimal score. Typical implementations of Smith-Waterman involve filling each cell either row by row or column by column. Given the dependency of each cell to its adjacent left, top and top-left neighboring cells, this order makes it infeasible to fill matrix elements in parallel, as we cannot guarantee that all the values needed to fill a given cell are already calculated. Conversely, filling A through its minor diagonals allows cells within a given diagonal to be calculated independently. The independence of diagonal elements can be used by moving groups of elements one diagonal at a time to single-instruction, multiple-data (SIMD) directives in modern processors. Each diagonal is loaded from memory to the processor, and the next diagonal can be calculated as the maximum of three elements of the two previous diagonals plus the penalties for mismatches and indels. At best, very efficient implementations of Smith-Waterman using SIMD instructions divide the number of operations by up to 16 using 128-bit registers. If two sequences of length 100 are aligned, the number of operations to fill the entire matrix A is at least 100 2 /16 = 625. This is still substantially slower than less sensitive Hamming distance comparisons. In four-letter DNA sequences, encoding letters using two bits allows the same 128-bit register to count up to 64 matches/mismatches using logical operations, so the Hamming distance of two sequences of length 100 can be calculated with two register operations. The abismal algorithm leverages the speed of Hamming distance calculations to account for the large number of hits from WGBS seeds. Abismal reduces the number of Smith-Waterman alignments by splitting the mapping algorithm in two steps. First, seeds are used and reads are compared to genome hits using Hamming distance. Second, the set of elements with lowest Hamming distance (in single-end mapping) or the set of all concordant pairs (in paired-end mapping) is subsequently fully aligned to the read. By encoding read and reference letters using four bits packed into 64-bit integers, the Hamming distance in the first step can be counted sixteen nucleotides at a time, and comparison stops if the number of mismatches is above 40% of the read length. The second step uses a banded Smith Waterman algorithm that allows at 84 most consecutive indels. Thea exact band width used isB =min(d,30), whered is the Hamming distance of each candidate. This is because, if there are d mismatches, then the optimal alignment is guaranteed to not have more thand edits and therefore not go farther than d cells from the main diagonal. The dynamic adjustment of band width leverages the speed of banded alignment while guaranteeing that the optimal alignment is found, with the exception of extreme cases where a read has more than 30 consecutive indels relative to its optimal mapping location. 6.3 Choosing a local alignment scoring scheme The universal strategy of short read mapping algorithms is to filter candidates and align the read to each filtered candidate. Short-read sequencing technologies often introduce non-biological bases at the end of the reads. In Illumina sequencing machines, for instance, these can originate from (1) non-calibrated cameras at the start of the sequencing reaction, leading to biases in the 5’ end of reads, (2) the exponential decrease in quality toward the 3’ end of reads as a result to the asynchrony between identical molecules sequenced in parallel and (3) sequencing adapters at either end of the read, among various other reasons. Since biases often occur at the ends of the read, local alignment is an important step in read mapping, as it allows low-scoring mapping ends to be removed after mapping and disconsidered in downstream analyses. Bases that are removed in local alignments are said to be soft-clipped. While originally devised to find sets of conserved subsequences among large homologous biological sequences, local alignments are also extremely useful to identify non-biological sections of sequenced reads. The information on soft-clipping is encoded in the mapper output. Failure to discard soft-clipped bases may cause downstream problems in methylation analysis. For instance, adapters are often methylated prior to addition to the DNA library, so Cs in adapters will likely be sequenced as Cs. If, by chance, these Cs map to reference Cs, methylation will be over-estimated if they map to a CpG, or bisulfite conversion will be under-estimated if they map to a non-CpG. The soft-clipped portion of a query is the prefix or suffix in which the global alignment score to the target’s prefix or suffix is at most zero. The length of the soft-clipped portions depends on the alignment score, which has important implications in falsely excluding biological bases. For instance, assume the match score to be M = 1 and the mismatch score to be− X = − 5. If the first 4 bases of a query match the target, a single mismatch in the fifth base will soft-clip all 5 bases of the prefix. Conversely, if the match score is +1 and the mismatch score is -1, it is possible that, by chance alone, an adapter will have 85 more matches than mismatches and be analyzed as biological bases. When hundreds of millions of reads are mapped, the resulting errors will compound. Since datasets vary in biases and adapter content, optimal alignment scores may vary between datasets. Alignment scores are typically devised based on prior knowledge of nucleotide or amino acid diver- gence [114]. Given letters i and j, let P ij be the probability that i is substituted to j and q i and q j be the frequencies of i and j, respectively. The score of a match between i and j is log P ij q i q j , reflecting a log-likelihood ratio between the hypothesis of substitution and the hypothesis that these bases co-occur by chance [69]. The total alignment score is the total probability of observing the optimal alignment under both the null and alternative hypotheses. Choices of alignment scores (either directly or indirectly parametrized byP ij andq i ,q j ) have been shown to yield significantly different results in database searches. For instance, the FASTA database use a +5/-4 scoring scheme for matches and mismatches, leading to sensitive homol- ogy results that are up to 65% similar, whereas a +1/-2 scoring scheme from the NCBI BLASTN database leads to homology results that are 95% similar [114]. The choice of alignment scores ultimately controls the trade-off between sensitivity and specificity on a per-base level. In the abismal algorithm described in Chapter 3, we selected, given a scoring scheme, the best alignment score among all mapped candidates, but the decision to report a read is based on the edit distance, defined as the fraction of substitutions (mismatches and indels) and the number of mapped bases (excluding soft- clipped bases). If the best alignment soft-clips too many bases, the resulting read needs to have fewer edits to be reported. We propose using this framework, along, with public data, to devise optimal alignment scores for short bisulfite sequencing read mapping. First, we focus on a simple alignment score with a fixed penalty for matches, mismatches and indels. Next, we discuss the issue of penalizing certain types of mismatches differently based on substitution frequencies and the underlying uneven frequency of bisulfite-converted letters. Let M be the match score, X be the mismatch score and D be the indel score (we consider insertions and deletions symetrically, soI = D). To find the values of M,X andD, we used five public paired-end datasets of different lengths (Table 6.1) and quantified the alignment quality of concordant pairs. We impose two rules for mapped reads (i) at least 90% of the read must align to the reference. In other words, at most 10% of the read can be soft-clipped (e.g. assumed to be low-quality bases or adapters) (ii) the number of edits (insertions and deletions) must be at most 5% of the mapped portion of the read, meaning we do not allow too many differences to the reference genome within a read. Under this framework, we vary M,X 86 and D as integer values between 0 and 20, resulting in 21 3 = 9,261 combinations of alignment scores. For each triplet (M,X,D), we map 1 million randomly sampled reads from each dataset using the given alignment score. We choose the alignment for which the maximum number of correctly mapped nucleotides is maximized among paired-end reads that obey criteria (i) and (ii). Results in Table 6.1 show that optimal alignment scores revolve around close multiples of (1,X,2), whereX is either1,1.5 or2. Most tested alignment scores have at least 99.9% of correctly mapped bases relative to the top score. This suggests thatX = 1.5 is an optimal compromise. Abismal uses the(2,3,4) scoring scheme as default, leading to a near-optimal trade-off between correctly mapped bases and low soft-clipping sizes. A similar grid search approach can be extended to more references and datasets. The one that maximizes the combined number of correctly reported bases across different species is preferred. Furthermore, a fourth dimension of the grid search can account for the separation of gap-open and gap-extend penalties. The affine or convex gap penalties define a parameter E as the cost of opening a gap in either the read or the reference. This penalty accounts for the biological observation that indels often occur consecutively, reflecting the biochemical addition of removal of multiple nucleotides simultaneously. A set ofx consecutive indels has a penaltyE+xD. In convex gap alignment methods, the penalty isE+xf(D), wheref is a convex function of D. Different convex functions can also be incorporated into the search space to empirically devise the optimal scoring function for WGBS reads. Table 6.1: Optimal alignment scores for some real datasets dataset read length species top score Bernstein 100 H. sapiens (12,18,19) Meissner 100 H. sapiens (12,15,18) Tycko 144 H. sapiens (10,18,19) Decato 90 M. musculus (10,14,18) Kamstra 150 D. rerio (9,18,20) 6.4 Constructing a substitution scoring matrix In the previous section, we used a brute-force grid search to find the best scoring scheme for real datasets. The search iterated over a set of integer values for matches, mismatches and indels from 0 to 20. This simple scoring scheme penalized any type of mismatch similarly, whereas, in reality, some substitutions are more or 87 less common than others and possibly should be penalized differently. For instance, transitions (i.e. purine- purine or pyrimidine-pyrimidine substitutions) are more common than transversions (i.e. purine-pyrimidine or pyrimidine-purine substitutions). Score values should reflect the underlying probability of observing the substitution relative to the probability of observing a co-occurence of two nucleotides by chance. A substitution scoring matrix is a4× 4 matrix,X, whereX ij represents the score of matching the read nucleotide i with the reference nucleotide j. In most nucleotide alignment algorithms, read and reference are treated interchangeably, meaning that X is a symmetric matrix. In WGBS, bisulfite bases (Ts or As) appear more frequently, whereas bases converted through bisulfite treatment (Cs or Gs) are extremely rare. For this reason, it may be advantageous to score combinations of substitutions to reflect their underlying statistical properties. A common approach to construct substitution scoring matrices for WGS nucleotide sequences is through the log-odds ratio [115, 116]. A set of K similar sequences of equal length N is placed one on top of the other (similar to a multiple sequence alignment without gaps). In each column, the set of co-occurrences of each pair of nucleotide is counted. There are τ = K 2 × N such pairs. Let Y ij be the number of co-occurrences of nucleotides i and j. The counts are converted to probabilities through a 4× 4 matrix P , where P ij = Y ij /τ . To construct the scores, the nucleotide frequencies in reads and in the reference must also be considered. Let q = (q A ,q C ,q G ,q T ) be the relative background frequencies of nucleotides in the genome For example,q = (0.29,0.21,0.21,0.29) for most mammalians, but this can be objectively quantified using the reference sequence and stored in the genome index. The probability of co-occurrence of a read nucleotidei with a reference nucleotidej isq ri q tj . The scoring matrixX is then constructed with the relationship X ij =log P ij q i q j . Note that both matrices P and X are symmetric. This is because there is no distinction between read and reference sequences in this approach. A similar method can be used to construct a substitution scoring matrix for WGBS reads. We must, however, distinguish between sequences that originates from read and reference, as both have vastly different relative nucleotide frequencies. We describe an adjustment of this approach as follows. Consider now a set of K WGBS reads of length N for which the true mapping location is assumed known. Construct the non-symmetric matrix Y , where Y ij counts the number of co-occurrences of read nucleotide i with reference nucleotide j. The total count is given by τ = P i,j Y ij = KN. The matrix P ij of probabilities 88 is then defined as P ij = Y ij /τ . Next, we distinguish the background frequencies q r of read nucleotides and q t of the reference nucleotides (recall that reads are bisulfite-converted but the reference is not). Let q r =(q rA ,q rC ,q rG ,q rT ) and, similarlyq t =(q tA ,q tC ,q tG ,q tT ). The scoring matrix is then given by X ij =log P ij q ri q tj . (6.1) To illustrate the asymmetry of scoring matrices for WGBS reads, consider the values ofX CT andX TC . First, note that q r and q t differ significantly. For the human genome, q t = (0.29,0.21,0.21,0.29). Con- versely,q r depends on the underlying methylation level of the dataset and the bisulfite conversion rate. These can be estimated from the reads (Figure 2.1). Assume, for instance,q r =(0.29,0.01,0.21,0.49), reflecting common nucleotide frequencies of T-rich reads. The value of P TC should be close to q rT q tC , since most Ts will match either a C or a T in the genome with probability proportional to the background frequencies. Conversely, the value of P CT reflects the frequency with which a C is simultaneously not converted and mutates to a T, which is much lower than q rC q tT . Assuming a liberal 10% probability of a methylated C to be mutated into a T, this impliesX CT = log(0.1), whereasX TC ≈ 1. Also note that, since methylated CpGs are biologically more prone to mutations, this approach would likely penalize a C→T mismatch less than, for example, an A→T mismatch. Strategies developed to construct human-mouse alignment scores can similarly be used to construct a dataset-specific substitution scoring matrix [116]. The human-mouse approach maintains sequences in which pairwise edit distances are between a minimum and a maximum threshold value. This discards non- informative exact matches to the reference, as well as reads that map very poorly. For this initial selection of sequences, a simple scoring scheme of(M,X,D) = (1,1,1) can be used. The score is then re-calculated based on equation (6.1). This procedure can be repeated until convergence of the scoring matrix. We note that, despite the fact that a proof of convergence for this procedure is not known, several real datasets converge rapidly for various starting scoring schemes [117]. Stopping criteria can also be imposed if too many iterations are necessary for convergence, in which case the initial alignment score is used instead. Biases in mapping bisulfite sequencing bases While accurate alignment scores can be inferred from nucleotide frequencies in reads and in the reference, a major concern in bisulfite read mapping is potential biases in methylation estimates given a scoring scheme. Consider a simple scoring scheme with +1 for matches and -1 for mismatches and indels. If Cs in reads are only allowed to match Cs in the reference, 89 reads with poorer bisulfite conversion are more likely to map uniquely due to the additional letter [118], which may lead to biases in methylation estimates if the read also contains CpGs. Interestingly, however, we often observe the opposite trend, where reads with less successful conversion are also less likely to map (Figure 6.2). This may be, among other reasons, due to (i) contamination from DNA that was not bisulfite- converted, (ii) reads with high adapter content, (iii) biases in sequence patterns that are more or less likely to be successfully converted, or (iv) higher probability that reads with more unconverted bases contain more mismatches and are less likely to pass the edit distance cut-off. While WGBS mappers must adapt their scoring system to avoid biases in methylation estimates, these adjustments can also be done donwstream, where reads with too many unconverted bases can be excluded from the analysis or given less weight when estimating CpG methylation. 0 5 10 15 20 25 0 2 4 6 8 10 12 Bernstein (H. sapiens) unconverted bisulfite bases frequency (log) input mapped 0 5 10 15 20 25 0 2 4 6 8 10 12 Meissner (H. sapiens) unconverted bisulfite bases frequency (log) 0 10 20 30 40 50 0 2 4 6 8 10 12 Tycko (H. sapiens) unconverted bisulfite bases frequency (log) 0 5 10 15 20 25 30 0 2 4 6 8 10 12 Gokhman (P . troglodytes) unconverted bisulfite bases frequency (log) 5 10 15 20 25 0 2 4 6 8 10 12 Decato (M. musculus) unconverted bisulfite bases frequency (log) 0 5 10 15 20 25 30 35 0 2 4 6 8 10 12 Passaro (M. musculus) unconverted bisulfite bases frequency (log) 0 10 20 30 40 50 0 2 4 6 8 10 12 Kamstra (D. rerio) unconverted bisulfite bases frequency (log) 0 5 10 15 20 25 30 0 2 4 6 8 10 12 Lee (G. gallus) unconverted bisulfite bases frequency (log) Figure 6.2: Comparison between the distribution of cytosines in input reads and the distribution of cytosines in mapped (unique or ambiguous) reads for various real datasets (Table 3.1). 90 6.5 Concluding remarks This chapter provided an overview of additional steps, beyond filtration, that should be properly rationalized and efficiently implemented to guarantee speed and output correctness in a read mapping algorithm. For paired-end reads, mapping read mates independently and subsequently finding concordant pairs is preferable to account for possible structural variation relative to the reference genome. Implementations of the independent mating approach must avoid false-positive concordant pairs that arise by chance. A solution is to keep a small number of high-scoring candidates for each end and, in order to account for repetitive elements, allow this number to increase if too many high-scoring candidates are found. Abismal keeps a max heap of 32 candidates per end, but if candidates with more than 90% matches to the read are found, the heap is allowed to “overflow”. If over 32,768 mates are found for any end, then mating is not attempted and the read is assumed to be ambiguous (although it is reported as unmapped and a location is not provided). Reads are aligned to candidates using the The Smith-Waterman algorithm. Albeit quadratic in the read length, the algorithm can be accelerated using SIMD instructions that divide the number of operations by up to 16. Given the large number of alignments in WGBS, however, even efficient implementations are prohibitive. The same SIMD operations can be used to calculate Hamming distances between the read and each candidate hit much faster, allowing scalability for the large number of hits in WGBS reads. Abismal only aligns candidates whose Hamming distance is below 40% of the read length. The Hamming distance can be used to find the maximum band width of a read and attain faster alignments through the banded Smith-Waterman algorithm. Alignment scores can significantly affect how mapped reads are interpreted downstream, including which regions of the read are assumed to be biological. Alignment scores are divided into substitution scoring matrices and indel penalties, possibly with gap open and gap extension scores. The substitution scoring can be estimated through the log-odds ratio, but certain alignment choices can cause methylation biases downstream because reads with less successful bisulfite conversion are more likely to map uniquely. The indel penalty can be estimated using real data by attempting several combinations of alignment scores and choosing the one that maximizes the number of mapped nucleotides while only accepting reads whose nucleotides are predominantly matches to the reference. Abismal uses a scoring scheme withM =2,X =3 andD = 4 and only allows unconverted bases to match the same letter in the reference. Empirically, reads with fewer converted bases are less likely to map. 91 Chapter 7 Conclusions The overarching goal of this dissertation is to reduce the number of alignments necessary to map a bisulfite sequencing read and, consequently, improve the speed of bisulfite sequencing mapping software tools. This is attained through novel encoding strategies of bisulfite-converted DNA sequences and novel methods to index the reference genome. These are justified by statistical properties of nucleotide frequencies in DNA sequences of most eukaryotes before and after bisulfite conversion. The two-letter alphabet that distinguishes purines and pyrimidines is an efficient encoding function for bisulfite-converted reads. It leverages the information loss due to bisulfite treatment by analyzing reads in an alphabet that is invariant to the treatment, and where both letters of the alphabet have identical frequencies. Although not the main motivation of this encoding, a significant advantage of the two-letter alphabet is that seeds are robust to transitions relative to their optimal alignment location. Transitions relative to the reference genome are the most common type of observed substitutions in nearly all organisms [119]. This is due to the similar biochemical composition of the two purines (A and G) or the two pyrimidines (C and T). Whole-genome sequencing databases have previously also explicitly used two-letter alphabet seeds to account for these frequent mutations [120, 121, 122]. The two-letter alphabet has theoretical advantages when we assign a numerical value (fingerprint) to k- mers in the genome. Two-letter fingerprints require one bit per letter. If fingerprints take value in a set Ω , we can usek-mers of size⌊log 2 (|Ω |)⌋, and we expect their frequency to be closer to uniform than fingerprints under the three-letter alphabet, wherek-mers are of size⌊log 3 (|Ω |)⌋ and the bisulfite base represents half of the observed bases. Uniformly represented fingerprints minimize the expected number of hits in the genome when seeds are selected from reads. The two-letter alphabet encoding attains the theoretical minimum hit 92 rate in infinite i.i.d. sequences, and several real genomes have similar fingerprint frequency distributions to i.i.d. genomes. If genomes have low entropy subsequences in two letters, such as consecutive runs of poly-purines or poly-pyrimidines, the three-letter alphabet can be used locally for higher specificity in read mapping. A major advantage of the two-letter alphabet is that calculating k-mer fingerprints only requires logi- cal operations (bit-shifts and OR operands), unlike the modular arithmetic to convertk-mers to numbers in base 3. This significantly accelerates the encoding of k-mers as fingerprints prior to filtration. The simple logical operations provide opportunities for the development of dedicated hardware for read mapping algo- rithms. Field-programmable gate arrays (FPGAs) have been successful in accelerating various steps in read mapping, such as the Smith-Waterman algorithm [123] and estimating bounds on the edit distance between a read and a reference sequence [124]. Research in computer architecture for genome analysis and algo- rithms that leverage the speed and parallelism of computer hardware are at the forefront of bioinformatics research [125]. The abismal software tool is a proof-of-concept of the theoretical contributions of this dissertation. Mapping large WGBS datasets is feasible in a small number of CPU hours and nearly all 64-bit devices. The ease of bisulfite sequencing data analysis provides opportunities to construct larger epigenetic databases for cross-species and cross-phenotype studies. These databases are a natural progression of the massive amounts of available biological data. For instance, querying cell phenotypes based on gene expression is now common with the availability of single-cell atlases of complex organisms [126, 127]. Similarly, we expect large single-cell resolution methylome databases to emerge as we improve the accuracy of low DNA input bisulfite sequencing library construction methods [83] and incorporate these methods to single- cell multi-omics technologies at scale [128]. Similar to how sequence databases became a necessary tool in modern molecular biology [69], we expect multi-omics phenotype databases to be central to future biological research in the near future. Efficient mapping of genome sequences will likely remain at the core of the algorithmic efforts in these “biological search engines”, not only to produce new entries to these databases, but also to query entries with high speed and accuracy. 93 Bibliography [1] Oswald T Avery, Colin M MacLeod, and Maclyn McCarty. Studies on the chemical nature of the sub- stance inducing transformation of pneumococcal types: induction of transformation by a desoxyri- bonucleic acid fraction isolated from pneumococcus type III. The Journal of Experimental Medicine, 79(2):137–158, 1944. [2] Robin Holliday and John E Pugh. DNA modification mechanisms and gene activity during develop- ment. Science, 187(4173):226–232, 1975. [3] Melanie Ehrlich, Miguel A Gama-Sosa, Lan-Hsiang Huang, Rose Marie Midgett, Kenneth C Kuo, Roy A McCune, and Charles Gehrke. Amount and distribution of 5-methylcytosine in human DNA from different types of tissues or cells. Nucleic Acids Research, 10(8):2709–2721, 1982. [4] Adrian Bird, Mary Taggart, Marianne Frommer, Orlando J Miller, and Donald Macleod. A fraction of the mouse genome that is derived from islands of nonmethylated, CpG-rich DNA. Cell, 40(1): 91–99, 1985. [5] Bradley E Bernstein, John A Stamatoyannopoulos, Joseph F Costello, Bing Ren, Aleksandar Milosavljevic, Alexander Meissner, Manolis Kellis, Marco A Marra, Arthur L Beaudet, Joseph R Ecker, et al. The NIH roadmap epigenomics mapping consortium. Nature Biotechnology, 28(10): 1045–1048, 2010. [6] Hao Wu and Yi Zhang. Reversing DNA methylation: mechanisms, genomics, and biological func- tions. Cell, 156(1-2):45–68, 2014. [7] Suhua Feng, Steven E Jacobsen, and Wolf Reik. Epigenetic reprogramming in plant and animal development. Science, 330(6004):622–627, 2010. 94 [8] Adrian Bird. DNA methylation patterns and epigenetic memory. Genes & Development, 16(1):6–21, 2002. [9] Wolf Reik, Wendy Dean, and J¨ orn Walter. Epigenetic reprogramming in mammalian development. Science, 293(5532):1089–1093, 2001. [10] Degui Zhi, Stella Aslibekyan, Marguerite R Irvin, Steven A Claas, Ingrid B Borecki, Jose M Ordovas, Devin M Absher, and Donna K Arnett. SNPs located at CpG sites modulate genome-epigenome interaction. Epigenetics, 8(8):802–806, 2013. [11] Howard Cedar. DNA methylation and gene activity. Cell, 53(1):3–4, 1988. [12] Michael B Stadler, Rabih Murr, Lukas Burger, Robert Ivanek, Florian Lienert, Anne Sch¨ oler, Erik van Nimwegen, Christiane Wirbelauer, Edward J Oakeley, Dimos Gaidatzis, et al. DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature, 480(7378):490–495, 2011. [13] Howard Cedar and Yehudit Bergman. Linking DNA methylation and histone modification: patterns and paradigms. Nature Reviews Genetics, 10(5):295–304, 2009. [14] Zheng Zuo, Basab Roy, Yiming Kenny Chang, David Granas, and Gary D Stormo. Measuring quan- titative effects of methylation on transcription factor–DNA binding affinity. Science Advances, 3(11): eaao1799, 2017. [15] Arthur D Riggs. X inactivation, differentiation, and DNA methylation. Cytogenetic and Genome Research, 14(1):9–25, 1975. [16] Carla Tribioli, Filippo Tamanini, Cristina Patrosso, Luciano Milanesi, Anna Villa, Rossana Per- golizzi, Elena Maestrini, Stefano Rivella, Silvia Bione, Mita Mancini, et al. Methylation and se- quence analysis around EagI sites: identification of 28 new CpG islands in XQ24-XQ28. Nucleic Acids Research, 20(4):727–733, 1992. [17] En Li, Caroline Beard, and Rudolf Jaenisch. Role for DNA methylation in genomic imprinting. Nature, 366(6453):362–365, 1993. [18] Gavin Kelsey and Marisa S Bartolomei. Imprinted genes. . . and the number is? PLoS Genet, 8(3): e1002601, 2012. 95 [19] Kenneth I Aston, Philip J Uren, Timothy G Jenkins, Alan Horsager, Bradley R Cairns, Andrew D Smith, and Douglas T Carrell. Aberrant sperm DNA methylation predicts male fertility status and embryo quality. Fertility and sterility, 104(6):1388–1397, 2015. [20] Zachary D Smith and Alexander Meissner. DNA methylation: roles in mammalian development. Nature Reviews Genetics, 14(3):204–220, 2013. [21] Kristopher L Nazor, Gulsah Altun, Candace Lynch, Ha Tran, Julie V Harness, Ileana Slavin, Ibon Garitaonandia, Franz-Josef M¨ uller, Yu-Chieh Wang, Francesca S Boscolo, et al. Recurrent variations in DNA methylation in human pluripotent stem cells and their differentiated derivatives. Cell Stem Cell, 10(5):620–634, 2012. [22] Antoine Molaro, Emily Hodges, Fang Fang, Qiang Song, W Richard McCombie, Gregory J Han- non, and Andrew D Smith. Sperm methylation profiles reveal features of epigenetic inheritance and evolution in primates. Cell, 146(6):1029–1041, 2011. [23] Antoine Molaro, Ilaria Falciatori, Emily Hodges, Alexei A Aravin, Krista Marran, Shahin Rafii, W Richard McCombie, Andrew D Smith, and Gregory J Hannon. Two waves of de novo methylation during mouse germ cell development. Genes & Development, 28(14):1544–1549, 2014. [24] Zachary Lippman, Anne-Val´ erie Gendrel, Michael Black, Matthew W Vaughn, Neilay Dedhia, W Richard McCombie, Kimberly Lavine, Vivek Mittal, Bruce May, Kristin D Kasschau, et al. Role of transposable elements in heterochromatin and epigenetic control. Nature, 430(6998):471–476, 2004. [25] Qiang Song, Benjamin Decato, Elizabeth E Hong, Meng Zhou, Fang Fang, Jianghan Qu, Tyler Garvin, Michael Kessler, Jun Zhou, and Andrew D Smith. A reference methylome database and analysis pipeline to facilitate integrative and comparative epigenomics. PloS One, 8(12), 2013. [26] Michael J Thompson, Karolina Chwiałkowska, Liudmilla Rubbi, Aldons J Lusis, Richard C Davis, Anuj Srivastava, Ron Korstanje, Gary A Churchill, Steve Horvath, and Matteo Pellegrini. A multi- tissue full lifespan epigenetic clock for mice. Aging (Albany NY), 10(10):2832, 2018. [27] Xiaoke Hao, Huiyan Luo, Michal Krawczyk, Wei Wei, Wenqiu Wang, Juan Wang, Ken Flagg, Jiayi Hou, Heng Zhang, Shaohua Yi, et al. DNA methylation markers for diagnosis and prognosis of common cancers. Proceedings of the National Academy of Sciences, 114(28):7414–7419, 2017. 96 [28] Alfonso Due˜ nas-Gonzalez, Brenda Alatorre, and Aurora Gonzalez-Fierro. The impact of DNA methylation technologies on drug toxicology. Expert Opinion on Drug Metabolism & Toxicology, 10(5):637–646, 2014. [29] Christina Gros, Jacques Fahy, Ludovic Halby, Isabelle Dufau, Alexandre Erdmann, Jean-Marc Gre- goire, Fr´ ederic Ausseil, St´ ephane Visp´ e, and Paola B Arimondo. DNA methylation inhibitors in cancer: recent and future approaches. Biochimie, 94(11):2280–2296, 2012. [30] Steve Horvath. DNA methylation age of human tissues and cell types. Genome Biology, 14(10):1–20, 2013. [31] Lai Wei, Baoying Liu, Jingsheng Tuo, Defen Shen, Ping Chen, Zhiyu Li, Xunxian Liu, Jia Ni, Pradeep Dagur, H Nida Sen, et al. Hypomethylation of the IL17RC promoter associates with age-related macular degeneration. Cell Reports, 2(5):1151–1158, 2012. [32] Louise F Porter, Neil Saptarshi, Yongxiang Fang, Sonika Rathi, Anneke I Den Hollander, Eiko K De Jong, Simon J Clark, Paul N Bishop, Timothy W Olsen, Triantafillos Liloglou, et al. Whole- genome methylation profiling of the retinal pigment epithelium of individuals with age-related mac- ular degeneration reveals differential methylation of the SKI, GTF2H4, and TNXB genes. Clinical Epigenetics, 11(1):1–14, 2019. [33] Maria Pia Campagna, Alexandre Xavier, Jeannette Lechner-Scott, Vicky Maltby, Rodney J Scott, Helmut Butzkueven, Vilija G Jokubaitis, and Rodney A Lea. Epigenome-wide association studies: current knowledge, strategies and recommendations. Clinical Epigenetics, 13(1):1–24, 2021. [34] Nathalie Acevedo, Lovisa E Reinius, Morana Vitezic, Vittorio Fortino, Cilla S¨ oderh¨ all, Hanna Honka- nen, Riitta Veijola, Olli Simell, Jorma Toppari, Jorma Ilonen, et al. Age-associated DNA methylation changes in immune genes, histone modifiers and chromatin remodeling factors within 5 years after birth in human blood leukocytes. Clinical Epigenetics, 7(1):1–20, 2015. [35] Chandra A Reynolds, Qihua Tan, Elizabeth Munoz, Juulia Jylh¨ av¨ a, Jacob Hjelmborg, Lene Chris- tiansen, Sara H¨ agg, and Nancy L Pedersen. A decade of epigenetic change in aging twins: genetic and environmental contributions to longitudinal DNA methylation. Aging Cell, 19(8):e13197, 2020. [36] Sonja Zeilinger, Brigitte K¨ uhnel, Norman Klopp, Hansj¨ org Baurecht, Anja Kleinschmidt, Christian 97 Gieger, Stephan Weidinger, Eva Lattka, Jerzy Adamski, Annette Peters, et al. Tobacco smoking leads to extensive genome-wide changes in DNA methylation. PloS One, 8(5):e63812, 2013. [37] Shinji Maegawa, Yue Lu, Tomomitsu Tahara, Justin T Lee, Jozef Madzo, Shoudan Liang, Jaroslav Jelinek, Ricki J Colman, and Jean-Pierre J Issa. Caloric restriction delays age-related methylation drift. Nature Communications, 8(1):1–11, 2017. [38] Marianne Frommer, Louise E McDonald, Douglas S Millar, Christina M Collis, Fujiko Watt, Ge- offrey W Grigg, Peter L Molloy, and Cheryl L Paul. A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proceedings of the National Academy of Sciences, 89(5):1827–1831, 1992. [39] Ryan Lister and Joseph R Ecker. Finding the fifth base: genome-wide sequencing of cytosine methy- lation. Genome Research, 19(6):959–966, 2009. [40] Shawn J Cokus, Suhua Feng, Xiaoyu Zhang, Zugen Chen, Barry Merriman, Christian D Hauden- schild, Sriharsa Pradhan, Stanley F Nelson, Matteo Pellegrini, and Steven E Jacobsen. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature, 452 (7184):215–219, 2008. [41] Fumihito Miura and Takashi Ito. Highly sensitive targeted methylome sequencing by post-bisulfite adaptor tagging. DNA Research, 22(1):13–18, 2015. [42] Stephen J Clark, S´ ebastien A Smallwood, Heather J Lee, Felix Krueger, Wolf Reik, and Gavin Kelsey. Genome-wide base-resolution mapping of DNA methylation in single cells using single-cell bisulfite sequencing (scBS-seq). Nature Protocols, 12(3):534–547, 2017. [43] S´ ebastien A Smallwood, Heather J Lee, Christof Angermueller, Felix Krueger, Heba Saadeh, Julian Peat, Simon R Andrews, Oliver Stegle, Wolf Reik, and Gavin Kelsey. Single-cell genome-wide bisulfite sequencing for assessing epigenetic heterogeneity. Nature Methods, 11(8):817–820, 2014. [44] Andrew E Teschendorff, Francesco Marabita, Matthias Lechner, Thomas Bartlett, Jesper Tegner, David Gomez-Cabrero, and Stephan Beck. A beta-mixture quantile normalization method for cor- recting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics, 29(2): 189–196, 2013. 98 [45] Sebastian Moran, Carles Arribas, and Manel Esteller. Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics, 8(3): 389–399, 2016. [46] Anthony Rhoads and Kin Fai Au. PacBio sequencing and its applications. Genomics, Proteomics & Bioinformatics, 13(5):278–289, 2015. [47] Miten Jain, Hugh E Olsen, Benedict Paten, and Mark Akeson. The oxford nanopore minion: delivery of nanopore sequencing to the genomics community. Genome Biology, 17(1):1–11, 2016. [48] Egor Dolzhenko and Andrew D Smith. Using beta-binomial regression for high-precision differential methylation analysis in multifactor whole-genome bisulfite sequencing experiments. BMC Bioinfor- matics, 15(1):1–8, 2014. [49] Pavel Bashtrykov, Gytis Jankevicius, Anita Smarandache, Renata Z Jurkowska, Sergey Ragozin, and Albert Jeltsch. Specificity of Dnmt1 for methylation of hemimethylated CpG sites resides in its catalytic domain. Cell, 19(5):572–578, 2012. [50] Mie Mechta, Lars R Ingerslev, Odile Fabre, Martin Picard, and Romain Barr` es. Evidence suggesting absence of mitochondrial DNA methylation. Frontiers in Genetics, 8:166, 2017. [51] ENCODE Project Consortium et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447(7146):799, 2007. [52] Leonard E Baum and Ted Petrie. Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics, 37(6):1554–1563, 1966. [53] Benjamin E Decato, Jianghan Qu, Xiaojing Ji, Elvin Wagenblast, Simon RV Knott, Gregory J Han- non, and Andrew D Smith. Characterization of universal features of partially methylated domains across tissues and species. Epigenetics and Chromatin, 13(1):1–14, 2020. [54] Ryan Lister, Mattia Pelizzola, Yasuyuki S Kida, R David Hawkins, Joseph R Nery, Gary Hon, Jessica Antosiewicz-Bourget, Ronan O’Malley, Rosa Castanon, Sarit Klugman, et al. Hotspots of aber- rant epigenomic reprogramming in human induced pluripotent stem cells. Nature, 471(7336):68–73, 2011. 99 [55] Fang Fang, Emily Hodges, Antoine Molaro, Matthew Dean, Gregory J Hannon, and Andrew D Smith. Genomic landscape of human allele-specific DNA methylation. Proceedings of the National Academy of Sciences, 109(19):7332–7337, 2012. [56] Hehuang Xie, Min Wang, Alexandre De Andrade, Maria de F Bonaldo, Vasil Galat, Kelly Arndt, Veena Rajaram, Stewart Goldman, Tadanori Tomita, and Marcelo B Soares. Genome-wide quanti- tative assessment of variation in DNA methylation patterns. Nucleic Acids Research, 39(10):4099– 4108, 2011. [57] Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12):1–21, 2014. [58] Nelly Olova, Felix Krueger, Simon Andrews, David Oxley, Rebecca V Berrens, Miguel R Branco, and Wolf Reik. Comparison of whole-genome bisulfite sequencing library preparation strategies identifies sources of biases affecting DNA methylation data. Genome Biology, 19(1):1–19, 2018. [59] Sara Ballouz, Alexander Dobin, and Jesse A Gillis. Is it time to change the reference genome? Genome Biology, 20(1):1–9, 2019. [60] Benjamin Kaminow, Sara Ballouz, Jesse Gillis, and Alexander Dobin. Virtue as the mean: Pan-human consensus genome significantly improves the accuracy of RNA-seq analyses. BioRxiv, 2020. [61] Heng Li, Xiaowen Feng, and Chong Chu. The design and construction of reference pangenome graphs with minigraph. Genome Biology, 21(1):1–19, 2020. [62] Saul B Needleman and Christian D Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443–453, 1970. [63] Temple F Smith, Michael S Waterman, et al. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–197, 1981. [64] Santiago Marco-Sola, Juan Carlos Moure, Miquel Moreto, and Antonio Espinosa. Fast gap-affine pairwise alignment using the wavefront algorithm. Bioinformatics, 37(4):456–463, 2021. [65] William R Pearson and David J Lipman. Improved tools for biological sequence comparison. Pro- ceedings of the National Academy of Sciences, 85(8):2444–2448, 1988. 100 [66] Mengyao Zhao, Wan-Ping Lee, Erik P Garrison, and Gabor T Marth. SSW library: an SIMD Smith- Waterman C/C++ library for use in genomic applications. PloS One, 8(12):e82138, 2013. [67] Andrzej Wozniak. Using video-oriented instructions to speed up sequence comparison. Bioinformat- ics, 13(2):145–150, 1997. [68] Torbjørn Rognes. Faster smith-waterman database searches with inter-sequence simd parallelisation. BMC bioinformatics, 12(1):1–11, 2011. [69] Stephen F Altschul. Amino acid substitution matrices from an information theoretic perspective. Journal of Molecular Biology, 219(3):555–565, 1991. [70] Ricardo A Baeza-Yates and Chris H Perleberg. Fast and practical approximate string matching. In Annual Symposium on Combinatorial Pattern Matching, pages 185–192. Springer, 1992. [71] Uri Keich, Ming Li, Bin Ma, and John Tromp. On spaced seeds for similarity search. Discrete Applied Mathematics, 138(3):253–263, 2004. [72] Bin Ma, John Tromp, and Ming Li. PatternHunter: faster and more sensitive homology search. Bioinformatics, 18(3):440–445, 2002. [73] Serghei Mangul, Harry Taegyun Yang, Nicolas Strauli, Franziska Gruhl, Hagit T Porath, Kevin Hsieh, Linus Chen, Timothy Daley, Stephanie Christenson, Agata Wesolowska-Andersen, et al. ROP: dump- ster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues. Genome Biology, 19(1):1–12, 2018. [74] Felix Krueger and Simon R Andrews. Bismark: a flexible aligner and methylation caller for Bisulfite- Seq applications. Bioinformatics, 27(11):1571–1572, 2011. [75] Brent S Pedersen, Kenneth Eyring, Subhajyoti De, Ivana V Yang, and David A Schwartz. Fast and accurate alignment of long bisulfite-seq reads. arXiv preprint arXiv:1401.1129, 2014. [76] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4):357, 2012. [77] Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997, 2013. 101 [78] Michael Burrows and David Wheeler. A block-sorting lossless data compression algorithm. In Digital SRC Research Report. Citeseer, 1994. [79] Yuanxin Xi and Wei Li. Bsmap: whole genome bisulfite sequence mapping program. BMC bioinfor- matics, 10(1):1–9, 2009. [80] Haifeng Chen, Andrew D Smith, and Ting Chen. WALT: fast and accurate read mapping for bisulfite sequencing. Bioinformatics, 32(22):3507–3509, 2016. [81] Mark Kirkpatrick. How and why chromosome inversions evolve. PLoS Biology, 8(9), 2010. [82] Richard M Karp and Michael O Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249–260, 1987. [83] Fumihito Miura, Yukiko Shibata, Miki Miura, Yuhei Sangatsuda, Osamu Hisano, Hiromitsu Araki, and Takashi Ito. Highly efficient single-stranded DNA ligation technique improves low-input whole- genome bisulfite sequencing by post-bisulfite adaptor tagging. Nucleic Acids Research, 47(15):e85– e85, 2019. [84] David Brocks, Christopher R Schmidt, Michael Daskalakis, Hyo Sik Jang, Nakul M Shah, Daofeng Li, Jing Li, Bo Zhang, Yiran Hou, Sara Laudato, et al. DNMT and HDAC inhibitors induce cryptic transcription start sites encoded in long terminal repeats. Nature Genetics, 49(7):1052–1060, 2017. [85] Sergei A Manakov, Dubravka Pezic, Georgi K Marinov, William A Pastor, Ravi Sachidanandam, and Alexei A Aravin. MIWI2 and MILI have differential effects on piRNA biogenesis and DNA methylation. Cell Reports, 12(8):1234–1243, 2015. [86] Alex de Mendoza, Ryan Lister, and Ozren Bogdanovic. Evolution of DNA methylome diversity in eukaryotes. Journal of Molecular Biology, 432(6):1687–1705, 2020. [87] Meng Zhang, Feng-Bin Yan, Fang Li, Ke-Ren Jiang, Dong-Hua Li, Rui-Li Han, Zhuan-Jan Li, Rui- Rui Jiang, Xiao-Jun Liu, Xiang-Tao Kang, et al. Genome-wide DNA methylation profiles reveal novel candidate genes associated with meat quality at different age stages in hens. Scientific Reports , 7:45564, 2017. [88] Yadollah Shahryary, Aikaterini Symeonidi, Rashmi R Hazarika, Johanna Denkena, Talha Mubeen, Brigitte Hofmeister, Thomas van Gurp, Maria Colom´ e-Tatch´ e, Koen Verhoeven, Gerald Tuskan, et al. 102 AlphaBeta: Computational inference of epimutation rates and spectra from high-throughput DNA methylation data in plants. bioRxiv, page 862243, 2020. [89] Michael J Ziller, Hongcang Gu, Fabian M¨ uller, Julie Donaghey, Linus T-Y Tsai, Oliver Kohlbacher, Philip L De Jager, Evan D Rosen, David A Bennett, Bradley E Bernstein, et al. Charting a dynamic DNA methylation landscape of the human genome. Nature, 500(7463):477–481, 2013. [90] David Gokhman, Nadav Mishol, Marc de Manuel, David de Juan, Jonathan Shuqrun, Eran Meshorer, Tomas Marques-Bonet, Yoel Rak, and Liran Carmel. Reconstructing denisovan anatomy using DNA methylation maps. Cell, 179(1):180–192, 2019. [91] Saher Sue Hammoud, Diana HP Low, Chongil Yi, Douglas T Carrell, Ernesto Guccione, and Bradley R Cairns. Chromatin and transcription transitions of mammalian adult germline stem cells and spermatogenesis. Cell Stem Cell, 15(2):239–253, 2014. [92] Masaki Yagi, Mio Kabata, Akito Tanaka, Tomoyo Ukai, Sho Ohta, Kazuhiko Nakabayashi, Masahito Shimizu, Kenichiro Hata, Alexander Meissner, Takuya Yamamoto, et al. Identification of distinct loci for de novo DNA methylation by DNMT3A and DNMT3B during mammalian development. Nature Communications, 11(1):1–14, 2020. [93] Benjamin E Decato, Jorge Lopez-Tello, Amanda N Sferruzzi-Perri, Andrew D Smith, and Matthew D Dean. DNA methylation divergence and tissue specialization in the developing mouse placenta. Molecular Biology and Evolution, 34(7):1702–1712, 2017. [94] Massimiliano Manzo, Jo¨ el Wirz, Christina Ambrosi, Rodrigo Villase˜ nor, Bernd Roschitzki, and Tun- cay Baubec. Isoform-specific localization of DNMT3A regulates DNA methylation fidelity at bivalent CpG islands. The EMBO journal, 36(23):3421–3434, 2017. [95] Mathilde Dura, Aur´ elie Teissandier, M´ elanie Armand, Joan Barau, Cl´ ementine Lapoujade, Pierre Fouchet, Lorraine Bonneville, Mathieu Schulz, Michael Weber, Laura G Baudrin, et al. DNMT3A- dependent DNA methylation is required for spermatogonial stem cells to commit to spermatogenesis. Nature Genetics, pages 1–12, 2022. [96] Jorke H Kamstra, Selma Hurem, Leonardo Martin Martin, Leif C Lindeman, Juliette Legler, Deborah Oughton, Brit Salbu, Dag Anders Brede, Jan Ludvig Lyche, and Peter Alestr¨ om. Ionizing radiation 103 induces transgenerational effects of DNA methylation in zebrafish. Scientific Reports , 8(1):1–13, 2018. [97] Ksenia Skvortsova, Katsiaryna Tarbashevich, Martin Stehling, Ryan Lister, Manuel Irimia, Erez Raz, and Ozren Bogdanovic. Retention of paternal DNA methylome in the developing zebrafish germline. Nature Communications, 10(1):1–13, 2019. [98] Isac Lee, Bejan A Rasoul, Ashton S Holub, Alannah Lejeune, Raymond A Enke, and Winston Timp. Whole genome DNA methylation sequencing of the chicken retina, cornea and brain. Scientific Data , 4:170148, 2017. [99] Lenin Yong-Villalobos, Sandra Isabel Gonz´ alez-Morales, Kazimierz Wrobel, Dolores Guti´ errez- Alanis, Sergio Alan Cervantes-Per´ ez, Corina Hayano-Kanashiro, Araceli Oropeza-Aburto, Alfredo Cruz-Ram´ ırez, Octavio Mart´ ınez, and Luis Herrera-Estrella. Methylome analysis reveals an impor- tant role for epigenetic changes in the regulation of the Arabidopsis response to phosphate starvation. Proceedings of the National Academy of Sciences, 112(52):E7293–E7302, 2015. [100] Shuhui Bian, Yu Hou, Xin Zhou, Xianlong Li, Jun Yong, Yicheng Wang, Wendong Wang, Jia Yan, Boqiang Hu, Hongshan Guo, et al. Single-cell multiomics sequencing and analyses of human col- orectal cancer. Science, 362(6418):1060–1063, 2018. [101] Fan Guo, Liying Yan, Hongshan Guo, Lin Li, Boqiang Hu, Yangyu Zhao, Jun Yong, Yuqiong Hu, Xi- aoye Wang, Yuan Wei, et al. The transcriptome and DNA methylome landscapes of human primordial germ cells. Cell, 161(6):1437–1452, 2015. [102] Lizhi Leng, Jiya Sun, Jinrong Huang, Fei Gong, Ling Yang, Shuoping Zhang, Xuye Yuan, Fang Fang, Xun Xu, Yonglun Luo, et al. Single-cell transcriptome analysis of uniparental embryos reveals parent- of-origin effects on human preimplantation development. Cell Stem Cell, 25(5):697–712, 2019. [103] Rasko Leinonen, Hideaki Sugawara, Martin Shumway, and International Nucleotide Se- quence Database Collaboration. The sequence read archive. Nucleic Acids Research, 39(suppl 1): D19–D21, 2010. [104] Felix Krueger. Sherman - bisulfite-treated read FASTQ simulator, 2017. URL http://www. bioinformatics.babraham.ac.uk/projects/sherman. 104 [105] Guilherme de Sena Brandine and Andrew D Smith. Fast and memory-efficient mapping of short bisulfite sequencing reads using a two-letter alphabet. NAR Genomics and Bioinformatics, 3(4): lqab115, 2021. [106] Nansheng Chen. Using Repeat Masker to identify repetitive elements in genomic sequences. Current Protocols in Bioinformatics, 5(1):4–10, 2004. [107] Heng Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14):2103–2110, 2016. [108] Jingwen Ren and Mark JP Chaisson. lra: A long read aligner for sequences and contigs. PLOS Computational Biology, 17(6):e1009078, 2021. [109] Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15(3):1–12, 2014. [110] Yaron Orenstein, David Pellow, Guillaume Marc ¸ais, Ron Shamir, and Carl Kingsford. Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLoS Com- putational Biology, 13(10):e1005777, 2017. [111] P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pages 390–398, 2000. [112] Chirag Jain, Arang Rhie, Haowen Zhang, Claudia Chu, Brian P Walenz, Sergey Koren, and Adam M Phillippy. Weighted minimizer sampling improves long read mapping. Bioinformatics, 36(Supplement 1):i111–i118, 2020. [113] Ken Chen, John W Wallis, Michael D McLellan, David E Larson, Joelle M Kalicki, Craig S Pohl, Sean D McGrath, Michael C Wendl, Qunyuan Zhang, Devin P Locke, et al. BreakDancer: an algo- rithm for high-resolution mapping of genomic structural variation. Nature Methods, 6(9):677–681, 2009. [114] Sean R Eddy. Where did the BLOSUM62 alignment score matrix come from? Nature Biotechnology, 22(8):1035–1036, 2004. [115] Steven Henikoff and Jorja G Henikoff. Amino acid substitution matrices from protein blocks. Pro- ceedings of the National Academy of Sciences, 89(22):10915–10919, 1992. 105 [116] Francesca Chiaromonte, VB Yap, and W Miller. Scoring pairwise genomic sequence alignments. In Biocomputing 2002, pages 115–126. World Scientific, 2001. [117] Robert S Harris. Improved pairwise alignment of genomic DNA. The Pennsylvania State University, 2007. [118] Andrew D Smith, Wen-Yu Chung, Emily Hodges, Jude Kendall, Greg Hannon, James Hicks, Zhenyu Xuan, and Michael Q Zhang. Updates to the RMAP short-read mapping software. Bioinformatics, 25(21):2841–2842, 2009. [119] Ziheng Yang and Anne D Yoder. Estimation of the transition/transversion rate bias and species sam- pling. Journal of Molecular Evolution, 48(3):274–283, 1999. [120] Laurent No´ e and Gregory Kucherov. Improved hit criteria for DNA local alignment. BMC Bioinfor- matics, 5(1):1–9, 2004. [121] Laurent No´ e and Gregory Kucherov. YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Research, 33(suppl 2):W540–W543, 2005. [122] Scott Schwartz, W James Kent, Arian Smit, Zheng Zhang, Robert Baertsch, Ross C Hardison, David Haussler, and Webb Miller. Human–mouse alignments with BLASTZ. Genome Research, 13(1): 103–107, 2003. [123] Isaac TS Li, Warren Shum, and Kevin Truong. 160-fold acceleration of the Smith-Waterman algo- rithm using a field programmable gate array (FPGA). BMC Bioinformatics, 8(1):1–7, 2007. [124] Mohammed Alser, Taha Shahroodi, Juan G´ omez-Luna, Can Alkan, and Onur Mutlu. SneakySnake: a fast and accurate universal genome pre-alignment filter for CPUs, GPUs and FPGAs. Bioinformatics, 36(22-23):5282–5290, 2020. [125] Mohammed Alser, Jeremy Rotman, Dhrithi Deshpande, Kodi Taraszka, Huwenbo Shi, Pelin Icer Baykal, Harry Taegyun Yang, Victor Xue, Sergey Knyazev, Benjamin D Singer, et al. Technology dictates algorithms: recent developments in read alignment. Genome Biology, 22(1):1–34, 2021. [126] Tabula Muris Consortium et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature, 562(7727):367–372, 2018. 106 [127] Tabula Sapiens Consortium*, Robert C Jones, Jim Karkanias, Mark A Krasnow, Angela Oliveira Pisco, Stephen R Quake, Julia Salzman, Nir Yosef, Bryan Bulthaup, Phillip Brown, et al. The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science, 376(6594):eabl4896, 2022. [128] Lia Chappell, Andrew JC Russell, and Thierry V oet. Single-cell (multi) omics technologies. Annual Review of Genomics and Human Genetics, 19:15–41, 2018. 107 Appendix A The abismal algorithm A.1 Indexing a genome Indexing of an input genome sequence of lengthm is performed by grouping positions based on fingerprints of thek-mer sequence starting at each position. We define the k-mer sequence of a positioni, with0≤ i≤ m− k to be the two-letter sequence of thek-mer starting at positioni and ending at positioni+k− 1. A fixed value of k is assumed given, and it is set tok = 256 in the abismal software tool, which is larger than the currently largest possible Illumina read and can be adjusted if read lengths increase. We seek to build a genome index, defined as an ordered subset of S ={0,...,m− 1}, with positions grouped together based on identical k-mer sequences (we add k− 1 random nucleotides to the end of the sequence so every position has a k-mer). The abismal index is built in the following steps, which will be described in more detail in the subsequent paragraphs. 1. counting fingerprint sequences in two- and three-letters 2. partitioningS into setsS 2 andS 3 , the positions encoded in two- and three-letters respectively 3. construction of the minimum hit rate index 4. creating direct address tables 5. final sorting of sets S 2 andS 3 . First step: counting fingerprints The indexing algorithm begins with counting the frequency of finger- prints under two-letters. This is done by setting twok-mer sizesk 2 andk 3 . We usek 2 = 25 andk 3 = 16 108 in abismal. For positioni∈ S, let(v,v t ,v a ) be the fingerprint triplet of position i, with0≤ v < 2 k 2 and 0≤ v t ,v a <3 k 3 . We allocate three 32-bit vectorsN,N t andN a of sizes2 k 2 ,3 k 3 and3 k 3 , respectively, that store the fingerprint frequency of every position in the genome. These are used in the second step to create the hybrid partitioning. Second step: partitioning the genome In the second step, setsS 2 andS 3 are constructed as partitions of S according to the algorithm described in section 4.2. For positioni∈ S with fingerprint triplet (v,v t ,v a ), if2× N(v)≤ N t (v t )+N a (v a ), theni is added toS 2 , otherwisei is added toS 3 . The partitioning is done using a vectorK of sizem storing 1-bit booleans, whereK(i)=1 ifi∈S 2 andK(i)=0 otherwise. Third step: construction of the minimum hit rate index The minimum hit rate index is constructed according to section 5.2.2. In order to construct the index, we set a window size of w = 20. The cost associated with indexing positioni with fingerprint triplet (v,v t ,v a ) is given by 2× N(v) ifK(i) = 1 or N t (v t )+N a (v a ) ifK(i) = 0. A double-ended queue of sizew is used to store up to the lastw minimum hit rate costs for positions from i− w to i− 1, where the position in the back of the queue is lower than the position at the front of the queue. Insertions into the queue work as follows: When the index cost Z(Opt(i)) of prefix G[0,...i] is calculated, it is compared to the element in front of the queue at position j. IfZ(Opt(i))<Z(Opt(j)), the front of the queue is popped. This is because we will never selectj over i to append any subsequent positions, asi will be part of any subsequent window that also containsj. This procedure guarantees that the queue costs are sorted from back to front and that the optimal prefix to which i should be appended is at the back of the queue. Since every element is inserted and deleted at most once, this leads to a linear time algorithm to construct the minimum hit rate index. Indexing is done in blocks of size 10 7 , since each optimal subproblem requires a 64-bit cost for the subproblem and a 32-bit position the that was indexed prior to i for future tracebacks. Thus, the algorithm uses 96× 10 7 bits of temporary extra storage. At the end of the algorithm, the subsetS Z (w)⊆ S contains a subset of approximatelym/w positions to be stored in the genome index. Fourth step: creating direct address tables In the fourth step, a counting sort algorithm is used to group genome index positions in three index arrays B,B t and B a , with B∪B t = B∪B a = S Z (w). Initially, arrays N,N t and N a are converted into their cumulative sums. A second linear pass through the genome is done to populate arrays B,B t and B a . For position i with fingerprint triplet (v,v t ,v a ), if K(i) = 1, i 109 is added to B in position N(v), after which N(v) is decremented. Otherwise, i is added to both B t and B a at positions N t (v t ) and N a (v a ), respectively, after which both N t (v t ) and N a (v a ) are decremented. The resulting arrays B, B t and B a contain positions with identical fingerprints in consecutive positions. Furthermore, fingerprints of positions in these arrays are sorted in increasing numerical order. This allows arrays N,N t and N a to be used as direct look-up tables for mapping when fingerprints are retrieved from reads. Fifth step: final sorting of equal fingerprint positions In the final step, arrays B,B t andB a are sorted based on longerk-mer sequences. In particular, for fingerprint v, positionsB(N(v)) toB(N(v+1)− 1) have fingerprint v based onk-mers of sizek 2 =25. Positions are further sorted based on positionsk 2 +1=26 to k =256. This allows maximum exact matches to be retrieved using longer seeds through successive binary searches (similar to a suffix array search). The same is done in arrays B t andB a from positionsk 3 +1=17 tok =256. The final abismal index The final abismal index is composed of vectors N,N t ,N a ,B,B t ,B a and a copy of the reference genome stored using four bits per nucleotide (Table A.1). The total size of the abismal index is given, in bits, by index size=32× 2 k 2 +2× 3 k 3 + m w +|B t | . For the human genome, withm≈ 3× 10 9 , the index size is approximately 2.8 GB. This is approximately the amount of memory needed to map reads to this genome. A.2 Mapping reads to an indexed genome Reads are mapped in two steps (i) the Hamming distance step collects likely candidates based on the number of mismatches, and is further subdivided into a sensitive and a specific step (ii) the local alignment step performs banded Smith-Waterman alignments between the read and the candidates collected in the first step. In what follows we will usek to refer to thek-mer size of contiguous subsequences used for filtration in the first step. 110 A.2.1 Selecting hits through Hamming distance comparison The Hamming distance step retrieves seeds for k-mers of varying sizes from the read, both under two and three letters. The index is used to associate seeds with indexed genome positions. Seeds of both small and large sizes are used to account for both cases in which reads map to repetitive elements and cases in which reads contain many errors and do not map exactly to any genome position. For the Hamming distance step, a cut-off c for maximum number of hits is provided (but not always used) as described below. We further assume a readr of lengthn is given as input. Specific step: Maximum exact matches In the specific step, we use seeds starting at positions from 0 to⌊n/2⌋. We expand seeds through successive binary searches until the number of retrieved candidates is belowc. If the seed size is at leastn/2, the valuec is ignored and the read is compared to every retrieved position. This step is repeated both for two-letter and three-letter encodings. Sensitive step: Small seeds If the read is not guaranteed to map ambiguously in the specific step, we proceed to the sensitive step, where seeds of seeds k 2 or k 3 are retrieved from positions 0 to n− k 2 (or n− k 3 for three-letter seeds). If the number of hits for a seed is belowc, the read is compared to every hit, otherwise the seed is skipped. In both sensitive and specific steps, the Hamming distance between the read and the collected hits is counted. When counting mismatches, Ts in T-rich reads are allowed to match to both Cs and Ts, and As in A-rich reads are allowed to match both As and Gs. Every other base must match the reference exactly. When calculating Hamming distance, abismal requires the number of mismatches to the read to be below a certain value for a candidate location to be considered a valid hit and subsequently align it. We set this value asFn, whereF =0.4 by default, that is, each candidate must match at least 40% of the read’s bases. This is motivated by the fact that, for typical values ofn originating from short Illumina reads, i.e.,50≤ n≤ 150, the expected relative number of mismatches of two i.i.d. sequences with Pr(A) = 0.3, Pr(G) = 0.2 and Pr(T) = 0.5, conditioned on an identical two-letter match of sizek = 25, is greater than 0.4 with> 90% probability. This can verified by simulating a large number of random reads for all possible values of n. If a read matches a candidate significantly but contains indels, there is a high probability that the number of matches will be above the 40% threshold, as at least one contiguous substring will contain predominantly matches to the candidate. We keep up toη = 50 valid hits for subsequent alignment, but we favor hits with fewer mismatches. To attain this, for each read, we keep a max heap, where triplets (p,d,s) are stored in 111 each element. Herep is the position in the genome as a 32-bit integer,d is the Hamming distance as a 16-bit integer, ands is a set of SAM flags according to the SAM file format documentation, also as a 16-bit integer. The triplets total 64 bits, which are aligned to the memory. At the root of the max heap is the candidate with highest Hamming distance, and when a new candidate is compared, if the Hamming distance is below the root’s distance (d), then the element is added to the heap and the root is discarded if the heap size is abovev. In single-end mapping, if at least two exact matches are found, then the comparison is stopped as the read is guaranteed to be ambiguous. Hamming distance step for paired-end reads Abismal maps paired-end reads by assuming that the best concordant pair is the most likely mapping location. Two reads can map ambiguously to several locations in the genome, but if only one pair of such locations is concordant in paired-end mapping, the read is said to map uniquely. Conversely, two read pairs can have equal similarity, with reads from individual ends having unique mapping locations. For this reason, in paired-end mapping, we continue to collect hits even when several exact matches are found. After finding the best approximate matches independently and populating each read end’s heap, we test all pairs of heap elements for concordance. The maximum heap size for paired-end mapping isη ′ =32,768. Given two reads r 1 and r 2 whose local alignments to the genome start and end at positions P 1 = (s 1 ,e 1 ) andP 2 =(s 2 ,e 2 ), respectively, wheres 1 ≤ e 1 ands 2 ≤ e 2 , we sayP 1 andP 2 are concordant if they map to the same chromosome andd≤ e 2 − s 1 ≤ D, whered andD can be user-defined as minimum and maximum fragment sizes and set byd = 32 andD = 3000 by default. Note that the fragment length can be smaller than individual read lengths, that ise 2 − s 1 <min(e 1 − s 1 ,e 2 − s 2 ). These “dovetail reads” are infrequent but possible in WGBS datasets, and therefore accepted by abismal. The best pairs are chosen independently of fragment size and solely based on the sum of alignment scores of the two fragments. Concordance tests between pairs of candidates are performed by sorting candidates from both ends by genome position p. If the best concordant pair has one unique end and one ambiguous end, only the unique end is reported. If no concordant pairs are found, abismal reports the end which highest alignment score if its edit distance is below the acceptable cut-off. Abismal favors paired-ended maps over single-ended maps even if the ends in the best pair have more mismatches than the any end in the best single-ended maps. 112 A.2.2 Local alignment of hits The η best candidates retrieved in the approximate matching step (in single-ended mapping) or in all con- cordant pairs (in paired-ended mapping) are aligned to the reference genome if the number of mismatches is less than Fn. Local alignments are performed using the banded Smith-Waterman algorithm. The band width is given by the Hamming distance calculated in the seeding step, as the distance is an upper bound in the number of edits in the resulting alignment. We use a scoring system of +1 for matches, -2 for mis- matches and -3 for indels. This scoring scheme implies that the optimal score is equivalent to the minimum edit distance. Alignments are only accepted if the aligned portion is above 60% of the read length to exclude mapping locations with an extreme abundance of low-quality bases. A.3 The abismal output in a SAM file For each read, abismal reports the highest scoring location for reads that map successfully, meaning that the edit distance of the mapped portion of the read (excluding soft-clipping) is below a user-defined cut- off. Reads are reported as unique if the candidate that maximizes alignment score (single-end mapping) or the sum of alignment scores (paired-end mapping) is unique and has edit distance below the fraction f of the read length (single-end) or if both ends are simultaneously below a fraction f of their read lengths. We set f = 0.1 by default, but this value can be defined by the user if more or less error is acceptable. Reporting locations for ambiguous reads can lead to incorrect methylation estimates in CpGs downstream of read mapping, but may be needed for application-specific global statistics on a dataset. By default abismal does not report ambiguous reads, but users can choose to output a random location using the boolean -A parameter, in which case ambiguity is reflected in the SAM flags. The mapped reads are reported in a Sequence Alignment/Map (SAM) format (https://samtools.github.io), which is recognized by most downstream analysis programs for short reads. A.4 Mapping all combinations of bisulfite bases and genome strands The framework above is used to map both traditional and RPBAT reads in both single-end and paired-end mapping. Mapping traditional single-end reads assumes the input to be T-rich and its reverse-complement (when mapping to the reverse strand of the genome) to be A-rich. Mapping single-end PBAT reads or the second end of paired-end reads assumes the input read is A-rich, and maps its reverse-complement as T- 113 Table A.1: Four-bit encoding of each nucleotide letter A C G T N reference encoding 0001 0010 0100 1000 0000 read encoding (T-rich) 0001 0010 0100 1010 0000 read encoding (A-rich) 0101 0010 0100 1000 0000 rich. Single-end random PBAT datasets are mapped by trying all possibilities, that is, first mapping the original read as T-rich (and its reverse-complement A-rich), then mapping the original read as A-rich (and its reverse-complement as T-rich). The information for strand and bisulfite base is kept for all candidates. For every bisulfite sequencing protocol, the strand is reported through SAM flags, and bisulfite conversion is reported through an additional SAM tag which displaysCV:A:T for reads mapped assuming T-rich and CV:A:A for reads mapped assuming A-rich. Mapping paired-end random PBAT datasets is done similarly to single-end, but corresponding reads in read pairs are always assumed to have complementary bisulfite bases, so abismal always reports corresponding reads in pairs with complementaryCV tag values. A.5 Important implementation choices Because the goal of abismal is to map bisulfite-converted reads efficiently, many implementation choices were made to accelerate certain critical parts of the read mapping routine. We specifically highlight our choice to encode if a reference base and a read base are a match or mismatch. For both read bases and refer- ence bases, we encode each letter using four bits per letter. The binary encoding of each base is highlighted in Table A.1. This encoding has the following property: read base x and reference base y are a match if and only ifx&y have exactly one active bit. We use this property to efficiently count mismatches between encoded read and reference sequences. Our encoding of the reference genome groups 16 consecutive letters in 64-bit integers, using 4 bits per letter, and a total of⌈m/16⌉ integers. We encode reads similarly, using ⌈n/16⌉ integers per read. For read integer w and genome integer u, the number of mismatches between the two sequences is the number of active bits in w&u, which can be efficiently counted using bit-wise operations. 114 Appendix B Supplementary Figures 60 80 100 120 140 0.80 0.90 sensitivity (error = 0%) read length correct reads / total reads 60 80 100 120 140 0.80 0.90 sensitivity (error = 0.3%) read length correct reads / total reads 60 80 100 120 140 0.80 0.90 sensitivity (error = 0.6%) read length correct reads / total reads 60 80 100 120 140 0.80 0.90 sensitivity (error = 0.9%) read length correct reads / total reads 60 80 100 120 140 0.80 0.90 sensitivity (error = 1.2%) read length correct reads / total reads 60 80 100 120 140 0.75 0.85 0.95 sensitivity (error = 1.8%) read length correct reads / total reads Figure B.1: Comparison of sensitivity (corrected/total reads) across various mappers, with error rates varying from 0 to 1.8% and read lengths varying from 50 to 150 bases 115 60 80 100 120 140 0.985 0.995 specificity (error = 0%) read length correct reads / reported reads 60 80 100 120 140 0.980 0.990 1.000 specificity (error = 0.3%) read length correct reads / reported reads 60 80 100 120 140 0.980 0.990 1.000 specificity (error = 0.6%) read length correct reads / reported reads 60 80 100 120 140 0.980 0.990 1.000 specificity (error = 0.9%) read length correct reads / reported reads 60 80 100 120 140 0.980 0.990 specificity (error = 1.2%) read length correct reads / reported reads 60 80 100 120 140 0.980 0.990 specificity (error = 1.8%) read length correct reads / reported reads Figure B.2: Comparison of specificity (corrected/reported reads) across various mappers, with error rates varying from 0 to 1.8% and read lengths varying from 50 to 150 bases. In the 0% error rate case, abismal and BWA-METH attain specificity of 1. This is guaranteed by the Hamming distance algorithmic steps, which compares the read to every two-letter exact match regardless of the number of hits. 116 60 80 100 120 140 0.88 0.92 0.96 F1 score (error = 0%) read length 2/(1/sensitivity + 1/specificity) 60 80 100 120 140 0.88 0.92 0.96 F1 score (error = 0.3%) read length 2/(1/sensitivity + 1/specificity) 60 80 100 120 140 0.88 0.92 0.96 F1 score (error = 0.6%) read length 2/(1/sensitivity + 1/specificity) 60 80 100 120 140 0.88 0.92 0.96 F1 score (error = 0.9%) read length 2/(1/sensitivity + 1/specificity) 60 80 100 120 140 0.86 0.90 0.94 0.98 F1 score (error = 1.2%) read length 2/(1/sensitivity + 1/specificity) 60 80 100 120 140 0.86 0.90 0.94 0.98 F1 score (error = 1.8%) read length 2/(1/sensitivity + 1/specificity) Figure B.3: Comparison of F1 score(harmonic mean of sensitivity and specificity) across various mappers, with error rates varying from 0 to 1.8% and read lengths varying from 50 to 150 bases. Bernstein 0 20 40 60 80 100 Tycko 0 20 40 60 80 100 Roadmap 0 20 40 60 80 100 Meissner 0 20 40 60 80 100 Low 0 20 40 60 80 100 Yamada 0 20 40 60 80 100 Decato 0 20 40 60 80 100 Baubec 0 20 40 60 80 100 Kamstra 0 20 40 60 80 100 Skvortsova 0 20 40 60 80 100 Lee 0 20 40 60 80 100 Yong 0 20 40 60 80 100 reads mapped (%) reads mapped (%) BSMAP BWA-meth HISAT-3N WALT abismal Bismark Figure B.4: Percentage of reads mapped at least once in various real paired-end datasets from the traditional WGBS protocol. 117 Miura 1 0 20 40 60 80 100 Bian 0 20 40 60 80 100 Miura 2 0 20 40 60 80 100 Leng 0 20 40 60 80 100 Guo 0 20 40 60 80 100 reads mapped (%) BSMAP abismal Bismark Figure B.5: Percentage of reads mapped at least once in various real paired-end datasets from the random- priming PBAT protocol. Bernstein 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Tycko 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Roadmap 0.0 0.1 0.2 0.3 0.4 0.5 Meissner 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Low 0.0 0.2 0.4 0.6 0.8 1.0 Yamada 0.0 0.5 1.0 1.5 2.0 2.5 Decato 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Baubec 0.0 0.5 1.0 1.5 2.0 Kamstra 0.0 0.5 1.0 1.5 Skvortsova 0.0 0.5 1.0 1.5 Lee 0.0 0.2 0.4 0.6 0.8 1.0 Yong 0.0 0.1 0.2 0.3 0.4 error rate ( %) error rate ( %) BSMAP BWA-meth HISAT-3N WALT abismal Bismark Figure B.6: Error rate (fraction of mapped nucleotides as mismatches and indels) comparison across various real paired-end datasets. Miura 1 0.0 0.5 1.0 1.5 2.0 2.5 Bian 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Miura 2 0.0 0.5 1.0 1.5 2.0 Leng 0.0 0.5 1.0 1.5 2.0 2.5 Guo 0 1 2 3 4 error rate ( %) BSMAP abismal Bismark Figure B.7: Error rate (fraction of mapped nucleotides as mismatches and indels) comparison across various real paired-end random-priming PBAT datasets. 118 Bernstein reads per hour (M) 0 10 30 50 Tycko 0 5 10 20 30 Roadmap 0 10 20 30 40 50 Meissner 0 20 40 60 80 Low 0 10 30 50 Yamada 0 5 10 15 20 25 Decato reads per hour (M) 0 10 30 50 Baubec 0 5 10 15 20 25 Kamstra 0 5 15 25 35 Skvortsova 0 10 20 30 40 Lee 0 20 40 60 80 Yong 0 100 200 300 400 Figure B.8: Mapping speed (million reads mapped per hour) across various real paired-end datasets. Miura 1 reads per hour (M) 0 10 20 30 40 Bian 0.0 1.0 2.0 Miura 2 0 5 10 15 Leng 0 1 2 3 4 Guo 0 2 4 6 8 10 BSMAP abismal Bismark Figure B.9: Mapping speed (million reads mapped per hour) across various real paired-end random-priming PBAT datasets. 119
Abstract (if available)
Abstract
Mapping short reads to a reference genome is a classic problem in computational biology. Efficient solutions to the read mapping problem motivated decades of research in the fields of mathematics and computer science. Results of these research efforts are the theoretical basis for mapping software tools that are fundamental to most modern molecular biology research. With the increasing amount of data generated in high-throughput sequencing, software tools to map reads efficiently are increasingly necessary.
Quantifying DNA cytosine methylation is a common application of high-throughput sequencing for epigenome-wide studies. Methylation of cytosines in a DNA molecule is a stable and heritable epigenetic trait that is strongly correlated with various phenotypes. Changes in DNA methylation are associated with many biological processes during the lifespan of metazoan organisms, including stem cell development, aging, X-chromosome inactivation, genomic imprinting and cancer progression. Cross-species analysis of methylation in specific tissues can also demonstrate the evolutionary changes in gene activation or silencing, thus providing greater insights on the history of functional elements beyond sequence alone.
Bisulfite sequencing is the gold standard to quantify methylation genome-wide. Through bisulfite treatment coupled with sequencing, cytosines that do not carry the methyl mark are sequenced as thymines. Mapping bisulfite sequencing reads to a reference genome allows DNA cytosine methylation to be estimated at single-base resolution. Most bisulfite sequencing read mapping algorithms adapt methods developed for unconverted DNA to account for bisulfite conversion. These direct adaptations can be particularly inefficient due to the uneven frequency of letters resulting from the biochemical conversion of nucleotides in bisulfite sequencing reads.
In this dissertation, we provide efficient solutions for the bisulfite-converted short read mapping problem. To attain this, we construct the mathematical framework to quantify the expected number of alignments required to map a read, after which we construct a formal methodology to encode reads and the reference genome to minimize this value. The main contribution of this dissertation is the introduction of a two-letter alphabet encoding that distinguishes purines and pyrimidines. We show that this encoding has various applications to reduce the time and memory requirement to map bisulfite-converted reads. We used the two-letter alphabet to create another bisulfite mapping algorithm (abismal), a novel software tool to map short bisulfite sequencing reads. Using data from various public repositories, we show that abismal is faster, uses less RAM, and attains more accurate results than mappers commonly used in epigenetics research. These results allow mapping efficiency to keep pace with the high volumes of bisulfite sequencing data being generated periodically by epigenetics studies. The improved speed and memory efficiency of the abismal software tool makes genome-wide methylation data analysis more accessible to the scientific community.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Whole genome bisulfite sequencing: analytical methods and biological insights
PDF
Fast search and clustering for large-scale next generation sequencing data
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Identifying allele-specific DNA methylation in mammalian genomes
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Efficient short-read sequencing on long-read sequencers
PDF
Developing a robust single cell whole genome bisulfite sequencing protocol to analyse circulating tumor cells
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Comparative analysis of DNA methylation in mammals
PDF
Breaking the plateau in de novo genome scaffolding
PDF
Genomic mapping: a statistical and algorithmic analysis of the optical mapping system
PDF
DNA methylation inhibitors and epigenetic regulation of microRNA expression
PDF
Limit of detection analysis for cell-free DNA methylation using targeted bisulfite sequencing
PDF
Identification and analysis of shared epigenetic changes in extraembryonic development and tumorigenesis
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Machine learning of DNA shape and spatial geometry
PDF
DNA methylation as a biomarker in human reproductive health and disease
PDF
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
Asset Metadata
Creator
De Sena Brandine, Guilherme
(author)
Core Title
Efficient algorithms to map whole genome bisulfite sequencing reads
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Degree Conferral Date
2022-08
Publication Date
07/26/2022
Defense Date
06/28/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bioinformatics,bisulfite sequencing,DNA methylation,epigenetics,OAI-PMH Harvest,short read mapping,string algorithms
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Smith, Andrew (
committee chair
), Chaisson, Mark (
committee member
), Edge, Michael (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
desenabr@usc.edu,guilhermesb14@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111375198
Unique identifier
UC111375198
Legacy Identifier
etd-DeSenaBran-10996
Document Type
Dissertation
Format
application/pdf (imt)
Rights
De Sena Brandine, Guilherme
Type
texts
Source
20220728-usctheses-batch-962
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
bioinformatics
bisulfite sequencing
DNA methylation
epigenetics
short read mapping
string algorithms