Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Fast search and clustering for large-scale next generation sequencing data
(USC Thesis Other)
Fast search and clustering for large-scale next generation sequencing data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
FAST SEARCH AND CLUSTERING FOR LARGE-SCALE NEXT-GENERATION SEQUENCING DATA by Haifeng Chen A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) December 2016 Copyright 2016 Haifeng Chen Dedication To my son, wife, mom and dad. ii Acknowledgments First and foremost, I would like to thank my advisor, Professor Ting Chen, for his guidance and support. He always gives me freedom to explore variety of topics in the field of computational biology and guides me in the right direction. I would like to thank my qualifying exam and dissertation committee mem- bers, Professor Remo Rohs, Professor Aiichiro Nakano, Professor Fengzhu Sun and Professor Liang Chen, for their valuable comments and suggestions on my PhD studies. I would like to thank Professor Andrew Smith for initiating the bisulfite map- ping project and guiding me in this project. I also thank Benjamin Decato and Meng Zhou for valuable comments and helps in this project. I would like to thank my good friends, Tsu-Pei Chiu and Satyanarayan Rao. They always give me help whenever I need, not only research but also life. I would like to thank Misagh Bagherian, Zohreh Baharvand and Yichao Dong, for their help in discussion. I would like to thank Hayley Peltz, Linda Bazilian and Douglas Burleson, for their suggestions in registering courses and preparing qualifying exam and and dissertation defense. Last but not least, I would like to thank my wife who gives me tremendous support in daily life, and let me have enough time to focus on my PhD studies. iii I thank my son who brings me so much fun and love. I thank my parents, who support me through everything. iv Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures ix Abstract xi 1 Introduction 1 1.1 Read Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Protein Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Bisulfite Read Mapping 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 DNA Methylation . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Bisulfite Sequencing . . . . . . . . . . . . . . . . . . . . . . 8 2.1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.4 Review of Short Read Mapping Algorithms . . . . . . . . . . 14 2.1.5 Review of Bisulfite Sequencing Mapping Algorithms . . . . . 21 2.2 Algorithms of WALT . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2 Spaced Seeds . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.3 Build Index Table . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2.4 Bisulfite Mapping . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.2 Comparison of Runtime . . . . . . . . . . . . . . . . . . . . 35 2.3.3 Comparison of Accuracy . . . . . . . . . . . . . . . . . . . . 40 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 v 3 Protein Motif Finding 44 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Algorithm of HSEARCH . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.2 Protein Sequence to Coordinate . . . . . . . . . . . . . . . . 50 3.2.3 Locality-Sensitive Hashing . . . . . . . . . . . . . . . . . . . 53 3.2.4 Motif Finding . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.2 Comparison of Runtime and Accuracy . . . . . . . . . . . . 63 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.4.1 Protein Motif Clustering . . . . . . . . . . . . . . . . . . . . 64 4 Conclusion 71 Reference 73 A Periodic Spaced Seeds (10100)* and (1110100)* 83 B Key and Value Pairs for a Reference Genome 84 C Comparison of Runtime, Accuracy and Memory Usage 85 D Runtime on Different Lengths of Reads 88 E Uniquely Mapped Reads on Different Number of Mismatches 89 F Test Environment and Command Lines 90 G WALT Output Formats 92 H Notes of Mathematical Symbols for Bislfite Read Mapping 94 I Twenty Amino Acids and 1 Letter Code 95 vi List of Tables 2.1 Memoryusage(GB)andaveragenumberofcandidatesasafunction of k on the human genome . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 A list of published bisulfite mapping programs . . . . . . . . . . . . 22 2.3 Three spaced seeds used in WALT . . . . . . . . . . . . . . . . . . . 28 2.4 Information about data sets used . . . . . . . . . . . . . . . . . . . 35 2.5 Runtime of WALT, Bismark and BSMAP (hours) . . . . . . . . . . 36 2.6 RuntimeofWALTandBSMAPondifferentnumberofthreads(hours) 37 2.7 Accumulated percentage of uniquely mapped reads as a function of bucket size for WALT-all in single-end mapping . . . . . . . . . . . 39 2.8 Accumulated percentage of uniquely mapped reads as a function of bucket size for WALT-all in paired-end mapping . . . . . . . . . . . 39 2.9 Runtime and percentage of uniquely mapped reads for WALT-all and WALT in single-end mapping (hours) . . . . . . . . . . . . . . 40 2.10 Runtime and percentage of uniquely mapped reads for WALT-all and WALT in paired-end mapping (hours) . . . . . . . . . . . . . . 40 2.11 Recall (R) and precision (P) of WALT, Bismark and BSMAP. . . . 42 3.1 The BLOSUM62 matrix . . . . . . . . . . . . . . . . . . . . . . . . 47 vii 3.2 Runtime(hours)ofHSEARCHonpercentageofresults(p-value:10 −6 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3 Runtime(hours) of HSEARCH on precentage of results (p-value:10 −4 ) 64 3.4 Recall(R), Precision(P), and F1 for CD-HIT, UCLUST and kClust 69 3.5 Runtime for CD-HIT, UCLUST and kClust (hours) . . . . . . . . . 69 3.6 Runtime (hours) and accuracy for HSEARCH in motif clustering . . 69 A.1 Five spaced seeds for (10100)* . . . . . . . . . . . . . . . . . . . . 83 A.2 Seven spaced seeds for (1110100)* . . . . . . . . . . . . . . . . . . 83 B.1 Key and value pairs for the reference genome in Figure 2.15 . . . . 84 C.1 Comparison of runtime on single-end reads (hours) . . . . . . . . . 86 C.2 Comparison of runtime on paired-end reads (hours) . . . . . . . . . 86 C.3 Comparison of accuracy on single-end read mapping . . . . . . . . . 86 C.4 Comparison of accuracy on paired-end read mapping . . . . . . . . 87 C.5 Comparison of memory usage on single-end reads (GB) . . . . . . . 87 C.6 Comparison of memory usage on paired-end reads (GB) . . . . . . . 87 D.1 Runtime on different lengths of reads (hours) . . . . . . . . . . . . . 88 E.1 The number and percentage of uniquely mapped reads as a function of the number of mismatches in WALT . . . . . . . . . . . . . . . . 89 I.1 Twenty Amino Acids and 1 Letter Code . . . . . . . . . . . . . . . 95 viii List of Figures 1.1 DNA structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Cytosine DNA methylation . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 CpG sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Bisulfite treatment of genomic DNA . . . . . . . . . . . . . . . . . . 8 2.4 Estimate methylation levels . . . . . . . . . . . . . . . . . . . . . . 9 2.5 BS-Seq analysis workflow . . . . . . . . . . . . . . . . . . . . . . . . 10 2.6 C→T asymmetric mapping . . . . . . . . . . . . . . . . . . . . . . . 11 2.7 Best uniquely mapped and ambiguously mapped . . . . . . . . . . . 12 2.8 Single-end and paired-end bisulfite read mapping paradigms . . . . 13 2.9 Hash table for read mapping . . . . . . . . . . . . . . . . . . . . . . 15 2.10 Suffix tree and suffix array for sequence abbcbab# . . . . . . . . . . 19 2.11 Time Complexity as a function of p on different read lengths . . . . 25 2.12 Time complexity comparison on sorting with and without partitioning 26 2.13 An example of applying periodic spaced seed (010)* . . . . . . . . . 28 2.14 An example to show how WALT obtains three seeds from a read . . 29 2.15 Applying periodic spaced seed (010)* to a genome sequence . . . . . 30 2.16 Hash table for the reference genome shown in Figure 2.15 . . . . . . 31 2.17 Spaced seed applied to a read and searching the index table . . . . 32 ix 2.18 The workflow of WALT for bisulfite read mapping . . . . . . . . . . 33 2.19 The hash table data structure in WALT . . . . . . . . . . . . . . . 34 2.20 Runtime as a function of read length . . . . . . . . . . . . . . . . . 36 2.21 An example of an ambiguously mapped read . . . . . . . . . . . . . 41 3.1 A motif with multiple alignment . . . . . . . . . . . . . . . . . . . . 45 3.2 Position weight matrix (PWM) for the motif in Figure 3.1 . . . . . 45 3.3 Percentage of 25-mers conserved during similarity to distance con- version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4 Projection of points to random line . . . . . . . . . . . . . . . . . . 54 3.5 Projection of points to random line . . . . . . . . . . . . . . . . . . 55 3.6 Probability of two points hashed to the same bucket as a function of distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.7 Probability of two points hashed to the same bucket with different K and L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.8 Percentage of data points inside and outside motifs to the centers as a function of distance . . . . . . . . . . . . . . . . . . . . . . . . 60 3.9 The number of data point pairs as a function of distance . . . . . . 61 3.10 Pairwise distance inside and outside motifs for 25-mers . . . . . . . 67 x Abstract With the development of next-generation sequencing technology, a massive amount of DNA and protein sequences have been produced. Fast analyzing these huge sequencing data to obtain important information becomes an important step for many biological applications. One of the most significant and computational intensive methods is to align these DNA and protein sequences to a reference genome or a annotated protein database in order to know locations of DNA sequences or predict functions of protein sequences. Read mapping is the problem to align DNA sequences to a reference genome for locating positions where they originalcomefrom. Proteinalignmentistheproblemtoalignaminoacidsequences toalargeannotatedproteindatabaseforpredictingfunctionsofpeptidesequences. During my PhD studies, I designed a new read mapping algorithm (called WALT) for bisulfite sequencing and a protein alignment algorithm (called HSEARCH) for protein motif finding. Bisulfite sequencing is the golden standard for DNA methy- lation study. The most computational intensive step in DNA methylation study is to map billions of short DNA reads to the reference genome. WALT uses a strat- egy of hashing periodic spaced seeds, which leads to significant speedup compared with the most efficient methods currently available. Protein motifs are conversed sequences which has a specific function, for example, the active site of an enzyme. HSEARCH convertedk-mers to data points in high dimensional space, and applies xi locality-sensitive hashing technique to fast search motifs database which is signif- icantly faster than brute force algorithm with the same accuracy. Additionally, HSEARCH supports protein motif sequence clustering, and it is significantly more sensitive than existing protein sequence clustering programs. xii Chapter 1 Introduction Deoxyribonucleic acid (DNA) is a genetic material in humans and almost all other organisms on the earth (Watson, 1970). DNA contains the instructions needed for an organism to develop, survive and reproduce (Sciences, 2006). DNA is made up of four chemical bases: adenine (A), guanine (G), cytosine (C) and thymine(T). Each baseis attachedto a sugarand a phosphateto form anucleotide (Figure 1.1). Nucleotides are arranged in two long strands that form a double helix (Watson, 2003). DNA bases pair up with each other, A with T and C with G, to form units called based pairs. One strand can serve as a template for producing the other one. The order, or sequence, of these bases determines the information available for building and maintaining an organism (Watson, 1970). A gene is a segment of DNA that contains particular instructions to encode a specific protein. A genome is a complete DNA with all its genes. The genome size and the number of genes vary widely between organisms. For example, the human genome has about 3 billion bases with 20,000 to 25,000 genes, and fruit fly genome has about 130 million bases with about 13,600 genes (Akalin et al., 2012). ThroughtheadvancementsofNext-GenerationSequencing(NGS)technologies, the cost of sequencing a human genome has been significantly reduced (Wetter- strand,2016), whichdramaticallyimprovesthestudyofgenomics, andmanybioin- formatics tools have been developed to analyze these DNA and protein sequencing data for variety of applications (Buermans & den Dunnen, 2014). 1 Figure 1.1: DNA structure 1.1 Read Mapping In most applications in NGS, the first step is to align or map the NGS reads back to the reference genome (Li & Homer, 2010) to obtain the location where each read original come from. For example, in genome assembly, with the location of each read could dramatically improve the speed and accuracy (Miller et al., 2010). In additional, to look for single nucleotide polymorphism (SNP) in DNA sequence, each read should map back to the reference genome, and identify the difference between the read and the reference genome. Many more applications are relied on read mapping results. Meanwhile, more than hundred short read mapping programs were developed to align reads to reference genome. For example, Needleman-Wunsch algorithm 2 (Needleman & Wunsch, 1970) and Smith-Waterman algorithm (Smith & Water- man, 1981) are the earliest alignment methods to align two sequences. However, the time complexity is quadratic to the sequence length, which is not faster enough to handle the large amount of sequencing data. In 1990, BLAST (Altschul et al., 1990) was developed, which filters out candidates that do not share k-mer with the query sequence. BLAST is significantly faster than pairwise alignment algo- rithms and maintains good accuracy. With the improvement of NGS technologies, BLAST is still too slow for data analysis. Many more alignment algorithms were published in the last decade, and they are reviewed in Section 2.1.4. Short read alignment is no longer the bottleneck of NGS data analysis (Li & Homer, 2010), while alignment algorithms for long reads and specific biologi- cal applications could be needed more concern. For example, in DNA methylation studies, whole-genome bisulfite sequencing (WGBS) is the golden standard for pre- dicting single nucleotide methylation levels. Since the issue of C→T asymmetric mapping (more details in Section 2.1.2), bisulfite read mapping is more computa- tional complicate than normal short read mapping. Furthermore, in one bisulfite experiment, it produces billion of reads in order to achieve single nucleotide resolu- tion. Bismark (Krueger & Andrews, 2011), one of the most popular used bisulfite mapping programs, takes more than 800 hours to map these billion of reads in one experiment. During my PhD studies, I developed a new bisulfite mapping algorithm, called WALT. WALT (Chen et al., 2016) uses a strategy of hashing periodic spaced seeds, which leads to significant speedup compared with the most efficient methods cur- rently available. Importantly, these speed gains do not sacrifice accuracy. WALT algorithm and program are described in Chapter 2. 3 1.2 Protein Alignment Protein alignment is more complicated than DNA read mapping in several aspects. First of all, DNA read mapping normally only consider the number of mismatches or gaps (insertion and deletion). However, for protein alignment, not only mismatches or gaps should be counted, but also different pairs of amino acids have different similarity scores, as shown in Table 3.1. For example, in read mapping, A to T is a mismatch, and G to A is another type of mismatch, but they are counted as the same. However, in protein alignment, the scores for Glycine (G) to Methionine (M) and G to Serine (S) are different. Secondly, there are only 4 types of nucleotides, but it has 20 different amino acids. For the same length of DNA sequence and amino acid sequence, there are extremely more combinations of amino acid sequences than DNA sequences 1 . Thirdly, DNA read mapping is a semi-global alignment which the whole read should be matched to the reference genome, but protein alignment is local alignment, and a fragment of the protein sequence matched to the protein database should be reported, too. Therefore, protein alignment is more challenging than DNA read mapping. In some cases, DNA sequences could align to protein database in order to improve sensitive. In the genetic code (Zhang & Yu, 2011), three nucleotides encode one amino acid, and more than one combinations of three-nucleotide for one amino acid. In this case, two DNA sequences is quite different, but their coded amino acid sequences may be the same. Needleman-Wunsch algorithm (Needleman & Wunsch, 1970), Smith-Waterman algorithm (Smith & Waterman, 1981) and BLAST (Altschul et al., 1990) are sup- ported for both DNA and protein alignment. However, they are too slow for the 1 Twenty amino acids and 1 letter code are shown in Appendix I 4 huge amount of protein sequences and the protein database doubled every two years (Hauser et al., 2013). In 2015, Buchfink et al. (2015) published a new pro- tein alignment tool, called DIAMOND, which is about 20,000× faster than BLAST and still reports about 80∼ 90% of matches. DIAMOND has alleviated the need of fast protein alignment tools. However, protein alignment for specific application is still a big challenging, for example, the protein motif finding. So far, there are only few tools are developed for protein motif finding, and most of them are web server-based which cannot scale up for large data set. During my PhD studies, I developed a new protein motif finding and cluster- ing algorithm, called HSEARCH. HSEARCH converts k-mers to data points in high dimensional space, and applies locality-sensitive hashing to fast search known motif database. HSEARCH is significantly faster than brute force algorithm and achieves the same accuracy. Furthermore, HSEARCH for protein motif clustering is dramatically more sensitive than existing programs. HSEARCH algorithm and program are described in Chapter 3. 5 Chapter 2 Bisulfite Read Mapping 2.1 Introduction 2.1.1 DNA Methylation DNA is the genetics material which contains the information for an organism to grow, survive and produce its next generation (Sciences, 2006). All cells carry the same DNA in an organism, but the cell types vary in different tissues and organs (O’Connor et al., 2010). The behavior of genes does not only depend on the sequence of genes, but also on environmental factors. Epigenetics is the study of the factors that affect the expression of genes (Egger et al., 2004). Epigenetics involves genetic control by environmental factors rather than the DNA sequence. The change of these factors can switch genes on and off and determine which genes are expressed. Among these factors, DNA methylation is one of most intensely studied epigenetics mechanisms (Conerly & Grady, 2010; De Zhu, 2005; Jaenisch & Bird, 2003; Kulis & Esteller, 2010). Figure 2.1: Cytosine DNA methylation 6 DNA methylation is a chemical process that adds a methyl group at 5-carbon of the cytosine forming 5-methylcytosine (5-mC) (Moore et al., 2013), as shown in Figure 2.1. It is highly specific and always happens in a region in which a cytosine nucleotide is located next to a guanine nucleotide (Figure 2.2), and this is called a CpG site (Egger et al., 2004; Jones & Baylin, 2002). CpG sites are methylated by a family of enzymes, called DNA methyltransferases (DNMTs) (Egger et al., 2004; Robertson, 2002). Figure 2.2: CpG sites Cells from different tissues and organs have variety of DNA methylation pat- terns (Brunner et al., 2009). Identifying variation in DNA methylation patterns is important for understanding gene regulation and expression in these cells. DNA methylation alters chromatin structure, which enables a single cell to grow into a complex organism with different tissues and organs (Lorincz et al., 2004). DNA methylation also plays an important part in the development of cancer and is a key regulator of gene transcription (Jaenisch & Bird, 2003). Studies have shown that genes with a promoter region that contains a high concentration of 5- methylcytosine are transcriptionally silent (Robertson, 2001). Additionally, DNA hypermethylation is linked to the activation of genes and DNA hypomethylation has been associated with the development of cancer (Esteller, 2005). Thus, it is 7 important to study how DNA methylation occur and where they are located on DNA in different cell types (Bock, 2012). 2.1.2 Bisulfite Sequencing In the past decade, whole-genome bisulfite sequencing (WGBS) technique has become a gold-standard approach for studying Cytosine DNA methylation at base-pair resolution (Frommer et al., 1992). During bisulfite treatment (Fig- ure 2.3), unmethylated Cs are converted to Uracils (U), and methylated Cs remain unchanged (House, 2013). After polymerase chain reaction (PCR), unmethylated Cs are converted to Ts, while methylated Cs are unaffected (Krueger et al., 2012). Bisulfite treatment effectively distinguishes the unmethylated Cs and methylated Cs. In Figure 2.3, OT represents the sequenced reads from the original top strand (C→T forward), CTOT represents the sequenced reads from the strand comple- mentary to the original top strand (G→A reverse complement), OB represents Figure 2.3: Bisulfite treatment of genomic DNA 8 the sequenced reads from the original bottom strand (C→T reverse complement) and CTOB represents the sequenced reads from the strand complementary to the original bottom strand (G→A forward) (Krueger et al., 2012). Figure 2.4: Estimate methylation levels By mapping reads from bisulfite sequencing, which we refer to here as "bisulfite reads", to the reference genome, methylation rates of the genomic DNA could therefore be inferred (Bock, 2012; Krueger et al., 2012). Figure 2.4 is an example to show how to estimate DNA methylation rate for one base. Seven mapped reads cover the CpG site, six of them are Cs, and one of them is T (Song et al., 2013). During bisulfite treatment, unmethylated Cs are converted to Ts (House, 2013), here assuming the number of Ts mapped equals to the number of unmethylated Cs in the genomic DNA, and the number of Cs mapped equals to the number of methylated Cs. Then, the methylation rate for this CpG site is 6/7 (0.86). Similarly, for the non-CpG site, only Ts are mapped to this position which means 9 that in the genomic DNA, all are unmethylated Cs. The methylation rate for this non-CpG site is 0. Normally, in human DNA, the methylation rate for CpG site is close to 1, and non-CpG site is close to 0. Figure 2.5: BS-Seq analysis workflow The analysis of bisulfite sequencing (BS-Seq) data generally includes five steps (Figure 2.5): quality control, adaptor trimming, read mapping, DNA methylation rates predication and biological analysis (Bock, 2012; Krueger et al., 2012). Reads with poor quality are discarded in quality control and the trimming step filters out the adaptor sequences from the reads (Martin, 2011). In read mapping step, bisulfite reads are aligned with the reference genome to identify the most likely genomic positions that the reads originally come from. The methylation rates for each CpG site and non-CpG site are calculated in the methylation rates step (Figure 2.4). The last step biological analysis analyzes how the methylation rates affect the gene regulation and expression (Akalin et al., 2012; Jiang et al., 2014; 10 Song et al., 2013). My work mainly focuses on the read mapping part which is the most computationally intensive step in the workflow. 2.1.3 Problem Statement The genomic DNA position for a read is missing after sequencing. In order to estimate the methylation rate for CpG sites and non-CpG sites, bisulfite reads should be mapped back to the reference genome. Each bisulfite read could be sequenced from one of the four strands, OT (C→T forward), OB (C→T reverse complement), CTOT (G→A reverse complement) or CTOB (G→A forward), as shonw in Figure 2.3. The process of read mapping for bisulfite sequencing is to identify the most likely genomic DNA position and the strand information which we refer to here as "true location", for each read. Figure 2.6: C→T asymmetric mapping Since unmethylated Cs are converted to Ts during the bisulfite treatment, Ts in the reads could be mapped to either Cs or Ts in the reference genome. However, Cs in the reads could only be mapped to Cs in the reference genome, and Ts in the reference genome could only be mapped to Ts in bisulfite reads, so that C to T (C→T) mapping is asymmetric (Chatterjee et al., 2012; Li & Li, 2009), as shown in Figure 2.6. This makes the bisulfite read mapping more complicated than normal 11 short read mapping. To avoid the issue of C→T asymmetric mapping, commonly, Cs in the reads and reference genome are converted to Ts (Bock, 2012; Krueger et al., 2012). For reads from CTOT and CTOB, all Gs are converted to As, same to the reference genomes. To avoid biased estimating methylation rate for each base, best uniquely mapped position is identified as genomic DNA position (true location) for each read (Krueger et al., 2012). Here best uniquely mapped position means that a read is mapped to only one genomic position with the minimal number of mis- matches. The reads that have only one genomic position with the minimal num- ber of mismatches are best uniquely mapped reads. On the contrary, the reads mapped to multiple positions with the minimal number of mismatches are consid- ered as ambiguously mapped reads. Ambiguously mapped reads are not used for estimating methylation rates so they are discarded in the mapping results. Figure 2.7: Best uniquely mapped and ambiguously mapped As shown in Figure 2.7, suppose the fragments with the same color are the same bisulfite reads and the numbers at the right end of these fragments are the number of mismatches. The blue read is mapped to 3 genomic positions: one position is exact mapped and two positions are mapped with 1 mismatch. Blue read is treated as best uniquely mapped read since only one position is mapped without mismatch and that position is identified as the genomic DNA position or true location for the read. The yellow read is mapped to two genomic positions and both are exact matched, so the yellow read is ambiguously mapped read. Ambigu- ously mapped reads are not used for estimating methylation rates in downstream 12 analysis. Similarly, the red read is best uniquely mapped read. The black read is only mapped to one genomic position with 3 mismatches, so this read is also best uniquely mapped read and the position is identified as the true location. Figure 2.8: Single-end and paired-end bisulfite read mapping paradigms Figure 2.8 shows the paradigm for single-end and paired-end bisulfite read mapping. For single-end mapping, one T-rich read is mapped to OT and OB strands, and the best uniquely mapped position is identified as the true location. For paired-end mapping, one T-rich read is mapped to OT and OB strand, and the other A-rich read is mapped to CTOT and CTOB strands. Two reads are mapped independently and a list of top matching positions is kept for each of them. Then the two list of positions are merged with quadratic algorithm. For any pair of positions, if the sum of the number of mismatches and the distance between them are satisfied the user-defined parameters, then the pair of positions is a valid matching. Finally, the valid matching position with the minimal number of mismatches is identified as true location for the pair. ProblemStatement: GivenasetofreadsQ={q 1 ,q 2 ,q 3 ,...,q n }, andareference genome sequenceG, find a substring inG which alignsq i with the minimal number of mismatches, 1<i<n. The alphabet for Q and G is{A,G,T}. 13 There are two main computationally challenges for mapping bisulfite reads: the C→T asymmetric mapping reduces the alphabet size and billions of bisulfite reads are produced in a single DNA methylation experiment (Krueger & Andrews, 2011; Li & Li, 2009). Firstly, during bisulfite treatment, unmethylated Cs are converted to Ts, C→T mapping cannot be count as a mismatch. If all Cs in the bisulfite reads and reference genome are converted to Ts, the alphabet size is reduced to 3. The average number of candidate matching genomic positions is 2L 4 l for normal read mapping, butitis 2L 3 l forbisulfitemapping. Thenumberoffalsepositivecandidates for bisulfite mapping is about ( 4 3 ) l times of the normal read mapping. Need to be clarified, in this dissertation 1 , L is the length of reference genome G, and l is the length of a read q i . Secondly, for bisulfite sequencing, ideally, every genomic position averagely should be covered by about 30 reads. Considering a sample from the human genome (L = 3×10 9 ), the number of reads to be sequenced is about 30×3×10 9 l , which is one billion, and two billion for paired-end sequencing, when the length for a read is 90. Bismark (Krueger & Andrews, 2011), one of the most popular used bisulfite mapping programs, takes about 800 hours to map these one billion bisulfite reads, about one month, day and night. 2.1.4 Review of Short Read Mapping Algorithms With the advancements of NGS technologies, massive sequencing data has been produced for variety of biological applications. For most of these applications, the first step is to map these sequenced reads to the reference genome. In the last decade, many short read mapping programs have been published (Fonseca et al., 1 For all notations, please refer to Appendix H. 14 2012;Li&Homer,2010). Mostofthemarebasedoneitherahashtable(Cormenet al., 2001) or a suffix tree/array (Manber & Myers, 1993; Ukkonen, 1995) methods. Hash Table-based Methods Hash table based mapping programs follow the seed-and-extend paradigm. They keep genomic positions started with a specifick-mer in a hash bucket. When mapping a read, one or multiple k-mers are selected from the read, called seeds, for locating hash buckets to obtain candidate genomic positions, as shown in Fig- ure 2.9. After obtaining all candidate genomic positions, the more time-consuming and accurate methods are applied, such as direct alignment Needleman-Wunsch algorithm (Needleman & Wunsch, 1970) or Smith-Waterman algorithm (Smith & Waterman, 1981). Figure 2.9: Hash table for read mapping MAQ (Li et al., 2008), SOAP (Li et al., 2008), BLAST (Altschul et al., 1990, 1997), BFAST (Homer et al., 2009), PerM (Chen et al., 2009), RMAP (Smith et 15 al., 2008), ZOOM (Lin et al., 2008), MrsFast (Hach et al., 2010, 2014), SeqMap (Jiang & Wong, 2008), Mosaik (Lee et al., 2014), RAPSearch (Ye et al., 2011; Zhao et al., 2012) and DIAMOND (Buchfink et al., 2015) are hash table based mapping programs. Memory Usage: LetL denote the length of the reference genome, and eachk- mer is encode as a unsigned integer which takes 4 bytes memory, and each genomic position takes 4 bytes memory as well, the total memory usage for the hash table is roughly 4 k ×4+4×L. The memory usage is about 11.2 gigabytes (GB) for human genome when k is 12. Runtime Analysis: The average number of candidates in each hash bucket is 2L 4 k . If the runtime for validating each candidate isO(l),l is the length of a read, then the runtime for mapping a read is O( 2L 4 k ×l). Table 2.1: Memory usage (GB) and average number of candidates as a function of k on the human genome k Memory usage Average # of candidates Average # of candidates for bisulfite mapping (4 k ×4+4×L) ( 2L 4 k ) ( 2L 3 k ) 12 11.2 358 11,290 13 11.4 90 3,763 14 12.2 22 1,254 15 15.2 6 418 16 27.2 1.4 140 Table 2.1 shows the memory usage for a hash table and average number of candidates in each hash bucket. For bisulfite mapping, since all Cs are converted to Ts, or all Gs are converted to As, only 3 k hash buckets are filled with genomic positions, and the other 4 k −3 k hash buckets are empty. Suppose the lengths of reference genome and reads are constant, then the runtime only depends on k. In 16 order to improve the mapping speed, longer k-mers should be used. However, as shown in Table 2.1, k cannot be too large since the hash table takes much larger memory. In practice, k usually is around 13 in order to maintain the balance of runtime and memory usage. Disadvantages: Firstly, as shown in Table 2.1, the average number of can- didates in each hash bucket is about 90 for normal read mapping, and 3,763 for bisulfite mapping, when k is 13. In order to identify the only one true location, most of the runtime is spent on validating a large number of false positive candi- dates. Secondly, the genomic positions are not evenly distributed in the 4 k or 3 k hash buckets, and some of the buckets are extremely large, for example,with mil- lion of genomic positions. If a read hits a large bucket, validating these candidate positions take a large amount of time. Furthermore, such buckets will be hit by millions of reads as each of these candidate positions can be sequenced. Thirdly, hash table-based methods use the k-mer to locate hash buckets. However, if the selected k-mer contains internal mismatches, it locates to the wrong hash bucket and waste time to validate all candidates without the true location. Many pro- grams applied spaced seed to consider mismatches insider k-mers or seeds. A spaced seed is a k-mer or seed with internal mismatches (Fonseca et al., 2012). More detail on spaced seed, please refer to Section 2.2.2. Advantages: Firstly, hash table-based method quickly obtain candidates by storing all candidate genomic positions started with a specific k-mer in the same hash bucket. Secondly, after locating hash buckets, the runtime only depends on the number of candidate genomic positions, it does not depend on the number of mismatches. 17 To sum up, hash table-based methods are slow for exact matching but more flexible for inexact matching 2 . Suffix Tree-based Methods Suffix tree is a tree data structure which stores all suffixes of a sequence for fast searching. Every suffix of the sequence is a path from the root node to a leaf node in the suffix tree (Manber & Myers, 1993; Ukkonen, 1995). Figure 2.10A shows a suffix tree of the reference sequence abbcbab#. The # sign in the tree indicates the end of a suffix. For example, to search a read bcba in the reference sequence abbcbab#, it traverses the path b→c→b→a from the root node. Here the first letter isb, it goes to the branch labeledb. The next letter isc, and only one of the branches starts with the letter c, which is a leaf node. It only needs to validate the rest lettersba with the label of the leaf node. Thus, the substring started from position 2 is bcba. The time complexity to search all occurred locations of a read in the reference sequence is O(l). Advantages: Suffix tree-based method is faster than hash-table based method since it traverses through the tree from the root node to an internal node only once for searching all occurred locations, and no matter how many times the read occurs in the reference sequence, while hash table-based methods need to validate all candidate positions in hash buckets. Inexact Matching: Suffix tree-based methods identify inexact matching sup- ported by exact matching (Li & Homer, 2010). For inexact matching, suffix tree- based methods enumerate all combinations of possible mismatches in the read (Li & Homer, 2010), and the number of candidates grows exponentially on the length 2 In this dissertation, without particular explanation, inexact matching stands for mapping with mismatches, while gaps (insertion or deletion) are not included. 18 Figure 2.10: Suffix tree and suffix array for sequence abbcbab# of the read and the number of mismatches. For example, to search the read baba with 1 mismatch, there are 4 different positions for 1 mismatch. If we consider 2 mismatches, there are 4 2 ! = 12 different pairs of positions for 2 mismatches. The time complexity to search a read is O( l m ! ×l), where m is the number of mismatches. Memory Usage: The number of nodes in a suffix tree is about 2×L (Mehta & Sahni, 2004), where L is the length of the reference genome. Each node has 4 pointers for linking to its children nodes, and at least one pointer or integer for linking to the genomic position. Thus, the total memory usage for a suffix tree is at leastO(2×L×5×4) bytes. It is 40 bytes for every genomic position, which is about 112 GB memory for a human genome. 19 Suffix Array: Since suffix tree for the human genome takes too huge memory, a much more compressed data structure called suffix array (Abouelhoda et al., 2004) is used for preprocessing the human genome. As shown in Figure 2.10B, the table with genomic positions is the suffix array for the reference sequence abbcbab#. A suffix array is an array which stores start positions for all suffixes, and all start positions are sorted by their suffixes (Gusfield, 1997). Binary search is applied to search a read in the suffix array. For example, to search the the bcba, the first letter b is used to shrink the region to [2, 5], as shown in Figure 2.10B. The second letter c is used to shrink the region to [5, 5] in the suffix array, and only one position left, it only needs to validate the rest part ba in the read. The time complexity to search a read in a suffix array is O(l×log(L)). A enhanced suffix array method used longest common prefix to reduce the time complexity to O(l+log(L)) (Abouelhoda et al., 2004; Manber & Myers, 1993). Each genomic position takes 4 bytes memory in the suffix array and it is about 11.2 GB memory for the human genome. Even more memory efficient methods are published based on Burrows-Wheeler transform (Burrows & Wheeler, 1994; Salson et al., 2009) and/or FM-index (Ferragina & Manzini, 2000; Simpson & Durbin, 2010) in the last decade. For inexact mapping, similarly to suffix tree-based methods, suffix array or Burrows-Wheeler transform based algorithms enumerate all combinations of pos- sible mismatches in the read (Li & Homer, 2010), and the runtime is exponentially on the length of the read and the number of mismatches. BWA (Li, 2013; Li & Durbin, 2009), Bowtie (Langmead & Salzberg, 2012; Langmead et al., 2009), SOAP2 (Li et al., 2009), segemehl (Hoffmann et al., 2009; Otto et al., 2012a) and ALFALFA (Vyverman et al., 2015) are suffix tree, suffix array or Burrows-Wheeler transform based mapping programs. 20 2.1.5 Review of Bisulfite Sequencing Mapping Algorithms Although many short read mapping programs have been developed, these pro- grams cannot deal with the issue of C→T asymmetric mapping. Several bisulfite sequencing mapping programs have been published, as shown in Table 2.2. Most of the bisulfite mapping programs convert Cs in both reads and reference genome to Ts to handle the issue of C→T asymmetric mapping. Bismark (Krueger & Andrews, 2011), which is one of the most popular used bisulfite mapping programs, first converts Cs to Ts for reads in the mate 1 file, and Gs to As for reads in the mate 2 file. The reference genome also converted to 4 copies, as shown in Figure 2.3, C→T forward (CT), C→T reverse complement (OB), G→A forward (CTOB) and G→A reverse complement (CTOT). Then, Bis- mark applies Bowtie (Langmead & Salzberg, 2012; Langmead et al., 2009) to map the converted reads to the converted reference genomes. BS-Seeker2 (Chen et al., 2010a; Guo et al., 2013), MethylCoder (Pedersen et al., 2011a), Lister et al. (2008)’s tool and GPU-BSM (Manconi et al., 2014a) employ the similar strategy as Bismark. In recently years, several independent bisulfite mapping programs were pub- lished. BSMAP (Li & Li, 2009) and BRAT (Harris et al., 2010) are based on hash table methods. BSMAP (Li & Li, 2009) builds a pre-compiled hash table to minimize the cache-miss latency when searching possible genome positions for k-mers. It also applies a bitwise masking method to efficiently count the number of mismatches for asymmetric conversion of C→T. BRAT-BW (Harris et al., 2012), BRAT-nova (Harris et al., 2016), segemehl (Otto et al., 2012a) and HPG-Methyl (Tarraga et al., 2015) are based on suffix array or Burrows-Wheeler transform methods. 21 Table 2.2: A list of published bisulfite mapping programs Program Source code Citation WALT https://github.com/smithlabcode/walt (Chen et al., 2016) Bismark https://github.com/FelixKrueger/Bismark (Krueger & Andrews, 2011) BSMAP https://github.com/genestack-open/bsmap (Li & Li, 2009; Xi et al., 2012) BS-Seeker2 https://github.com/BSSeeker/BSseeker2 (Chen et al., 2010b; Guo et al., 2013) BRAT-BW http://compbio.cs.ucr.edu/brat/ (Harris et al., 2016, 2012, 2010) BWA-meth https://github.com/brentp/bwa-meth (Pedersen et al., 2011b; Pedersen et al., 2014) BatMeth https://github.com/jingquanlim/batmeth (Lim et al., 2012) PASS-bis http://pass.cribi.unipd.it/cgi-bin/pass.pl (Campagna et al., 2013) segemehl http://www.bioinf.uni-leipzig.de/Software/segemehl/ (Otto et al., 2012b) GNUMAP-bs http://dna.cs.byu.edu/gnumap/ (Hong et al., 2013) Pash http://brl.bcm.tmc.edu/pash/index.rhtml (Coarfa et al., 2010) GPU-BSM https://pypi.python.org/pypi/GPU-BSM/ (Manconi et al., 2014b) Bison https://github.com/dpryan79/bison (Ryan & Ehninger, 2014) MOABS http://www.deqiangsun.org/software/ (Sun et al., 2014) 22 Although many bisulfite mapping programs are available, fast and accurate alignment of massive bisulfite reads remains a challenging problem (Bock, 2012; Krueger et al., 2012; Tran et al., 2014). During my PhD studies, I developed a new bisulfite mapping program, called WALT (Wildcard ALignment Tool). WALT (Chen et al., 2016) uses a strategy of hashing periodic spaced seeds, which leads to significant speedup compared with the most efficient methods cur- rently available. Although many existing WGBS mapping programs slow down with read length, WALT improves in speed. Importantly, these speed gains do not sacrifice accuracy. WALT is open source and can be downloaded from https://github.com/smithlabcode/walt. 23 2.2 Algorithms of WALT 2.2.1 Introduction With the rapid improvement of NGS technologies, sequencing reads are getting longer and longer (Li, 2014; Li & Durbin, 2010; McCoy et al., 2014; Reuter et al., 2015). For example, Illumina HiSeq 3000/HiSeq 4000 Systems can produce 150 bp length reads for whole-genome bisulfite sequencing data (Illumina, 2016). Nowadays, most WGBS reads are in the length of 90 bp to 110 bp. When mapping bisulfite reads, normally, the number of mismatches to be considered is about 6. Thus, the total different combination of positions for 6 mismatches is about 100 6 ! ≈ 1.2×10 9 , when the read length is 100. Even if only 2 mismatches are considered, there are 100 2 ! =4,950 combinations. Hash Table and Suffix Array Suffix array based methods search all combinations of mismatches to handle inexact matching, which is not a computationally reasonable algorithm for full sensitivity of 6 mismatches. Therefore, suffix array based methods often follow the seed-and-extend paradigm as well. They partition a read to several segments, and each segment is mapped separately with only 1 or 2 mismatches. After obtaining candidategenomicpositionsforeachsegment,thesecandidatepositionsaremerged for identifying the true location for the whole read. Suppose l is the length of the read, and it is partitioned to p segments, the average length for each segment is l =b l p c. and the average number of candidate genomic positions for each segment is about L 3 l . Suppose the time complexity to merge all candidate genomic positions is linear to the number of candidates. Then 24 the time complexity to search a read is roughly O( l m ! ×l×log(L)×p+ L 3 l ×p), which is O( l 2 p ×log(L)+ L 3 l ×p), when only one mismatch is considered in each segment. Herem is the number of mismatches andL is the length of the reference genome. Suppose the lengths of reference genome and reads are constant, then the runtime only depends on p. Figure 2.11 shows the time complexity as a function of p on four different read lengths. When p is small, the runtime decreases with the increasing of p. However, after certain turning point, when p increases, the runtime grows exponentially. The turning point increases with the increasing of the read length. 0 5 10 0 0.5 1 1.5 2 x 10 5 Time Complexity Read Length: 75 0 5 10 0 0.5 1 1.5 2 x 10 6 Read Length: 90 0 5 10 0 2 4 6 x 10 5 p: Number of segments for a read Time Complexity Read Length: 100 0 5 10 15 0 2 4 6 8 x 10 5 p: Number of segments for a read Read Length: 120 Figure 2.11: Time Complexity as a function of p on different read lengths 25 AsdiscussedinSection2.1.4, thetimecomplexitytosearchaquerysequencefor hash table-based methods is O( L 3 k ×l). Even if we consider internal mismatches in the hash key or seed using periodic spaced seeds (Sectoin 2.2.2), which is to add constant in the time complexity, the runtime of hash table-based methods is significantly smaller than suffix array-based methods (O( L 3 k ×l)O( l 2 p ×log(L)+ L 3 l ×p)). Suffix array-based methods are more time-consuming than hash table-based methods even we only consider one mismatch in each segment. As we mentioned in Section 2.1.4, spaced seeds are applied to consider internal mismatch in the hash key for hash table-based methods. Then what about we apply spaced seed (Section 2.2.2) to suffix array methods for consider mismatches? 10 11 12 13 14 2 3 4 5 6 7 8 9 10 x 10 10 k Time Complexity Time Complexity Comparison on Sorting Partition non−Partition Figure2.12: Time complexity comparisononsorting withandwithout partitioning In fact, hash table-based methods is faster than suffix array-based methods if both of them apply spaced seed to consider mismatches. 26 First of all, in the building index step, hash table-based methods partition the 3 billion genomic positions to 3 k hash buckets, and then sort genomic positions in each hash buckets. The time complexity is roughly O(3 k × L 3 k log( L 3 k )). For suffix array-based methods, all candidates are sorted together, the time complexity is roughly O(Llog(L)). As shown in Figure 2.12, O(3 k × L 3 k log( L 3 k ))O(Llog(L)). Secondly, in the bisulfite mapping step, hash table-based methods can directly go to the corresponding hash bucket in constant time, but suffix array-based meth- ods takesO(klog(L)) time for shrinking to the same number of candidates. There- fore, WALT is designed based on hash table methods. Binary Search As we discussed in Section 2.1.4, hash table-based methods store genomic posi- tions started with a specific k-mer in a hash bucket, and the runtime (O( 2L 3 k ×l)) only depends on k if the lengths of reference genome and reads are constant. However, k cannot be too large since the hash table takes much larger memory (Table 2.1). With small k, the average number of candidates in a hash bucket is large. It takes a lot of runtime to validate all candidates in a hash bucket. All these candidate genomic positions should be validated since they are stored in a hash bucket without a specific order, which means that they are randomly in a hash bucket. WALT sorts all candidate genomic positions in each hash bucket and applies binary search method to reduce the number of false positive candidates. For a particular seed (Section 2.2.2) exacted from a read, the firstk nucleotides are usedtolocatethehashbucket, andbinarysearchisappliedforthenucleotidesafter the first k in the seed. In this case, WALT uses longer seeds without increasing memory. 27 2.2.2 Spaced Seeds WALT employs periodic spaced seed (010)* originally described by Chen et al. (2009). The ones ("1") in the spaced seeds indicate positions that must match, while the zeros ("0") indicate positions that are not required to match (Ma & Li, 2007). The star ("*") indicates repeating of the spaced seed pattern 010 until the spaced seed reaches the desired weight, or the end of a sequence to which it is applied. When the spaced seed is applied to a sequence, the bases aligned to ones ("1") are extracted and concatenated to form a subsequence. As shown in Figure 2.13, the subsequence ATTTTGTGTATTATTT is formed when the spaced seed is applied to the sequence. Figure 2.13: An example of applying periodic spaced seed (010)* WALT applies three periodic spaced seed (010)*, 0(010)* and 00(010)*, each with 0, 1, and 2 shifts of the original spaced seed (010)*, respectively, to each read to extract three subsequences. These three subsequences for each read are the three seeds used for mapping, as shown in Table 2.3. Table 2.3: Three spaced seeds used in WALT (1) 010010010010010010010010010010010010010010010010010010010... (2) 001001001001001001001001001001001001001001001001001001001... (3) 000100100100100100100100100100100100100100100100100100100... The seed weight for a read is determined by the length of the third spaced seed that reaches the end of the read, and the formula is shown in Equation 2.1, where sw is the seed weight and rl is the read length. Figure 2.14 shows an example of how WALT applies three spaced seeds to a read to extract three subsequences, which are three seeds used for mapping. 28 sw =b rl−2 3 c (2.1) Figure 2.14: An example to show how WALT obtains three seeds from a read WALT preprocesses a reference genome with spaced seed (010)* applied to each genomic position to build an index table, applies three spaced seeds, (010)*, 0(010)* and 00(010)* (each with 0, 1, and 2 shifts to the spaced seed (010)*) to each read, respectively, and searches the index table three times. Therefore, the genomic positions that match the read exactly will be found by the first seed, the genomic positions that have one mismatch to the read will be found by the first or second seed, and the genomic positions that have two mismatches or more to the read will be found by the first, second, or third seed. Based on this observation, we design a simple strategy: once WALT finds an exact match using the first seed, it stops searching because the best unique solution has been found. Otherwise, WALT continues searching with the second seed: if it finds a genomic position with only one mismatch during the first two searches, it stops searching because the best unique solution with one mismatch has been found. Otherwise, WALT combines results from all three searches to find the best solution. In this case, the best solution will have at least two mismatches to the read. WALT also supports two other periodic spaced seeds (10100)* and (1110100)*, as shown in Appendix A. 29 2.2.3 Build Index Table WALT is a hash table based mapping program, and four hash tables are built on four copies of reference genomes, OT, OB, CTOT and CTOB (Figure 2.3). Figure 2.15: Applying periodic spaced seed (010)* to a genome sequence Figure 2.15 illustrates how WALT builds a hash table using the periodic spaced seed(010)*. Foreverygenomicpositioninthereferencegenome, WALTappliesthe spaced seed (010)* to extract a subsequence starting from that position. For exam- ple, for genomic position 0, the extracted subsequence is TAGATTTTATTAA..., and for genomic position 4, it is AATTTGATATAT.... In this illustration, the first k = 3 nucleotides of the subsequences are used as the hash key to index the genomic position in the corresponding hash bucket, but in practice, WALT uses k =12. Figure 2.16 shows the hash table for the reference genome shown in Figure 2.15. The full hash table is shown in Appendix B. The total number of hash keys is 3 k . In each hash bucket, WALT sorts all candidates (extracted subsequences) in the dictionary order, and applies a binary search to identify exact matches. In practice, only genomic positions are stored in the index table, instead of the actual 30 Figure 2.16: Hash table for the reference genome shown in Figure 2.15 subsequences. As shown in Figure 2.16, there are 5 genomic positions, 29, 24, 14, 7 and 9, in the hash bucket with key ATT, and these positions are sorted in dictionary order, ATT, ATTAA, ATTATATT, ATTTGATATAT, ATTTTATTAA, respectively. 2.2.4 Bisulfite Mapping WALT uses the first k nucleotides in the seed to locate the hash bucket and binary search is applied for nucleotides after the first k. As shown in Figure 2.17, the first seed for the read AATATGATTTTGTTAAATTTAATA is ATTTTATT. The first 3 nucleotides ATT is used as the hash key to locate the hash bucket. The nucleotides after the first 3 in the seed are TTATT. There are 5 positions in the reference genome in the same hash bucket, and they are sorted (Figure 2.16). WALT applies a binary search to find exact matches for TTATT. As shown in Figure2.16, outof5positions, onlythesuffixofthegenomicposition9withspaced subsequence TTATTAA matches TTATT. Then, WALT validates position 9 by 31 aligning the whole read (Figure 2.17) against the genome sequence (Figure 2.15) to count the number of mismatches. Figure 2.17: Spaced seed applied to a read and searching the index table Since WALT applies three spaced seeds to each read, WALT searches the index table using the three extract subsequences of the read, and computes the number of mismatches. 2.2.5 Implementation Figure 2.18 shows the workflow of WALT for mapping bisulfite reads. There are two main parts. The first part includes generating four converted genomes and building indexes for each of them. The other part is to map converted bisulfite reads to the converted genomes using index tables. Ns in the reads and reference genomes are randomly converted to As, Cs, Gs and Ts with equal probability. Each index is a hash table data structure which is a list of (key, value) pairs. The hash key is a unsigned integer number in [0, 4 k −1], and the hash value is the genomic positions in the reference genome started with a specific k-mer. For each genomic position, the first k nucleotides in the spaced seed suffix are converted to a unsigned integer which is treated as a hash key to put the genomic position to the corresponding hash bucket. Similarly, for a read, the first k nucleotides in the spaced seed suffix are converted to a unsigned integer which is treated as a hash 32 key to locate the hash bucket. Each nucleotide takes 2 bits, 00 for A, 01 for C, 10 for G and 11 for T. Figure 2.18: The workflow of WALT for bisulfite read mapping As we discussed in Table 2.1, the memory for human genome hash table is 11.2 GB, when k is 12. Additionally, the human genome takes about 3 GB memory. So the total memory for each index is about 14.2 GB, which is 56.8 GB for four indexes. In order to put indexes in a 16 GB memory computer, WALT loads n (n is 1,000,000 by default) reads or pair of reads to memory, then loads one index at a time, and then WALT maps thesen reads to the index. After finishing mapping, WALT loads another index and maps. In this cases, only one index is loaded in the memory, which is smaller than 16 GB. Hash table data structure is a list of (key, value) pairs. Since WALT loads the four index tables multiple times, here we used two arrays to implement the hash table data structure in order to reduce the load time. Figure 2.19 shows the two arrays in WALT: indicator array and positions array. The indicator array indicates the starting index for each hash bucket, and its length is 4 k , where k 33 Figure 2.19: The hash table data structure in WALT is the length of the hash keys. The index of the indicator array represents the hash key. The positions array stores all the genomic positions in each bucket, and its length equals to the length of genome. For example, if the indicator array is{I 0 ,I 1 ,I 2 ,...,I 4 k −1 }, and the positions array is{P 0 ,P 1 ,P 2 ,...,P L }, then the genomic positionsP I i ,P I i +1 ,P I i +2 ,...,P I i+1 −1 are in theith hash bucket, and their hash key is i. The genomic positions with the same hash key are in a consecutive region in the positions array. 2.3 Results 2.3.1 Data Sets The performance of WALT was assessed in comparison with the state-of-the- art bisulfite mapping programs Bismark (Krueger & Andrews, 2011) and BSMAP (Xi et al., 2012). Three other programs, BS-Seeker2 (Guo et al., 2013), BatMeth (Lim et al., 2012) and BWA-meth (Pedersen et al., 2014) were also included in this comparison, and their results were put in Appendix C. We used three data sets: (1) SRR1532534 (Gao & Xi, 2014), (2) SRR948855 (Fortin et al., 2014) and (3) SRR2296821 (Yong-Villalobos et al., 2015) for the comparison, evaluating both speed and accuracy. All of them are paired end data 34 by Illumina HiSeq 2000. The read length, reference genome and number of reads in each data set are shown in Table 2.4. These three data sets cover bisulfite reads from different species with multiple read lengths. Table 2.4: Information about data sets used Data set Read length Genome # of reads (1) SRR1532534 90 Human (hg19) 50,000,000 (2) SRR948855 100 Human (hg19) 50,000,000 (3) SRR2296821 90 Arabidopsis (TAIR10) 21,543,659 PASS-bis (Campagna et al., 2013) and segemehl (Otto et al., 2012b) were not included because they did not complete the single-end read mapping within 100 hours for date sets SRR1532534 and SRR948855. MethylCoder (Pedersen et al., 2011b) was superseded by BWA-meth. BRAT-BW (Harris et al., 2012) did not include read names (id) in the output file, and BRAT-nova (Harris et al., 2016) did not complete building index for human genome hg19 within 100 hours, so they were not included in comparisons. All tests were run using identical Intel Xeon processors using a single core. The details on test environment and command lines for running each program are shown in Appendix F. 2.3.2 Comparison of Runtime For the two human data sets, Table 2.5 shows a comparison of runtime for WALT, Bismark and BSMAP. WALT is about 36× faster than Bismark and 7× faster than BSMAP in single-end mapping. For paired-end mapping, WALT is approximately 15× faster than Bismark and 7× faster than BSMAP. WALT does a look-up for 3 seeds when reads have more than 1 mismatch, but only one seed for exact matches, and two for one or two mismatches (Figure 2.14). 35 Table 2.5: Runtime of WALT, Bismark and BSMAP (hours) Single-end Paired-end WALT Bismark BSMAP WALT Bismark BSMAP (1) 0.71 26.85 4.14 2.43 38.72 12.32 (2) 0.85 29.13 6.09 2.61 37.10 21.55 (3) 0.09 3.11 0.09 0.32 7.46 0.22 Since about 61.8% and 11.1% of reads are matched with 0 and 1 mismatch, respec- tively, the average number of lookups per read is only 1.49 (Table E.1). The A. thaliana genome is relatively small, leading to fewer candidates in each hash bucket. Even with this small genome, WALT nearly matches the speed of BSMAP and is still approximately 25× faster than Bismark. 50 60 70 80 90 100 0 1 2 3 4 5 6 Read Length Runtime (hours) Single−end 50 60 70 80 90 100 0 1 2 3 4 5 6 7 8 9 Read Length Runtime (hours) Paired−end WALT Bismark BSMAP WALT Bismark BSMAP Figure 2.20: Runtime as a function of read length 36 Runtime on Reads with Different Lengths Figure 2.20 shows the runtime on reads with different lengths generated from SRR948855. The full comparison of 6 bisulfite mapping programs are shown in Table D.1. WALT runs faster on longer reads, while Bismark and BSMAP slow down significantly. WALT and BSMAP both are based on hash table method. The time required to validate each candidate genomic position is proportional to the read length. However, WALT uses heavier seeds for longer reads and gets less false positive candidates, while BSMAP uses fixed seed length. Bismark uses Bowtie to map the reads, which essentially conducts an inexact search in a compressed suffix array, which is also more costly for longer reads (Section 2.1.4). Table 2.6: Runtime of WALT and BSMAP on different number of threads (hours) Single-end Paired-end # of WALT BSMAP WALT BSMAP threads Runtime Speedup Runtime Speedup Runtime Speedup Runtime Speedup 1 20.54 1.00 71.56 1.00 60.43 1.00 268.76 1.00 2 12.58 1.63 39.79 1.80 34.70 1.74 135.59 1.98 4 8.14 2.52 20.64 3.47 23.99 2.52 58.24 4.61 8 3.90 5.27 11.12 6.44 20.86 2.90 32.17 8.35 16 3.49 5.88 5.73 12.48 16.00 3.78 18.93 14.20 Runtime on Multiple Threads Table 2.6 shows the runtime of WALT and BSMAP on different number of threads. In this test, all 988,453,793 reads pair in data set SRR153253 are used. SinceBismarkdoesnotreallysupportmultiplethreads, theresultofBismarkisnot included. As shown in the table, WALT is faster than BSMAP in 1, 2, 4, 8, and 16 threadsforbothsingle-endandpaired-endmapping. For2or4threads, WALThas agoodspeedup, butupto8ormore, WALTdoesnotshowagoodspeedup. WALT 37 uses openMP, and BSMAP uses pThread for multi-thread programming. pThread could manfully optimize each thread and get good speedup, while openMP could not. Furthermore, in WALT, all threads use the same index, memory collision becomes a serious problem when the number of threads is getting more. Uniquely Mapped Reads in Small Hash Buckets We have observed that genomic positions are not evenly distributed in the hash buckets. Some hash buckets are extremely large, for example, with millions of genomic positions. If a read hits a large bucket, validating these candidate positions takes a large amount of time. Furthermore, such buckets will be hit by millions of reads as each of these candidate positions can be sequenced. Here we discovered that by using periodic spaced seeds (010)*, 0(010)* and 00(010)*, the majority of uniquely mapped reads are located in small hash buckets. Tables 2.7 and 2.8 show the accumulated percentage of uniquely mapped reads located in different bucket sizes when considering candidates in all hash buckets 3 . More than 87% of uniquely mapped reads are located in the hash buckets with only one candidate. More than 99% of uniquely mapped reads are located in the hash buckets with less than 5,000 candidates. If we only consider hash buckets with less than 5,000 candidates, we could identify about 99% of uniquely mapped reads. Tables2.9and2.10showthecomparisonofruntimeandthenumberofuniquely mapped reads between WALT-all and WALT. WALT-all keeps all candidates in 3 WALT-all is the program kept all hash buckets, and WALT is the program only kept the hash buckets with size less than 5,000. 38 Table 2.7: Accumulated percentage of uniquely mapped reads as a function of bucket size for WALT-all in single-end mapping Uniquely mapped reads 1 5 10 50 100 500 1000 5000 (1) 42,259,476 0.877 0.922 0.935 0.958 0.966 0.981 0.985 0.993 (2) 41,701,812 0.893 0.932 0.944 0.964 0.971 0.984 0.987 0.995 (3) 15,667,881 0.910 0.967 0.977 0.991 0.996 1.000 1.000 1.000 Table 2.8: Accumulated percentage of uniquely mapped reads as a function of bucket size for WALT-all in paired-end mapping Uniquely mapped reads pair 1 5 10 50 100 500 1000 5000 (1) 39,825,389 T-rich 0.863 0.910 0.922 0.946 0.954 0.970 0.976 0.989 A-rich 0.861 0.909 0.921 0.945 0.953 0.970 0.976 0.988 (2) 38,977,964 T-rich 0.885 0.926 0.937 0.957 0.964 0.977 0.981 0.992 A-rich 0.885 0.926 0.937 0.957 0.965 0.978 0.982 0.992 (3) 15,509,508 T-rich 0.882 0.956 0.969 0.989 0.995 1.000 1.000 1.000 A-rich 0.882 0.957 0.969 0.989 0.994 1.000 1.000 1.000 hash buckets, while WALT filters out hash buckets with more than 5,000 can- didates. WALT not only is significantly faster than WALT-all, but also obtains almost the same percentage of uniquely mapped reads. For data sets SRR1532534 and SRR948855, WALT-all obtained a little less uniquely mapped reads than WALT since the second or the third seeds for a few reads are located in the hash bucket with more than 5,000 candidates. If these candidatesarefilteredout,thesereadsareuniquelymapped,butthesereadsmaybe ambiguously mapped if considering more candidates. For data set SRR2296821, the number of candidates for WALT-all and WALT is quite similar since genome Arabidopsis (TAIR10) is relatively smaller, and almost all hash buckets are less than 5,000. 39 Table 2.9: Runtime and percentage of uniquely mapped reads for WALT-all and WALT in single-end mapping (hours) Runtime Uniquely mapped reads # of candidates WALT-all WALT WALT-all WALT WALT-all WALT (1) 1.58 0.71 42,259,476 (84.52%) 42,276,826 (84.55%) 27,621,265,999 7,073,731,529 (2) 1.24 0.85 41,701,812 (83.40%) 41,713,985 (83.43%) 18,668,641,433 6,107,274,297 (3) 0.11 0.09 15,667,881 (72.73%) 15,666,896 (72.72%) 386,743,773 379,015,237 Table 2.10: Runtime and percentage of uniquely mapped reads for WALT-all and WALT in paired-end mapping (hours) Runtime Uniquely mapped reads pair # of candidates WALT-all WALT WALT-all WALT WALT-all WALT (1) 6.32 2.43 39,825,389 (79.65%) 39,526,217 (79.05%) 103,040,472,355 24,093,187,956 (2) 4.66 2.61 38,977,964 (77.96%) 38,790,909 (77.58%) 67,198,887,286 20,674,726,455 (3) 0.35 0.32 15,509,508 (71.99%) 15,508,389 (71.99%) 1,183,486,797 1,157,661,023 2.3.3 Comparison of Accuracy Mapping accuracy is essential in WGBS as incorrect mapping may lead to bias in estimates of methylation levels and other epigenomic features in downstream analyses. Ambiguouslymappingreads–thoseforwhichthebestmatchinggenomic position is not unique – are excluded from most analyses as their interpretation requires project-specific considerations. Only the uniquely mapped reads which have only one matching genomic position with the minimal number of mismatches are used for downstream analyses. Previouslythenumberofuniquelymappedreadswasusedtoevaluatetheaccu- racyofbisulfitemappingprograms. Inseveralresearchpapers, theauthorsclaimed that their programs have better accuracy since they obtained more uniquely mapped reads than others. However, it is not true that the program reported more uniquely mapped reads is better. For example, as shwon in Figure 2.21, the read is matched to two genomic posi- tions with 0 mismatch, and this read should be reported as ambiguously mapped. If a bisulfite mapping program A only finds one of the 2 genomic positions, and 40 Figure 2.21: An example of an ambiguously mapped read it reports this read is uniquely mapped. Another bisulfite mapping program B finds both genomic positions and reports this read is ambiguously mapped. In this case, program A reports more uniquely mapped reads than program B. However, apparently, program B is more accurate than program B. To know whether a read is uniquely mapped or not, the only way is to obtain all mapping genomic positions within certain number of mismatches. If there is only one genomic position with minimal number of mismatches, then the read is uniquely mapped, otherwise, it is not. We used the program mrsFAST (Hach et al., 2010), which is a program designed to produce all mapping positions, with C→T converted reads and refer- ence genome, to obtain all possible mapping genomic positions for each read. This provided a ground truth in measuring mapping accuracy. For a uniquely mappable read (as determined by mrsFAST), if the bisulfite mapping program reports it as uniquely mapping, then it counts as true positive (TP), otherwise false negative (FN). For a read found to be ambiguously mapped or unmapped, if the bisulfite mappingprogramreportsitassuchitcountsastruenegative(TN), otherwisefalse positive (FP). We then compute precision and recall for each bisulfite mapping program. Recall = TP TP +FN (2.2) 41 Precision= TP TP +FP (2.3) As shown in Table 2.11, WALT had the best precision while BSMAP had the best recall on all data sets. Bismark had the lowest recall and precision. BSMAP reported 27% and 293% 4 more incorrect uniquely mapped reads than WALT in single- and paired-end mapping, respectively. Table 2.11: Recall (R) and precision (P) of WALT, Bismark and BSMAP. Single-end Paired-end WALT Bismark BSMAP WALT Bismark BSMAP (1) R 0.982 0.973 0.994 0.965 0.968 0.991 P 0.991 0.976 0.988 0.984 0.943 0.967 (2) R 0.984 0.974 0.995 0.971 0.972 0.993 P 0.993 0.969 0.991 0.989 0.935 0.973 (3) R 0.987 0.974 0.995 0.986 0.959 0.994 P 0.995 0.978 0.993 0.996 0.960 0.967 2.4 Discussion WALT also supports different periodic spaced seeds, as shown in Table A.1. The periodic spaced seed (010)* achieves the fastest speed since it only searches one seed for exact matching, two seeds for one mismatch and three seeds for two or more mismatches, and the majority of reads are matched with zero or one mismatch, so the average seed search is only about 1.5 for spaced seed (010)*. Additionally, we tried to convert all Cs to Ts and Gs to As simultaneously for both T-rich reads and A-rich reads. However, since the alphabet size is only two, 4 The percentage of more incorrect uniquely mapped reads is calculated by this formula: (1−P BSMAP )−(1−P WALT ) 1−P WALT . 42 the number of false positive candidates increases significantly which leads to low speed of mapping. 43 Chapter 3 Protein Motif Finding 3.1 Introduction Protein motifs are conserved fragment sequences occurred frequently in protein sequences. Protein motifs have specific functions, such as active site of an enzyme, or a structural unit necessary for proper folding of that protein (Grant et al., 2011). Protein motifs are a basic functional units, so in order to predict the function of a protein, normally, the first step is to identify the motifs in the protein, and according to the function of these motif to predict the function of the protein. A protein motif is a set of protein sequences with the same length and aligned together by multiple alignment methods. Figure 3.1 shows an example of a motif, and this motif contains 10 protein sequences with length 25. A motif could also be represented as a position weight matrix (PWM), which was first introduced by Stormo et al. (1982). A PWM is a matrix in which each row represents a column of the multiple alignment, and each element represents the percentage of an amino acid in the corresponding column. Figure 3.2 shows the PWM for the motif in Figure 3.1. If a motif contains n protein sequences, s 1 ,s 2 ,s 3 ,...s n , and each protein sequence s i = s i1 s i2 s i3 ...s il , l is the length of each protein sequence. Then, the PWM of the motif can be calculated by Equation 3.1. In the Equation, AA k represents the k th animo acids. 44 Figure 3.1: A motif with multiple alignment Figure 3.2: Position weight matrix (PWM) for the motif in Figure 3.1 M jk = 1 n n X i=1 I(s ij =AA k ),(1≤j≤l,1≤k≤20) (3.1) Pr[s 1 s 2 s 3 ...s l ]= l Y i=1 M ip ,(p is the index of animo acid s i ) (3.2) 45 Given the PWM matrix, for any protein sequence with length l, we could calculate the probability that the protein sequence belongs to this motifs, as shown in Equation 3.2. Nowadays, more and more protein sequences have been produced by next- generation sequencing machines, especially metagenomic studies. For example, IGC data set in human gut microbiome (Qin et al., 2010) has about 10 million protein sequences, and these sequences are from different species. Protein motifs finding could fast annotate these protein sequences and figure out what species are in the sample. Protein motifs finding is to search each known motif again these unknown proteins to predict which motifs are occurred in these proteins, and then figure out the function of the protein. For protein sequences, different pairs of amino acids have different scores, as shown in Table 3.1. For a given protein sequence, protein motif finding is to search all substring of this protein which has a higher similar score to a motif sequence using the BLOSUM62 score matrix. Problem Statement: Given a protein sequence S = s 1 s 2 s 3 ...s t and a protein motif PWMM =[m ij ], (1≤i≤l,1≤j≤20), wherem ij denotes the probability that amino acidj could occur at positioni. Find a substring ofS which has length l with the probability higher than thresholdT or withp-value less than a threshold τ. There are several motif finding algorithms have been developed in the past, such as MotifScanner (Mele, 2016), MotifViz (Fu et al., 2004), STORM (Schones et al., 2007), PoSSuMsearch (Beckstette et al., 2006) and FIMO (Grant et al., 2011). PoSSuMsearch and FIMO support protein motif finding, and the others only support DNA motif finding. FIMO scans each protein position and calculates 46 Table 3.1: The BLOSUM62 matrix A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 theprobabilityandp-valuethatthepositionbelongstothismotifusingPWM.The time complexity isO(l p lm), wherel p andl m are the lengths of protein sequence and motif, receptively. Most of existed motif finding programs are web server-based which could not analyze a huge amount of protein sequences. However, the IGC data set (Qin et al., 2010) has about 10 million protein sequences and total length is about 2.4 billion. It is not computational practical to search all these protein sequences on web server. During my PhD studies, I developed a new protein motif finding algorithm and program, called HSEARCH. HSEARCH converts fixed length protein sequences to data points in high dimensional space, and applies locality-sensitive hashing to fast search protein sequences near a motif . HSEARCH is significantly faster than 47 the brute force algorithm with the same accuracy. HSEARCH is open source and can be downloaded from https://github.com/acgtun/hclust. 3.2 Algorithm of HSEARCH 3.2.1 Introduction Previous protein motif finding methods go through each protein position and calculate the similarity score or p-value using PWM or consensus sequence of the motif. The time complexity is O(l p lm), where l p and l m are lengths of protein sequence and motif, receptively. PoSSuMsearch (Beckstette et al., 2006) builds index for all k-mers in protein sequences and then searches consensus sequence of each motif in the index table to find out all similark-mers for the motif. However, as we discussed in Section 2.1.4, longerk-mers speedup the algorithm but decrease the sensitivity and take huge memory, while shorterk-mers achieve high sensitivity but decrease the speed significantly. In our program, called HSEARCH, first convertsk-mers to data points in high dimensional space, then uses locality-sensitive hashing technique to partition data points near to each other in the same hash bucket, and data points far from each other to different hash buckets. The protein sequences of each motif are also converted to data points, and all data points in the same hash bucket as the center of a motif are candidates. There are two steps in HSEACH algorithm. The fist step converts fixed length protein sequences to data points in high dimensional space and keep the property that two sequences are similar to each other and their corresponding data points are near in high dimensional space. The second step is to fast obtain data points whose distance is less than a threshold T to a given data point q. 48 Problem 1 (Protein Sequence to Coordinates): Given a protein sequence s i =s i1 s i2 s i3 ...s il (1≤i≤n), find a data pointp i =(p i1 ,p i2 ,p i3 ,...,p id ), subject that for any i, j and k, the following four conditions hold. 1. sim(s i ,s j )>sim(s i ,s k )⇔kp i −p j k<kp i −p k k 2.kp i −p j k=0⇔p i =p j 3.kp i −p j k=kp j −p i k 4.kp i −p j k≤kp i −p k k+kp k −p j k Herel is the length of protein sequences i , andd is the dimension of data pointp i . sim(s i ,s j ) denotes the similarity score between two protein sequences s i and s j , and it is defined in Equation 3.3.kp i −p j k denotes the Euclidean distance between two data points p i and p j , and it is defined in Equation 3.4. sim(s i ,s j )= l X k=0 BLOSUM62(s ik ,s jk ) (3.3) kp i −p j k= d X k=1 q (p ik −p jk ) 2 (3.4) The first condition guarantees that two sequences are more similar to each other and their corresponding data points are closer in high dimensional space. The other three conditions keep converted data points in metric space. Problem 2 (T-Nearest Neighbor Search): Given a set of data points P = {p 1 ,p 2 ,p 3 ,...,p n }, and a query data point q in R d , find all p i in P, subject that kq−p i k≤T. 49 3.2.2 Protein Sequence to Coordinate The most straightforward method to convert a protein sequence to a data point is to count the number ofk-mers. The protein sequence is converted to a20 k dimensional vector and each dimension is the number of a specific k-mers in the protein sequence. However, firstly, the order of these k-mers or amino acids are lost during this conversion. Secondly, even for a smallk, the number of dimension 20 k is a huge number, which could be memory and time consuming for downstream data analysis. Last but not least, in the conversion, the BLOSUM62 substitution matrix has not been applied, and all pair of amino acid are treated as the same weight. In order to improve the sensitivity of protein motif finding algorithm, BLO- SUM62 should be included in this conversion. Here we first converted each animo acid to a d A dimensional vector according to BLOSUM62 similarity score matrix. For a protein sequence, a d = d A ×l vector is generated by concatenating the vector of each amino acid. Here d A is the number of dimension for a single amino acid, and l is the length of protein sequence. For a single amino acid, we could think as a protein sequence with a single amino acid and convert it to a vector which satisfied four conditions in Problem 1. Thus, we firstly convert BLOSUM62 similarity score matrix to a distance matrix, and then convert the distance matrix to 20 data points in R d A . Blosum62 Score Matrix to Distance Matrix Let the set AA ={A 1 ,A 2 ,A 3 ,...,A 20 } denote 20 animo acids in Table I.1. BLOSUM62(A i ,A j ) is the score shown in BLOSUM62 score matrix (Table 3.1) for amino acid A i and A j , andkA i −A j k is the distance between amino acid A i 50 and A j . Then, for any 1≤ i,j≤ n, the conversion is defined as the following equation. kA i −Ajk=sim(A i ,Ai)+sim(A j ,A j )−2×sim(A i ,A j ) (3.5) Firstly, since BLOSUM62 is a symmetric matrix, it is easy to prove that Con- ditions 2 and 3 hold in Problem 1. We manually validated 8,000 different combi- nations of triangle inequality for Condition 4 in Problem 1. Fortunately, Equa- tion 3.5 satisfies all 8,000 combinations. Condition 1 in Problem 1 is not fully satisfied. We ramdonly generated 100,00025-merstoestimatethedistortionduring thisconversion. Foreach25-mer, we first used BLOSUM62 similarity score matrix to obtain top q most similar 25- mers, then used distance matrix got from Equation 3.5 to obtain topq most closed candidates. After that, we got the percentage of candidates by using BLOSUM62 and also in the candidates by using distance matrix. As shown in Figure 3.3 shown, when q increases, the percentage decreases a little bit, but not much. More than 80% of 25-mers are conserved during this conversion, which is good enough since BLOSUM62 similarity score matrix itself is an estimation. Multidimensional Scaling After we obtained the 20×20 distance matrix, the next step converts it to 20 vectors in high dimensional space. Multidimensional scaling (MDS) is the tech- nique to convert distance matrix to coordinates (Borg & Groenen, 2005; Groenen et al., 2005; Kruskal & Wish, 1978). 51 1 5 10 20 30 50 80 100 150 200 300 500 1000 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Number of top candidates to be considered Percentage Figure 3.3: Percentage of 25-mers conserved during similarity to distance conver- sion Problem 3 (Multidimensional Scaling): Given a distance matrix D =[kA i −A j k], (1≤i,j≤20), find x 1 ,x 2 ,x 3 ,...x 20 ∈R d A , subject that σ 2 (x 1 ,x 2 ,x 3 ,...x 20 )= X 1≤i<j≤20 (kA i −A j k−kx i −x j k) 2 (3.6) is minimized. This is also refereed as the least-squares MDS model (Groenen et al., 2005). The least-squares MDS cannot be solved in closed form and it was solved by iterative numerical algorithm, such as SMACOF algorithm (De Leeuw, 1988, 2005; De Leeuw & Heiser, 1980; De Leeuw & Stoop, 1984). In our experiment, we used Matlab mdscale function to obtain x 1 ,x 2 ,x 3 ,...x 20 . 52 3.2.3 Locality-Sensitive Hashing Locality-Sensitive Hashing (LSH) is a popular used method for searching T- nearest neighbor Problem, which finds all data points whose distance is less than a threshold T to a given data point q in R d (Andoni & Indyk, 2004; Datar et al., 2004; Har-Peled et al., 2012). Normally, hash functions are designed for collision as less as possible for fast searching. However, LSH functions are designed that data points closer to each other have more chance to be hashed into the same bucket than data points far from each other, which is shown in Equation 3.7. Here H denotes a LSH function which hashes a data point to an integer, and Pr denotes the probability. Pr[H(p i )=H(p j )]>Pr[H(p s )=H(p t )]⇔kp i −p j k<kp s −p t k (3.7) LSH originally is designed for Hamming Distance in binary data, and later Datar et al. (2004) extended LSH to support for Euclidean Distance in high dimen- sional data points. LSH function for Euclidean Distance is based on random pro- jection, as shown in Figure 3.4. Two data points is closer in high dimensional space intuitively could be projected near to each other in a random line. As shown in the figure, data point A is near to data point B, and the projection of A is also closer to the projection of B. However, there are counter examples as shown in Figure 3.5. In high dimen- sional space, data point B is near to data point A, but the projection of B is near to the projection of C. Thus, in order to reduce the false positive data points hashed to the same bucket, normally, multiple random lines are chosen for projection. h a,b (p)=b a·p+b w c (3.8) 53 Figure 3.4: Projection of points to random line Equation 3.8 shows the LSH function family for Euclidean Distance. a is a d-dimensional random vector from Gaussian distribution, and b is a real number uniformly randomly drawn from [0,w), where w is the bucket width, as shown in Figure 3.4. p is any data point from P ={p 1 ,p 2 ,...,p n }. Since a is drawn from Gaussian distribution which is a 2-stable distribution. For any vectora from Gaussian distribution,a·p i −a·p j has the same distribution askp i −p j kX, where X follows Gaussian distribution (Zolotarev, 1986). For any two data points p i and p j projected to the same bucket, there are two conditions. The first one is|a·p i −a·p j | < w, which is the projection distance less than the bucket width. The second one is that no bucket boundary locates between the projection of p i and the projection of p j on the random line, and the probability is 1− |a·p i −a·p j | w . 54 Figure 3.5: Projection of points to random line For the first condition, |a·p i −a·p j |<w⇔|a·(p i −p j )|<w (2-stable distribution) ⇔|kp i −p j kX|<w ⇔kp i −p j k|X|<w ⇔|X|< w kp i −p j k (3.9) 55 SinceX∼N(0,1), then Y=|X| is a half normal distribution, whose probability density function is shown below f Y (y)= 2 √ 2π e − y 2 2 ,(y>0) (3.10) According to Equation 3.9, the first condition isY < w kp i −p j k . Letc=kp i −p j k, then Pr[Y < c w ]= Z w c 0 f Y (y)dy (let t=cy) = Z w 0 t c f Y ( t c )dt = 1 c Z w 0 f Y ( t c )dt (3.11) For the second condition, letPr[c no ] denote the probability that no bucket bound- ary locates between the projection of p i and the projection of p j is, Pr[c no ]=1− |a·p i −a·p j | w ] =1− kp i −p j k|X| w =1− cY w =1− t w (3.12) Therefore, for any data points p i , p j and c =kp i −p j k, the probaility that they are hashed to the same bucket is Pr[c]=Pr[Y < c w ]Pr[c no ] = 1 c Z w 0 f Y ( t c )dt(1− t w ) = Z w 0 1 c f Y ( t c )(1− t w )dt (3.13) 56 Pr[c] is monotonically decreasing inc when the bucket width is fixed, as shown in Figure 3.6. 0 100 200 300 400 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Euclidean Distance Probability w=50 Figure 3.6: Probability of two points hashed to the same bucket as a function of distance In order to reduce the chance that two points far from each other hash to the same bucket, normally, multiple random lines are used, if two data points are projected to the same bucket in all random lines, then the two data points are put in the same bucket in the hash table. Additionally, to increase the chance that two points near to each other could hash to the same bucket, generally, multiple hash tables are built on the data set. Thus, for any data point p, the hash valueh(p) is shown as fellow: h(p)=(h a 1 ,b 1 (p),h a 2 ,b 2 (p)...,h a K ,b K (p)) (3.14) Then for searching data points p i in P wherekq−p i k < T, all data points in hash buckets g 1 (q),g 2 (q),...,g L (q) should be validated, where g i (q) denotes a 57 hash bucket with date points that has the same hash value as q in i th hash table. Then the probability that two points p i and p j hash to the same bucket in of the of L hash table is Pr K,L [c]=1−(1−Pr[c] K ) L (3.15) 30 40 50 60 70 80 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Euclidean Distance Probability K8L16W50 K8L32W50 K8L64W50 K16L16W50 K16L32W50 K16L64W50 K32L16W50 K32L32W50 K32L64W50 Figure 3.7: Probability of two points hashed to the same bucket with different K and L Figure 3.7 shows several curves of probability that two points have the same hash value with different K and L. As we can see, the increasing of K could decrease the probability, and the increasing of L could increase the probability. Definition (Locality-sensitive hashing): A familyH is called (R 1 ,R 2 ,P 1 ,P 2 )- sensitive if for any p i ,p j ∈P •kp i −p j k≤R 1 then Pr[kp i −p j k]≥P 1 58 •kp i −p j k>R 2 then Pr[kp i −p j k]≤P 2 Normally, R 1 < R 2 , P 1 > P 2 and R 2 = R 1 (1+ε), where ε is a smaller real number and P 1 =Pr[R 1 ] and P 2 =Pr[R 2 ]. Time Complexity: To search the near neighbors, there are two steps: build LSH tables and search. The time complexity for the first step is O(nKLd), where n is the number of data points, and d is the dimension of each data point. The time complexity for the second step to search one query isO(L#collisions), where #collisions is the number of data points in the hash bucket in which the query located. Let K =log 1/P 2 n , ρ= log(1/P 1 ) log(1/P 2 ) and L=n ρ , then Pr K,L [R 1 ]=1−(1−Pr[R 1 ] K ) L =1−(1−P K 1 ) L =1−(1−P log 1/P 2 n 1 ) n ρ =1−(1−n − log1/P 1 log1/P 2 ) n ρ =1−(1−n −ρ ) n ρ =1− 1 e (3.16) Pr[R 2 ]=Pr[R 1 ] K =P log 1/P 2 n 2 = 1 n (3.17) From Equation 3.16, if K and K =log 1/P 2 n and L=n ρ , two data points with distance less thanR 1 have more than 1 2 probability to be in the same hash bucket. From Equation 3.17, for a query data pointq, the expected number of data points with distance larger than R 2 to q is at most 1 in one hash table, then the total expected number of data points with distance larger than R 2 is at most L. 59 3.2.4 Motif Finding For a given protein database, each protein sequence is converted to l−k+1 k-mers and eachk-mer is converted to ad-dimensional data point. LSH tables are built on these data points. In order to search motifs in these protein sequences, we obtained motif sequences from Pfam database (Bateman et al., 2004). Each protein sequence in a motif is also converted to a data point, and the center data point of all protein sequences in a motif is set as a query data point. The query q represents a motif and uses the LSH technique to obtain all data point p where kp−qk≤T. 0 20 40 60 80 100 120 0 1 2 3 4 5 6 7 8 9 Percentage Pairwise Distance 25−mers Inside data points Outside data points Figure 3.8: Percentage of data points inside and outside motifs to the centers as a function of distance To set the distance threshold T, we took all motifs in Pfam database and investigated distance to the center from data points inside motifs and random 60 data points (Figure 3.8). Additionally, we randomly selected 100,000 25-mers from IGC protein database, and plotted the frequency of pairwise distance from these 25-mers to centers of motifs, as shown in Figure 3.9. The frequency roughly follows a normal distribution N(54.45,4.58 2 ), and the probability that pairwise distance less than 32.66 is 1.0×10 −6 . Thus, if a data points in LSH table has a distance less than 32.66 to the query, the corresponding 25-mer belongs to the motif represented by the query with the p-value 1.0×10 −6 . Therefore, for fixed length motif sequences, we set the thresholdT is 32.66 for 25-mers. In our results, we also tested the p-vale 1.0×10 −4 with threshold T 34.90. Different lengths of k-mers have different threshold which are obtained from Pfam database. 30 40 50 60 70 80 0 0.5 1 1.5 2 2.5 3 3.5 x 10 4 Pairwise Distance Number of pairs N(54.45, 4.58 2 ) x=32.66 p−value=10 −6 Figure 3.9: The number of data point pairs as a function of distance 61 3.3 Results 3.3.1 Data sets We compared the performance of HSEARCH with brute force algorithm on a data set with 18,000,000 k-mers. The length of k-mers is 25, and all of them are randomly selected from IGC data set (Qin et al., 2010). Furthermore, we randomly selected 574 motifs from Pfam-A.seed database (Bateman et al., 2004), and each motif with at least 50 protein sequences. All motif sequences are trimmed to 25-mers and converted to d-dimensional data points. The center of each motif sequences are treated as queries for searching k-mers from IGC data set, and a k-mer with a distance less than thresholdT to a center of a motif, then thek-mer is reported as a sequence of that motif. Let the set P ={p 1 ,p 2 ,p 3 ,...,p n } are the k-mers and C ={c 1 ,c 2 ,c 3 ,...,c m } are the centers of motifs. The result of the brute force algorithm is a set of pairs R b ={(p i ,c j )|kp i −c j k≤ T,1≤ i≤ n,1≤ j≤ m} and the result of HSEARCH R H is a subset of R b . weight(p i ,c j )= 1 kp i −c j k−λ kp i −c j k−λ>1.0 1.0 else (3.18) TP = n X i=0 m X j=0 weight(p i ,c j )I[(p i ,c j )∈R b ∧(p i ,c j )∈R H ] (3.19) FN = n X i=0 m X j=0 weight(p i ,c j )I[(p i ,c j )∈R b ∧(p i ,c j ) / ∈R H ] (3.20) Recall = TP TP +FN (3.21) 62 Bruteforcealgorithmobtainsallpairsof(p i ,c j ), wherekp i −c j k≤T. Weaccess the accuracy of HSEARCH use recall defined in Equation 3.21. True positive (TP) and False negative (FN) are defined in Equations 3.19 and 3.20. Each pair has a weight which is defined as shown in Equation 3.18, wehreλ is the minimal distance among all pairs. 3.3.2 Comparison of Runtime and Accuracy We assessed HSEARCH runtime on different percentages (Recall defined in Equation 3.21) of brute force results. Brute force algorithm took 1.72 hours for searching 18,000,000 k-mers on 574 moitfs. Tables 3.2 and 3.3 show the runtime on different percentages of brute force results forp-value1.0×10 −6 and1.0×10 −4 . With fixed K and L, we used different bucket width W in the LSH method to obtainvarietyofpercentageofbruteforceresults. Asthetablesshown, HSEARCH is about 1.5× faster than brute force method when got the same result. With decreasing of the percentage, the speedup of HSEARCH increased significantly. Importantly, since in LSH method, the data points closer to the query has higher probability to be located in the same hash bucket, with decreasing of the percent- age, most of data points with larger distance to the query could be excluded in HSEARCH result, and the data points near the query be still in the HSEARCH result. Table 3.2: Runtime(hours) of HSEARCH on percentage of results (p-value:10 −6 ) % 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Runtime 0.07 0.10 0.15 0.19 0.23 0.37 0.53 0.60 0.78 0.92 Speedup 24.6 17.2 11.5 9.1 7.5 4.6 3.2 2.9 2.2 1.9 63 Table 3.3: Runtime(hours) of HSEARCH on precentage of results (p-value:10 −4 ) % 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Runtime 0.07 0.10 0.15 0.20 0.26 0.46 0.54 0.58 1.05 1.11 Speedup 24.6 17.2 11.5 8.6 6.6 3.7 3.2 3.0 1.6 1.5 3.4 Discussion 3.4.1 Protein Motif Clustering Sequence clustering is a fundamental technique which has many important applications. Firstly, inmetagenomicstudies, DNAorproteinsequencescrossvari- ety of species (Sharpton, 2014), and clustering these sequences based on sequence similarity could predict how many different species in the sample and what they are. For example, clustering 16S rRNA sequences into Operational Taxonomic Units (OTUs) is one of the most important applications of sequence clustering. Secondly, the size of sequence database is getting larger and larger, such as NCBI (National Center for Biotechnology Information) non-redundant database doubles every two years (Hauser et al., 2013). Clustering sequences in the database could reduce the size of the database and improve the speed for sequence search without lose much sensitivity. Thirdly, in motif discovery, clustering sequences into differ- ent groups, if the size of certain group is significantly larger than the expected size, it has high probability that this group is a sequence motif with specific function. Several sequence clustering programs have been developed in the past decades. For example, CD-HIT (Li & Godzik, 2006) uses incremental greedy methods. The first sequence is set as a representative sequence for the first cluster. Then each query sequence is compared with representative sequences in existing clusters. If the similarity between the query sequence and the current compared representative 64 sequence is higher than certain threshold, the query sequence is added to that clus- ter. If no existing representative sequence is found, a new cluster is generated and the query sequence is the representative sequence for the new cluster. There are several drawbacks for this algorithm. Firstly, the clustering results depend on the order of the input sequences. Secondly, the algorithm stops searching the represen- tative sequences when there is a representative sequence whose similarity is higher than the threshold. In this case, probably there is a more similar representative sequence in one of the existing clusters. Thirdly, if the number of clusters is larger, the time complexity is nearly quadratic to the number of sequences. UCLUST (Edgar, 2010) is similar to CD-HIT but uses different lengths of k-mers to reduce false positive candidates. kClust (Hauser et al., 2013) improves the accuracy by the following strategies. Firstly, it sorts all sequences by sequence lengths in descending order. Then, it uses the similar method as CD-HIT, but kClust searches all existing representative sequences and if the similarity between the query sequence and the most similar representative sequence is higher than certain threshold, the query sequence is addedtothatcluster. Otherwise, anewclusterisgeneratedandthequerysequence is the representative sequence for the new cluster. However, all these existing clustering algorithms are not faster enough for clus- tering larger data set, such as IGC data set in human gut microbiome (Qin et al., 2010). Importantly, all these algorithms has a very low accuracy in clustering pro- tein sequences. The majority pair of sequences should be in one cluster, but they clustered to different clusters. During my PhD studies, I developed a new protein sequence clustering algorithm based on locality-sensitive hashing technique which is dramatically more accurate than CD-HIT, UCLUST and kClust. 65 Problem Statement: Given a set of protein sequences S ={s 1 ,s 2 ,s 3 ,...,s n }, partition n protein sequences into a set of clusters C ={c 1 ,c 2 ,c 3 ,...,c m }, where each cluster is a set of protein sequences, c i ={s i 1 ,s i 2 ,s i 3 ,...,s im i }, 1≤ i≤ m. Here n is the number of protein sequences, m is the number of clusters, and m i is the number of protein sequences in cluster c i . Let sim(s i ,s j ) denotes the simi- larity between two protein sequences s i and s j . The objective of protein sequence clustering is to maximize P m i=1 P 1≤j,k≤m i ,j6=k sim(s i j ,s i k ). Unfortunately, to achieve the objective of protein sequence clustering is a NP- hard problem. Thus, many heuristic algorithms are developed, such as to set a threshold T, and similarity between two sequences in one cluster should be larger than T. However, this methods cannot guarantee the strategy. For example, sim(s i ,s j ) > T, sim(s i ,s k ) > T and sim(s j ,s k ) < T, according to the objective, s i and s j should be in the same cluster, and s i and s k should be in the same cluster. Then, s j and s k is in the same cluster, but sim(s j ,s k ) < T, which is a contradiction. Thus, sometimes, it is hard to set a strict boundary among clusters, and a specific sequence can be in one cluster and also can be in another cluster. Since it is hard to claim that a sequence in a certain cluster is correct or not, it isnoteasytoevaluatetheperformancefordifferentsequenceclusteringalgorithms. In our study, we built a ground truth data set from Pfam database (Finn et al., 2013), and compared clustering results from different algorithms with the ground truth data set and used recall and precision to evaluate the performance (Rand, 1971). Here we applied a heuristic algorithm in HSEARCH. For a protein sequence s i which has not been added to any cluster, it is set as the representative sequence of a new clusterc j . Protein sequences that have not added to any cluster and whose alignment similarity score with s i are higher than a threshold T are added to c j . 66 The most important step is to search all protein sequences with similarity score larger than T for each representative sequence. As we talked in Section 3.2.3, all protein sequences are converted to data points in high dimensional space, and we used locality-sensitive hashing technique to search near neighbors for each repre- sentative sequence. 20 30 40 50 60 70 80 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Percentage Pairwise Distance 25−mers Inside motifs Outside motifs Figure 3.10: Pairwise distance inside and outside motifs for 25-mers We assessed the performance of HSERACH for clustering protein sequences on a ground truth data set selected from Pfam database. This data set contains 1,000 motifs with total 316,097 protein sequences. All protein sequences are trimmed to length 25. We ran HSEARCH, CD-HIT, UCLUST and kClust on this data set. If a pair of protein sequences are from one moitf, and the program clustered them to the same cluster, this pair is counted as true positive (TP), otherwise false negative (FN). Similarly, if a pair of protein sequences are from different motifs, and the 67 program clustered them to the same cluster, this pair is counted as false positive (FP), otherwise true negative (TN). The recall, precision defined in Equations 2.2 and 2.3 and F1 (Equation 3.22) are used to evaluate the results. F 1 = 1 2 Recall + 1 Precision = 2×Recall×Precision Recall+Precision (3.22) Additionally, normalized mutual information (NMI) is also used to assess clus- tering accuracy. Let the set of clustersG={g 1 ,g 2 ,g 3 ,...,g s } is the ground truth, and a set of clusters C ={c 1 ,c 2 ,c 3 ,...,c t } is the clustering results from a pro- gram. LetN denote the total number of protein sequences. Then, NMI is defined as follow, NMI(G,C)= 2I(G,C) H(G)+H(C) (3.23) where I(G,C), H(G) and H(C) are defined I(G,C)= s X i=1 t X j=1 |g i ∩c j | N log |g i ∩c j | N |g i | N |c j | N (3.24) H(G)= s X i=1 |g i | N log |g i | N (3.25) H(C)= t X i=1 |c i | N log |c i | N (3.26) We investigated the ground truth data set by plotting pairwise distances for data points inside motifs, and between motifs, as shown in Figure 3.10. Since HSEARCH is a heuristic algorithm, if two data points have distance to the center less thanT, it dose not guarantee that the distance between these two data points is less thanT. Therefore, we did not set the thresholdT to the intersection of these two plots in Figure 3.10, but a little smaller than the intersection. In HSEARCH, for 25-mers, the threshold T is set to 50. 68 Table 3.4: Recall(R), Precision(P), and F1 for CD-HIT, UCLUST and kClust Similarity 10% 20% 30% 40% 50% 60% 70% 80% 90% CD-HIT R - - - - - - 0.025 0.011 0.002 P - - - - - - 0.998 0.999 1.000 F1 - - - - - - 0.049 0.021 0.004 NMI - - - - - - 0.752 0.731 0.710 UCLUST R 0.092 0.092 0.092 0.091 0.080 0.055 0.023 0.007 0.002 P 0.960 0.960 0.960 0.965 0.985 0.997 0.998 0.999 0.999 F1 0.168 0.168 0.168 0.166 0.148 0.104 0.044 0.014 0.004 NMI 0.799 0.799 0.799 0.798 0.793 0.778 0.751 0.725 0.710 kClust R - 0.075 0.074 0.071 0.068 0.060 0.029 0.011 0.002 P - 0.998 0.998 0.998 0.998 0.997 0.997 0.999 0.999 F1 - 0.140 0.137 0.133 0.128 0.114 0.056 0.022 0.004 NMI - 0.792 0.791 0.789 0.787 0.780 0.753 0.730 0.708 Table 3.5: Runtime for CD-HIT, UCLUST and kClust (hours) Similarity 10% 20% 30% 40% 50% 60% 70% 80% 90% CD-HIT - - - - - - 0.002 0.002 0.002 UCLUSTR 0.005 0.005 0.005 0.005 0.012 0.015 0.010 0.008 0.007 kClust - 0.027 0.022 0.015 0.012 0.009 0.007 0.012 0.016 Tables 3.4 and 3.5 show the accuracy and runtime comparison for CD-HIT, UCLUST and kClust on different sequence similarity identities. CD-HIT does not support sequence identity less than 70% for 25-mers. kClust does not support sequence identity less than 20%. Table 3.6 shows the runtime and accuracy for HSEARCH in motif clustering. All four programs ran very fast in the ground truth data set. However, CD-HIT, UCLUST and kClust have recall less than 10% for all different sequence identities. As we talked above, clustering protein sequences Table 3.6: Runtime (hours) and accuracy for HSEARCH in motif clustering L 4 8 16 Runtime 0.032 0.036 0.030 Rcall 0.286 0.399 0.496 Precision 0.827 0.823 0.815 F1 0.425 0.537 0.617 NMI 0.859 0.880 0.897 69 based on sequence identity is not sensitive enough, since two protein sequences may be homologous even they have low sequence identity. The NMI score for HSEARCH is also significant higher than CD-HIT, UCLUST and kClust. 70 Chapter 4 Conclusion With the development of of next-generation sequencing technologies, a massive amount of DNA and protein sequences have been produced. Fast sequence analysis tools are a big need for many biological applications based on DNA and protein sequences. Sequence search and clustering are the most fundamental methods for many studies in computational biology. Previous sequence search and clustering algorithms are either too slow for large-scale data sets or not sensitive enough for specific applications. During my Phd studies, I developed a bisulfite mapping algorithm and a protein motif finding and clustering algorithm. Challenges of bisulfite read mapping are the small size of alphabet in bisulfite reads and billions of bisulfite reads. Previous programs take more than 800 hours for mapping bisulfite reads from one experiment. Our program, called WALT, builds hash table for k-mers using periodic spaced seeds. WALT is 36× faster the most popular used bisulfite mapping program Bismark. The speedup of WALT causes by the specific designed periodic spaced seeds and taking the advantage that the majority of uniquely mapped bisulifte reads are in small size of hash buckets. WALT is significantly faster than the existing programs and maintains similar accuracy. Importantly, many existing tools slow down with read length, while WALT improves in speed. Protein motifs are conserved fragment sequences occurred frequently in pro- tein sequences. There are many DNA motif finding algorithm were developed in 71 the past decades. However, very few of them support protein motif finding. Pro- tein motif finding and clustering is more complicated than DNA motif finding and clustering. First of all, the alphabet size for DNA is 4, but for protein is 20. Sec- ondly, different pairs of amino acid has different similarity scores, while for DNA, normally, only consider match or mismatch. Therefore, protein motif finding and clustering is more computational intensive than DNA motif finding and clustering. Most of the protein motif finding programs are web server-based which cannot analyze for large data sets. The existing protein or DNA clustering programs are based on the percentage of sequence identity which is not sensitive enough for protein sequence clustering since two protein sequences may be homologous even they have low sequence identity. Our program, called HSEARCH, converts protein k-mers to data points in high dimensional space and applies locality-sensitive hash- ing (LSH) technique to search homologous protein sequences. For protein motif finding, HSEARCH is significantly faster than brute force algorithm and maintains the same accuracy. For protein motif clustering, HSEARCH achieves dramatically higher accuracy compared to CD-HIT, UCLSUT and kClust. 72 Reference Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2:53–86. Akalin A, Kormaksson M, Li S, Garrett-Bakelman FE, Figueroa ME, Melnick A, Mason CE (2012) methylkit: a comprehensive r package for the analysis of genome-wide dna methylation profiles. Genome Biology 13. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local align- ment search tool. Journal of Molecular Biology 215:403–410. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res 25:3389–402. Andoni A, Indyk P (2004) E2lsh: Exact euclidean locality-sensitive hashing (http://web.mit.edu/andoni/www/lsh/index.html, 2004.). Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL et al. (2004) The pfam protein families database. Nucleic acids research 32:D138–D141. BeckstetteM,HomannR,GiegerichR,KurtzS(2006) Fastindexbasedalgorithms and software for matching position specific scoring matrices. BMC bioinformat- ics 7:389. Bock C (2012) Analysing and interpreting dna methylation data. Nat Rev Genet 13:705–19. Borg I, Groenen PJ (2005) Modern multidimensional scaling: Theory and appli- cations Springer Science & Business Media. Brunner AL, Johnson DS, Kim SW, Valouev A, Reddy TE, Neff NF, Anton E, Medina C, Nguyen L, Chiao E, Oyolu CB, Schroth GP, Absher DM, Baker JC, Myers RM (2009) Distinct dna methylation patterns characterize differen- tiated human embryonic stem cells and developing human fetal liver. Genome Res 19:1044–56. 73 Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using diamond. Nature methods 12:59–60. Buermans HP, den Dunnen JT (2014) Next generation sequencing technology: Advances and applications. Biochim Biophys Acta 1842:1932–1941. Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algo- rithm. Technical Report 124 . Campagna D, Telatin A, Forcato C, Vitulo N, Valle G (2013) Pass-bis: a bisul- fite aligner suitable for whole methylome analysis of illumina and solid reads. Bioinformatics 29:268–70. Chatterjee A, Stockwell PA, Rodger EJ, Morison IM (2012) Comparison of alignment software for genome-wide bisulphite sequence data. Nucleic Acids Research 40. Chen H, Smith AD, Chen T (2016) Walt: fast and accurate read mapping for bisulfite sequencing. Bioinformatics p. btw490. Chen PY, Cokus SJ, Pellegrini M (2010a) Bs seeker: precise mapping for bisulfite sequencing. Bmc Bioinformatics 11. Chen PY, Cokus SJ, Pellegrini M (2010b) Bs seeker: precise mapping for bisulfite sequencing. BMC Bioinformatics 11:203. Chen YH, Souaiaia T, Chen T (2009) Perm: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics 25:2514–2521. Coarfa C, Yu FL, Miller CA, Chen ZZ, Harris RA, Milosavljevic A (2010) Pash 3.0: A versatile software package for read mapping and integrative analysis of genomic and epigenomic variation using massively parallel dna sequencing. Bmc Bioinformatics 11. Conerly M, Grady WM (2010) Insights into the role of dna methylation in disease through the use of mouse models. Disease Models & Mechanisms 3:290–297. Cormen TH, Stein C, Rivest RL, Leiserson CE (2001) Introduction to Algorithms McGraw-Hill Higher Education, 2nd edition. Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions In Proceedings of the twentieth annual symposium on Computational geometry, pp. 253–262. ACM. De Leeuw J (1988) Convergence of the majorization method for multidimensional scaling. Journal of classification 5:163–180. 74 De Leeuw J (2005) Applications of convex analysis to multidimensional scaling. Department of Statistics, UCLA . De Leeuw J, Heiser WJ (1980) Multidimensional scaling with restrictions on the configuration. Multivariate analysis 5:501–522. De Leeuw J, Stoop I (1984) Upper bounds for kruskal’s stress. Psychome- trika 49:391–402. De Zhu J (2005) The altered dna methylation pattern and its implications in liver cancer. Cell Res 15:272–80. Edgar RC (2010) Search and clustering orders of magnitude faster than blast. Bioinformatics 26:2460–2461. Egger G, Liang G, Aparicio A, Jones PA (2004) Epigenetics in human disease and prospects for epigenetic therapy. Nature 429:457–63. Esteller M (2005) Dormant hypermethylated tumour suppressor genes: questions and answers. J Pathol 205:172–80. Ferragina P, Manzini G (2000) Opportunistic data structures with applica- tions. 41st Annual Symposium on Foundations of Computer Science, Proceed- ings pp. 390–398. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J et al. (2013) Pfam: the protein families database. Nucleic acids research p. gkt1223. Fonseca NA, Rung J, Brazma A, Marioni JC (2012) Tools for mapping high- throughput sequencing data. Bioinformatics 28:3169–77. Fortin JP, Labbe A, Lemire M, Zanke BW, Hudson TJ, Fertig EJ, Greenwood CM, Hansen KD (2014) Functional normalization of 450k methylation array data improves replication in large cancer studies. Genome Biol 15:503. Frommer M, Mcdonald LE, Millar DS, Collis CM, Watt F, Grigg GW, Molloy PL, Paul CL (1992) A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual dna strands. Proceedings of the National Academy of Sciences of the United States of America 89:1827–1831. Fu Y, Frith MC, Haverty PM, Weng Z (2004) Motifviz: an analysis and visualiza- tion tool for motif discovery. Nucleic acids research 32:W420–W423. Gao F, Xi Y (2014) Genomic and epigenomic architectures of monozygotic twins discordant for child acute lymphoblastic leukemia (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse59869). 75 Grant CE, Bailey TL, Noble WS (2011) Fimo: scanning for occurrences of a given motif. Bioinformatics 27:1017–1018. Groenen F, Patrick J, Velden M (2005) Multidimensional scaling Wiley Online Library. Guo W, Fiziev P, Yan W, Cokus S, Sun X, Zhang MQ, Chen PY, Pellegrini M (2013) Bs-seeker2: a versatile aligning pipeline for bisulfite sequencing data. BMC Genomics 14:774. Gusfield D (1997) Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology EBL-Schweitzer. Cambridge University Press. Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, Eichler EE, Sahinalp SC (2010) mrsfast: a cache-oblivious algorithm for short-read mapping. Nature Methods 7:576–577. HachF,SarrafiI,HormozdiariF,AlkanC,EichlerEE,SahinalpSC(2014) mrsfast- ultra: a compact, snp-aware mapper for high performance sequencing applica- tions. Nucleic Acids Research 42:W494–W500. Har-Peled S, Indyk P, Motwani R (2012) Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of computing 8:321–350. Harris EY, Ounit R, Lonardi S (2016) Brat-nova: fast and accurate mapping of bisulfite-treated reads. Bioinformatics . HarrisEY,PontsN,LeRochKG,LonardiS(2012) Brat-bw: efficientandaccurate mapping of bisulfite-treated reads. Bioinformatics 28:1795–1796. Harris EY, Ponts N, Levchuk A, Le Roch K, Lonardi S (2010) Brat: bisulfite- treated reads analysis tool. Bioinformatics 26:572–573. Hauser M, Mayer CE, Söding J (2013) kclust: fast and sensitive clustering of large protein sequence databases. BMC bioinformatics 14:1. Hoffmann S, Otto C, Kurtz S, Sharma CM, Khaitovich P, Vogel J, Stadler PF, Hackermuller J (2009) Fast mapping of short sequences with mismatches, inser- tions and deletions using index structures. Plos Computational Biology 5. Homer N, Merriman B, Nelson SF (2009) Bfast: an alignment tool for large scale genome resequencing. PLoS One 4:e7767. Hong C, Clement NL, Clement S, Hammoud SS, Carrell DT, Cairns BR, Snell Q, Clement MJ, Johnson WE (2013) Probabilistic alignment leads to improved accuracy and read coverage for bisulfite sequencing data. BMC Bioinformat- ics 14:337. 76 House SH (2013) Epigenetics in adaptive evolution and development: the interplay betweenevolvingspeciesandepigeneticmechanisms: extractfromtrygvetollefs- bol (ed.) (2011) handbook of epigenetics–the new molecular and medical genet- ics. chapter 26. amsterdam, usa: Elsevier, pp. 423-446. Nutr Health 22:105–31. Illumina (2016) Whole-genome bisulfite sequencing on the hiseq 3000/hiseq 4000 systems (http://www.illumina.com/documents/products/appnotes/hiseq3000- hiseq4000-wgbs-application-note-770-2015-052.pdf). Jaenisch R, Bird A (2003) Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nature Genet- ics 33:245–254. Jiang H, Wong WH (2008) Seqmap: mapping massive amount of oligonucleotides to the genome. Bioinformatics 24:2395–2396. Jiang P, Sun K, Lun FM, Guo AM, Wang H, Chan KC, Chiu RW, Lo YM, Sun H (2014) Methy-pipe: an integrated bioinformatics pipeline for whole genome bisulfite sequencing data analysis. PLoS One 9:e100360. Jones PA, Baylin SB (2002) The fundamental role of epigenetic events in cancer. Nat Rev Genet 3:415–28. Krueger F, Andrews SR (2011) Bismark: a flexible aligner and methylation caller for bisulfite-seq applications. Bioinformatics 27:1571–1572. Krueger F, Kreck B, Franke A, Andrews SR (2012) Dna methylome analysis using short bisulfite sequencing data. Nature methods 9:145–151. Kruskal JB, Wish M (1978) Multidimensional scaling, Vol. 11 Sage. Kulis M, Esteller M (2010) Dna methylation and cancer. Adv Genet 70:27–56. Langmead B, Salzberg SL (2012) Fast gapped-read alignment with bowtie 2. Nature Methods 9:357–U54. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory- efficient alignment of short dna sequences to the human genome. Genome Biol- ogy 10. Lee WP, Stromberg MP, Ward A, Stewart C, Garrison EP, Marth GT (2014) Mosaik: A hash-based algorithm for accurate next-generation sequencing short- read mapping. Plos One 9. Li H (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv e-prints . 77 Li H (2014) Fast construction of fm-index for long sequence reads. Bioinformat- ics 30:3274–5. Li H, Durbin R (2009) Fast and accurate short read alignment with burrows- wheeler transform. Bioinformatics 25:1754–1760. LiH,DurbinR(2010) Fastandaccuratelong-readalignmentwithburrows-wheeler transform. Bioinformatics 26:589–95. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S (2009) The sequence align- ment/map format and samtools. Bioinformatics 25:2078–9. Li H, Homer N (2010) A survey of sequence alignment algorithms for next- generation sequencing. Brief Bioinform 11:473–83. Li H, Ruan J, Durbin R (2008) Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome Research 18:1851–1858. Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J (2009) Soap2: an improved ultrafast tool for short read alignment. Bioinformatics 25:1966–7. LiRQ,LiYR,KristiansenK,WangJ(2008) Soap: shortoligonucleotidealignment program. Bioinformatics 24:713–714. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659. Li YX, Li W (2009) Bsmap: whole genome bisulfite sequence mapping program. Bmc Bioinformatics 10. LimJQ,TennakoonC,LiG,WongE,RuanY,WeiCL,SungWK(2012) Batmeth: improved mapper for bisulfite sequencing reads on dna methylation. Genome Biol 13:R82. Lin H, Zhang ZF, Zhang MQ, Ma B, Li M (2008) Zoom! zillions of oligos mapped. Bioinformatics 24:2431–2437. Lister R, O’Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR (2008) Highly integrated single-base resolution maps of the epigenome in arabidopsis. Cell 133:523–536. Lorincz MC, Dickerson DR, Schmitt M, Groudine M (2004) Intragenic dna methy- lation alters chromatin structure and elongation efficiency in mammalian cells. Nat Struct Mol Biol 11:1068–75. 78 Ma B, Li M (2007) On the complexity of the spaced seeds. Journal of Computer and System Sciences 73:1024–1034. Manber U, Myers G (1993) Suffix arrays: a new method for on-line string searches. siam Journal on Computing 22:935–948. Manconi A, Orro A, Manca E, Armano G, Milanesi L (2014a) Gpu-bsm: A gpu- based tool to map bisulfite-treated reads. Plos One 9. Manconi A, Orro A, Manca E, Armano G, Milanesi L (2014b) Gpu-bsm: a gpu- based tool to map bisulfite-treated reads. PLoS One 9:e97277. Martin M (2011) Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal 17:pp–10. McCoy RC, Taylor RW, Blauwkamp TA, Kelley JL, Kertesz M, Pushkarev D, Petrov DA, Fiston-Lavier AS (2014) Illumina truseq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS One 9:e106689. Mehta D, Sahni S (2004) Suffix trees and suffix arrays In Handbook of Data Structures and Applications, Chapman & Hall/CRC Computer and Information Science Series. CRC Press. Mele G (2016) Arabidopsis motif scanner. BMC bioinformatics 17:1. Miller JR, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data. Genomics 95:315–327. Moore LD, Le T, Fan G (2013) Dna methylation and its basic function. Neuropsy- chopharmacology 38:23–38. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–53. O’Connor C, Adams JU, Fairman J (2010) Essentials of cell biology. NPG Edu- cation, Cambridge . Otto C, Stadler PF, Hoffmann S (2012a) Fast and sensitive mapping of bisulfite- treated sequencing data. Bioinformatics 28:1698–1704. Otto C, Stadler PF, Hoffmann S (2012b) Fast and sensitive mapping of bisulfite- treated sequencing data. Bioinformatics 28:1698–704. Pedersen B, Hsieh TF, Ibarra C, Fischer RL (2011a) Methylcoder: software pipeline for bisulfite-treated sequences. Bioinformatics 27:2435–2436. 79 Pedersen B, Hsieh TF, Ibarra C, Fischer RL (2011b) Methylcoder: software pipeline for bisulfite-treated sequences. Bioinformatics 27:2435–6. Pedersen BS, Eyring K, De S, Yang IV, Schwartz DA (2014) Fast and accurate alignment of long bisulfite-seq reads. ArXiv e-prints . Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T, Pons N, Levenez F, Yamada T et al. (2010) A human gut microbial gene catalogue established by metagenomic sequencing. nature 464:59–65. Rand WM (1971) Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association 66:846–850. Reuter JA, Spacek DV, Snyder MP (2015) High-throughput sequencing technolo- gies. Mol Cell 58:586–97. Robertson KD (2001) Dna methylation, methyltransferases, and cancer. Onco- gene 20:3139–55. Robertson KD (2002) Dna methylation and chromatin - unraveling the tangled web. Oncogene 21:5361–79. Ryan DP, Ehninger D (2014) Bison: bisulfite alignment on nodes of a cluster. BMC Bioinformatics 15:337. Salson M, Lecroq T, Leonard M, Mouchard L (2009) A four-stage algo- rithm for updating a burrows-wheeler transform. Theoretical Computer Sci- ence 410:4350–4359. Schones DE, Smith AD, Zhang MQ (2007) Statistical significance of cis-regulatory modules. BMC bioinformatics 8:1. Sciences N (2006) The New Genetics NIH publication. U.S. Department of Health and Human Services, Public Health Service, National Institutes of Health, National Institute of General Medical Sciences. Sharpton TJ (2014) An introduction to the analysis of shotgun metagenomic data. Front Plant Sci 5:209. Simpson JT, Durbin R (2010) Efficient construction of an assembly string graph using the fm-index. Bioinformatics 26:i367–i373. Smith AD, Xuan ZY, Zhang MQ (2008) Using quality scores and longer reads improves accuracy of solexa read mapping. Bmc Bioinformatics 9. SmithTF,WatermanMS(1981) Identificationofcommonmolecularsubsequences. Journal of Molecular Biology 147:195–197. 80 Song Q, Decato B, Hong EE, Zhou M, Fang F, Qu JH, Garvin T, Kessler M, Zhou J, Smith AD (2013) A reference methylome database and analysis pipeline to facilitate integrative and comparative epigenomics. Plos One 8. Stormo GD, Schneider TD, Gold L, Ehrenfeucht A (1982) Use of the perceptron algorithm to distinguish translational initiation sites in e. coli. Nucleic Acids Research 10:2997–3011. Sun D, Xi Y, Rodriguez B, Park HJ, Tong P, Meong M, Goodell MA, Li W (2014) Moabs: model based analysis of bisulfite sequencing data. Genome Biol 15:R38. Tarraga J, Perez M, Orduna JM, Duato J, Medina I, Dopazo J (2015) A paral- lel and sensitive software tool for methylation analysis on multicore platforms. Bioinformatics . Tran H, Porter J, Sun MA, Xie H, Zhang L (2014) Objective and compre- hensive evaluation of bisulfite short read mapping tools. Adv Bioinformat- ics 2014:472045. Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14:249–260. Vyverman M, Baets BD, Fack V, Dawyndt P (2015) A long fragment aligner called alfalfa. BMC Bioinformatics 16:159. Watson J (2003) The double helix 50 years later: implications for psychiatry. Am J Psychiatry 160:614. Watson JD (1970) Molecular biology of the gene Pearson, 6th edition. Wetterstrand K (2016) Dna sequencing costs: Data from the nhgri genome sequencing program (gsp) (https://www.genome.gov/sequencingcostsdata/). XiY,BockC,MullerF,SunD,MeissnerA,LiW(2012) Rrbsmap: afast, accurate and user-friendly alignment tool for reduced representation bisulfite sequencing. Bioinformatics 28:430–2. Ye Y, Choi JH, Tang H (2011) Rapsearch: a fast protein similarity search tool for short reads. BMC Bioinformatics 12:159. Yong-VillalobosL,Gonzalez-MoralesSI,WrobelK,Gutierrez-AlanisD,Cervantes- Perez SA, Hayano-Kanashiro C, Oropeza-Aburto A, Cruz-Ramirez A, Martinez O, Herrera-Estrella L (2015) Methylome analysis reveals an important role for epigenetic changes in the regulation of the arabidopsis response to phosphate starvation. Proc Natl Acad Sci U S A 112:E7293–302. 81 Zhang Z, Yu J (2011) On the organizational dynamics of the genetic code. Genomics, proteomics & bioinformatics 9:21–29. Zhao Y, Tang H, Ye Y (2012) Rapsearch2: a fast and memory-efficient protein sim- ilarity search tool for next-generation sequencing data. Bioinformatics 28:125–6. Zolotarev VM (1986) One-dimensional stable distributions, Vol. 65 American Mathematical Soc. 82 Appendix A Periodic Spaced Seeds (10100)* and (1110100)* Table A.1: Five spaced seeds for (10100)* (1) 1010010100101001010010100101001010010100101001010010100... (2) 0101001010010100101001010010100101001010010100101001010... (3) 0010100101001010010100101001010010100101001010010100101... (4) 0001010010100101001010010100101001010010100101001010010... (5) 0000101001010010100101001010010100101001010010100101001... Table A.2: Seven spaced seeds for (1110100)* (1) 11101001110100111010011101001110100111010011101001110100... (2) 01110100111010011101001110100111010011101001110100111010... (3) 00111010011101001110100111010011101001110100111010011101... (4) 00011101001110100111010011101001110100111010011101001110... (5) 00001110100111010011101001110100111010011101001110100111... (6) 00000111010011101001110100111010011101001110100111010011... (7) 00000011101001110100111010011101001110100111010011101001... 83 Appendix B Key and Value Pairs for a Reference Genome Table B.1: Key and value pairs for the reference genome in Figure 2.15 AAA 8 AAATTATATT AAATATTATTTTGTTAAATTTAATTTAATAT AAT 11 AATTATATT TATTATTTTGTTAAATTTAATTTAATAT AAT 4 AATTTGATATAT AATGAAATATTATTTTGTTAAATTTAATTTAATAT AGA 3 AGATTTTATTAA AAATGAAATATTATTTTGTTAAATTTAATTTAATAT ATA 2 ATAAATTATATT TAAATGAAATATTATTTTGTTAAATTTAATTTAATAT ATA 28 ATAT TAATTTAATAT ATA 22 ATATAT TAAATTTAATTTAATAT ATA 23 ATATT AAATTTAATTTAATAT ATT 29 ATT AATTTAATAT ATT 24 ATTAA AATTTAATTTAATAT ATT 14 ATTATATT TATTTTGTTAAATTTAATTTAATAT ATT 7 ATTTGATATAT GAAATATTATTTTGTTAAATTTAATTTAATAT ATT 9 ATTTTATTAA AATATTATTTTGTTAAATTTAATTTAATAT GAT 19 GATATAT TGTTAAATTTAATTTAATAT GAT 6 GATTTTATTAA TGAAATATTATTTTGTTAAATTTAATTTAATAT TAA 5 TAAATTATATT ATGAAATATTATTTTGTTAAATTTAATTTAATAT TAA 1 TAATTTGATATAT TTAAATGAAATATTATTTTGTTAAATTTAATTTAATAT TAG 0 TAGATTTTATTAA ATTAAATGAAATATTATTTTGTTAAATTTAATTTAATAT TAT 25 TATAT ATTTAATTTAATAT TAT 20 TATATT GTTAAATTTAATTTAATAT TAT 26 TATT TTTAATTTAATAT TAT 21 TATTAA TTAAATTTAATTTAATAT TGA 16 TGATATAT TTTTGTTAAATTTAATTTAATAT TTA 27 TTAA TTAATTTAATAT TTA 17 TTATATT TTTGTTAAATTTAATTTAATAT TTA 18 TTATTAA TTGTTAAATTTAATTTAATAT TTG 13 TTGATATAT TTATTTTGTTAAATTTAATTTAATAT TTT 15 TTTATTAA ATTTTGTTAAATTTAATTTAATAT TTT 10 TTTGATATAT ATATTATTTTGTTAAATTTAATTTAATAT TTT 12 TTTTATTAA ATTATTTTGTTAAATTTAATTTAATAT 84 Appendix C Comparison of Runtime, Accuracy and Memory Usage Here shows comparisons of runtime, accuracy and memory usage for WALT, Bismark (Krueger & Andrews, 2011), BSMAP (Xi et al., 2012), BS-Seeker2 (Guo et al., 2013), BatMeth (Lim et al., 2012) and BWA-meth (Pedersen et al., 2014). PASS-bis (Campagna et al., 2013) and segemehl (Otto et al., 2012b) were not included because they did not complete the single-end read mapping within 100 hours for date sets SRR1532534 and SRR948855. MethylCoder (Pedersen et al., 2011b) was superseded by BWA-meth. BRAT-BW (Harris et al., 2012) did not include read names (id) in the output file, and BRAT-nova (Harris et al., 2016) did not complete building index for human genome hg19 in 100 hours, so they were not included in comparisons. As shown in Tables C.1 and C.2, for single-end read mapping, WALT is about 6× faster than BSMAP, 12× faster than BWA-meth, and more than 35× faster than Bismark, BS-Seeker2 and BatMeth 1 . For pair-end read mapping, WALT is about 6× faster than BSMAP, 10× faster than BWA-meth, 15× faster than Bismark and more than 35× faster than BS-Seeker2. As shown in Tables C.3 and C.4, WALT, Bismark, BSMAP, BS-Seeker2 and BatMeth have very high recall and precision, while BWA-meth has low recall and precision in paired-end read mapping. 1 BatMath does not support pair-end bisulfite mapping. 85 Tables C.5 and C.6 show the memory usage for each dataset. For human genome (the first two data sets), WALT uses about 15 Gb memory for single-end mapping, and 17 Gb for paired-end mapping. Table C.1: Comparison of runtime on single-end reads (hours) WALT Bismark BSMAP BS-Seeker2 BatMeth BWA-meth (1) 0.71 26.85 4.14 39.59 40.87 8.36 (2) 0.85 29.13 6.09 42.86 30.13 9.08 (3) 0.09 3.11 0.09 10.32 4.00 2.06 Table C.2: Comparison of runtime on paired-end reads (hours) WALT Bismark BSMAP BS-Seeker2 BatMeth BWA-meth (1) 2.43 38.72 12.32 102.51 22.78 (2) 2.61 37.10 21.55 91.03 23.40 (3) 0.32 7.46 0.22 21.02 6.09 Table C.3: Comparison of accuracy on single-end read mapping WALT Bismark BSMAP BS- Seeker2 BatMeth BWA- meth (1) R 0.982 0.973 0.994 0.961 0.990 0.991 P 0.991 0.976 0.988 0.934 0.993 0.852 (2) R 0.984 0.974 0.995 0.964 0.990 0.991 P 0.993 0.969 0.991 0.927 0.995 0.839 (3) R 0.987 0.974 0.995 0.965 0.994 0.993 P 0.995 0.978 0.993 0.923 1.000 0.733 86 Table C.4: Comparison of accuracy on paired-end read mapping WALT Bismark BSMAP BS- Seeker2 BatMeth BWA- meth (1) R 0.965 0.968 0.991 0.955 0.495 P 0.984 0.943 0.967 0.886 0.702 (2) R 0.971 0.972 0.993 0.960 0.494 P 0.989 0.935 0.973 0.878 0.674 (3) R 0.986 0.959 0.994 0.946 0.489 P 0.996 0.960 0.967 0.862 0.628 Table C.5: Comparison of memory usage on single-end reads (GB) WALT Bismark BSMAP BS-Seeker2 BatMeth BWA-meth (1) 14.94 9.55 7.83 11.67 16.55 12.30 (2) 15.05 9.58 7.84 11.71 16.55 12.33 (3) 1.02 0.53 1.02 2.52 7.77 2.07 Table C.6: Comparison of memory usage on paired-end reads (GB) WALT Bismark BSMAP BS-Seeker2 BatMeth BWA-meth (1) 16.89 9.64 7.88 9.47 12.52 (2) 16.95 9.64 7.89 9.72 12.48 (3) 2.89 0.54 1.06 3.25 2.30 87 Appendix D Runtime on Different Lengths of Reads Table D.1 compares the runtime of WALT, Bismark, BSMAP, BS-Seeker2, BatMeth and BWA-meth for data sets with different read lengths. Six data sets with read length L=50, 60, 70, 80, 90 and 100 respectively were generated by truncating the reads from SRR948855 (Fortin et al., 2014). Each data set takes the firstL nucleotides from the first 10 million read pairs. WALT runs faster when the read length increases. It is because spaced seeds for longer reads have heavier weights, which help to reduce the number of random hits. Table D.1: Runtime on different lengths of reads (hours) Read Length 50 60 70 80 90 100 Single-end WALT 0.71 0.27 0.21 0.19 0.18 0.18 Bismark 1.90 2.57 2.84 3.32 4.30 5.61 BSMAP 0.33 0.46 0.65 0.43 0.62 1.03 BS-Seeker2 7.14 7.36 7.75 8.27 8.59 9.09 BatMeth 46.52 21.47 15.06 11.04 8.46 6.39 BWA-meth 1.55 1.50 1.52 1.61 1.71 1.88 Paired-end WALT 2.66 1.20 0.85 0.63 0.59 0.61 Bismark 4.60 5.50 6.63 6.69 7.82 8.09 BSMAP 0.62 1.11 1.79 1.40 2.09 3.08 BS-Seeker2 20.98 23.88 22.83 20.88 24.04 22.43 BatMeth BWA-meth 5.38 4.47 4.47 4.62 4.93 5.48 88 Appendix E Uniquely Mapped Reads on Different Number of Mismatches Table E.1: The number and percentage of uniquely mapped reads as a function of the number of mismatches in WALT # of SRR1532534 SRR948855 SRR2296821 mismatches Single-end Paired-end Single-end Paired-end Single-end Paired-end 0 34,689,403 27,200,475 34,743,554 27,559,957 13,728,491 12,697,635 (69.38%) (54.40%) (69.49%) (55.12%) (63.72%) (58.94%) 1 5,489,829 8,090,762 5,211,452 7,388,497 1,279,528 1,809,749 (10.98%) (16.18%) (10.42%) (14.78%) (5.94%) (8.40%) 2 1,131,020 2,394,739 898,452 2,342,240 287,268 473,105 (2.26%) (4.79%) (1.80%) (4.68%) (1.33%) (2.20%) 3 448,474 983,247 361,270 810,523 178,251 190,001 (0.90%) (1.97%) (0.72%) (1.62%) (0.83%) (0.88%) 4 250,467 511,419 228,968 404,315 107,945 161,688 (0.50%) (1.02%) (0.46%) (0.81%) (0.50%) (0.75%) 5 155,276 289,866 156,000 238,442 56,045 128,729 (0.31%) (0.58%) (0.31%) (0.48%) (0.26%) (0.60%) 6 112,357 55,709 114,289 46,935 29,368 47,482 (0.22%) (0.11%) (0.23%) (0.09%) (0.14%) (0.22%) SUM 42,276,826 39,526,217 41,713,985 38,790,909 15,666,896 15,508,389 (84.55%) (79.05%) (83.43%) (77.58%) (72.72%) (71.99%) 89 Appendix F Test Environment and Command Lines All tests were run on the USC HPC cluster with Intel Xeon E5-2600 processors at clock speed 2.4 GHz. All experiments were run in single thread on single proces- sor. WALT 1.0, Bismark 0.13.1, BSMAP 2.88, BS-Seeker2 2.0.10, BatMeth 1.04 and BWA-meth 0.10 were used in these experiments. Here shows the command lines for them. WALT single-end mapping walt -i genome.dbindex -r reads_file_1.fastq -o output_single.sam WALT paired-end mapping walt -i genome.dbindex -1 reads_file_1.fastq -2 reads_file_2.fastq -o output_paired.sam Bismark single-end mapping bismark -N 1 -L 32 –path_to_bowtie bowtie2-2.2.4 –bowtie2 gnome_files reads_file_1.fastq Bismark paired-end mapping bismark -N 1 -L 32 –path_to_bowtie bowtie2-2.2.4 –bowtie2 gnome_files reads_file_1.fastq -2 reads_file_2 90 BSMAP single-end mapping bsmap -d gneome_files -a reads_file_1.fastq -p 1 -v 6 -r 0 -o output_single.sam BSMAP paired-end mapping bsmap -d gneome_files -a reads_file_1.fastq -b reads_file_2.fastq -p 1 -v 6 -r 0 -o output_paired.sam BS-Seeker2 single-end mapping bs_seeker2-align.py –aligner=bowtie2 -g genome_files -m 6 –bt2-p 1 -d ./bs_utils/reference_genomes/ -f sam -i read_file_1 -o output_single.sam BS-Seeker2 paired-end mapping bs_seeker2-align.py –aligner=bowtie2 -g genome_files -m 6 –bt2-p 1 -d ./bs_utils/reference_genomes/ -f sam -1 read_file_1 -2 read_file_2 -o output_paired.sam BatMeth single-end mapping batmeth -g genome_files -i read_file_1 -n 6 -p 1 -o output_single.out BWA-meth single-end mapping bwameth.py –reference genome_files read_file_1 -t 1 –prefix PREFIX_SINGLE BWA-meth paired-end mapping bwameth.py –reference genome_files read_file_1 read_file_2 -t 1 –prefix PREFIX_PAIR 91 Appendix G WALT Output Formats WALTsupportsTab-delimitedSAM(Lietal.,2009)andMR(Songetal.,2013) output formats. In WALT, SAM and MR formats have 12 and 8 mandatory fields, respectively. One mapping result has one line and different fields are separated by a Tab. Here shows the mandatory fields in order for SAM and MR formats. SAM Format • QNAME (read name) • FLAG (bitwise FLAG) • RNAME (chromosome name) • POS (start position, 1-based, 0 means unmapped) • MAPQ (255 in WALT) • CIGAR (CIGAR string) • RNEXT (chromosome name of the mate read) • PNEXT (start position of the mate read) • TLEN (observed segment length) • SEQ (read sequence) • QUAL (quality sequence) • NM-tag (number of mismatches) 92 MR Format • RNAME (chromosome name) • SPOS (start position, 0-based) • EPOS (end position, 0-based) • QNAME (read name) • MISMATCH (number of mismatches) • STRAND (forward or reverse strand) • SEQ (read sequence) • QUAL (qualify sequence) 93 Appendix H Notes of Mathematical Symbols for Bislfite Read Mapping L: the length of the reference genome l: the length of a query sequence or a read k: the lenght of a k-mer m: the number of mismatches p: the number of segments for a read l: the average length of each segment O: approximation of time complexity log: log function based on 2 Q: a set of read sequences q i : one of the read sequences n: the number of reads G: reference genome sequence 94 Appendix I Twenty Amino Acids and 1 Letter Code Table I.1: Twenty Amino Acids and 1 Letter Code Amino acid One letter code Alanine A Arginine R Asparagine N Aspartic acid D Cysteine C Glutamic acid E Glutamine Q Glycine G Histidine H Isoleucine I Leucine L Lysine K Methionine M Phenylalanine F Proline P Serine S Threonine T Tryptophan W Tyrosine Y Valine V 95
Abstract (if available)
Abstract
With the development of next-generation sequencing technology, a massive amount of DNA and protein sequences have been produced. Fast analyzing these huge sequencing data to obtain important information becomes an important step for many biological applications. One of the most significant and computational intensive methods is to align these DNA and protein sequences to a reference genome or a annotated protein database in order to know locations of DNA sequences or predict functions of protein sequences.Read mapping is the problem to align DNA sequences to a reference genome for locating positions where they original come from. Protein alignment is the problem to align amino acid sequences to a large annotated protein database for predicting functions of peptide sequences. During my PhD studies, I designed a new read mapping algorithm (called WALT) for bisulfite sequencing and a protein alignment algorithm (called HSEARCH) for protein motif finding. Bisulfite sequencing is the golden standard for DNA methylation study. The most computational intensive step in DNA methylation study is to map billions of short DNA reads to the reference genome. WALT uses a strategy of hashing periodic spaced seeds, which leads to significant speedup compared with the most efficient methods currently available. Protein motifs are conversed sequences which has a specific function, for example, the active site of an enzyme. HSEARCH converted k-mers to data points in high dimensional space, and applies locality-sensitive hashing technique to fast search motifs database which is significantly faster than brute force algorithm with the same accuracy. Additionally, HSEARCH supports protein motif sequence clustering, and it is significantly more sensitive than existing protein sequence clustering programs.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Whole genome bisulfite sequencing: analytical methods and biological insights
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Profiling transcription factor-DNA binding specificity
PDF
Breaking the plateau in de novo genome scaffolding
PDF
Computational analysis of the spatial and temporal organization of the cellular environment
PDF
Machine learning of DNA shape and spatial geometry
PDF
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
Techniques for de novo sequence assembly: algorithms and experimental results
PDF
DNA shape at transcription factor binding sites: from purifying selection to a new alphabet
PDF
Mapping 3D genome structures: a data driven modeling method for integrated structural analysis
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Clustering 16S rRNA sequences: an accurate and efficient approach
PDF
Efficient short-read sequencing on long-read sequencers
PDF
Data-driven approaches to studying protein-DNA interactions from a structural point of view
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Feature engineering and supervised learning on metagenomic sequence data
Asset Metadata
Creator
Chen, Haifeng
(author)
Core Title
Fast search and clustering for large-scale next generation sequencing data
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
09/23/2016
Defense Date
08/19/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bisulfite read mapping,OAI-PMH Harvest,protein motif finding,protein sequence search
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chen, Ting (
committee chair
), Nakano, Aiichiro (
committee member
), Rohs, Remo (
committee member
)
Creator Email
chenhaifeng88888@gmail.com,haifengc@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-305494
Unique identifier
UC11280505
Identifier
etd-ChenHaifen-4803.pdf (filename),usctheses-c40-305494 (legacy record id)
Legacy Identifier
etd-ChenHaifen-4803.pdf
Dmrecord
305494
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Chen, Haifeng
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
bisulfite read mapping
protein motif finding
protein sequence search