Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
(USC Thesis Other)
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
GEOMETRIC INTERPRETATION OF BIOLOGICAL DATA: ALGORITHMIC SOLUTIONS FOR NEXT GENERATION SEQUENCING ANALYSIS AT MASSIVE SCALE by Ehsan Behnam A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) May 2015 Copyright 2015 Ehsan Behnam Dedication Dedicated to ones who taught me what dedication means. To my first and best teachers. To my mother and father. ii Acknowledgments Foremost, I would like to express my deepest gratitude to my supervisor, Prof. Smith for his patience and support during these years. I had the chance to gradually learn about many fascinating topics with his guidance that made my Ph.D. a fruitful journey. While he provided me with freedom and flexibility to develop my ideas, we had many inter- active discussions which enormously helped me through recognizing the high potential ones to overcome research challenges. Next, I would like to thank Prof. Waterman for his endless support. Long before I joined computational biology program at USC, I was fascinated by his excellent work in modeling alignment of biological sequences. The combination of algorithmic approaches with diligent statistical models was truly inspiring. This was a key factor in shaping my enthusiasm for research in computational biology. I was one of the few lucky students who benefited from his enlightening advices. This gave me the opportunity to learn far beyond mathematical modeling from a great scientist. I also thank my proposal and defense committee, Prof. Sun, Prof. Baxendale and Prof. Ehrenreich for their valuable time reading my proposal and thesis and for sharing their constructive comments and feedbacks with me. In particular, Prof. Sun kindly offered his advice several times while I was dealing with a research problem that helped me to successfully find the appropriate solution. iii During these years, I have been privileged to join a lab with a friendly and excit- ing ambience. My collegues in Smith’s lab always offered great help. We shared our thoughts and ideas in a warm and social environment. Dr. Uren helped me extensively during my final defense and writing my dissertation, beyond a colleague. Emad Bahrami is a close friend of mine who has always been ready to help me with any problem. I also had many useful discussions with Dr. Dolzhenko. We exchanged ideas and I profoundly enjoyed our scientific conversations. Jenny Qu is a great friend who kindly assisted me with proofreading my papers. Wenzheng Li worked closely with me for the last year of my Ph.D. and I was benefited from his talents and skills. I would also like to extend my appreciation to other lab members, Dr. Song, Dr. Daley, Meng Zhou and many others for their help. A very special appreciation for my office mate Ben Decato, who is an accurate and skillful writer and has always offered me his friendly help. I am very pleased to have been able to work with all of them and to gradually build a long lasting friendship with them. MCB program at USC has brought the best scientists from different backgrounds together and has created one of the best collaborative environments between mathe- maticians, computer scientists and biologists. I would like to express my gratitude for people like Prof. Waterman, Prof. Tavare, Prof. Sun, Prof. Chen, Prof. Rohs and many more whom this would not have been possible without their restless efforts. I also very much appreciate helps from MCB Staff including Hayley Peltz, Dawn Burke, Linda Bazillian and others for helping me during these years. I sincerely thank my fellow classmates and friends Hossein Asgharian, Misagh Bagherian, Zohreh Baharvand, Natalia Rodchenko, Lin Yang and many more for creat- ing such a unique experience of life for me in the past few years. I was also privileged to have my long-time friend Reza Ardekani here at USC. With his greedy attitude for iv science, he has been always open to discuss ideas and delve into the problem details. There are no words to show my appreciation for him. Last but certainly not the least, my deepest gratitude goes to my parents and my lovely sisters, whom without their kindness and support I would not have been able to keep moving forward all these years. Despite the fact that we live far apart, they are a part of my everyday life and I know for a fact that their warm and endless encouragement has always been with me and in many ways enabled me to complete my course of study. v Table of Contents Dedication ii Acknowledgments iii Table of Contents vi List of Figures viii List of Tables ix List of Algorithms x 1 Introduction 1 2 Background 4 2.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Local alignment-free sequence comparison . . . . . . . . . . . 6 2.1.2 A scalable database for next generation sequencing data . . . . 8 2.2 Geometric representation of biological sequences . . . . . . . . . . . . 13 2.3 Similarity search in high dimensional space . . . . . . . . . . . . . . . 16 2.4 Nearest neighbor graphs . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Locality sensitive hashing . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 A geometric interpretation for local alignment-free sequence comparison 23 3.1 Local similarity problem revisited . . . . . . . . . . . . . . . . . . . . 24 3.2 A geometric framework for maximizing similarity . . . . . . . . . . . . 25 3.3 Solving bichromatic closest pair problem in high dimension . . . . . . 32 3.4 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.1 Simulation setup . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.2 Study of the naive algorithm . . . . . . . . . . . . . . . . . . . 39 3.4.3 Performance of the approximation algorithm . . . . . . . . . . 41 3.4.4 Empirical evaluation of bucket occupancies . . . . . . . . . . . 42 3.5 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 vi 4 TheAmordad database engine for metagenomics 48 4.1 Engine structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.1.1 Geometric interpretation of metagenomes . . . . . . . . . . . . 49 4.1.2 Indexing metagenomic points with locality-sensitive hashing for angles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.3 Supplemental indexing with a developing nearest neighbor graph 52 4.1.4 A heuristic to model the effect of the graph . . . . . . . . . . . 57 4.2 Performance evaluations . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2.1 Dataset and implementation details . . . . . . . . . . . . . . . 59 4.2.2 Rate of convergence for the nearest neighbor graph . . . . . . . 60 4.2.3 Assignment of significance scores to query responses . . . . . . 61 4.2.4 Evaluation of the graph effect on the query space requirement . 63 4.2.5 Measuring query time in a large-scale database . . . . . . . . . 64 4.2.6 Relationships captured in the nearest neighbor graph . . . . . . 66 4.3 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5 Perspectives and future applications 71 5.1 Fusing metagenomic data . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2 Localization of environmental metagenomic samples . . . . . . . . . . 74 5.2.1 Metagenome-to-metagenome matching (MG-MG) . . . . . . . 75 5.3 Extending Amordad to other biological data types . . . . . . . . . . . . 77 6 Summary and conclusion 79 A Public datasets inAmordad 92 B Significant biological associations discovered byAmordad 98 vii List of Figures 2.1 Approximate nearest neighbor problem . . . . . . . . . . . . . . . . . 19 2.2 Constructing a nearest neighbor graph from count vectors . . . . . . . . 20 3.1 Accuracy study of different statistics for weakly planted motifs . . . . . 40 3.2 Accuracy study of different statistics for moderately planted motifs. . . 41 3.3 Accuracy study of different statistics for strongly planted motifs. . . . . 43 3.4 Comparative accuracy study between the naive and approximate solutions. 44 3.5 Wall-time comparison between naive and approximation algorithms. . . 45 3.6 Study of the maximum and the average bucket occupancies. . . . . . . 46 4.1 The incorporation of the graph for augmenting the nearest neighbor search. 54 4.2 Schematic representation of indexing scheme of Amordad . . . . . . . . 56 4.3 Fidelity of the nearest neighbor graph as a function of LSH iterations. . 61 4.4 The probability density function of angles resulted from real metagenomes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5 The average recall value for a query set. The left and the right plots correspond to the conservative and liberal maximum proximity radius respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.6 ‘(A) The average wall-time of the query as a function of the log of the database size. (B) The required number of distance computations per query as a function of the log of the database size. . . . . . . . . . . . . 65 viii List of Tables 4.1 Top 10 most similar pairs of metagenomes from different projects, with scope of corresponding projects, distance between metagenomic points (i.e. the angle in degrees) and associated significance scores. (*) From a cohort of normal and atherosclerosis samples. (**) Metagenomes are sampled from European women with normal, impaired and diabetic glu- cose control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 A.1 List of the projects used in Amordad . . . . . . . . . . . . . . . . . . . 97 B.1 List of significant biological discoveries found by Amordad . . . . . . . 100 ix List of Algorithms 1 Approximation for MDP via SLF with an oracle for BCP . . . . . . . 29 2 Similarity search in Amordad . . . . . . . . . . . . . . . . . . . . . . . 57 x Chapter 1 Introduction Biology has become a big data science. High throughput technologies provided amounts of data that was unthinkable a decade ago, and this data challenges every exist- ing infrastructure. Less attention has been given to the data analysis. As a specific example, it has been well known that the speed and cost in DNA sequencing data gener- ation has exceeded Moore’s law. Practically, this means that soon, data becomes so large that any algorithm with super-linear processing complexity must be avoided if possible. This creates ample incentive for alternative approaches in addressing central problems in computational biology. For example real-time organization of sequence data into biologically related clusters and providing novel lightweight big data analytics to rec- ognize alterations in the DNA molecule patterns might be essential when the number of nucleotides reaches peta order. Unfortunately, arriving at algorithmic solutions with sub-linear or even linear com- plexity is extremely difficult. For instance, any sequence comparison framework must be equipped with suitable string matching algorithms. Due to the sequencing errors and existing natural variations, the result of exact string matching algorithms could be misleading. This might elucidate voluminous literature on alignment algorithms and heuristic methods to break its quadratic behavior. Alignment-free algorithms serve as another alternative and define a new paradigm by trying to represent biological sequence objects with a set of descriptive features. This is an appealing strategy to deal with mas- sive amount of data. However, accurate grasp of biological properties from sequences 1 typically entails an ultra high dimensional representation. Therefore, the original com- putational barrier has been transformed into what is often called “the curse of dimen- sionality”. This dissertation establishes an algorithmic foundation for handling biological data sets that are massive both in terms of the number of data elements, and the dimensional- ity of those data elements. The work described in this dissertation focuses on alignment- free sequence comparison, but the techniques developed in this context are much more general. By casting alignment-free sequence comparison as a geometric problem, we uncovered a set of techniques that can be leveraged in a host of other contexts. The starting point for this work was the problem of computing local alignment-free pairwise sequence similarity. Our approach was to transform fixed-length fragments of each sequence into feature vectors, which placed this “sequence” problem in a geo- metric space. We designed an efficient randomized data structure to detect the most similar feature vectors corresponding to the sequence regions with the strongest biolog- ical signals. Although the local alignment-free sequence comparison problem is novel, the perspective we gained in studying this problem turned out to have exceptional value. The geometric approaches we adopted in local alignment-free sequence compari- son immediately suggested other applications in areas where advances require algorith- mic innovations. We next developed a database engine for organizing next generation sequencing data at massive scale, but in terms of the content of that data rather than its meta-data. The fundamental contribution was to structure an alignment-free representa- tion of sequence data such that natural similarities between biologically related samples could be rapidly identified. To date this has not been done for next-generation sequenc- ing data that has not been mapped or assembled. We designed a robust indexing system that reorganizes itself over the course of the time and adapts with accumulation of new data – an essential property to cope with the fast pace of the data generation. 2 We expect our approach to organize high-throughput sequencing data to find a broad range of applications in different domains. We discuss some future directions and per- spectives in which our comparative tool could be extremely beneficial. With achieved scalability in indexing scheme of our database engine, numerous biological sequences can be searched in seconds and meaningful inferences can be made based on the simi- larity between a sample of interest and those in the database. 3 Chapter 2 Background DNA sequencing is the core technology in a variety of biological fields including genomics and molecular biology. The application of next generation shotgun (NGS) sequencing has led to discoveries ranging from characterization of single genes (Koboldt et al., 2013) to systematic studies of complicated environmental ecosystems (Shokralla et al., 2012). Owing to technologies with massive parallelism, the cost of the sequencing has continuously dropped in recent years (von Bubnoff, 2008). This results in a class of emerging computational problems stemming from the fact that both the number and the length of the accessible genomes and genomic fragments are significantly increasing. In this dissertation, we establish a geometric interpretation of genomic data by trans- forming instances of such problems to similarity search in a high dimensional metric space. We then utilize computational geometry techniques and spatial data structures to efficiently solve them. We begin by exploring the alignment-free sequence comparison problem that asks to quantify the relatedness between two sequences while, as its name suggests, avoiding the traditional alignment-based algorithms (Vinga & Almeida, 2003). One problem is to identify regions with conserved patterns that do not extend to the entire sequence. This local alignment-free sequence comparison could be defined as a natural extension of the problem that searches for regional similarity signals. We represent local alignment-free comparisons as an optimization problem in very high-dimensional search space. In other words, we formulate this problem so as to find the maximum of a functionf(p;q) where p andq are a pair of feature vectors in somed-dimensional search space describing local 4 regions in the sequences and the function f : R d R d 7! R resembles the pairwise similarity measures. Details of this problem formulation along with our framework to solve it is given later. The success of the geometric interpretation in finding local patterns of similarity in biological sequences motivated the utilization of the same approach to sequence simi- larity search in next generation sequencing data at massive scale. With rapidly evolv- ing omics technologies, data production is no longer the main bottleneck of many life science projects (Schatz et al., 2010; Marx, 2013). Instead, the processing pipelines reportedly fail to cope with the biological data deluge (Schadt et al., 2010; Fuller et al., 2013; Hu & Bajorath, 2014). In close accordance with our approach to finding local similarity, we use a geometric “fingerprint” of NGS samples to build a high-dimensional content-based database handling a large number of such samples. Our approach creates a concise sequence-based signature for each sample and represents it as a point in high- dimensional space. We then organize such points using a combined data structuring strategy that results in a scalable database with flexible data manipulation and efficient querying processes. This approach overcomes the complexity of the similarity search in NGS data and introduces an analytical tool for analysis. We will demonstrate that our approach can lead to very efficient (and yet reasonably accurate) solutions by recasting string matching problems in a geometric context. In what follows, we provide brief descriptions for the aforementioned problems with emphasis on the connection between the sequence data and its geometric transformation. Detailed solutions and evaluations are explained in the following chapters. 5 2.1 Problem formulation We present two separate problems both dealing with sequence similarity search at massive scale. The first problem considers a pair of sequences but as we will see the length of each sequence might be large enough to avoid any exhaustive solution to be applicable. The nature of the second problem is different. The input is composed of large number of samples where each sample is a collection of numerous fragmentary sequences. In both problems, the central issue is to handle the complexity of big data and arrive at algorithmic solutions to identify biological relatedness utilizing efficient data structures. 2.1.1 Local alignment-free sequence comparison Sequence alignment is perhaps the most well-known and intensively studied compu- tational problem in modern molecular biology. The need to quickly and reliably identify relations between sequences has been a driving force in the field of computational biol- ogy, both on the statistical (Dayhoff et al., 1978; Karlin & Altschul, 1990; Waterman & Vingron, 1994) and algorithmic fronts (Smith & Waterman, 1981; Pearson & Lipman, 1988; Altschul et al., 1990; Zhang et al., 1998; Johnson et al., 2008). Sequence align- ment arises in a wide variety of data analysis contexts, ranging from theoretical studies of evolution to the practical use of sequencing instrumentation. Alignment-based sequence similarity makes a correspondence between letters (bases or residues) in sequences but also requires this correspondence to preserve the order of letters. Alignment-free sequence comparison ignores positional information in strings, and compares sequences without regard for the order (or orientation) of elements within the sequences. The use of such methods is motivated partially by technical issues of more rapidly screening candidates in very large-scale database searching (Haubold et al., 6 2011; Mahmood et al., 2012) and also by biological issues that suggest more functional information can be leveraged when order and orientation of sequence elements are not constrained (Sims & Kim, 2011). To date every optimal sequence alignment algorithm requires a core step, based on dynamic programming, that requires quadratic time in the length of the sequences. Although heuristics have been extensively applied to improve the computational effi- ciency (Altschul et al., 1990, 1997), the inherent quadratic behavior remains unchanged when the sequence similarity is weak and there is no prior knowledge available about sequences. On the other hand, for global sequence comparison, alignment-free methods can run in time that is a linear function of the sequence lengths. Even in cases when alignment-free similarity is not as biologically meaningful as alignment-based simi- larity, the additional speed makes alignment-free approaches attractive for large-scale database filtering to reduce the number of pairwise sequence alignments that must be computed (Altschul et al., 1997; Lippert et al., 2002). There are also certain biological phenomena overlooked by alignment and hence motivate the use of alignment-free methods as a substitute. Among these problems is the study of horizontal gene transfer (HGT), the transfer of genetic material between two organisms to acquire new traits. HGT is believed to be a vital step in bacterial adap- tation and virulent niches (Alm et al., 2006). A previous study suggests HGT frequency in oceanic bacteria is up to a hundred million times greater than estimated earlier, imply- ing that the diversity gained by HGT might be dramatically underestimated (McDaniel et al., 2010). On the other hand, horizontal gene transfer as a localized recombination phenomenon scatters DNA fragments and makes the homologous sequences difficult to align (Domazet-Lošo & Haubold, 2011). Recent results have shown alignment-free methods are able to accurately reconstruct evolutionary relationships between metage- nomic populations (Song et al., 2012). 7 Just as local alignment is used to identify locally similar parts of sequences, the concept of alignment-free comparison may be applied in a local sense. The utility of such methods emerges when the conserved functional elements are harboured in certain regions of the divergent sequences making it impossible for global methods to discover them. One clear example is the identification of transcription factor binding sites in gene regulatory regions with low level of sequence conservation (Berman et al., 2002). Gene expression is often controlled by regulatory modules which are typically short stretches of DNA sequence with highly variable distances from the target gene. Each of these modules called a cis-regulatory module (CRM), contains one or a combined set of bind- ing sites that is recognized by transcription factors (Kazemian et al., 2011). Identifica- tion of CRMs has been addressed as a challenging computational problem especially for organisms in which high binding site turnover takes place (Sinha & Siggia, 2005; Meader et al., 2010; Venkataram & Fay, 2010). This decreases the performance of alignment based methods significantly because the order of functional elements is not necessarily preserved between similarly functioning sequences (Taher et al., 2011). As described in Chapter 3, we place this problem in the geometric context and present an algorithm for local alignment-free sequence comparison (Liua et al., 2011) using a class of similarity measures that we later describe as dot product measures. Our framework provides a general means of transforming dot product similarity into metric distance in the context of many similarity optimization problems, and we expect this framework will find applications outside of alignment-free sequence comparison. 2.1.2 A scalable database for next generation sequencing data Next, we present our analytic tool for comparative analysis of large scale NGS sequencing data. We assume the constraint on the volume of data and the complexity of 8 fragmentary elements circumvent any computationally intensive processing including the assembly of short reads to longer contigs and scaffolds (Li & Durbin, 2009; Li et al., 2010). For this study, We focus on the application of NGS sequencing technology for studying microbial communities taken directly from their natural habitats. Such studies belong to a new scientific discipline called metagenomics (Handelsman et al., 1998). Metagenomics has revolutionized our knowledge about bacterial communities. Such studies have discovered inter-community rules governing the behavior of the microbial networks and the different aspects of their symbiotic or oppositional life with their hosts in several ecosystems (Daniel, 2005; Qin et al., 2010; Belda-Ferre et al., 2011; Nesme et al., 2014). Whole genome shotgun sequencing has been applied to study microbial communities. This emerging technology, called whole metagenome shotgun (WMS) sequencing (Schloissnig et al., 2013), has been applied to many different projects from direct study of simple biofilms in specific niches (Tyson et al., 2004) to complicated bacterial consortia like human gut microbiota (Zhang et al., 2009; Qin et al., 2012). This is different from alternative approaches such as amplicon sequencing of conserved genes (e.g. 16S ribosomal RNA) since it gives more information about the genes present within an ecosystem of micro-organisms (Petrosino et al., 2009; Kuczynski et al., 2011). Metagenomic studies are also greatly motivated by the crucial impact of microbes on human health (Cho & Blaser, 2012). Microbial organisms hosted by the human body construct the human microbiome with at least 10 14 cells in a healthy indi- vidual (Clemente et al., 2012). The first phase of the human microbiome project (HMP) (Turnbaugh et al., 2007) provides an unprecedented opportunity to explore our “other genome” (Le Chatelier et al., 2013). However, making sense of metagenomic data has proven tremendously challenging. At the root of this challenge is the immense size and number of metagenomes. Deeper sampling of a particular microbiome seems 9 always to reveal additional information, and the number of distinct metagenomes is vir- tually limitless (Handelsman et al., 2007). Not surprisingly, even storing raw metagenomic data requires significant resources and proposes a variety of challenges. The result of a WMS project is typically a cohort of metagenomic samples bundled with a set of descriptive information called meta- data. Considering the fact that the size of a single sample representing a sophisticated ecosystem like the human gut microbiota could be several gigabases, it is apparent that a systematic study targeting a population of such ecosystems would create terabytes of data (Gilbert et al., 2010; Human Microbiome Project Consortium & others, 2012). There are several data resources that are completely or partially devoted to metage- nomics. The DoE’s Joint Genome Institute database is a general genomics database often called “genomes online database” and abbreviated as GOLD (Pagani et al., 2012). To date, it covers 544 metagenomics projects, split into three major classes: host- associated (or organismal), environmental and engineered metagenomes. Engineered metagenomes refer to samples that are not naturally existing and taken from environ- ments like bioreactors. Significant effort was devoted to associating a comprehensive set of descriptive metadata to every metagenome. Unfortunately, this information remains incomplete. For environmental samples, more refined data is also available. For exam- ple some metagenomes are associated with “fresh” water, or from soil with a certain PH range. The Metagenomics RAST server (MG-RAST) is another major data resource. This metagenomic analysis server includes several tools for analyzing and annotating data sets (Meyer et al., 2008; Glass et al., 2010). To our knowledge, the repository asso- ciated with MG-RAST is the largest repository of metagenomes. It includes 140; 916 metagenomes of which 20; 389 are publicly available. MG-RAST does not require deposition of complete metadata. Consequently, some samples lack such information or contain inconsistencies, particularly in higher resolution information. For example, 10 the collecting location (i.e., latitude and longitude coordinates) is absent in some sam- ples and the country is not matched with the provided geographical information for many samples. Another database, supported by the European Bioinformatics Institute, is EBI-Metagenomics (Hunter et al., 2014). A unique feature about this database is a powerful comparative framework provided within its tool set. Users can access a variety of characteristics information about each metagenome such as taxonomic information or functional assessments. More importantly, different samples within a specific project can be compared with respect to the characterizing information. Unfortunately this com- parative tool is not available for samples from different projects, which limits its use to analyze metagenomes in a certain cohort. It is also worth mentioning that the number of publicly accessible metagenomic projects stored in this database is relatively small – only 77 public projects as of 10/20/2014. These databases and repositories offer a variety of computational tools to empower the analysis of metagenomic data. These tools heavily rely on alignment-based algo- rithms at different stages of the analysis pipeline. Assembly of raw reads into longer contigs (Flicek & Birney, 2009), binning of sequencing reads (Chan et al., 2008) and homology searching to characterize the predicted operational taxonomic units (OTUs) (Hao et al., 2011) are among the tasks that require sequence alignment (Wooley et al., 2010). The computational requirements of these tasks, however, have reduced their usefulness in large-scale projects. Alignment-free approaches have been success- fully applied to overcome the scalability problems associated with alignment and assem- bly (Grabherr et al., 2011; Chan et al., 2013). In metagenomics, with the growth of accessible data, such approaches have demonstrated their efficiency in a variety of prob- lems. Phylogenetic characterization of the metagenomes (Nalbantoglu et al., 2011), rapid abundance profiling of bacterial species estimation (Meinicke et al., 2011) and 11 comparative analysis of metagenomic samples within a cohort (Jiang et al., 2012) can be addressed by alignment-free algorithms. In Chapter 4, we introduce Amordad, a content-based database engine for metage- nomics designed to support rapid indexing and retrieval even for massive data volume. In the most basic form, the user makes a query by providing a metagenomic sample sequence and asks for the most similar metagenomes in the database. Similarity between two metagenomes is computed and query responses are ranked according to their simi- larity level to the query. Amordad does not attempt sequence alignment or assembly and focuses on the raw metagenome itself as the fundamental data element. Indexing is based on feature vec- tors and exploits a combination of random hashing and regular nearest neighbor graph. Together these dual indecies result in a low complexity querying process both in terms of time and memory usage. Furthermore, as we will elucidate later, this strategy for data structuring allow the database to reorganize itself continually as new data is added, an essential property that is difficult to achieve in high-dimensional geometric databases. We demonstrate the efficiency of Amordad in handling millions of metagenomes by conducting a series of experiments using both simulated and real data. Before further explanation, we first discuss our geometric approach in analyzing sequence data. We outline a correspondence between the string data where data ele- ments are biological sequences, and the Euclidean search space in which such data is uniquely mapped to high dimensional points. In this context, alignment-free sequence comparison can be naturally placed in a geometric context. This reformulation pro- vides us with an opportunity to apply computational geometry techniques to organize sequence data and solve a variety of problems in the alignment-free comparison context. We briefly survey such techniques along with supporting data structures with emphasis 12 on ones designated to handle high dimensional geometric data. In the following chap- ters, we incorporate such data structures and introduce efficient algorithms to solve the discussed problems. 2.2 Geometric representation of biological sequences Much of the literature on alignment-free sequence comparison addresses the statisti- cal features of word frequencies in sequences. Sequences are often represented by word count vectors and inferences are made using similarity scores defined for those vectors. In its simplest form, the number ofk-letter words (k-mers) that a pair of sequences have in common is counted for small values of k. This results in a well-studied and often applied statistic called D 2 (Torney et al., 1990). Several investigations have analyzed the properties of D 2 statistic and its variants (Burden et al., 2008; Liua et al., 2011). For instance, Kantorovitz et al. (2007) introduce D 2 z, the normalized version of D 2 and show the new measure considerably outperformsD 2 when sequences are generated from different models. In another example, Göke et al. (2012) proposeN 2 by stretching the number of exact matches between shortk-mers to include those with slight differ- ences. The idea is to define a “neighborhood” n(w) for each word w containing all similar oligo-nucleotides to w and compute a standardized similarity score based on neighborhoods instead of words. In such studies, one important concern is to estimate the asymptotic distribution of the statistical measure under the null hypothesis that two sequences are generated by a particular model (e.g. i.i.d. (Forêt et al., 2009) or Markovian (Kantorovitz et al., 2007) sequences). Once the distribution of the statistic is approximated for such sequences, meaningful thresholds can be established to measure the deviation from the null hypoth- esis signalling biological relatedness (Lippert et al., 2002; Reinert et al., 2009; Wan et 13 al., 2010). Due to the simplicity and time efficiency for the calculation ofD 2 , this statis- tics and its transformations has been previously used for EST sequence searches (Lippert et al., 2002). Next, we give a formal treatment ofD 2 and two of its particular variants D ? 2 andD S 2 along with technical concepts used throughout subsequent sections. We assume all strings are over a fixed alphabetA. For any stringS, letS[i::j] denote the substring ofS beginning at positioni and having lengthji+1. Ak-mer is a string of lengthk, and we letA k denote the set of possiblek-mers. Define the count ofk-mer z in stringS i as v iz =jfj :S i [j::j +k 1] =zgj: Thek-mer count vector associated withS i is V i =fv iz :z2A k g; with order implied. The similarity measureD 2 between stringsS 1 andS 2 is defined as D 2 (S 1 ;S 2 ) =V 1 V 2 = P d z=1 v 1z v 2z ; (2.1) i.e. the dot product betweenk-mer count vectors forS 1 andS 2 . Hered =jAj k is the dimension of each vector and we treat count vectors as points inR d . Dot products are computed as usual, the norm of pointp iskpk = p p T p and the angle between points p andq is pq = arccosp T q=(kpkkqk). TheD 2 similarity measure was first applied to molecular sequences by Torney et al. (1990). Neglecting to account for statistical properties ofk-words inherent in various kinds of molecular sequences has proven problematic (Lippert et al., 2002), and aug- mented measures have been designed to improve uponD 2 . If we viewS i as a random 14 sequence, lettingp ia denote the probability of drawing lettera2A when generatingS i and ignoring existing correlations, then for anyz =z 1 z 2 :::z k 2A k , p iz = Q k j=1 p iz j ; so np iz = (nk + 1)p iz , withn =jS i j, approximates the expected number of occur- rences of z in S i . The associated standard deviation is denoted z . The D ? 2 measure introduced by Reinert et al. (2009) is defined D ? 2 (S 1 ;S 2 ) = X z v c 1z v c 2z p 2 1z 2 2z : (2.2) wherev c iz =v iz np iz . Since frequencies ofk-mers in sequences can be approximated using a Poisson approximation, 2 i;z np iz ; and we can simplify Equation (2.2). Let ~ v iz = v iz np iz p np iz ; and ~ V i =f~ v iz :z2A k g. Then we can also expressD ? 2 as a dot product of vectors: D ? 2 (S i ;S j ) = ~ V i ~ V j : (2.3) We will also considerD S 2 , a variant ofD 2 introduced by Reinert et al. (2009): D S 2 (S 1 ;S 2 ) = X z v c 1z v c 2z p 2 1z + 2 2z X z v c 1z v c 2z p np 1z + np 2z : (2.4) 15 AlthoughD 2 and its variants are powerful in comparative studies of genomic sequences, their application to NGS sequencing data is limited. One common problem of such statistics is their dependency on the length of the underlying sequences and the fre- quency of nucleotides (Song et al., 2013a). To tackle such problems Song et al. (2013b) suggested further normalization and showed the new resulting statistics are more appli- cable to analyze unassembled raw reads. The frequency count vectors can be viewed as points inR d space where d = 4 k is the number of dimensions. Additionally, several statistical measures find explicit geometric interpretation and therefore, creating a correspondence between alignment- free sequence comparison and optimization geometric problems is straightforward in many cases. Particularly, calculation ofD 2 (and some of its variants) is often translated to finding the dot-product between a pair of vectors representing the sequence count vector or a linear transformed version of it. The geometric conceptualization of sequence data, as we shall see in later chapters, leads to dramatic reduction on the computational complexity of the solution for the similarity search problems. Furthermore, we note that it is applicable to both completely assembled genomes and NGS short read data. We leave problem-specific details and comment more formally on them in the upcoming chapters. Next, we briefly explain algorithmic concepts and spatial data struc- tures later used in this dissertation to address sequence comparison problems within our alignment-free frameworks. 2.3 Similarity search in high dimensional space Geometric similarity search appears at the heart of many analytic challenges. With broad range of applications from geographical information systems (van Oostrum, 1999) to querying multimedia databases (Faloutsos et al., 1994), this has been recognized as a 16 fundamental problem in computational geometry (Preparata & Shamos, 1985). Similar- ity search can be often formulated as the problem of range searching: letS be a set ofn points ind-dimensional space. For a collectionT ofR d subspaces called “ranges”, we aim to preprocessS and generate data structureD such that for any range queryq2T , elements ofS\q could be efficiently identified (Agarwal & Erickson, 1999). “Efficient identification” of such elements usually means a sub-linear time procedure with regard to the input sizen, since the naive algorithm can check every element ofS and answer the query in linear time. For a pointp2R d , defineB r (p) as thed dimensional ball (d-ball) centered atp with radiusr: B r (p) =fx2R d j d(x;p)rg; where d(x;p) is a metric distance measure between x and p. Here, unless otherwise stated, we use Euclidean distance. When no confusion arises, by a d-ball we mean B r (0). For each query pointq2 R d , letT = B r (q), then the range searching will be reduced to the prominent problem of near neighbor search. Near neighbor search Input: SetS containingn points inR d , the query pointq and the parameterr called the search radius Question: What are the elements ofS\B r (q)? The optimization version of the near neighbor problem seeks among the elements of S\B r (q), the point with the closest distance to the query and hence named the nearest neighbor search (NNS) problem. We present several sequence comparison problems which will be converted to an instance of NNS, and henceforth we focus on algorithmic solutions for this problem. The nearest neighbor search problem has satisfactory solu- tions for low dimensional spaces. Particularly for data inR 2 andR 3 , several elegant data 17 structures are available with sublinear complexity (Finkel & Bentley, 1974; Cormen et al., 2001; Samet, 2006). One popular example isk-dimensional tree (or kd-tree) which is a space partitioning data structure first introduced in the classic work by Bentley (1975). Construction of a kd-tree is a recursive procedure. First, for a randomly selected 1 i d, the median of theith coordinate value of all points inS is calculated and stored inM. ThenS is partitioned into two disjoint subsetsS L andS R where the subset S L contains only points withith coordinate values less thanM andS R =SS L . Since we already consideredM as the median value, the resulting subsets are “balanced”. This procedure is then repeated on both subsets with an increment on the coordinate number i (i.e.i i+1 ori 1 for the last coordinate). The recursion is stopped when a newly generated subset has only one element. Intuitively, the result is a binary tree with height logn representingS points on the leaves. Note that since the median of an array withn elements can be computed inO(n), the time complexity of this procedure isO(n logn). Several algorithms have been suggested to solve NNS using a kd- tree (Shakhnarovich et al., 2006) with a logarithmic dependency on the size of the input in many practical situations. More particularly, as the analysis of Friedman et al. (1977) demonstrate, if the data points are randomly distributed in the search space and the query is independent of them, the nearest neighbor is identified inO(f(d) logn) time. This means for fixed number of dimensions, kd-tree gives a logarithmic solution for the NNS problem. Despite the success in solving a variety of range search problems and in particular NNS in lower dimensional spaces, these problems are mostly open for high dimensional data. The central issue is that the analysis of complexity of such algorithms reveals an exponential dependence on the dimension. This is recognized as one aspect of the curse of dimensionality (Arya et al., 1998; Andoni & Indyk, 2006) which has shifted the 18 Query Nearest neighbor Candidates Figure 2.1: An instance of the nearest neighbor search problem with a search radiusr and the performance ratioc = 1:5 efforts from solving the near neighbor search problem towards its approximate version defined below. Approximate nearest neighbor search Input: SetS containingn points inR d , the query pointq and parametersr > 0,c > 1 and> 0 Question: For every pointp inS\B r (q), return eitherp or a candidate pointp 0 2S such that d(q;p 0 )cr with probability not less than 1. In words, the approximate near neighbor problem will accept solutions that lie within the ball centred at q and having an extended radius of cr. Parameter c can be viewed as the performance ratio and determines the admissible extension to the proximity of the query. Figure 2.1 depicts an instance of the near neighbor search problem with the candidates and the real solution. 2.4 Nearest neighbor graphs One fundamental data structure when dealing with spatial data is the nearest neigh- bor graph. This is a concise representation of a set of interrelating objects that is con- cerned with the geometrical distances between them. When fixed number m of the 19 Figure 2.2: (A) SetS of sequences. (B) Corresponding count vectors fork = 1. (C) Euclidean pairwise distance between vectors. Only upper triangle is given because of the symmetry. (D) The resulting 2-NNG. nearest neighbors is stored for each object, the term m nearest neighbor graph or m- NNG refers to the directed graph with as many nodes as the number of objects where each node is connected to itsm closest neighbors (Samet, 2006). In our context and in its most general form, anm-NNG containsn nodes corresponding ton sequence data elements inS. Each node is a k-mer count vector and connected to its m “closest” k-mer count vectors with respect to a specific similarity measure. Figure 2.2 illustrates the construction of an 2-NNG forS containing 4 sequences wherek = 1. Correspond- ingk-mer counts are points inR 4 where coordinates show the frequency of lettersA,C, G andT respectively. For simplicity, we choose the Euclidean distance as the similarity measure for constructing this 2-NNG. Nearest neighbor graphs have various applications in different areas including pat- tern recognition (Beis & Lowe, 1997) and computational geometry (Miller et al., 1997) and efficient algorithms have been proposed to construct them (Dong et al., 2011). 2.5 Locality sensitive hashing Locality sensitive hashing (LSH) is a randomized algorithm for dimension reduc- tion in high dimensional data. Since its introduction by Indyk & Motwani (1998), it has found applications in different contexts such as computer vision (Shakhnarovich et al., 20 2006), image and audio retrieval (Yu et al., 2009; Kulis & Grauman, 2009), document similarity search (Haveliwala et al., 2000), to name a few. More related to biological sequence comparison, Buhler (2001) applies LSH to efficiently estimate the edit dis- tance between a pair of sequences in both global and local regimes in an effort to break the quadratic behavior of the alignment algorithms. LSH offers an approximate solution to solve NNS problem with guaranteed bounds on the error probability if some conditions on the similarity measure are satisfied (Gionis et al., 1999; Charikar, 2002; Datar et al., 2004). The basic idea is to hash points using a function that ensures nearby points are more likely to hash into the same bucket. More formally, a familyH of hash functions is “locality sensitive” with respect to parameters r > 0 (the search radius) andc > 1 (the performance ratio) if for any two pointsp and q inR d and anyh2H, the following conditions hold: ifkpqkr then Pr(h(p) =h(q))P 1 , ifkpqkcr then Pr(h(p) =h(q))P 2 , where P 1 > P 2 for any practical case. In fact, the hash function h measures proba- bilistically, the relative locality of any point p (presumably in a large database) with regard to a pointq (the query). In the context of alignment-free sequence comparison, we may make a direct correspondence between the relative locality of two points and the similarity between twok-mer count vectors and write Pr(h(p) =h(q)) = sim(p;q): One particular family that is used in further chapters, introduces locality sensitive hash functions when the similarity measure is the angle between points. Supposeu is a point randomly sampled from the boundary ofd-dimensional ball centered at the origin. 21 Assuming a source of random bits, simple algorithms are known for generating such points (Muller, 1959). For anyp2R d define h u (p) = 8 < : 1 ifpu 0; 0 otherwise. (2.5) It is shown by Goemans & Williamson (1995) that for any two pointsp andq with angle 0 pq , Pr h u (p) =h u (q) = 1 pq =: (2.6) It is easy to see thath u satisfies the requirements of a locality sensitive hash function when distances are measured as angles. if pq then Pr(h(p) =h(q)) 1=, if pq c then Pr(h(p) =h(q)) 1c=. This function originated in the theoretical work of Goemans & Williamson (1995) prior to its use in LSH (Charikar, 2002). Subsequently it has been applied in different con- texts (Ravichandran et al., 2005). We skip the theoretical details about LSH and refer the interested reader to an excellent review article by Andoni & Indyk (2008). In later chapters, we describe how LSH can be effectively applied to solve approximate NNS and its close variants. 22 Chapter 3 A geometric interpretation for local alignment-free sequence comparison In this chapter, we present our solution for local alignment-free sequence compar- ison problem underD 2 and its standardized variantD ? 2 as the similarity measure. We show that geometric representation of sequence data transforms this problem to find- ing the maximum dot-product value between two vectors each selected from a large set of vectors. We introduce a spatial data structure to efficiently reduce this problem to bichromatic closest pair (BCP) which is a famous problem in computational geome- try. Challenges arose from the high dimensionality of our data motivates an approxi- mate solution for BCP. We describe our algorithm based on locality sensitive hashing (see Section 2.5) to solve BCP with guaranteed upper bound for the error probability. Finally, the behavior of the algorithm is empirically evaluated under different scenar- ios. We show that the local relatedness between sequences could be identified by the proposed algorithm with the time complexity inversely related to the similarity between sequences. In other words, when two sequences are not harboring any regions with high sequence similarity, our algorithm shows slight improvement over the naive solu- tion. However, with the increase of strength in local biological signals, the improvement becomes quite significant. 23 3.1 Local similarity problem revisited We are interested in local similarity between two sequences, analogous to local alignment. While local alignments seek to optimize both the locations and the lengths of the aligned substrings, we restrict ourselves to fixed sized windows. So in pairwise sequence comparison, we seek to identify windows of fixed width, one in each of two strings, such that the similarity between the pair of windows is maximal over all possible pairs of substrings. Local alignment-free pairwise sequence similarity Input: Two stringsS 1 andS 2 , both of lengthn, a similarity measurement scores and positive integerswn andkw. Question: What is the maximum value ofs(S 1 [i::i +w 1];S 2 [j::j +w 1]) over all 1i;jnw + 1? Herew is the window size which is fixed, and for all practical applicationsk = o(w) andw = o(n). The similarity measures is assumed to be based on thek-mer counts vectors. In particular, we study dot-product based similarity measures, for exampleD 2 andD ? 2 . For each of the similarity measures in this study, the naive or brute-force algorithm for solving the local pairwise similarity problem iterates over all possible pairs of win- dows in each of the two given sequences, and explicitly computes dot products for each. For sequences of length n, this requires (n 2 ) time, where is the time required to compute each dot product. This time depends on how we represent count vectors (including the standardized vectors), but will generally be (d). When we think of eachS[i::i +w 1] window as a point ink-mer counts space, the sequence of such points has an interesting geometric interpretation. Each point in the sequence differs from the previous point by at most 1 unit (i.e. a single count) 24 in at most 2 dimensions. One new k-mer is included, and the corresponding count increases. Another, possibly the same,k-mer is excluded, and the corresponding count is decreased. We refer to this dynamic of our points as the “sliding property” and use it later to improve the running time of our algorithm. The naive algorithm can also take advantage of this property. For each update it identifies two “in” and “out”k-mers separately in logarithmic time and modify the corresponding counts. Since there are (n 2 ) updates required to find the largest local dot-product between two sequences, the naive algorithm takes (k log(jAj)n 2 ) using the sliding property. By encoding the sequences as suffix trees and making clever use of suffix links, the factor of k can be eliminated. 3.2 A geometric framework for maximizing similarity In this section we place the local alignment-free sequence comparison problem in a geometric context that can transform a large class of similarity measures to distances satisfying the triangle inequality. Our framework is sufficiently general that it can be used for many global alignment-free similarity optimization problems. In this section we assume that the similarity measure is the basicD 2 . We remark that our framework can be applied to other variants of this statistic with only slight modification. First we recast the local alignment-free similarity problem as the maximial dot product problem. Maximal dot product (MDP) Input: Two setsR andB of vectors inR d . Question: What is the maximum value ofrb over any pair (r;b)2RB? The transformation from local alignment-free pairwise similarity underD 2 is clear: the vectors inR are the count vectors for the length w windows in S 1 and the vectors in B are the count vectors for the lengthw windows inS 2 . When referring to these sets 25 of points, unless otherwise stated we assumejRj =jBj = n which is equivalent to assumingjS 1 j =jS 2 j in the original problem instance. We can say a few things immediately about MDP problem instances obtained from instances of pairwise local alignment-free similarity. First, the dimensions correspond to the k-mers over the sequence alphabet, d =jAj k . This should immediately raise concerns among those familiar with high-dimensional optimization problems, which frequently require time that is exponential in the dimension of the space. Second, when the similarity measure isD 2 , the` 1 norm of each of these vectors is precisely determined by the values ofw andk: kuk 1 =wk + 1; for allu2R[B, since each possiblek-mer in a window contributes one count in the vectors. More generally, for other similarity measures in the original problem, the ` 1 norms can usually be bounded. We can also bound the` 2 norms of these vectors. In the case ofD 2 , the maximum possible norm is achieved when all counts are for the same k-mer. This can only happen forjAj distinct substrings, corresponding to runs of the same letter. We have the following bound underD 2 , which will be useful later: p wk + 1kuk 2 wk + 1: (3.1) The naive solution for MDP is the quadratic time evaluation of all pairs of elements from two sets. Nothing is trivially gained by transforming local alignment-free similar- ity into MDP, and our definition of MDP does not reflect the potentially useful struc- tural property relating count vectors for consecutive windows. The dot product between vectors is related to the angle between those vectors, and optimization involving angles between vectors has received attention in the context of comparing documents based on the cosine similarity metric (Tan et al., 2006). Identifying the pair of points with 26 maximum cosine similarity is equivalent to finding the pair with minimum angle, but ignores the magnitude of those points. Instead of directly solving the MDP problem, we transform it into the well-studied bichromatic closest pair. Bichromatic closest point (BCP) Input: Two setsR andB of vectors inR d . Question: What is the minimum value ofkrbk over any pair (r;b)2RB? The BCP problem was first addressed by Yao under the name nearest foreign neighbor in the context of geometric spanning trees (Yao, 1982). We will show how to efficiently approximate MDP if an oracle for BCP is available, and in later sections we will explain how to rapidly solve BCP for our problem. We begin with the following straight-forward observation which provides motivation for our approach. Supposekrk =x r for allr2R andkbk =x b for allb2B. Then, by the cosine law, krbk 2 =x 2 r +x 2 b 2x r x b cos( rb ); where rb is the angle betweenr andb relative to the origin. So if the` 2 norms for all points inR andB are fixed, any bichromatic pair of points with maximal dot product also has minimal Euclidean distance: arg max r2R;b2B (rb) = arg min r2R;b2B krbk: As explained above, the ` 1 norm is fixed for all points when the original similarity measure isD 2 , but the` 2 norm can vary (Equation (3.1)). We introduce the sequential layering framework (SLF) to transform the original vectors in a way that constrains their ` 2 norms. We now explain how the SLF is used to partition the setR. The procedure is identical forB but the two sets must be partitioned separately. Define the set 27 S R =S =fS 1 ;:::;S m+1 g of hyperspheres centered at the origin. Note that we have dropped the subscript fromS R for convenience, andS with no subscript is assumed to be defined relative toR. We will explain below how we compute the value ofm. Radii of the hyperspheres are defined recursively as follows: radius(S 1 ) = min r2R krk; and for 2im + 1, radius(S i ) = radius(S i1 ): We will explain later how the constant > 1 is determined. For eachr2R, define the orthogonal projection proj(r;S) = max 1im radius(S i )krk r radius(S i ) krk ; which effectively “shrinks”r by the smallest amount so that it resides on a hypersphere inS. Then define, for 1im, R i =fproj(r;S) :kproj(r;S)k = radius(S i )g; and letR =fR 1 ;:::;R m g. We defineB =fB 1 ;:::;B m g similarly based onB and S B . Our procedure is as follows. As mentioned previously, we assume an oracle for BCP. The m-partitions ofR andB by the SLF are used to create m 2 subproblems. We solve each of these subproblems by identifying pairs of points from eachR i B j that solve BCP(R i ;B j ). Dot products are only actually computed for the m 2 point pairs returned as solutions for BCP from each of the m 2 subproblems. We retain the maximum dot product from among thesem 2 as the solution to MDP. Pseudocode for 28 Algorithm 1 Approximation for MDP via SLF with an oracle for BCP Input: SetsB andR of points and constant > 1. Output: A pair (r;b)2RB with approximately maximal dot product 1: ConstructB =fB 1 ;:::;B m g andR =fR 1 ;:::;R m g fromB andR 2: (r;b) ( ~ 0; ~ 0) 3: fori = 1 tom do 4: forj = 1 tom do 5: (p;q) BCP(B i ;R j ) 6: (p 0 ;q 0 ) proj 1 (p;S R ); proj 1 (q;S B ) 7: ifp 0 q 0 >rb then 8: (r;b) (p 0 ;q 0 ) 9: end if 10: end for 11: end for 12: return (r;b) this procedure is presented in Algorithm 1. In the pseudocode we use the notation proj 1 to denote the inverse of projections used above to defineR andB fromR andB. What remains is to determine an appropriate value form and to explain why the bichromatic point pair produced by this algorithm is guaranteed to have dot product within a certain factor of the optimal. Proposition 1. If we select = p for some> 1 in Algorithm 1, then the optimal dot product value is at most times the value returned by the algorithm. Proof. Denote the optimal pair (r opt ;b opt ) and supposer opt 2 R i andb opt 2 B j . Let (r;b) be the output of BCP for the same sub-problem. For convenience we assume u 0 = proj(u;S) for every pointu. For any pointu we haveku 0 kkuk<kuk. Since the angle between any pair of points is preserved when both are projected towards the origin, r 0 opt b 0 opt r opt b opt <(r 0 opt b 0 opt ): 29 On the other hand BCP guaranteesr 0 opt b 0 opt r 0 b 0 . Therefore r opt b opt <(r 0 b 0 )(rb): If (r;b) is replaced in a subsequent iteration, the resulting dot product is increased and the above inequality holds for the replacing pair of points. We now explain how the sizem of partitions ofR andB is determined based on our desired performance ratio. Proposition 2. There exists a set of (log w) hyperspheres partitioningB such that for allb2B, kbk=kproj(b;B)k p : Proof. We showm = (log w) is sufficient for orthogonal projection of all points in B with the performance ratio > 1. Denote b min (b max ) the point with the minimum (maximum) norm inB. As we previously described (Equation (3.1))kb min k p w andkb max k w where w = wk + 1. Sinceb min andb max are projected toS 1 and S m respectively,kb max k=kb min k can not be greater than radius(S m+1 )=radius(S 1 ) and thereby w p w < m=2 ; and the desired bound form follows. In factm = log w + 1 hyperspheres suffice for partitioningB with the performance ratio. Note that sincek = o(w) we can remove the dependency ofm onk and write m = (log w). In the statement of the next result we explicitly bound the constantc between 1 and 2. While this may seem artificial, it simplifies the proof compared with a 30 more general statement, and also corresponds to actual bounds: we can never do better than linear time, and the naive algorithm takes quadratic time. Theorem 1. Given an oracle that solves BCP(R i ;B j ) in O (jR i j +jB j j) c time, for some constant 1c 2, MDP can be approximated with performance ratio in time O(log (w)n c ). Proof. LetT BCP be the time complexity for BCP and similarly defineT SLF (R;B) for the runtime for solving MDP via SLF. PartitioningR (B) is linear. Therefore T SLF (R;B) = m X i=1 m X j=1 T BCP (R i ;B j ) + (n)a m X i=1 m X j=1 (jR i j +jB j j) c + (n): for some constanta. LetX = (x 1 ;:::;x m ) andY = (y 1 ;:::;y m ) and define f(X;Y ) = P m i=1 P m j=1 (x i +y j ) c ; subject to x i 0, y i 0 and P x i = P y i = n. Using the method of Lagrange multipliers, we obtain f(X;Y ) (2m + 2 c 2)n c : Since the sizes of members ofR andB are under the same constraints asX andY , we can substitute the bound onf(R;B) and conclude T SLF (R;B)a(2m + 2)n c + (n) =O(mn c ) =O log (w)n c ; since log (w) is the number of hyperspheres required to ensure the performance ratio. The sequential layering framework transforms MDP into the well-studied BCP problem with a performance ratio of and an additional factor of log w time. BCP can 31 be considered well-solved when the number of dimensions is low. Elegant and efficient o(n 2 ) time algorithms have been designed specifically forR 2 andR 3 points (Preparata & Shamos, 1985). However few options exist for solving the general BCP efficiently in high dimensions. Recall that the dimensions of our problem instances are already exponential in a natural parameter of our original problem: d = 4 k . The time complex- ity of existing algorithms are either exponential in the number of dimensions, or rapidly approach quadratic asd grows. The randomized algorithm of Khuller & Matias (1995) is an example of the former, requiring (3 d ) time for a filtering step. As an example of the latter, Agarwal et al. (1991) proposed a randomized algorithm to solve BCP in expectedO(n 2f(d)+ ) time, for any positive, wheref(d) = 2=(dd=2e + 1). In the following section, we describe an algorithm based on hashing for BCP and show that it provides the sub-quadratic oracle in our algorithm. This is however, just one approach to solve BCP and SLF can be regarded as a general framework to find the maximum local alignment-free score under dot product based similarity measures. 3.3 Solving bichromatic closest pair problem in high dimension In this section we present an algorithm for solving BCP in high dimensions based on random hashing. We show that under certain assumptions on the statistical properties of the inputs, the algorithm has a sub-quadratic time complexity. Later we investigate the behavior of the algorithm when relaxing some constraints on the input. We conduct our analysis under the assumption of i.i.d. sequences – an assump- tion commonly used in sequence analysis. We assume all of the sequences are gener- ated under the null hypothesis that each alphabet letter is identically and independently 32 distributed and all sequences are independent. We name points extracted from such sequences i.i.d. induced points. In our context even if the sequences are i.i.d., we have two sources of dependency: within a count vector the overlappingk-mers in the underlying string are not indepen- dent, and between count vectors from the same original string there is an overlap of the window sizew making consecutive points highly dependent. We refer to the former as the dependency associated with parameter k, and the latter as dependency associated with parameterw. We first conduct our analysis ignoring these two sources of depen- dence, and then explicitly address the dependency associated with w. For the depen- dency associated with k, we present simulation results to indicate that in practice the behavior of our algorithm is not affected by the latter form of dependency for a broad range of values fork. A previous study showed that underD ? 2 , the count of eachk-mer has an asymptotic standard normal distribution for i.i.d. sequences (Reinert et al., 2009). It is well-known that if the counts of each coordinate of ad-dimensional vector have independent standard normal distributions, then the normalized vector is uniformly distributed on the surface of thed-ball (Muller, 1959). Therefore ignoring all types of dependencies, reduces our problem to solving BCP when the input is two independent sets of i.i.d induced points each having a uniform distribution on the surface of the unitd-ball. Our algorithm relies on the the concept of locality sensitive hashing as described in Chapter 2. Recall from Algorithm 1 that we need to solve BCP(R i ;B j ), whereR i and B j are obtained by transforming subsets ofR andB. The solution to BCP(R i ;B j ) is achieved by performing L hashing iterations as follows. In one iteration, a set U =fu 1 ;:::;u v g of random unit vectors is generated. Each p in R i [B j is hashed using the functionh U (p) below 33 h U (p) = v X i=1 h u i (p)2 i1 ; (3.2) where eachh u i is given by Equation (2.5). Then for every bucket that contains points of both colors, we solve the BCP problem restricted to the (unprojected) pre-images of points that were hashed into that bucket. The maximum dot product is retained over all buckets and over each of theL iterations. This is a randomized algorithm and the parametersL andv determine the relation- ship between the time complexity of the algorithm and the probability of missing the closest bichromatic pair. In what follows we analyze this algorithm and show how L andv can be chosen to achieve a favorable balance between running time and accuracy of the algorithm. A random string S consists ofjSj=w independent substrings of length w. Under our geometric interpretation ask-mer count vectors these arejSj=w points without any dependency associated withw. First we consider these points and analyze our algorithm for BCP(R i ;B j ) assuming all members ofR i andB j are independent random points. Without loss of generality letjR i j = jB j j = n and assume each point has unit norm – thanks to the SLF in the previous section, we need only be concerned with angles between points. According to Equation (3.2), the hashing value of a pointp is determined by the relative position of p to v random vectors in U. For any u 2 U the geometric loci of all points p such that pu > 0 is obtained by (1) partitioning thed-ball into two equal parts by the hyperplane orthogonal tou and (2) selecting the half that containsu. We may consider this random partitioning as a Bernoulli trial and therefore two points p and q are hashed to the same bucket iff they fall on the same side of a random hyperplanev times or equivalently when the outcome ofv independent Bernoulli trials forp andq are identical. 34 Observation 1. Letv = logn and hash point setX usingv random vectors and the hash familyh. Then occupancy of any bucket follows a binomial distribution with parameters n andp = 1=n. This observation, which follows directly from the uniformity assumption on the dis- tribution of points, casts our analysis as a classic occupancy problem. The following upper bound on the maximum occupancy of the buckets can then be established (similar to Theorem 3.1 in Motwani & Raghavan (1995)). Proposition 3. There exist an absolute constant c < 2:91 such that if n independent i.i.d. induced points are hashed ton buckets, the probability that any bucket occupancy exceedsc logn is at most 1=n 2 . Consider all ofL iterations of this hashing-based procedure. The probability that no bucket occupancy ever exceedsc logn is at least (1 1=n 2 ) L 1L=n 2 : Assume sub-linear number of iterations is sufficient to find the bichromatic mini- mum angle (i.e. L =o(n)), then the above formula establishes an important fact about our algorithm: The number of naive dot-product computations in any bucket does not exceedO(log 2 n) (with high probability) and thereby BCP algorithm mostly performs O(nL log 2 n) dot-product computations to obtain and retain the closest bichromatic pair. It remains to bound the probability that the algorithm fails to identify the closest pair whenL =o(n). Suppose pointsr 0 2R i andb 0 2B j achieve the minimum angle min in a given instance of BCP. According to Equation (2.6), the probability thatr 0 andb 0 hash to different buckets in allL iterations is (1 (1 min =) v ) L ; 35 for some constant error probability threshold > 0. For v = logn this inequality establishes a trade-off between min and L. In particular it is straightforward to show thatL =O(n ) where < 1 is sufficient to satisfy this inequality if min <=2. Proposition 4. Suppose setR ofn independent points is uniformly distributed on the surface of a unitd-ball. For any pointb on the unitd-ball surface and any satisfying sin> c p d 1(logn)=n 1=(d1) ; for some constantc, with probability at least 1 1=n there exists anr2 R such that rb <. Proof. Consider thed-ball capC b with half angle 0 centered atb. For anyr2R we say r2C b if rb < 0 . DefineX i 2f0; 1g to indicate the eventr i 2C and letX = P X i . Since points inR are uniformly distributed, Pr(X i = 1) =S(C b )=S(1) =p; for any i, where S(C b ) is the surface area (i.e. (d-1)-dimensional volume) of C b and S(1) denotes the surface area of the unitd-ball. We aim to find the cap with minimum half angle 0 such that Pr(X = 0)< 1=n. Since points inR are independent Pr(X = 0) = (1p) n < 1=n; we can bound p> 1 (1=n) 1=n : Note that logn=n> 1 (1=n) 1=n for alln, therefore we findp> logn=n. We are inter- ested inS(C b ) which can be lower bounded by the volume of the (d 1)-dimensional 36 ball with radius sin 0 (Feige & Schechtman, 2002). The volume and the surface area of ad-dimensional ball with radiusx are S d (x) = 2 d=2 (d=2) x d1 and V d (x) = 2 d=2 d(d=2) x d : Therefore, p > V d1 (sin 0 ) S d (1): Considering (x + 1=2) = (x) p x for largex, we conclude that for some constant c sin 0 > c p d 1(logn)=n 1=(d1) : This proposition asserts that even for small sub problems with a few points there exists a bichromatic pairr2R i andb2B j such that sin rb < 1 and thus the minimum bichromatic angle should be bounded away from=2. We extend these findings consid- ering dependence between our points associated withw and conclude a sub-quadratic algorithm for solving BCP. Theorem 2. For i.i.d. induced points, BCP problem with two point sets each havingn points can be solved usingO(n 1+ logn) hashing andO n 1+ (w logn) 2 dot-product computations for some < 1. Proof. For each point the algorithm requiresLv hash function evaluations overall. To find the number of the dot-product computations, we notice that the maximum bucket occupancy is O(w logn). This is because of the fact that during the hashing process the number of the independent i.i.d. induced points in any bucket isO(logn) with high probability. On the other hand, for each point there are at mostw 1 points with some shared k-mers and hence proposing some dependency associated with w. Assuming 37 v = logn we haven buckets in each iteration and therefore the number of dot-product computations does not exceedO(Ln(w logn) 2 ). SubstitutingL =O(n ) completes the proof. Up to this point, we ignored the existing dependency associated with the parameter k. In Section 3.4.4, we present the empirical evaluation of the behavior of our algorithm considering the dependency associated withk. More specifically we show the maximum bucket occupancy remainsO(logn) for i.i.d. induced points ifw=d is sufficiently large. 3.4 Performance evaluation We present empirical results to demonstrate both the accuracy and efficiency of our method. We show that under a reasonable random data model (i.i.d. sequences) our algorithm is typically much more accurate than the theoretical guarantees established above. Moreover our simulations show that under a planted motif model, our algorithm performs almost as accurately as the naive algorithm. Therefore, to the degree that D ? 2 describes important sequence similarities, our algorithm is an effective means of identifying regions of local similarity. 3.4.1 Simulation setup Our simulation experiments require specifying a triple (n;w;k) of parameters. For simulations under the null hypothesis, when the sequences share no interesting local similarity, a pairS 1 andS 2 of lengthn sequences is randomly generated by sampling letters i.i.d. from the DNA alphabet. We use a planted motif model for the alternate hypothesis, which first generates sequences S 1 and S 2 as for the null hypothesis, but then modifies the sequences. For each of the two sequences, a random location is chosen and the corresponding window of lengthw is replaced by a sequence that has additional 38 planted occurrences of a specifick-mer (called the motif). Thisk-mer is randomly sam- pled from all of thed = 4 k possiblek-mers for each experiment. Non-overlapping motif occurrences are inserted randomly, and we use the parameter to indicate the density of these occurrences; the selected lengthw window will containw=k occurrences of the motif. When evaluating an algorithm under the alternate hypothesis, we ask whether the algorithm has identified the pair of windows in which the motif has been planted. The local alignment-free comparison involves identifying a pair of maximally similar win- dows within the sequences being compared. We compute the amount of overlap between the windows identified by the algorithm (W O1 andW O2 , in sequencesS 1 andS 2 , respec- tively) and the windows generated during the simulation (W M1 andW M2 inS 1 andS 2 , respectively). Assumex is the starting position where motifs have been inserted inS 1 andy is the analogous position inS 2 . We defines D as follows: s D =s D 1 s D 2 with s D i = jW M i \W O i j jW M i [W O i j : Here,W M1 starts atx-th position in the first sequence and ends at positionx +w 1. It is important to note that there is some chance in the simulation thats D will be less than 1 even for the naive algorithm, if the optimal windows are not exactly those that have been targeted in the simulation process. 3.4.2 Study of the naive algorithm We conducted a set of experiments repeating each of them 100 times while we incre- ment the length of the sequence from 2000 to 20000 base pairs. Using the naive algo- rithm, we compare the performance ofD 2 and two of its variantsD ? 2 andD S 2 for differ- ent values of. We considered three different scenarios: (1) In a “strong” case, there 39 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 2 4 6 8 10 12 14 16 18 20 sequence length (kbp) Figure 3.1: Performance evaluation of different statistics for the weak case proposed by = 0:02. is sufficient number of motifs planted. Therefore, we expect an appropriate statistical measure to detect the replaced window in almost all experiments. In our experimental design, an easy case is introduced by = 0:1. (2) The “moderate” case is introduced by = 0:05. The planted region contains several motifs, but it is possible (especially for longer sequences) that other windows show stronger similarity solely by random chance. Thus we do not assume the flawless performance by any of the statistics, however, the comparative power study still determines their relative performance. (3) The “weak” case proposes sequences with weakly planted similarity signals. We used = 0:02 in our design to construct this case. Not surprisingly, in many experiments regions with strongerk-mer count similarity were identifiable. Therefore, the value ofs D is close to zero for most of the experiments. We letw = 1000 andk = 5 for all experiments in this simulation study. The results for these scenarios are given by Figures 3.1– 3.3. Based on the results, localD ? 2 achieves the best performance (the largests D ) in discrimination of the regions containing planted motifs from the background sequence. Furthermore, 40 sequence length (kbp) 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 12 14 16 18 20 Figure 3.2: Performance evaluation of different statistics for the moderate case where = 0:05. D ? 2 has the minimum variance among these three statistics meaning that it is the most robust and stable one in local identification of the sequence relatedness. 3.4.3 Performance of the approximation algorithm We evaluateds D of the approximation algorithm while is adjusted from 0 to 0:1 at increments of 0:01. Each experiment was repeated 20 times and the results were compared to the naive algorithm. Interestingly, in almost all of the experiments the performance of the approximation algorithm is identical with the naive demonstrating the accuracy of our algorithm. The results shown in Figure 3.4 are all based on parameter combinations of (n;w;k) = (100k; 1k; 5). Evaluating the speed of the algorithm: We expect to observe quadratic behavior for the naive algorithm and a sub-quadratic running-time for the approximation algo- rithm. Figure 3.5A shows results for both algorithms when no motif is planted and both 41 sequences are i.i.d.. In the broad range of input sequence lengths from 100 kbp to 500 kbp, these results match our expectations based on the analysis presented in Section 3.3. According to the theory, if the number of shared k-mers is increased then the minimum bichromatic angle, and consequently the number of required iterations for BCP, will decrease. We compared the running time of our algorithm in three cases of 2f0; 0:05; 0:1g. The results, presented in Figure 3.5B, demonstrate that these heuris- tics lead to substantial improvement in running time as the local similarity increases. Interestingly the relative error for all experiments carried out here remained less than three percent while the performance ratio of SLF was set to = 1:05. Overall, the running time and the accuracy of the approximation algorithm to identify the regions with the maximum D ? 2 makes it applicable for solving many real biological problems efficiently. For example our algorithm enables comparisons of large orthologous inter- genic regions to identify locally similar intervals of several hundred to a few thousand bases, which are candidate enhancer regions bearing similar sets of transcription factor binding sites. 3.4.4 Empirical evaluation of bucket occupancies We showed that the maximum bucket occupancy is O(logn) when n points uni- formly distributed on the surface of d-ball are hashed. The uniformity assumption ignores existing dependency between overlapping k-mers in i.i.d. induced points. Here we use simulation to show that this upper bound is reasonable in practice. The experiment setup is as follows: for specificw andk, we generaten independent i.i.d. sequences each havingw letters. The norm of allk-mer count vectors constructed from these sequences constrained between a specificx and p x for some randomly selected value of x and > 1 as the performance ratio. In fact we generate a point set that is projected to a specific hypersphere with radius x according to the SLF. Then for 42 0.6 0.7 0.8 0.9 1.0 2 4 6 8 10 12 14 16 18 20 sequence length (kbp) Figure 3.3: Performance evaluation of different statistics for the strong case represented by = 0:1. v = logn, we hash the point set usingv randomly generated vectors as described pre- viously. We both looked at the maximum and the average bucket occupancies resulting from 10 4 iterations of this experiment. The results shown in Figure 3.6 is forn = 10 5 , w = 1000 andk = 4. In this figure, the logarithmic line with equationy = 2r (green curve) is depicted as the upper bound to show that for all of the iterations, the maximum bucket occupancy remains O(logn). Another interesting trend is seen in the average occupancy of the buckets which is nearly a perfect line with unit slope suggesting that on average very small bichromatic sub-problems are created after hashing the point sets. We repeated this experiment using different values forw andk (results not shown). In general we concluded that when thek-mer count vector is not too sparse (i.e.w=d is not too small) then maximum bucket occupancy remainsO(logn). 43 0 1 2 3 4 5 6 7 8 9 10 0.0 0.2 0.4 0.6 0.8 1.0 motif density ( ) percentage naive algorithm approximation Figure 3.4: Comparative performance evaluation ofD ? 2 for the naive and the approxi- mation algorithms for (n;w;k) = (100k; 1k; 5). 3.5 Final remarks Alignment-free sequence comparison is motivated mostly due to its ability to accel- erate similarity searching. It can also detect biological signals that evade alignment- based methods, for example analogously functioning regulatory regions whose similar- ity is based on convergent evolution with individual sites whose order, orientation and multiplicity allow flexibility. Statistical measures to describe similarity without align- ment have received attention, but there has been little attention to the development of algorithms for rapid alignment-free comparison. Here, focusing on a local variant of the alignment-free sequence comparison problem, we introduce an algorithmic frame- work that substantially accelerates the sequence comparisons, while providing a means of achieving a balance between accuracy and speed. We designed this framework in the context of the D 2 and D ? 2 statistics, but we remark that our framework is equally applicable to other measures based on dot-product similarity (Göke et al., 2012). The essence of our framework is a transformation that maps the original string problem to a 44 sequence length (kb) time (seconds x 10 6 ) 0.5 1.0 1.5 100 200 300 400 500 Approximation algorithm Naive algorithm time (seconds x 10 5 ) 0.5 1.0 1.5 2.0 100 200 300 400 500 sequence length (kb) Figure 3.5: (A) Wall-time comparison between the naive and the approximation algo- rithms for i.i.d. sequences. (B) Same analysis under the common motif model. geometric problem, and then decomposes dot product similarity measures by separating the influence of vector angles and norms. This framework, which we call SLF, converts a non-metric similarity measure to the Euclidean distance, which is accompanied by the triangle inequality. This transformation increases time complexity by a logarithmic fac- tor. Following this transformation, the triangle inequality allows inherent locality in the data to be leveraged. Besides allowing us to draw from a large body of existing algo- rithmic results for searching in metric spaces, the SLF has another major advantage for local alignment-free sequence comparison: it permits heuristics in the flavor of branch- and-bound to be implemented at several stages. Such heuristics for pruning the search space are not apparent as extensions of the naive algorithm which performs all pairs of comparisons between windows in the two sequences. Our empirical results show that these heuristics lead to substantial speed-ups even when the similarity in the sequences is weak. 45 Number of bits Bucket Occupancy 5 10 15 20 25 30 1 3 5 7 9 11 13 15 Average Bucket Occupancy Control Max. Bucket Occupancy Figure 3.6: Maximum and average bucket occupancy forw = 1000 andk = 4. In order to solve local alignment-free sequence comparison, the SLF must be cou- pled with a procedure to solve BCP problem instances that result from the transforma- tion. We described a simple method based on random hashing, and showed favorable performance both in theory and in practice. However, any approach for solving BCP could be used, and in some situations other BCP algorithms may be more appropriate. Under the null hypothesis non-overlapping windows of sequences are transformed into a set of points that may satisfy certain sparsity constraints (e.g. Preparata & Shamos (1985)), and we believe such sparsity could form the basis for alternative efficient algo- rithms to find the minimum bichromatic angle. Our approach has certain limitations and in our analyses we made specific assump- tions that are not always met in practice. First we assumed generally that our sequence background is i.i.d. with equiprobable letters. While neither of these simplifying assumptions hold for biological sequences, both have been used effectively in the past (Lippert et al., 2002; Reinert et al., 2009) and are reasonable approximations. Sec- ond, we ignored the overlap between k-mers in the analysis of our algorithm. Even 46 for i.i.d. sequences, especially for sparsek-mer count vectors, the dependency between overlappingk-mers means the spatial distribution of hashed points will not be uniform in the domain, and therefore bucket occupancy may not be balanced. Our upper bound on the maximum bucket occupancy after hashing assumed independence between dif- ferent coordinates of the vectors. Previous studies (Reinert et al., 2009) have shown that the covariance matrix describing the effects of these k-mer overlap dependencies does not disappear even for the asymptotic distribution ofD ? 2 . Fortunately our empirical results indicate that this specific issue has little impact on the efficiency of our method for values ofk we tested (see Figure 3.1). Finally we note that without too much dif- ficulty, our analysis forD ? 2 can be extended to sequences generated by a homogeneous Markov chain as the asymptotic distribution ofD ? 2 has been already generalized for this case (Reinert et al., 2009). 47 Chapter 4 TheAmordad database engine for metagenomics We introduce Amordad, a content-based database engine for metagenomic studies. The fundamental task is to develop a comparative framework that can rapidly identify similar metagenomes in a database to a given sample (i.e. the query). Two main objec- tives are pursued in the design of Amordad: (1) The ability to handle large number of metagenomic samples in the database. (2) To deal with fast pace of data accumulation by an efficient maintenance system. Similar to what was proposed in the previous chapter, we rely on alignment-free principles to create a feature vector representing the content of each metagenome con- cisely. The indexing scheme of Amordad takes a set of such feature vectors and con- structs a randomized spatial data structure that is a combination of random hashing and a regular nearest neighbor graph. We discuss different aspects of our indexing structure. In particular, we describe how this combination enables us to achieve the fundamental design objectives. A variety of experiments have been conducted to evaluate the perfor- mance of Amordad. We show that under our framework, the fuzzy problem of similarity search in metagenomic data finds an explicit geometric interpretation allowing us to leverage efficient algorithms to solve it. 48 4.1 Engine structure We first describe a geometric representation we use for raw metagenomic sequencing data. Then we introduce the two indexing strategies: a randomized hashing strategy based on locality sensitive hashing and the regular nearest neighbor graph. Next, we show how these two indexing strategies are integrated. Finally, we give a heuristic model to express the influence of the nearest neighbor graph on query accuracies using a formula similar to those commonly used in the context of LSH alone. The notation and terminology used here is very similar to what previously described in Chapter 3. Data is a collection of strings over a fixed alphabetA, each representing a raw read from next generation sequencing machine. 4.1.1 Geometric interpretation of metagenomes A metagenome is the set of genomes for a population of organisms existing in some defined space at a defined time. This is commonly a complex mixture of microorgan- isms interacting to form a microbial community with behaviors related to the relative frequencies of different microbes in the community. Our window into this population is by sampling parts of these genomes via sequencing. Following a WMS experiment, the totality of information resides in the unprocessed sequenced reads from a given sample (e.g. in FASTQ files). Our goal is to extract information from this data, but to avoid any kind of sequence alignment, assembly (Luo et al., 2012) or assignment of OTUs (Porter & Beiko, 2013). We instead use representations from alignment-free sequence comparison. Specifically, the set of sequenced reads is transformed into ak-mer count vector. This is ad = 4 k -dimensional vector, where coordinatei gives the frequency of occurrence in the metagenome for theithk-mer. We remark that in general these need not be of uniform length, nor do they need to be contiguous words. The key insight 49 to applying alignment-free methods in metagenomics is that each bacterium contributes a distinct blueprint when represented by a k-mer count vector, while similar bacterial sequences should give vectors that are closer in thisd-dimensional space. Note that this also depends on the measure used for comparing thesek-mer count vectors. Assume metagenomeM, represented as a set of reads, containsN possiblek-mers, which includes all sliding windows of lengthk in all sequenced reads. We centralize the count of eachk-mer to reduce the effect of background noise, as noted in the previous sections. If each read were a random sequence, letting p a denote the probability of drawing lettera2A when generatingM, then for anyz =z 1 z 2 :::z k 2A k , p z = Q k j=1 p z j ; soNp z is the expected number of occurrences ofz inM. The centralizedk-mer count vector is M(z) =X z Np z ; where X z is the total number of occurrences of z inM. This simple representa- tion accomplishes the transformation from the set of reads into a vector in d = 4 k dimensional space. We treat these vectors as points and hereafter, when we refer to a metagenome, we assume it is represented as a point in this space. In addition to expressing biologically meaningful relationships, appropriate distance measures provide a statistically meaningful foundation for alignment-free sequence comparison. Metric distances are especially appealing, and they allow algorithms to take advantage of the triangle inequality, which may allow some distance relationships to be be implicitly inferred. Here we use the angle between two metagenomes as our dis- tance measure. Dot products are computed as usual, the norm of pointx iskxk = p xx and the angle between pointsx andy is xy = arccosxy=(kxkkyk). We remark that 50 other dot-product based similarity measures including D 2 and its adaptation for next generation sequencing datad 2 , could be incorporated in Amordad using SLF introduced in Chapter 3. 4.1.2 Indexing metagenomic points with locality-sensitive hashing for angles Locality sensitive hashing is among the best known indexing methods for high dimensional data (Gionis et al., 1999; Lv et al., 2007), and provides a framework for understanding and analyzing a large class of randomized dimension-reduction tech- niques. We use an LSH family for which the measure of proximity is the angle between points as introduced in Section 2.5. LetU =fu 1 ;:::;u r g be a set of random points on the surface of unitd dimensional ball. Equivalently, define S d =fx2R d jkxk = 1g; and assumeu i 2S d for 1ir. Construct a mapping H U (x) = h u 1 (x); ;h ur (x) ; (4.1) where h u i belongs to the family of hash functions given by Equation (2.5). Equa- tion (2.6) suggests that under the r-bit binary embedding induced by H U , the “local- ity” of points is likely to be preserved. In other words, the closer are two points inR d , the more likely are they to obtain the same mapping in the Hamming space that is the codomain ofH U . 51 To achieve our desired accuracy during the query process, we index points in multi- ple iterations, with each iteration generating a new hash table. Therefore, the first step in our indexing procedure can be summarized as follows: Generate U as a set of r random points on S d (using an established method (Muller, 1959)). For each metagenomic pointx, findH U i (x) and store the result in a hash tableT i . Repeat the above two stepsL times, producing the setT =fT 1 ; ;T L g of hash tables. Although LSH provides a simple indexing scheme and guarantees sub-linear query complexity under certain conditions (Gionis et al., 1999), its use in large databases has been hindered because a large number of hash tables can be required to overcome its approximate nature, which is critical for applications that must guarantee query accu- racy. Our starting point to address this problem in the metagenomics context was to ask how could we keep some of the useful information from each hash table, without explic- itly using all the hash tables or explicitly storing them. The solution was to augment the indexing structure of Amordad by incorporating a proximity graph. This graph encodes important distance relationships that have accumulated through the application of many distinct hash functions, allowing all but a small number of hash tables to be discarded. We show how this graph is constructed and later examine its use in the query process. 4.1.3 Supplemental indexing with a developing nearest neighbor graph The family of hash functions defined in Equation (2.5) corresponds to hyperplanes in R d , each partition the space of metagenomes. Consider the geometric loci of all 52 points that are hashed to the same bucket as the query point. This “query region” is the intersection ofr half-spaces, defined byr hyperplanes, that contain the query point. The query region is likely to contain the actual nearest neighbor point, but even if it does not, it is reasonable to expect that this point remains close to the query region. We grow this region by searching the neighborhood of all points inside it. Amordad encodes spatial information about the proximities of data points using a graph, and employs the graph when processing queries to expand the search region. Specifically, for a set ofn metagenomic points in the database, we construct anm-regular nearest neighbor graph (m-NNG). We refer the reader to Section 2.4 for a brief introduction of the neareast neighbor graphs along with their applications in different areas. Figure 4.1 provides a schematic depicting how this graph is used to expand the query region. To understand how we efficiently construct them-NNG, recall that LSH provides a probabilistic measure of the relative locality of the points. When two points reside in a common hash bucket it is more likely that they should be connected in them-NNG. This observation suggests an approximate algorithm to construct them-NNG: Given a pointx, for each hash tableT i (i) examine all pointsy in the same bucket asx, (ii) keep only them closest points tox, and (iii) connect the corresponding nodes in them-NNG. Since the information from LSH is probabilistic, the probability of misidentifying an edge never reaches zero. However, the theory of LSH guarantees that this “approxi- mate” graph converges to the actualm-NNG as we increase the number of hash func- tions/tables employed in this process. We present results about the rate of convergence of them-NNG below. There is a major problem with this process as just described: the graph encodes no more information than already exists in the hash tables, which are also used in the query process. This redundancy, however, allows us to “retire” hash functions, replacing them with new ones. This continually refreshes the randomness in Amordad, while retaining 53 Nearest neighbor Query LSH Graph Figure 4.1: The query region and the incorporation of the graph for augmenting the nearest neighbor search process are shown. The graph which is partially depicted here is 2-NNG. the useful information from retired hash tables, since it is encoded in the m-NNG. In Amordad, as each hash table is generated, it is placed in a queue (i.e. first-in, first-out) of sizeL. At the same time, the oldest hash table in the queue is removed; we elaborate this after having defined the basic insertion and deletion operations for Amordad. The nature of any useful large-scale metagenomic database is dynamic: users must be able to add points over time, and possibly also remove them. These basic operations are handled in Amordad efficiently as follows: insert(x) computesH U i (x) for 1iL and modifiesi-th hash table by adding x to buckets corresponding to eachH U i (x). It also inserts a vertex into the graph and associates it withx. 54 delete(x) computes bucket numbersH U i (x) for 1 i L and removesx from the appropriate buckets in all hash tables in queue. Removal ofx from the graph follows a lazy deletion strategy (Jannink, 1995). This lazy deletion is essential to the efficiency of maintaining the graph because, although each node has a fixed out-degree, the in-degree is not constrained. Nodes have a reference count for in-degree, and upon deletion nodes are marked, and although never reported, each subsequent encounter during query or insertion operations remove an incoming edge and decrements the count; the node itself is deleted when the count reaches zero. The insertion and deletion operations are both efficient: the complexity of each opera- tion is dominated by evaluation ofL hash functions. Once the graph has converged, for each point, its m out-going edges indicate its m nearest neighbors. A problem arises because the insertion and deletion operations eliminate this property. A newly inserted point might be among them nearest neighbors of any existing point, but to check this requires processing each metagenome in the database. Similarly, when a metagenome is deleted, some metagenomes will effectively have out-degree ofm 1. The solution is continually applied maintenance operations, which one may imagine as occurring according to a regular schedule, such as once per day. The maintenance consists ofl iterations of the following procedure: A random hash function is generated and used to create a new hash tableT L+1 . T 1 is examined to update edges in the nearest neighbor graph. T L+1 is added to the queue, andT 1 is removed. At any time during this process, queries have access to L hash tables, and T 1 is not discarded until the edges indicated by bucket co-occupancy in T 1 have been added to 55 Database points Hash table ( ) Hash function( ) ... ... Queue of length L Insertion/Deletion m-NNG ... ... Time Figure 4.2: This schematic shows how the basic operations interact with the different components of the database. The continual procedure for updating the graph is also represented. the graph. The schematic representing this process is depicted in Figure 4.2. In our current implementation, we setL = 200 andl = 50. Performing a query, or similarity search, in Amordad occurs in two steps: 1. For query pointy and 1iL, computeH U i (y) and check the distance between y and all other points hashing to same bucket numberH U i (y) inT i . 2. For anyx in bucketH U i (y), evaluate them neighbors ofx in the graph, and check their distance toy. This procedure is summarized in Algorithm 2, whereN (x) refers to the set of neighbor points ofx in the graph and the function max_element(b) gives the furthest point inb to the query point. 56 Algorithm 2 Similarity search in Amordad Input: SetS ofn points, queue of sizeL,m-NNG, queryy and parametert Output: t nearest neighbors ofy inS 1: initializeb; max 1 2: fori = 1 toL do 3: computeH U i (y) 4: for allx2H U i (y) do 5: for allx 0 2N (x)[fxg do 6: if x 0 y < max then 7: b max max_element(b) 8: b b[fx 0 gnfb max g 9: update max 10: end if 11: end for 12: end for 13: end for 14: return b 4.1.4 A heuristic to model the effect of the graph We are mainly interested in answering the following question: how the time and memory complexities of the query are affected by the integration of the graph to the basic LSH algorithm for the nearest neighbor search? We do not repeat query analysis of the indexing based on the LSH and refer the interested reader to (Gionis et al., 1999) for details. For simplicity here, we consider just the single nearest neighbor (i.e. t = 1) of any given queryy and denotex2S as the closest point toy with the distance xy . First, notice the number of distance computations when using the nearest neighbor graph in the search process is at most increased by a factor of m. In the worst case, each time we compute one distance, we do so for all neighbors of the point. Distance computations are the inner-most operation in Algorithm 2 and hence dominate the query time. Considering that m is a small, and fixed number in an instance of the database (e.g. m = 10), we conclude that the effect of the graph on the query time complexity is asymptotically negligible. 57 Next, we study the space requirement of Amordad. Equation (2.6) states that the probability of co-occurrence ofx andy in a bucket, in any ofL hash tables, isp co =p r 0 wherep 0 = 1 xy =. Let be the error bound in identifyingx, then (1p r 0 ) L ; (4.2) which provides a lower bound on L, the size of the queue. As mentioned earlier, the purpose of the graph is to effectively expand the query region, increasing the probability of success top co =p r 0 . We define the expansion coefficient after applying the graph and revise the above inequality: (1p r 0 ) L ; or equivalentlyL log= log(1p r 0 ). Proposition 5. For the lower bound ofL provided above, ifr!1 we have (1p r 0 ) L =(1p r 0 ) L ! 1: Proof. Taking the logarithm of the above fraction and substitutingL by log= log(1 p r 0 ), reduces the problem to prove that if r!1, we have log(1p r 0 )= log(1 p r 0 )! 1. The correctness of this limit is easily verified by applying l’Hôpital’s rule. For a database withn points, we always letr = log 2 n. Therefore, this proposition means for a large-scale database, our indexing algorithm achieves the same error bound as if we just used LSH but with a reduction on the number of hash tables by an factor. We remark that is data dependent and should be re-evaluated (possibly on the maintenance time of the database), if there is a significant change on the data points; 58 however as we show in the next section, it remains significantly greater than one for a wide range of search parameters. Note that the space required by any hash table is (n) bytes. Therefore, by efficient implementation of the graph, the space required for indexing structure of Amordad is (nL= +mn) bytes. Considering the fact that L could be easily hundreds to achieve a reasonable accuracy, the effect of the graph is quite significant if is sufficiently larger than one. 4.2 Performance evaluations 4.2.1 Dataset and implementation details We have two objectives in studying the performance of Amordad: (1) investigating its potential for identifying metagenomic associations in large-scale applications, and (2) demonstrating the favorable technical aspects of our database engine with particu- lar emphasize on scalability. We use two evaluation platforms. For the first platform, we create a relatively small database using currently available metagenomic data sets, and we use this to study the interaction between the LSH and the NNG components of Amordad. We also examine how the graph serves as a metagenomic knowledge base. In particular, we investigate the quality of biological discoveries stemmed by mining this graph. The second platform mixes synthetic with the actual data to highlight the computational efficiency of the query process in a highly populated database. We mea- sure both the space and time requirements of queries using a combination of real and synthetic metagenomic data. We obtained WMS data from the European Nucleotide Archive (ENA) (Leinonen et al., 2011), including approximately 3:5TB of compressed FASTQ files from 133 projects (08/15/13). We extracted 5-mers from each sample and obtained 5073 real metagenomic count vectors ind = 1024 dimensional space. We augmented this set with an additional 59 10 6 randomly generated metagenomes to increase the scale of our evaluations. Each simulated metagenome containedN = 10 7 i.i.d. reads of length 100. Upon initiation of the query, the graph and the queue are read to the memory (these could be memory resident on a dedicated server). Due to space constraints we do not load all count vectors but instead, a map of vector ids and their corresponding address on the disk is read to the RAM. For comparison purposes in evaluating accuracy, we implemented the brute- force (quadratic time) nearest neighbor graph construction and the linear time scan for exact results from queries. The implementation of the database engine and associated utilities was done in C++ and with the assistance of the Boost Graph Library (http://www.boost.org/libs/graph). Our Amordad implementation includes multi-threaded query processing. However, all evaluations described here were done using a single core of a 2.4GHz Xeon CPU and 16GB of memory. Building Amordad is a multi-stage process, involving the construction of hash tables and the nearest neighbor graph. The overall time to build the database with 10 6 count vectors was 44:9 hours. We also investigated the memory requirement for building the database. Initial construction of the nearest neighbor graph required the largest memory among all of the programs included in Amordad software package. When examining L = 200 hash tables to construct a 20-NNG representing 10 6 metagenomes, the peak memory was 10:4GB. 4.2.2 Rate of convergence for the nearest neighbor graph First we studied how accurately the underlying graph, constructed incrementally and probabilistically during the database construction, reflects the truem-NNG. We varied m2f1; 5; 10; 15; 20g, and used the 5073 metagenomes from 133 projects. At each iter- ation, corresponding to the use of an additional hash function, the fidelity of the graph 60 mis-assigned edges 0 200 400 600 800 1000 number of hash functions used 0.0 0.05 0.1 0.15 0.2 0.25 0.3 Figure 4.3: Fidelity of the nearest neighbor graph as a function of LSH iterations. was assessed by computing the number of edges by which it differs from the idealm- NNG, which we constructed for eachm by brute force. As indicated in Figure 4.3, for all values ofm there is an exponential decrease in the number of misidentified edges, indi- cating that the construction via LSH converges rapidly. Also evident from Figure 4.3, asm increases, the construction ofm-NNG requires more iterations, reflecting the fact that them-th nearest neighbor of each vertex becomes more distant for largerm. What matters most for queries, however, is not the fraction of edges that are accurate, but the usefulness of the edges, and the edges less accurately identified tend to be those connecting more distant neighbors. 4.2.3 Assignment of significance scores to query responses Assigning significance to query results is essential, but there should be no expec- tation that actual distances between metagenomes (for any reasonable representation) follow a simple statistical distribution. To gain insight into how these distances can be 61 0 1 2 3 4 5 density angle Figure 4.4: The probability density function of angles resulted from real metagenomes. interpreted in the context of Amordad queries, we examined the distribution of distances for relatively small sample of available metagenomes. Figure 4.4 shows the empirical distribution of distances between the 5073 metagenomes from the 133 projects. Denoting F as the cumulative function of this distribution, for any query pointy with the answerx, we definedF ( xy ) as the signifi- cance score of the query answer. This score shows the relative ranking of the distance betweenx andy among all of the distance measurements. Note that ifF ( xy ) is large, the relatedness betweenx andy might be questionable. In fact, ifx andy are two unre- lated metagenomes, the centralization of count vectors and the linearity of expectations asserts thatE( xy ) ==2. This important observation suggests that any meaningful xy must belong to the tail of the empirical distribution. We decided to define the maximum proximity radius as the maximum distance between the query point and the retrieved data point such that the response is assumed relevant to the query. We denote by 0 the maximum proximity radius and avoid report- ing any answerx to the query pointy, if xy > 0 . In our evaluation, a liberal threshold 62 is F 1 (0:2), which results in 0 = 0:987 radians or 0 57 . A more conservative cut-off isF 1 (0:1) which is equivalent to 0 = 0:752 radians or approximately 43 . In fact, the conservative and liberal cut-off thresholds are respectively selected as the first and the second deciles of the empirical cumulative distribution of distances within the database. 4.2.4 Evaluation of the graph effect on the query space requirement We can view them-NNG as having two effects: increasing accuracy when using a fixed number of hash tables or reducing the number of hash tables required to achieve a particular accuracy. We conducted a series of experiments to compare the performance of the various instances of Amordad with the basic LSH algorithm when they are both using the same sets of hash tables. For LSH alone, we exclude the graph. The evaluation metric used here is average recall on a set of queries. Given a queryy, denoting thet nearest neighbors asN t (y), ifR(y) is the set of points reported by the algorithm, then recall(y) =jN t (y)\R(y)j jN t (y)j: The detailed experimental setup is as follows: We randomly selected 10 out of 133 projects and regarded their total 94 samples as the query set. Different instances of Amordad were then constructed considering all other samples as database points and m-NNG form2f1; 10; 20g. Next, using 1 L 200 hash tables, we compared the average recall value of different instances of Amordad with the basic LSH algorithm. For simplicity, we only considered the first nearest neighobr (i.e. t = 1) and performed experiments separately for both the conservative and liberal maximum proximity radius. If the distance between any query point and its nearest neighbor exceeded the maximum proximity radius in an experiment, it was discarded from the query set. Since hash 63 0.35 0.40 0.45 0.50 0.55 16 17 18 19 20 time (seconds) 700 800 900 1000 16 17 18 19 20 angle computations (N) log of database size log of database size (A) (B) Figure 4.5: The average recall value for a query set. The left and the right plots corre- spond to the conservative and liberal maximum proximity radius respectively. functions are randomly generated, we repeated each experiment 10 times with different set of hash functions and reported the average recall. Based on the results presented in Figure 4.5, Amordad reaches to the same accuracy as the LSH requiring considerably fewer hash tables. When the graph is constructed form = 10 orm = 20, there is no query error for any experiment. However, LSH is not able to reach perfect recall even whenL = 200 hash tables are employed under the liberal maximum proximity radius is selected. 4.2.5 Measuring query time in a large-scale database We conducted a set of experiments to investigate the scalability of the Amordad engine. We first divided metagenomes from the 133 projects extracted from public databases (described above). Similar to the previous experiment, we randomly selected 10 projects and used their metagenomes (a total of 183) for queries. All n d = 4890 remaining metagenomic points were used, along withn s simulated points, to construct 64 0.5 0.6 0.7 0.8 0.9 1.0 0 20 40 60 80 100 120 140 160 180 200 number of hash tables 0.5 0.6 0.7 0.8 0.9 1.0 0 20 40 60 80 100 120 140 160 180 200 number of hash tables recall recall (B) (A) Figure 4.6: ‘(A) The average wall-time of the query as a function of the log of the database size. (B) The required number of distance computations per query as a function of the log of the database size. the different instances of the database. We set the value ofn s such thatn s +n d = 2 i for i increasing from 16 to 20. Then we created an instance of the database with n points as follows. We set the number of hash functions to r = logn. We used the conservative threshold 0 = 0:752 to determine the number of hash tables required by the LSH to bound the probability of missing a nearest neighbor below = 0:05. Finally, we constructed the graph withm = 10 outgoing edge from each node. To evaluate the relation between the query time and the number of points in the database, we calculated the average of wall-time and the number of angle computations required for answering all queries. As shown in Figure 4.6, the query time grows almost linearly with logn. This establishes the scalability of Amordad, validating its ability to efficiently process queries in very large database instances. Note that in each experi- ment conducted here the query accuracy was 100 percent, again demonstrating the effect of the graph on reducing the probability of error in identification of the query nearest neighbor. 65 4.2.6 Relationships captured in the nearest neighbor graph Our original purpose for including the m-NNG in the design of Amordad was to eliminate the space that would otherwise be required to explicitly use enough hash tables to ensure desired accuracy. But it can also be required as a persistent and adaptive knowledge base that links close points and therefore permits the detection of biologically similar metagenomes. It is expected that most of the edges of the graph link between biological replicates. We refer to biological replicates by samples taken from the same bacterial consortia but under different spatio-temporal properties. To name a few, a cohort representing sea water samples taken at different depths or the gut microbiota samples obtained at slightly different times. Detection of biological replicates within a cohort is an easy task under our framework and could be potentially useful. However, we claim our approach can cluster inherently similar metagenomes together even when the methodologies for obtaining, preparing and sequencing samples are varied. To demonstrate this type of relationship captured within Amordad, we constructed the least informative graph (i.e. the 1-NNG) using metagenomes from the 133 projects. We identified any edge connecting metagenomes that belong to two distinct projects. Among the 194 such edges, 58 showed the significance score less than 0:01. Table 4.1 shows the top ten most similar pairs of samples (projects are listed by accession number from ENA). To avoid any redundancy, any pair of projects is reported at most once in this table. See the supplemetary material for the complete list with more detailed information. We emphasize on the broad spectrum of applications that can be offered by mining this graph, once the database is created from a comprehensive set of metagenomes. For example, new studies can take advantage of previous fully-analyzed studies to which they are linked in Amordad, providing a starting point for data analysis. We also may pool such samples and search the result for new assemblies of bacterial genomes or 66 Accession I Accession II Project scope I Project scope II Distance Ranking ERP001737 ERP001736 Sea water Sea water 4 0.0016 ERP001736 ERP001227 Sea water North sea water 4.6 0.0020 SRP009476 ERP001736 Sea water Sea water 5.3 0.0023 ERP001068 SRP013944 Soil (diff. PH) Forest soil 5.3 0.0023 SRP021115 ERP001736 Marine matter Sea water 5.4 0.0023 SRP009476 SRP021115 Sea water Marine matter 5.5 0.0024 SRP017582 ERP001568 Oil sands Polluted water 6.5 0.0028 ERP001737 SRP021115 Sea water Marine matter 6.6 0.0029 SRP016067 ERP002469 Gut microbiota* Gut microbiota** 6.8 0.0030 SRP019913 SRP021115 Red sea water Marine matter 7.3 0.0033 Table 4.1: Top 10 most similar pairs of metagenomes from different projects, with scope of corresponding projects, distance between metagenomic points (i.e. the angle in degrees) and associated significance scores. (*) From a cohort of normal and atheroscle- rosis samples. (**) Metagenomes are sampled from European women with normal, impaired and diabetic glucose control. even discovery of biomarkers based on phenotype-metagenotype associations identified by this method. 4.3 Final remarks We presented Amordad, a database engine for whole-metagenome sequencing data. Amordad uses alignment-free principles, representing metagenomes as points in a high- dimensional geometric space. LSH is used to efficiently index the database points. To improve accuracy and efficiency beyond what is practical from LSH alone, Amordad augments this indexing with a regular nearest neighbor graph. The randomness in Amor- dad is continually refreshed by resampling new random hash functions. Although only a fixed number of hash functions is alive within Amordad at any given time, those that are extinct have contributed to optimizing connectivity in the graph, and thus continue to 67 assist queries even if they are no longer used for hashing. Results from a series of exper- iments have demonstrated this approach to have a significant effect on query efficiency and accuracy. The ability to cope with the rapid accumulation of data was a central design goal. LSH is a well known approach to index high-dimensional data, but it is also a random- ized method. A major drawback is the the number of hash-tables one must maintain in order to provide guarantees on the accuracy of queries, imposing time and space con- straints. This is a well-known problem and solutions have been proposed to overcome this pitfall (Panigrahy, 2006; Lv et al., 2007). These solutions mostly rely on the fact that each hash function (i.e. each bit in our scheme) provides a ranking based on the proxim- ity of points to a certain query. Therefore, even if the closest neighbor is not hashed to the same bucket as the query, it is highly probable that buckets close to the query bucket contain the nearest neighbor. Consequently, one may avoid generating many hash tables at the cost of checking more buckets from each table. This clever strategy has a direct effect on the indexing memory consumption (Lv et al., 2007). However, as the dis- tance between the query and its nearest neighbors increase, there is a rapid growth in the number of buckets that must be checked. This presents a challenge in our applications due to the inherent diversity of bacterial communities in many circumstances. In fact, if we view a set of biologically related metagenomes as a cluster in high dimensional space, the diameter of some complex clusters like human gut microbiota (Arumugam et al., 2011), might be large enough to make the process of searching close buckets to the query bucket time-consuming. Our approach, augmenting the LSH indexing with a graph structure, avoids having to explicity search additional buckets, and depends on transitivity of relationships encoded in the graph. Intuitively, if the underlying distance measure between metagenomes sat- isfies the triangle inequality, the graph should guide the search in the most appropriate 68 directions. Additionally, the LSH provides a means of streamlining the database main- tenance process. Hash tables are maintained in a fixed size queue that is continually updated and each new hash table contributes to refining edges in the graph. As a con- sequence, we are able to keep the graph in a near ideal state without requiring explicit operations to maintain the graph after metagenomes are added to (or removed from) the database. This design decision acknowledges that in the near future large (and distributed) metagenome databases will be updated frequently, and efficiency of queries will be more important to users of Amordad. We modelled the effect of the graph by introducing the query expansion coefficient and experimentally validated the significance of the graph on the space requirements of the database. We envision future applications, for example in medicine, that might use an engine like Amordad to ask whether the query metagenome “exists” in the database. The user would not be concerned with the criteria for responding that some database metagenome is equivalent to the query, but some meaningful criteria must be defined for each appli- cation. Amordad only requires metagenomes are represented as points in a metric space, and we envision different applications using slightly different representations of metagenomes and also different distance measures. Apparently, for each distance mea- sure, there will be opportunity to optimize statistical aspects of query criteria impacting both accuracy and efficiency. It is clear that many applications of Amordad will involve maintaining large metage- nomic data clusters so that new metagenomes can be understood through their neigh- borhoods in the graph. From this perspective, the graph becomes the primary structure and the LSH serves simply to find the right neighborhood without exhaustive search. Finally, we address some limitations of this work. We represent each metagenome by a feature vector that included the frequencies of all k-mers for a fixed k. We did 69 not use feature selection, and were bound to relatively small values ofk. Many of the k-mers were almost certainly irrelevant in our experiments on real data. More flexibility could be obtained if a metagenome is represented by a bag of features (Salton, 1991) composed of a variety of words (i.e. subsets ofk-mers for varyingk). From statistical point of view, this is equivalent to shifting the paradigm from a full Markov chain of orderk towards a variable length Markov chain (Bühlmann & Wyner, 1999). The main challenge is to find a parsimonious algorithm for feature extraction at large-scale. More important than irrelevant features, however, are features representing technical artifacts, for example of the sequencing experiments. 70 Chapter 5 Perspectives and future applications The Amordad database engine was conceived in the context of metagenomics appli- cations, but from the outset we knew the concepts within Amordad had very broad applications in massive databases for “ultra-high” dimensional data. In this chapter we explore those applications, and start outlining a path to leverage Amordad more gen- erally. We will discuss future perspectives where the current trend in accumulation of data practically leads to situations that computationally costly algorithms will be unable to organize sequence data or support content-based similarity search. Our context, like the previous chapter, will primarily remain within the metagenomics. However, we note that other rapidly evolving technologies such as whole genome bisulfite sequencing (Cokus et al., 2008), RNA-seq (Pepke et al., 2009) or even single cell sequencing (Eberwine et al., 2014) will eventually require light-weighted geometric interpretation of gigantic string objects and the structuring high dimensional data for efficient similarity search. We begin with a question: with several general purpose biological data warehouses such as NCBI, ENA, etc. or more specialized databases and web-servers like EBI- Metagenomics (Hunter et al., 2014) or MG-RAST (Meyer et al., 2008; Glass et al., 2010), how can heterogeneous metagenomic data be fused on a massive scale? By data fusion, we do not mean to simply gather information from multiple resources and con- struct a relational database with a trivial indexing scheme. What is potentially much more helpful is to organize data based on the content of metagenomes such that clusters of similar data can be accessible to the user of the data. It is obvious that the number and 71 volume of such data, even currently, poses a serious challenge for content-based fusion of metagenomic samples. Subsequently, this impacts the downstream applications that consider a cohort of seemingly analogous metagenomes for statistical modelling and prefer the largest possible cluster of similar metagenomes. We propose a framework that uses Amordad as the back-end engine and results in an integrative database of metage- nomic data from multiple resources while maintaining the intrinsic characteristics of each sample. We then continue to explore the application of Amordad to unravel metagenotype- phenotype associations. We focus on the geo-specification problem which asks to locate an environmental metagenome by analyzing its raw NGS data. This is a complicated trait and as we later explain, we expect that existing tools fail to address it. We intro- duce a method to infer geographical information for a given metagenome (the query) by relying on similar metagenomes with known geographical information (the near neigh- bors). Finally, we discuss the extension of our geometric algorithm to treat other types of biological data. We believe that our approach is generic in the sense that regardless of the type, every complicated string object could be analyzed by our framework. However, it might be necessary for each application to adjust the geometric representation and the similarity measures for coping with the needs of the study domain. 5.1 Fusing metagenomic data The lack of any metagenomic database accepting content-based queries means new studies have no or limited access to the results of previous projects. Comparative anal- ysis of metagenomes often takes place between samples within a single cohort leading 72 to statistical inferences based on small sample size, with the possibility of outlier distor- tions. Several projects, however, are attempting to organize metagenomic data based on phenotypic traits. Some major data repositories offer relational tables containing meta- data associated with each metagenome. For example in MG-RAST, the user can filter metagenomes based on the biome or the inhabiting material of the bacterial colony. Despite the promise in this type of organization, considerable issues reduce the reli- ability of such ontological classifications. The first concern is the absence of a unique standard to deal with metadata. Since traditional taxonomic classification does not apply to metagenomes, a major barrier in ontological data integration is the absence of a sin- gle standardized classification scheme. One proposal for a systematic classification of metagenomic samples exists within DoE’s Joint Genome Institute database (GOLD). This database is linked to NCBI and uses an enhanced ontology from what is pro- posed by NCBI. Another systematic classification, suggested by (Yilmaz et al., 2010), has been used in other major metagenomic data repositories such as MG-RAST and EBI-Metagenomics. There are differences between these two classification approaches imposing a challenge in fusing multi-hosted metagenomic data. Another concern is the missing metadata currently deposited in metagenomic databases. Our investiga- tions have found the submission guidelines and protocols in almost no major reposi- tories require that users give complete metadata at submission time. Therefore even within one database, descriptive information about samples frequently contains missing or inconsistent values. Content-based data fusion, on the other hand, could mitigate such difficulties in a systematic way. The clustering of metagenomes finds statistically meaningful interpre- tations based on features within the data. Additionally, with an efficient similarity search 73 engine, we might be able to infer some missing metadata systematically, based on the existing information in the cluster of which a metagenome belongs with. We plan to construct a comprehensive alignment-free database containing thek-mer count vectors of all of the publicly available samples deposited in major metagenomic databases. For the creation of the database, we fix k = 5 however, the user is able to download k-mer count vectors for 1 k 10. Amordad serves as the back-end search engine for the database providing users with real-time similarity search results. We then use the nearest neighbor graph associated with the integrated samples, to infer missing metadata at large-scale. We expect to have inconsistency in metadata fields stemming from different sources. Therefore, this process might not be accomplished without human intervention. However, Amordad will significantly facilitate the infer- ence process and present a confidence score for predicted metadata. To better illustrate this strategy for inferring missing data, we discuss a metagenotype-phenotype association problem in the next section. We use our geomet- ric database engine to predict the location of a bacterial community in a probabilistic framework. 5.2 Localization of environmental metagenomic sam- ples Characterization of a bacterial community, as we previously explained, is a labori- ous procedure requiring significant computational resources. An alternative approach could be understanding a metagenome based on the similar communities that have been already analyzed. Since we confine the concept of similarity to what could be concluded from a feature based representation, we may not be able to predict every behavioral 74 aspect of metagenomes. However, we can guarantee that to the degree by which a partic- ular representation is able to encode the characteristics of the underlying metagenomes, this approach will help us to probabilistically understand unknown traits of a sample. An example of trait identification is described in this section. We focus on the envi- ronmental samples and hypothesize that the correlation between the genetic material of each sample with the geographical information about existing species may lead us to specify the location of a metagenome with high resolution. In the light of this hypoth- esis, we define the geo-specificity problem that seeks the provenance of a metagenome. Note that in practice, a metagenome might be a mixture of several ecosystems due to the casual contact of the sample with the ambient surrounding colonies of microorganisms. For example, a small piece of dust collected from someone’s boot is a compendium of different dormant soil microbes gradually piled up as the person walks. We propose a high-level approach to address the geo-specificity problem. We shed light on the geometric similarity search as a solid approach to dramatically lower the computational demands of this problem. We expect the integrated database of metagenomes, as described earlier, to have the capacity of unraveling the connections between the genomic material of a bacterial community with its provenance informa- tion. We also remark the efficiency of the Amordad to determine such information even if the sample is composed of a number of microorganismal ecosystems. 5.2.1 Metagenome-to-metagenome matching (MG-MG) It is possible to determine geo-specificity of metagenomic sequences, with some resolution and some confidence level, relying only on existing databases. Even within the context of existing databases, the degrees of resolution and confidence will increase as those databases grow. 75 Similarity search of a metagenome against the database of environmental metage- nomic samples introduces a straight-forward approach: given a query metagenome, identify the similar metagenomes in a database, and then use geo-spatial information associated with those similar metagenomes to assign geo-spatial provenance to the query. Due to the diversity of microbial life and the crucial influence of the habitat to the presenting genomes, it is logical to assume two similar metagenomes should have close natural habitats. The algorithmic issues in this approach are: (1) how to represent the metagenomes, both in the database and for queries, (2) how to measure “similarity” between metagenomes, (3) how to combine geo-spatial information from metagenomes in response to an individual query, and (4) how to balance accuracy, computational demands, error modes and resolution. We already have a framework that provides can- didate solutions to points (1) and (2). In fact Amordad, can serve as the search engine for the environmental database enabling efficient comparative metagenomic studies. Remember that Amordad uses metagenomic fingerprints, based on counts of short oligo- nucleotides, to represent metagenomes. These representations correspond to points in high-dimensional spaces, and allow one to define “clusters” of metagenomes that cor- respond to environmental categories (i.e. aquatic, terrestrial or air metagenomes). The problem becomes a nearest-neighbor search problem, and the identified nearest neigh- bor(s) can be used to infer geo-spatial provenance. Issues (3) and (4) depend on the pre- cise features of the database. A database must be sufficiently large and diverse to have “representative” metagenomes for each type of query. Ideally these would be uniformly sampled from all over the globe, in representative environmental niches, and during dif- ferent seasons. The ideal situation does not currently exist, but enough information is accumulating that high-specificity solutions might currently be possible. 76 5.3 ExtendingAmordad to other biological data types The starting points for whole metagenomic shotgun sequencing data processing have involved sequence assembly or assignment of operational taxonomic units. These starting points were beyond difficult – they were ill-posed. Horizontal gene transfer confounds assembly, and the level of inter-individual variation blurs the definition of species. By treating each WMS sample as a single point in a high-dimensional space, significant information was shown to exist in the form of global relationships, like dis- tances, between these points. Other kinds of biological data are also high dimensional, for example gene expres- sion profiles, with typical analyses involve manipulating matrices whose columns are expression profiles, e.g., by principal component analysis or hierarchical clustering. Soon, however, it will be impossible to download all public-available expression pro- files for such analysis: there will be too many. Even small single-cell gene expres- sion projects entail phenotyping hundreds of cells, producing hundreds of profiles. The immediate aims in single-cell phenotyping are measuring cell-to-cell variation, quanti- fying heterogeneity in cell populations, and identifying new cell types. These questions consider the whole expression profiles as basic units, and seem to require comparing each new profile with all existing ones. The latter observation is the source of significant future problems as data scales up. The former suggests a solution: operate on the expression profiles in the same manner as we did fork-mer count vectors to index WMS data sets in Chapter 4. Our vision for the innovation began to emerge: a substantial amount of data analysis could involve treating entire data sets as individual points in some space, followed by analyses targeted at understanding how they relate to each other. Novel relationships will constitute important discovery themselves, but additionally, the identified relationships 77 will enable data scientists to select the most relevant data sets for deep scrutiny. Consider these scenarios: 1. Given a putatively new single-cell expression profile, can we efficiently test if a “similar” one has already been identified? 2. Given soil metagenomes before and after pesticide contamination, can we iden- tify the 10 most similar metagenomes and the projects that produced them, thus allowing all data from those projects to be downloaded for full analyses? The first scenario is a basic database query: does my data point exist in the database? In answering this query, the database must account for a radius of technical variation around the expression profiles. In some contexts the semantics of that query might become: “What cell type have I sequenced?” The second scenario, however, goes much further: the user seeks the identities of nearby points to guide data fusion. Supporting data-driven discovery in such scenarios requires a database engine designed to account for certain features common to many emerging forms of scientific data: the data points have dimensions numbering many thousands; the databases will contain many million points; and the data accumulates like the wild-west, by public submission and with public ownership. These features exist in many database contexts, but their intersection presents new challenges. The concepts introduced in the design of Amordad will help in such scenarios to efficiently execute the operations required for data-driven discovery in the contexts we have outlined. 78 Chapter 6 Summary and conclusion We viewed sequence comparison problem from a geometric perspective. The start- ing point was to interpret biological sequence data as geometric points which creates a mapping between string and geometry domains. We observed that this mapping will enable us to eventually represent a class of alignment-free based comparative studies as instances of the nearest neighbor search (NNS) problem. Depending on the similarity measure, problem conversion might not be straightforward. One widely used class of such measures is introduced byD 2 and its standardized variantD ? 2 . Under these statisti- cal measures, comparison of two sequences is equivalent to finding the dot product value between two vectors. Calculation of each dot product is linear with respect to the length of the underlying sequences, however, as we saw in local alignment-free sequence com- parison, the number of such existing calculations may impose a computational barrier in large-scale studies. We proposed a spatial data structure, SLF, to recapture the maximum dot product computation problem as a series of bichromatic closest pair (BCP) sub-problems while increasing the time complexity just by a logarithmic factor. BCP is closely related to NNS and therefore, a randomized strategy commonly used to address approximate NNS was incorporated to solve it. We relied on the locality sensitive hashing to deter- mine over a set of iterations a collection of candidate bichromatic pairs from which the closest pair is identified by exhaustive search. Our analysis showed that under cer- tain conditions, the size of the candidate set is sufficiently small such that BCP can be solved in sub-quadratic time with guaranteed upper bound on the probability of error. 79 We empirically evaluated our framework under the planted motif model and observed that in practice, our algorithm can identify regions of similarity almost as accurate as the naive search but with considerable decrease of the wall-time. Same approach was applied to organize next generation metagenomics data and to develop a high dimensional database engine called Amordad. The idea is to have a feature based representation of metagenomes that apprehends the essential infor- mation while considerably shrinks the size of the stored data. The count of k-mers in metagenomes were considered as informative features in our work. However, we emphasized on the generality of our approach in the sense that any set of features could be plugged into our framework. The curse of dimensionality easily hits our database indexing system since the number of dimensions grows exponentially with regard tok value. We elucidated that although locality sensitive hashing could be helpful for index- ing and query retrieval, its use is impeded by the memory requirements when the number of metagenomes is immensely large. Another issue was the inherent dynamicity and fast pace of data accumulation which necessitated an efficient maintenance procedure. We designed a novel spatial data structure by combining the locality sensitive hash tables and the nearest neighbor graph that resolved both of these issues. We described how the graph mitigates the memory demands of the database by capturing the similarity relations existed in data points. A “retiring” strategy was also presented for hash tables that aggregates, over the time, the proximity information in hash tables to the graph and rapidly evolves it when data undergoes frequent updates. This equips Amordad with a lightweight maintenance procedure and makes the indexing system robust and flexible against changes in data. A set of experiments were conducted to the evaluate the performance of differ- ent components in Amordad. We showed the nearest neighbor graph could be effi- ciently updated by acquiring hash tables information even for instances of Amordad 80 with numerous data points. We assessed the efficacy of Amordad with regards to both time and memory requirements and found while the memory consumption is signifi- cantly decreased, the query is answered in reasonable time. Amordad aims to serve as an analytic tool for comparative studies in metagenomics. We briefly described some of its applications in biology and medicine. For example, we discussed finding the phenotype-metagenotype associations when a missing phenotype of a metagenome could be predicted by querying the metagenome against a large database and understand the phenotype based on the similarity results. 81 Reference List Agarwal P, Edelsbrunner H, Schwartzkopf O, Welzl E (1991) Euclidean MST and bichromic closest pairs. Disc. Comput. Geom. 6:407–422. Agarwal PK, Erickson J (1999) Geometric range searching and its relatives. Contem- porary Mathematics 223:1–56. Alm E, Huang K, Arkin A (2006) The evolution of two-component systems in bacteria reveals different strategies for niche adaptation. PLoS Comput. Biol. 2:e143. Altschul S, Gish W, Miller W, Myers E, Lipman D (1990) Basic local alignment search tool. J. Mol. Biol. 215:403–410. Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search pro- grams. Nucleic Acids Res. 25:3389–3402. Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM 51:117–122. Andoni A, Indyk P (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pp. 459–468. IEEE. Arumugam M, Raes J, Pelletier E, Le Paslier D, Yamada T, Mende DR, Fernandes GR, Tap J, Bruls T, Batto JM et al. (2011) Enterotypes of the human gut microbiome. Nature 473:174–180. Arya S, Mount DM, Netanyahu NS, Silverman R, Wu AY (1998) An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM) 45:891–923. Beis JS, Lowe DG (1997) Shape indexing using approximate nearest-neighbour search in high-dimensional spaces In Computer Vision and Pattern Recognition, 1997. Pro- ceedings., 1997 IEEE Computer Society Conference on, pp. 1000–1006. IEEE. 82 Belda-Ferre P, Alcaraz LD, Cabrera-Rubio R, Romero H, Simón-Soro A, Pignatelli M, Mira A (2011) The oral metagenome in health and disease. The ISME jour- nal 6:46–56. Bentley J (1975) Multidimensional binary search trees used for associative searching. Communications of the ACM 18:509–517. Berman B, Nibu Y , Pfeiffer B, Tomancak P, Celniker S, Levine M, Rubin G, Eisen M (2002) Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc. Natl. Acad. Sci. 99:757–762. Buhler J (2001) Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17:419–428. Bühlmann P, Wyner AJ (1999) Variable length markov chains. The Annals of Statis- tics 27:480–513. Burden CJ, Kantorovitz MR, Wilson SR et al. (2008) Approximate word matches between two random sequences. The Annals of Applied Probability 18:1–21. Chan CX, Ragan MA et al. (2013) Next-generation phylogenomics. Biology direct 8:1–6. Chan CKK, Hsu AL, Halgamuge SK, Tang SL (2008) Binning sequences using very sparse labels within a metagenome. BMC bioinformatics 9:215. Charikar M (2002) Similarity estimation techniques from rounding algorithms In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pp. 380–388. Cho I, Blaser MJ (2012) The human microbiome: at the interface of health and disease. Nature Reviews Genetics 13:260–270. Clemente JC, Ursell LK, Parfrey LW, Knight R (2012) The impact of the gut microbiota on human health: an integrative view. Cell 148:1258–1270. Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, Pradhan S, Nelson SF, Pellegrini M, Jacobsen SE (2008) Shotgun bisulphite sequencing of the arabidop- sis genome reveals dna methylation patterning. Nature 452:215–219. Cormen T, Leiserson C, Rivest R, Stein C (2001) Introduction to algorithms (chapter 33) MIT press. Daniel R (2005) The metagenomics of soil. Nature Reviews Microbiology 3:470–478. 83 Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions In Proceedings of the twentieth annual symposium on Computational geometry, pp. 253–262. ACM. Dayhoff M, Schwartz R, Orcutt B (1978) A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5:345–352. Domazet-Lošo M, Haubold B (2011) Alignment-free detection of local similarity among viral and bacterial genomes. Bioinformatics 27:1466–1472. Dong W, Moses C, Li K (2011) Efficient k-nearest neighbor graph construction for generic similarity measures In Proceedings of the 20th international conference on World wide web, pp. 577–586. ACM. Eberwine J, Sul JY , Bartfai T, Kim J (2014) The promise of single-cell sequencing. Nature methods 11:25–27. Faloutsos C, Barber R, Flickner M, Hafner J, Niblack W, Petkovic D, Equitz W (1994) Efficient and effective querying by image content. Journal of intelligent information systems 3:231–262. Feige U, Schechtman G (2002) On the optimality of the random hyperplane rounding technique for max cut. Random Struct. Algorithms 20:403–440. Finkel RA, Bentley JL (1974) Quad trees a data structure for retrieval on composite keys. Acta informatica 4:1–9. Flicek P, Birney E (2009) Sense from sequence reads: methods for alignment and assembly. Nature methods 6:S6–S12. Forêt S, Wilson S, Burden C (2009) Empirical distribution of k-word matches in bio- logical sequences. Pattern Recognition 42:539–548. Friedman JH, Bentley JL, Finkel RA (1977) An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software (TOMS) 3:209–226. Fuller JC, Khoueiry P, Dinkel H, Forslund K, Stamatakis A, Barry J, Budd A, Soldatos TG, Linssen K, Rajput AM (2013) Biggest challenges in bioinformatics. EMBO reports 14:302–304. Gilbert JA, Meyer F, Antonopoulos D, Balaji P, Brown CT, Brown CT, Desai N, Eisen JA, Evers D, Field D et al. (2010) Meeting report: the terabase metagenomics workshop and the vision of an earth microbiome project. Standards in genomic sci- ences 3:243. 84 Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing In Proceedings of 25th Int. Conf. on Very Large Databases, pp. 518–529. Glass EM, Wilkening J, Wilke A, Antonopoulos D, Meyer F (2010) Using the metage- nomics rast server (mg-rast) for analyzing shotgun metagenomes. Cold Spring Har- bor Protocols 2010:pdb–prot5368. Goemans M, Williamson D (1995) Improved approximation algorithms for maxi- mum cut and satisfiability problems using semidefinite programming. Journal of the ACM 42:1115–1145. Göke J, Schulz M, Lasserre J, Vingron M (2012) Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts.l. Bioinformat- ics 28:656–663. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q et al. (2011) Full-length transcriptome assembly from rna-seq data without a reference genome. Nature biotechnology 29:644–652. Handelsman J, Tiedje J, Alvarez-Cohen L, Ashburner M, Cann I, Delong E, Doolittle W, Fraser-Liggett C, Godzik A, Gordon J et al. (2007) The new science of metage- nomics: Revealing the secrets of our microbial planet. National Academy of Sciences, Washington DC . Handelsman J, Rondon MR, Brady SF, Clardy J, Goodman RM (1998) Molecular bio- logical access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & biology 5:R245–R249. Hao X, Jiang R, Chen T (2011) Clustering 16s rrna for otu prediction: a method of unsupervised bayesian clustering. Bioinformatics 27:611–618. Haubold B, Reed F, Pfaffelhuber P (2011) Alignment-free estimation of nucleotide diversity. Bioinformatics 27:449–455. Haveliwala T, Gionis A, Indyk P (2000) Scalable techniques for clustering the web In Procecding of the 3rd intl. workshop on web and databases, pp. 129–134. Hu Y , Bajorath J (2014) Learning from â ˘ AŸbig dataâ ˘ A ´ Z: compounds and targets. Drug discovery today 19:357–360. Human Microbiome Project Consortium et al. (2012) A framework for human micro- biome research. Nature 486:215–221. Hunter S, Corbett M, Denise H, Fraser M, Gonzalez-Beltran A, Hunter C, Jones P, Leinonen R, McAnulla C, Maguire E et al. (2014) Ebi metagenomicsâ ˘ A ˇ Ta new resource for the analysis and archiving of metagenomic data. Nucleic acids research 42:D600–D606. 85 Indyk P, Motwani R (1998) Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pp. 604–613. Jannink J (1995) Implementing deletion in b+-trees. ACM Sigmod Record 24:33–38. Jiang B, Song K, Ren J, Deng M, Sun F, Zhang X (2012) Comparison of metagenomic samples using sequence signatures. BMC genomics 13:730. Johnson M, Zaretskaya I, Raytselis Y , Merezhuk Y , McGinnis S, Madden T (2008) NCBI BLAST: a better web interface. Nucleic Acids Res. 36:W5–W9. Kantorovitz M, Robinson G, Sinha S (2007) A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics 23:1249–1255. Karlin S, Altschul S (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. 87:2264–2268. Kazemian M, Zhu Q, Halfon M, Sinha S (2011) Improved accuracy of supervised CRM discovery with interpolated markov models and cross-species comparison. Nucleic Acids Res. 39:9463–9472. Khuller S, Matias Y (1995) A simple randomized sieve algorithm for the closest-pair problem. Information and Computation 118:34–37. Koboldt DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER (2013) The next- generation sequencing revolution and its impact on genomics. Cell 155:27–38. Kuczynski J, Lauber CL, Walters WA, Parfrey LW, Clemente JC, Gevers D, Knight R (2011) Experimental and analytical tools for studying the human microbiome. Nature Reviews Genetics 13:47–58. Kulis B, Grauman K (2009) Kernelized locality-sensitive hashing for scalable image search In Computer Vision, 2009 IEEE 12th International Conference on, pp. 2130–2137. IEEE. Le Chatelier E, Nielsen T, Qin J, Prifti E, Hildebrand F, Falony G, Almeida M, Aru- mugam M, Batto JM, Kennedy S et al. (2013) Richness of human gut microbiome correlates with metabolic markers. Nature 500:541–546. Leinonen R, Akhtar R, Birney E, Bower L, Cerdeno-Tárraga A, Cheng Y , Cleland I, Faruque N, Goodgame N, Gibson R et al. (2011) The european nucleotide archive. Nucleic acids research 39:D28–D31. 86 Li H, Durbin R (2009) Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics 25:1754–1760. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y , Li S, Shan G, Kristiansen K et al. (2010) De novo assembly of human genomes with massively parallel short read sequencing. Genome research 20:265–272. Lippert R, Huang HY , Waterman M (2002) Distributional regimes for the num- ber of k-word matches between two random sequences. Proc. Natl. Acad. Sci. 100:13980–13989. Liua X, Wan L, Li J, Reinert G, Waterman M, Sun F (2011) New powerful statistics for alignment-free sequence comparison under a pattern transfer model. J. Theor. Biol. 284:106–116. Luo R, Liu B, Xie Y , Li Z, Huang W, Yuan J, He G, Chen Y , Pan Q, Liu Y et al. (2012) Soapdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1:18. Lv Q, Josephson W, Wang Z, Charikar M, Li K (2007) Multi-probe lsh: efficient index- ing for high-dimensional similarity search In Proceedings of the 33rd international conference on Very large data bases, pp. 950–961. VLDB Endowment. Mahmood K, Webb G, Song J, Whisstock J, Konagurthu A (2012) Efficient large- scale protein sequence comparison and gene matching to identify orthologs and co- orthologs. Nucleic Acid Res. 40:e44. Marx V (2013) Biology: The big challenges of big data. Nature 498:255–260. McDaniel L, Young E, Delaney J, Ruhnau F, Ritchie K, Paul J (2010) High frequency of horizontal gene transfer in the oceans. Science 330:50. Meader S, Ponting C, Lunter G (2010) Massive turnover of functional sequence in human and other mammalian genomes. Genome Res. 20:1335–1343. Meinicke P, Aßhauer KP, Lingner T (2011) Mixture models for analysis of the taxo- nomic composition of metagenomes. Bioinformatics 27:1618–1624. Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A et al. (2008) The metagenomics rast server–a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC bioin- formatics 9:386. Miller GL, Teng SH, Thurston W, Vavasis SA (1997) Separators for sphere-packings and nearest neighbor graphs. Journal of the ACM (JACM) 44:1–29. 87 Motwani R, Raghavan P (1995) Randomized algorithms Cambridge University Press. Muller M (1959) A note on a method for generating points uniformly on n-dimensional spheres. Communications of the Association for Computing Machinery 2:19–20. Nalbantoglu OU, Way SF, Hinrichs SH, Sayood K (2011) Raiphy: phylogenetic clas- sification of metagenomics samples using iterative refinement of relative abundance index profiles. BMC bioinformatics 12:41. Nesme J, Cécillon S, Delmont TO, Monier JM, V ogel TM, Simonet P (2014) Large- scale metagenomic-based study of antibiotic resistance in the environment. Current Biology 24:1096–1100. Pagani I, Liolios K, Jansson J, Chen IMA, Smirnova T, Nosrat B, Markowitz VM, Kyrpides NC (2012) The genomes online database (gold) v. 4: status of genomic and metagenomic projects and their associated metadata. Nucleic acids research 40:D571–D579. Panigrahy R (2006) Entropy based nearest neighbor search in high dimensions In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pp. 1186–1195. ACM. Pearson W, Lipman D (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. 85:2444–2448. Pepke S, Wold B, Mortazavi A (2009) Computation for chip-seq and rna-seq studies. Nature methods 6:S22–S32. Petrosino JF, Highlander S, Luna RA, Gibbs RA, Versalovic J (2009) Metagenomic pyrosequencing and microbial identification. Clinical chemistry 55:856–866. Porter MS, Beiko RG (2013) Spanner: Taxonomic assignment of sequences using pyra- mid matching of similarity profiles. Bioinformatics . Preparata F, Shamos M (1985) Computational geometry: an introduction Springer- Verlag. Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T, Pons N, Levenez F, Yamada T et al. (2010) A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464:59–65. Qin J, Li Y , Cai Z, Li S, Zhu J, Zhang F, Liang S, Zhang W, Guan Y , Shen D et al. (2012) A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490:55–60. 88 Ravichandran D, Pantel P, Hovy E (2005) Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 622–629. Reinert G, Chew D, Sun F, Waterman M (2009) Alignment-free sequence comparison (I): statistics and power. J. Comp. Biol. 16:1615–1634. Salton G (1991) Developments in automatic text retrieval. Science 253:974–980. Samet H (2006) Foundations of multidimensional and metric data structures Elsevier. Schadt EE, Linderman MD, Sorenson J, Lee L, Nolan GP (2010) Computational solutions to large-scale data management and analysis. Nature Reviews Genet- ics 11:647–657. Schatz MC, Langmead B, Salzberg SL (2010) Cloud computing and the dna data race. Nature biotechnology 28:691. Schloissnig S, Arumugam M, Sunagawa S, Mitreva M, Tap J, Zhu A, Waller A, Mende DR, Kultima JR, Martin J et al. (2013) Genomic variation landscape of the human gut microbiome. Nature 493:45–50. Shakhnarovich G, Indyk P, Darrell T (2006) Nearest-neighbor methods in learning and vision: theory and practice. Shokralla S, Spall JL, Gibson JF, Hajibabaei M (2012) Next-generation sequencing technologies for environmental dna research. Molecular Ecology 21:1794–1805. Sims G, Kim S (2011) Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs). Proc. Natl. Acad. Sci. 108:8329–8334. Sinha S, Siggia E (2005) Sequence turnover and tandem repeats in cis-regulatory mod- ules in Drosophila. Mol. Biol. Evol. 22:874–885. Smith T, Waterman M (1981) Identification of common molecular subsequences. J. Mol. Biol. 147:195–197. Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F (2012) Alignment-free sequence com- parison based on next generation sequencing reads: extended abstract In RECOMB, pp. 272–285. Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F (2013a) New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Briefings in bioinformatics p. bbt067. 89 Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F (2013b) Alignment-free sequence com- parison based on next-generation sequencing reads. Journal of Computational Biol- ogy 20:64–79. Taher L, McGaughey D, Maragh S, Aneas I, Bessling S, Miller W, Nobrega M, McCal- lion A, Ovcharenko I (2011) Genome-wide identification of conserved regulatory function in diverged sequences. Genome Res. 21:1139–1149. Tan P, Steinbach M, Kumar V (2006) Introduction to data mining (Chapter 8) Addison- Wesley. Torney D, Burks C, Davison D, Sirotkin K (1990) Computation of D2: a measure of sequence dissimilarity. Computers and DNA pp. 109–125. Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R, Gordon JI (2007) The human microbiome project. Nature 449:804–810. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV , Rubin EM, Rokhsar DS, Banfield JF (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428:37–43. van Oostrum RW (1999) Geometric algorithms for geographic information systems . Venkataram S, Fay J (2010) Is transcription factor binding site turnover a sufficient explanation for cis-regulatory sequence divergence? Genome Biol Evol. 2:851–858. Vinga S, Almeida J (2003) Alignment-free sequence comparisonâ ˘ A ˇ Ta review. Bioinfor- matics 19:513–523. von Bubnoff A (2008) Next-generation sequencing: the race is on. Cell 132:721–723. Wan L, Reinert G, Sun F, Waterman M (2010) Alignment-free sequence comparison (II): theoretical power of comparison statistics. J. Comput. Biol. 17:1467–1490. Waterman M, Vingron M (1994) Rapid and accurate estimates of statistical significance for sequence data base searches. Proc. Natl. Acad. Sci. 91:4625–4628. Wooley JC, Godzik A, Friedberg I (2010) A primer on metagenomics. PLoS computa- tional biology 6:e1000667. Yao A (1982) On constructing minimum spanning trees in k-dimensional spaces and related problems. SIAM J. Comput. 11:721–736. Yilmaz P, Kottmann R, Field D, Knight R, Cole JR, Amaral-Zettler L, Gilbert JA, Karsch-Mizrachi I, Johnston A, Cochrane G et al. (2010) The â ˘ AIJminimum infor- mation about an environmental sequenceâ ˘ A ˙ I(miens) specification . 90 Yu Y , Crucianu M, Oria V , Chen L (2009) Local summarization and multi-level lsh for retrieving multi-variant audio tracks In Proceedings of the 17th ACM international conference on Multimedia, pp. 341–350. ACM. Zhang H, DiBaise JK, Zuccolo A, Kudrna D, Braidotti M, Yu Y , Parameswaran P, Crow- ell MD, Wing R, Rittmann BE et al. (2009) Human gut microbiota in obesity and after gastric bypass. Proceedings of the National Academy of Sciences 106:2365–2370. Zhang Z, Schaffer A, Miller W, Madden T, Lipman D, Koonin E, Altschul S (1998) Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res. 26:3986–3990. 91 Appendix A Public datasets inAmordad Amordad contains public datasets from 133 projects as explained in the main text. For each project, the ENA accession number and a brief description is provided. Note that some projects contain cohorts of case and control metagenomes. For instance “Gut microbiota (Crohn’s disease)” (Accession no: SRP015779), has both healthy samples and metagenomes from individauls with Crohn’s disease. Accession num. Scope DRP000446 Human serum (Kawasaki Disease) DRP000454 Bioterrorism ERP000673 Coprolites ERP000674 Coprolites (in Chauvet cave) ERP001021 Polluted soil ERP001038 Gut microbiota (infants) ERP001068 Low and neutral PH soil ERP001118 Rhizosphere ERP001178 Polar water ERP001227 North sea water ERP001239 Agricultural soil ERP001267 Root bacterial communities of plants ERP001568 Polluted water ERP001706 Gut microbiota (Crohn’s disease) 92 ERP001726 Earth crust ERP001737 Sea water ERP001934 Rumen metagenome ERP002047 Neandertal fossil ERP002170 Acid mine drainage ERP002361 Deep brine sea water ERP002469 Gut microbiota (diff. glucose control cond.) ERP002477 Sea water (diiferent depth) ERP002583 Sea water ERP003267 Phytophthora ERP003268 Sea water (diiferent depth) SRP001634 Gut microbiota (infants) SRP002421 Psoriasis in human SRP002468 Gut with esophageal adenocarcinoma SRP002817 Rumen metagenome of pre-ruminant calf SRP003198 Kimchi (food) metagenome SRP004544 Rhizosphere of mangrove SRP004633 Forest soil SRP004634 Soil (hot climate) SRP004635 Soil SRP004644 Soil with crude glycerin SRP004720 Esponja SRP004722 Soil SRP004929 Hydrothermal vents SRP005451 Sea water (extreme depth) SRP005641 Oil polluted sediment 93 SRP005989 Fungus gardens of ants SRP006081 Gut microbiota (human and mice) SRP006127 Ammonia oxidizing archeon SRP006172 Gut microbiota (Insect) SRP006444 Highly acid biofilms SRP006764 Human gut with T1D SRP006837 African fruit bat SRP006846 Sea water SRP006872 Gut microbiota with whipworm infection SRP007242 Different arctic metagenomes SRP007287 Low PH soil SRP007434 Horse gut microbiome SRP007507 Raw sewage SRP007586 Hadopelagic zone of Mediterranean sea SRP007599 Rumen metagenome SRP007664 Rumen metagenome SRP008126 Swine gut microbiota SRP009106 Chemolithoautotrophic populations SRP017205 Tomato plant (diff. Organs) SRP017461 Contaminated soil SRP018108 Oral microbiome (human) SRP018736 Hungarian mummy (with tuberculosis) SRP021521 Mice gut (with antibiotic treatment) SRP024705 Plasma virome (with HIV) SRP026071 Desert soil SRP026072 Desert soil 94 SRP026236 Mouse with Salmonella (different days) SRP026287 Petroleum metagenome ERP001736 Sea water (different depth) SRP001634 Gut microbiota (infant) SRP007749 Cystic fibrosis lung SRP008135 Polluted soil SRP008478 Rumen microbiome of Svalbard reindeer SRP008662 Microbiome of oil fields SRP009185 Soil SRP009186 Soil SRP009243 Pockmarked sediment SRP009392 Cystic fibrosis lung SRP009395 Subtropical fresh water SRP009476 Sea water (different depth) SRP009489 stromatolitic microbial mats SRP009683 pine wood nematode SRP009802 Mud volcanoes (different depth) SRP010132 Air metagenome SRP010272 Fresh water (lake) SRP010564 Viral samples from hypersaline lake SRP010567 CRISPR SRP010576 Phosphorus removal bioreactors SRP010819 Cocoa bean box fermentation SRP011078 Lignocellulose degrading environments SRP011419 Agricultural bulk soil SRP011572 Drinking water (different treatments) 95 SRP011637 Microbiome of uranium field SRP011941 Reindeer rumen microbiome SRP012061 nasopharyingeal swabs of patients with influenza SRP012104 Different metagenomes (human mouth, sea water) SRP012145 Deep-sea hydrothermal plumes SRP012152 Bioreactor with sediment from sea (diff. time) SRP012558 Gut microbiota (pre-mature infant) SRP012653 Microbiome of milk SRP013103 Sponge microbiome SRP013339 Contaminated soil SRP013381 Uranium contaminated site SRP013413 Sequencing batch reactor from wastewater SRP013944 Forest soil SRP013946 Alkaline surface water SRP014440 Iberian lynx gut SRP014720 Hydrothermal springs SRP014745 Mice gut (different treatment) SRP014764 Ocean samples SRP014914 Anaerobic thermophilic cellulolytic sludge SRP015319 Halorespiration mixed culture SRP015320 Halorespiration mixed culture SRP015679 Study of tenacibaculum maritimum SRP015779 Gut microbiota (Crohn’s disease) SRP015994 Camel rumen SRP016067 Gut microbiota (with atherosclerosis) SRP016520 Soil surface samples 96 SRP016523 Diverged synthetic communities SRP016526 Deep sea vents SRP017119 Viral fraction of surface water SRP017582 Oil sands SRP017657 Clonal isolates of E-coli SRP017928 Fermentation of soy sauce SRP018411 Snottites biofilms SRP018745 Sand filter of drinking water SRP018821 Insect gut SRP019347 Brucella melitensis culture SRP019348 Brucella melitensis culture SRP019913 Red sea metagenome SRP021107 Intestinal viral particles SRP021115 Marine dissolved organic matter SRP021141 Intertidal thrombolitic mat Table A.1: List of the projects used in Amordad 97 Appendix B Significant biological associations discovered byAmordad Each row in the below table shows a pair of similar metagenomes. Pairs are sorted according to their distance (not shown here). In addition to the accession number of the projects that contain metagenomes, the sample accession number for each metagenome and a brief description about the projects are also listed. Acc # 1 Sample # 1 Acc # 2 Sample # 2 Project scope 1 Project scope 2 ERP001737 ERR164414 ERP001736 ERR315863 Sea water Sea water ERP001736 ERR315863 ERP001737 ERR164414 Sea water Sea water ERP001737 ERR164416 ERP001736 ERR315858 Sea water Sea water ERP001737 ERR164421 ERP001736 ERR315861 Sea water Sea water ERP001736 ERR164409 ERP001227 ERR091522 Sea water North sea water ERP001227 ERR091555 ERP001736 ERR315858 North sea water Sea water ERP001737 ERR164418 ERP001736 ERR315863 Sea water Sea water SRP009476 SRR385733 ERP001736 ERR164409 Sea water Sea water ERP001068 ERR059346 SRP013944 SRR516953 Low and neutral PH soil Forest soil SRP013944 SRR516953 ERP001068 ERR059346 Forest soil Low and neutral PH soil SRP021115 SRR828634 ERP001736 ERR164409 Marine matter Sea water SRP009476 SRR385732 SRP021115 SRR828634 Sea water Marine matter SRP013944 SRR516948 ERP001068 ERR059349 Forest soil Low and neutral PH soil ERP001068 ERR059349 SRP013944 SRR516948 Low and neutral PH soil Forest soil ERP001737 ERR164423 ERP001736 ERR164409 Sea water Sea water ERP001736 ERR164407 ERP001737 ERR164422 Sea water Sea water ERP001737 ERR164422 ERP001736 ERR164407 Sea water Sea water SRP017582 SRR636569 ERP001568 ERR149037 Oil sands Polluted water 98 ERP001737 ERR164417 SRP021115 SRR828638 Sea water Marine matter ERP001737 ERR164425 ERP001227 ERR091532 Sea water North sea water SRP016067 SRR585728 ERP002469 ERR260271 Gut microbiota Gut microbiota ERP001736 ERR318620 SRP021115 SRR828633 Sea water Marine matter SRP019913 SRR789380 SRP021115 SRR828634 Red sea Marine matter SRP016067 SRR585697 ERP002469 ERR260218 Gut microbiota Gut microbiota ERP001737 ERR164410 SRP021115 SRR828634 Sea water Marine matter SRP016067 SRR585786 ERP002469 ERR260187 Gut microbiota Gut microbiota SRP006764 SRR203323 ERP002469 ERR260229 Gut microbiota Gut microbiota SRP016067 SRR585765 ERP002469 ERR260255 Gut microbiota Gut microbiota SRP016067 SRR585751 ERP002469 ERR260244 Gut microbiota Gut microbiota ERP001227 ERR091529 SRP009395 SRR371574 North sea water Subtropical water SRP006764 SRR203326 ERP002469 ERR260167 Human gut Gut microbiota SRP016067 SRR585755 ERP002469 ERR260143 Gut microbiota Gut microbiota ERP001737 ERR164412 SRP009476 SRR385735 Sea water Sea water SRP009476 SRR385735 ERP001737 ERR164412 Sea water Sea water SRP009489 SRR389087 ERP001227 ERR091529 stromatolitic mats North sea water SRP004929 SRR653001 SRP009476 SRR385733 Hydrothermal vents Sea water SRP017582 SRR634694 ERP001568 ERR149036 Oil sands Polluted water SRP013944 SRR516952 ERP001068 ERR059348 Forest soil Low and neutral PH soil ERP001068 ERR059348 SRP013944 SRR516952 Low and neutral PH soil Forest soil SRP017582 SRR636581 SRP013413 SRR770018 Oil sands Wastewater SRP013413 SRR770018 SRP017582 SRR636581 Wastewater Oil sands ERP001068 ERR059347 SRP013944 SRR516950 Low and neutral PH soil Forest soil SRP013944 SRR516950 ERP001068 ERR059347 Forest soil Low and neutral PH soil ERP001736 ERR315859 ERP001737 ERR315856 Sea water Sea water ERP001737 ERR315856 ERP001736 ERR315859 Sea water Sea water SRP017119 SRR616198 SRP021107 SRR935354 Viral (surface water) Viral particles ERP001227 ERR091558 ERP001736 ERR315859 North sea water Sea water SRP013413 SRR770574 SRP013944 SRR516953 Wastewater Forest soil ERP001227 ERR091556 SRP021115 SRR828640 North sea water Marine matter 99 SRP021115 SRR828640 ERP001227 ERR091556 Marine matter North sea water SRP018411 SRR681084 SRP017582 SRR636569 Snottites biofilms Oil sands SRP009802 SRR389128 SRP017582 SRR634695 Mud volcanoes Oil sands SRP017119 SRR611810 ERP001227 ERR091556 Viral (surface water) North sea water SRP016067 SRR585746 ERP002469 ERR260239 Gut microbiota Gut microbiota SRP017582 SRR636701 SRP013413 SRR770019 Oil sands Wastewater SRP013413 SRR770019 SRP017582 SRR636701 Wastewater Oil sands SRP015779 SRR579290 ERP001038 ERR056990 Gut microbiota Gut microbiota ERP001038 ERR056990 SRP015779 SRR579290 Gut microbiota Gut microbiota Table B.1: List of significant biological discoveries found by Amordad 100
Abstract (if available)
Abstract
Recent advances in high-throughput sequencing technologies have provided scientists with unprecedented abilities to unravel the secrets of nature. Substantial cost reduction in such technologies has eliminated a major bottleneck in biological studies and led to initiation of projects that generate terabytes of data. This phenomenon demands novel algorithmic solutions to deal with massive amount of data. Unfortunately, most of the existing algorithms suffer from scalability issues and their performance is negatively affected as data size grows. ❧ This dissertation is focused on structuring biological data at massive scale. Alignment-free representations are used in this work due to their computational advantages. Such representations are transformed to a geometric context enabling new interpretations of the problems and introducing a novel set of techniques to organize, handle and compare sequence data very efficiently. This approach is employed to study two different scenarios: (1) in local alignment-free sequence comparison, a pair of sequences are searched for the most similar fixed-length fragments. (2) In next generation sequencing data, numerous samples resulting from sequencing experiments are given and the goal is to build a content-based database engine such that new data could be rapidly queried for similar existing samples. To cope with the computational requirements, no downstream analysis such as mapping or assembly is undertaken. ❧ Several examples are presented to demonstrate the ability of the designed data structures to address problems under both scenarios. Particular attention is paid to the latter due to the importance of having a rapid data retrieval framework for the next generation sequencing technologies. With the flexibility in representation and achieved scalability in indexing scheme of the database engine, actionable inferences and recommendations can be drawn in seconds for a variety of heterogeneous biological data types.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Application of machine learning methods in genomic data analysis
PDF
Fast search and clustering for large-scale next generation sequencing data
PDF
Applications and improvements of background adjusted alignment-free dissimilarity measures
PDF
Genomic mapping: a statistical and algorithmic analysis of the optical mapping system
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Neural sequence models: Interpretation and augmentation
PDF
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
PDF
Statistical modeling of sequence and gene expression data to infer gene regulatory networks
PDF
A generalization of the accumulation curve in species sampling and its applications to high‐throughput sequencing
PDF
Whole genome bisulfite sequencing: analytical methods and biological insights
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
De novo peptide sequencing and spectral alignment algorithm via tandem mass spectrometry
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
Unlocking capacities of genomics datasets through effective computational methods
PDF
Exploring the genetic basis of complex traits
PDF
Integrative approach of biological systems analysis on regulatory pathways, modules, protein essentialities, and disease gene classification
PDF
Clustering 16S rRNA sequences: an accurate and efficient approach
Asset Metadata
Creator
Behnam, Ehsan
(author)
Core Title
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
03/30/2015
Defense Date
11/04/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
alignment-free sequence comparison,data structures,locality sensitive hashing,next generation sequencing,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Smith, Andrew D. (
committee chair
), Baxendale, Peter H. (
committee member
), Sun, Fengzhu Z. (
committee member
), Waterman, Michael S. (
committee member
)
Creator Email
behnamgh@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-541361
Unique identifier
UC11298554
Identifier
etd-BehnamEhsa-3242.pdf (filename),usctheses-c3-541361 (legacy record id)
Legacy Identifier
etd-BehnamEhsa-3242.pdf
Dmrecord
541361
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Behnam, Ehsan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
alignment-free sequence comparison
data structures
locality sensitive hashing
next generation sequencing