Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
(USC Thesis Other)
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
STATISTICAL AND COMPUTATIONAL APPROACHES FOR ANALYZING METAGENOMIC SEQUENCES WITH REPRODUCIBILITY AND RELIABILITY by Xin Bai A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) May 2022 Copyright 2022 Xin Bai Acknowledgements First of all, I would like to extend my most sincere thanks to my advisor, Dr. Fengzhu Sun, for all his guidance and support during the pursuit of my Ph.D. degree. I had the fortune to work with Dr. Sun even before I came to USC and he has always been supportive to my research and life. For example, Dr. Sun provided me extensive help when I got stuck in research, offered me frank suggestions when I was at junctions of life, and applauded me when I made accomplishments. During the journey, I have also been deeply inspired by his enthusiasm to research, persistence when facing challenges, dedication to details, and high integrity at all times. Without his help, I would not have come to USC, explored the field of computational biology, and become the man I am today. Dr. Sun has literally changed my life and I am sure what I have learned from him will continue to shed light on my life. I would also like to thank Dr. Yingying Fan for her careful guidance on my first research project, during which she has offered many useful suggestions and devoted much of her efforts towards the publication of the work. I have also learned a lot from her such as the theoretical insight of a statistician. I would also like to thank my lab senior Dr. Jie Ren for her contributions to my Ph.D. research, especially for inspiring me to explore the idea of detecting out-of-distribution genomic sequences. ii I would also like to thank my oral and dissertation committee members, Dr. Liang Chen, Dr. Doc Edge, Dr. Jed Fuhrman, and Dr. Michael Waterman, for their savvy suggestions and help during the research. I would also like to thank all the faculty who have helped me during the Ph.D. studies includ- ing Dr. Jinchi Lv, Dr. Sonia Michail, Dr. Remo Rohs, and Dr. Andrew Smith to name a few. I would also like to thank my past and current labmates at Sun lab: Dr. Mengge Zhang, Dr. Han Li, Dr. Yang Lu, Dr. Fang Zhang, Dr. Weili Wang, Dr. Kujin Tang, Zifan Zhu, Siliangyu Cheng, Yilin Gao, Tianqi Tang, Beibei Wang, Yuxuan Du, Dallace Francis, Jiawei Huang, and Wenxuan Zuo. Becoming a Ph.D. is not a piece of cake, but working with you all has made the journey joyful and unforgettable. I would also like to thank my friends and peers at USC including Nibal Arzouni, Brendon Cooper, Robel Dagnew, Wei Jiang, Dr. Jinsen Li, Dr. Wenzhang Li, Yuhui Li, Tsung-Yu Lu, Jingwen Ren, Ji Youn Seo, Bo Sun, Yingfei Wang, Qifan Yang, Qingyang Yin, Yuxiang Zhan and many others. Finally, I would like to thank my parents for their unconditional support throughout my life. Although they never had the privilege to attend college, they have always been pushing me to pursue a higher level of education and live a better life. This dissertation is dedicated to them. iii TableofContents Acknowledgements ii ListofTables vii ListofFigures ix Abstract xiii Chapter1: Introduction 1 1.1 Identifying sequence motifs from metagenomic sequences with reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 Research problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Enhancing the reliability of classifying metagenomic sequences by detecting out-of-distribution sequences . . . . . . . . . . . . . . . . 7 1.2.1 Research problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2: KIMI: Knockoff Inference for Motif Identification from molecular se- quenceswithcontrolledfalsediscoveryrate 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Countingk-mer frequencies from metagenomic contigs . . . . . . . . . . 13 2.2.2 The knockoffs framework for variable selection and FDR control . . . . . 15 2.2.3 Generating knockoff k-mer frequencies . . . . . . . . . . . . . . . . . . . 19 2.2.4 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.4.1 Simulation procedures for FDR control using KIMI . . . . . . . 21 2.2.4.2 Simulation procedures for FDR control using the BH procedure 23 2.2.5 Application in viral contig identification by identifying relevant k-mers . 25 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.1 The relationship between FDR/power and various parameters in simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.2 The BH procedure and the q-value method produce very low power compared to KIMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.3 Prediction accuracy of viral contigs based on relevantk-mers selected by KIMI is higher than that of VirFinder . . . . . . . . . . . . . . 32 iv 2.4 Discussion and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Chapter3: MLR-OOD:aMarkovchainbasedLikelihoodRatiomethodforOut-Of- Distributiondetectionofgenomicsequences 36 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.1 Data sets for detecting OOD genomic sequences . . . . . . . . . . . . . . 40 3.2.1.1 The bacterial data set . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.1.2 The viral data set . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.1.3 The plasmid data set . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.2 Outline of the Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.3 The framework of the likelihood ratio method . . . . . . . . . . . . . . . . 44 3.2.4 The framework of the new MLR-OOD method . . . . . . . . . . . . . . . 46 3.2.5 Investigating the effect of training sequence length on prediction accuracy and computational time . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.6 Investigating the effect of testing contig length on prediction accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.7 Investigating the effect of the genome distance between OOD testing sequences and ID training sequences on the prediction accuracy of MLR-OOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2.8 Investigating the effect of chimeric contigs on the prediction accuracy of MLR-OOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.9 Comparison to LLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2.10 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.1 CNN classification models fail to distinguish between ID and OOD testing sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.2 MLR-OOD outperforms LLR on all data sets . . . . . . . . . . . . . . . . . 57 3.3.2.1 Markov order estimation for different testing data sets . . . . . 57 3.3.2.2 The optimal training length for MLR-OOD is 250bp . . . . . . . 58 3.3.2.3 The computational time of MLR-OOD . . . . . . . . . . . . . . . 59 3.3.2.4 The comparison between MLR-OOD, max-LL, and LLR on different lengths of testing sequences . . . . . . . . . . . . . . . 60 3.3.3 MLR-OOD is robust to the GC content on all data sets . . . . . . . . . . . 63 3.3.4 The prediction accuracy of MLR-OOD increases with the Mash distance threshold for choosing the OOD testing sequences . . . . . . . . 64 3.3.5 The impact of chimeric contigs on the prediction score and prediction accuracy of MLR-OOD . . . . . . . . . . . . . . . . . . . . . . 66 3.4 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Chapter4: Conclusionsandfuturework 72 4.1 Identifying sequence motifs based on deep learning . . . . . . . . . . . . . . . . . 73 4.2 Detecting OOD genomic sequences using autoencoder . . . . . . . . . . . . . . . 74 v Chapter5: Supplementarymaterials 77 5.1 Chapter 2 supplementary materials . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.1.1 Knockoff variables construction via knockoff contigs . . . . . . . . . . . . 77 5.1.2 Knockoff inference based on the log-ratio transformed k-mer frequencies for predicting viral contigs . . . . . . . . . . . . . . . . . . . . 79 5.1.3 The relationship between FDR/power and the target FDR ρ based on simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.1.4 The BH procedure and theq-value method produce almost no power based onp-values of likelihood ratio tests from joint logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.1.5 KIMI based on dropping one k-mer has higher accuracy of predicting viral contigs than based on the log-ratio transformation . . . . . . . . . . 82 5.2 Chapter 3 supplementary materials . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2.1 The viral and plasmid data sets . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2.1.1 The viral data set . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2.1.2 The plasmid data set . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2.2 Bayesian information criterion for estimating the Markov order . . . . . . 85 5.2.3 Generative models based on Markov chains achieve lower prediction accuracy than LSTM . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2.4 The distribution of estimated Markov orders of the testing sequences in each data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.2.5 TheL 2 coefficient λ does not affect much on LLR for viral and plasmid experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2.6 The distribution of the minimum Mash distance of OOD testing genomes to ID classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Bibliography 95 vi ListofTables 3.1 The data sets we use for benchmarking the detction of OOD genomic sequences. The details about the specific ID and OOD genera names and the construction of the training and testing data sets are discussed in Supplementary Materials. . . . 41 3.2 The comparison among MLR-OOD, max-LL, and LLR in several aspects showing their strengths and weaknesses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.1 The viral families used in the viral genomic data set for the ID and OOD classes. *unclassified contains viruses that are currently not able to be classified into any existing families. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2 The plasmid genera used in the plasmid genomic data set for the ID and OOD classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3 The distribution of the estimated Markov orders for the bacterial sequences in theTest2016 data set. The testing data set for each contig length contains 10k ID and 10k OOD contigs. We show the number of contigs that are estimated by BIC to have a specific order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4 The distribution of the estimated Markov orders for the bacterial sequences in theTest2018 data set. The testing data set for each contig length contains 10k ID and 10k OOD contigs. We show the number of contigs that are estimated by BIC to have a specific order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.5 The distribution of the estimated Markov orders for the sequences in the viral testing data set. The testing data sets for contig lengths 250bp, 500bp, and 1,000bp contain 10k ID and 10k OOD contigs. The testing data set for contig length 2,500bp contains 5k ID and 5k OOD contigs. The testing data set for contig length 5000bp contains 3k ID and 3k OOD contigs. We show the number of contigs that are estimated by BIC to have a specific order. . . . . . . . . . . . . 90 vii 5.6 The distribution of the estimated Markov orders for the sequences in the plasmid testing data set. The testing data set for each contig length contains 10k ID and 10k OOD contigs. We show the number of contigs that are estimated by BIC to have a specific order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 viii ListofFigures 2.1 The maximum Spearman correlation betweenk-mer frequencies for different k. The contigs are simulated from a third order Markov model with a randomly generated transition probability matrix. The number of contigs is 500 and the contig length is 10kbp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2 The relationship between power/FDR and various parameters in simulation studies. a, The relationship between power/FDR and the number of relevant k-mersκ . b, The relationship between power/FDR and the signal amplitudea. c, The relationship between power/FDR and the number of contigs n. d, The relationship between power/FDR and the length of contigsL. The thick black horizontal line indicates the target FDR atρ = 0.2. Standard error of the mean (SEM) calculated from 100 rounds is added to each data point. . . . . . . . . . . . 29 2.3 KIMI outperforms both the BH procedure and the q-value method by simul- taneously achieveing controlled FDR and high power. Default parameters for simulation studies are used for the comparison. We implement the BH procedure and theq-value method based onp-values calculated from joint logistic regres- sion implemented by the R package "glm", marginal p-values calculated from independent WMW tests, and conditionalp-values of post selection inference for lasso by the R package "selectiveInference". The black horizontal line indicates the target FDR atρ = 0.2. Standard error of the mean (SEM) calculated from 100 rounds of simulations is added to each data point. . . . . . . . . . . . . . . . . . . 30 2.4 Prediction accuracy of viral contigs based on relevantk-mers selected by KIMI is higher than that of VirFinder. Thek-mer size is fixed as 3. a, The AUC values on the testing set for different contig lengths by using all k-mers (VirFinder) and subset ofk-mers selected by KIMI. Error bars indicate the standard deviation of AUC values calculated from 30 rounds. b, The fractions of 3-mers selected by knockoffs for different contig lengths. Error bars indicate the standard deviation of fractions calculated from 30 rounds. c, Among those 3-mers that are selected by KIMI in at least 1 round, the fractions of 3-mers that are consistently selected by KIMI in 100% (30 times) and 80% (24 times) of all the 30 rounds. . . . . . . . . . 34 ix 3.1 The complete workflow of MLR-OOD for predicting OOD sequences. . . . . . . . 49 3.2 The bar plots of the maximum class probabilities of 10k ID and 10k OOD sequences in each type of testing data set. The contig length of the testing sequences is 250bp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3 The prediction accuracy of MLR-OOD on the Test2016 bacterial data set based on different training/testing sequence lengths. Each bar shows the mean accuracy of 30 random repetitions for a particular training/testing sequence length. Error bars indicate the standard deviation. . . . . . . . . . . . . . . . . . . 59 3.4 The prediction accuracies of LLR, max-LL, and MLR-OOD for OOD genomic sequences detection on the Test2016, Test2018 bacterial data sets, viral data sets, and the plasmid data sets. (A) and (B): AUROC and AUPRC for the bacterial Test2016 data set. (C) and (D): AUROC and AUPRC for the bacterialTest2018 data set. (E) and (F): AUROC and AUPRC for the viral data set. (G) and (H): AUROC and AUPRC for the plasmid data set. Each bar shows the mean accuracy of 30 random repetitions for each method and a particular sequence length. Error bars indicate the standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.5 The relationship between the GC content and the prediction scores of LLR, max-LL and MLR-OOD on three types of testing sequences. Each subfigure contains 500 randomly selected ID and 500 randomly selected OOD 250bp testing sequences. For the bacterial data set, we use theTest2016 data set. For the viral and the plasmid data sets, we select the prediction score corresponding to theµ yielding the highest prediction, that is,µ = 0.1 for the viral data sets andµ = 0.2 for the plasmid data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.6 The relationship between the Mash distance threshold for choosing OOD sequences and the prediction accuracy of MLR-OOD on the bacterialTest2016, viral, and the plasmid data sets. (A) and (B): AUROC and AUPRC for the bacterial Test2016 data set. (C) and (D): AUROC and AUPRC for the viral data set. (E) and (F): AUROC and AUPRC for the plasmid data set. Each point shows the mean accuracy of 30 random repetitions for each method and a particular sequence length. Error bars indicate the standard deviation. The x-axis represents the minimum Mash distance threshold for choosing OOD testing genomes. The contig length is fixed as 1,000bp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 x 3.7 The impact of chimeric contigs on the prediction score and prediction accuracy of MLR-OOD. (A), (C), and (E): the violin plots of the prediction scores of bacterial Test2016, viral, and plasmid chimeric contigs. Thex-axis represents the fraction of ID sequences in the chimeric contigs. The contig length is fixed at 1,000bp. (B), (D), and (F): the AUROC for classifying bacterial Test2016, viral, and plasmid contigs containing different fractions of ID sequences. For example, "ID 0/25" represents classifying contigs containing 0% and 25% ID sequences. Each bar shows the mean accuracy of 30 repetitions based on 1k randomly drawn testing contigs from each chimeric contig set. Error bars indicate the standard deviation. 69 5.1 The relationship between power/FDR and the target FDRρ . The thick black line indicates the target FDR from 0.1 to 0.3 by the step of 0.05. Standard error of the mean (SEM) calculated from 100 rounds is added to each data point. Default parameters are used for simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2 The power/FDR of the BH procedure and theq-value method based onp-values of LRT calculated from joint logistic regression. Default parameters for simulation studies are used. The joint logistic regression is implemented by the R package "glm". The black horizontal line indicates the target FDR atρ = 0.2. Standard error of the mean (SEM) calculated from 100 rounds is added to each data point. . 82 5.3 Prediction accuracy of viral contigs based on relevantk-mers selected by KIMI based on the relative k-mer frequency after dropping onek-mer is higher than that based on the log-ratio transformation. Thek-mer size is fixed as 3. The AUC values on the testing set for different contig lengths by using subset of k-mers selected by KIMI based on relative k-mer frequencies after dropping onek-mer and the log-ratio transformation. Error bars indicate the standard deviation of AUC values calculated from 30 rounds. . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4 The prediction accuracies of max-LL and MLR-OOD on the bacterial 250bp sequences in the Test2016 and Test2018 data sets based on two different generative models: Markov chains and LSTM. (A) and (B): AUROC and AUPRC for the bacterial Test2016 data set. (C) and (D): AUROC and AUPRC for the bacterial Test2018 data set. Each bar shows the mean accuracy of 30 random repetitions for each method and a particular sequence length. Error bars indicate the standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.5 The accuracies of LLR with model-parameterλ = 1e-4 and different values of µ = 0.1, 0.15, and 0.2 on the viral and plasmid data sets. (A) and (B): AUROC and AUPRC for the viral data set. (C) and (D): AUROC and AUPRC for the plasmid data set. The numbers in the parentheses indicate another model-parameterµ in LLR. Each bar shows the mean accuracy of 30 random repetitions for each method and a particular sequence length. Error bars indicate the standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 xi 5.6 The histograms of the minimum Mash distance of OOD testing genomes to the ID classes of bacterial, viral, and plasmid genomes. The number of sketches is specified as 1,000,000 for calculating the Mash distance. . . . . . . . . . . . . . . . 94 xii Abstract Microbial communities are ubiquitous in diverse environments and play important roles in many aspects including the human health in particular. Within a microbial community, many types of microbes exist such as bacteria, viruses, plasmids, and fungi. To analyze the composition and diversity of microbial communities, metagenomics offers an alternative to culturing in labora- tory which is a challenge for a majority of microbial species. Particularly, in metagenomics the DNA fragments of all samples in a specific environment are extracted and sequenced. Although metagenomics is able to capture the diversity of microbial communities, the source of metage- nomic sequences is lost due to the nature of metagenomic sequencing. Many computational methods have been proposed towards deciphering metagenomic sequences including assembly, binning, and labeling the taxonomic information. This dissertation presents two novel computational methods to address reproducibility and reliability in the analysis of metagenomic sequences. In Chapter 2, we present KIMI for identi- fying sequence motifs (k-mers) from binary types of sequences with controlled false discovery rate (FDR). By adapting the model-X knockoffs framework to this particular problem, we show that high power and controlled FDR are achieved simultaneously by KIMI. We also provide a real data application example to show that the prediction accuracy of viral contigs is improved if KIMI serves as a pre-screening step for identifying releventk-mers. In Chapter 3, we present xiii a maximum likelihood ratio based method, MLR-OOD, for detecting out-of-distribution (OOD) sequences that can bias the classification results of metagenomic sequences in machine learning related methods. Based on long short term memory (LSTM) which is an advanced deep learn- ing based generative model and the Markov chain likelihood, MLR-OOD is shown to achieve markedly higher prediction accuracy than other methods on bacterial, viral, and plasmid data sets. We conclude that MLR-OOD will greatly reduce false positives caused by OOD sequences in metagenomic sequence classification. xiv Chapter1 Introduction Microbial communities are complex and diverse, consisting of various types of microbes including bacteria, archaea, viruses, plasmids, phages, fungi, and protozoa [34, 45, 150]. It is estimated that there are more than10 30 microbes in the entire microbial community, accounting for 60% of the earth’s total biomass [49, 161]. Directing earth’s biological and geochemical systems, microbial communities as a whole play vital roles in the ecosystem on earth [80, 92, 134]. Many studies have analyzed microbial communities in diverse environments including soil [27, 50, 154, 181], marine [65, 75, 85, 127], freshwater [68, 93, 125], and sediments [25, 81, 156], clearly demonstrating the ubiquity and diversity of microbial communities and their functionality interacting with the local environment. Particularly, more than 1000 bacterial species live in the human body, harboring 2-4 million uncharacterized genes which could significantly impact the human health [36]. It is estimated that there are a total of 10-100 trillion microbes, most of which are bacteria, in human intesti- nal microbiota [87, 190]. Many studies have revealed the apparent links between the microbes inhabiting the human gut and human health and disease [12, 28, 30, 71, 190]. For example, the associations between human gut microbial communities and many human diseases have been 1 studied such as type 2 diabetics [26, 56, 62], obesity [37, 41, 171], inflammatory bowel disease (IBD) [64, 69, 155], and colorectal cancer (CRC) [2, 177, 188]. Therefore, it is critically important to utilize modern experimental techniques and computational algorithms to investigate the hu- man microbiota and its roles in multiple human diseases. As a result, international and national consortia have been established such as the Human Microbiome Project (HMP) [33, 117]. Although microbes are ubiquitous in various environments, relatively little is yet known be- cause of the challenges to culture them under standard laboratory conditions [172]. It is reported that less than 1% of the total microbes in the environment can be cultured except human and human-related environments in which the majority of the microbial genera can be cultured [6, 103, 134]. To address the weakness of culture-based methods, metagenomics has been proposed and widely applied in recent years. In metagenomics, the DNA fragments of all environmen- tal samples are directly extracted, bypassing the culturing process in the laboratory [111, 134]. By sequencing the DNA of all microbes rather than a single marker gene, metagenomic studies have produced thorough and insightful results [124, 173], including gaining information about gene functions and the genome structures, identification of novel genes, and unveiling the re- lationships between different types of microbial communities in the biogeochemical cycle [134, 150]. Due to the complexity of microbial communities, sufficiently deep and comprehensive metage- nomic sequences are needed to capture the composition of all representative microbial species, many of which exist at low abundance [150]. The advances in sequencing technologies and the consequent reduction in sequencing costs have facilitated the generation of numerous metage- nomic sequences from extensive microbial species [55]. Depending on the sequencing platform, 2 those metagenomic sequences can have various lengths. In particular, Illumina, a leading se- quencing technology that has been widely used in genomic labs, produces relatively short reads with length up to 300 basepair [150]. Since these short reads usually contain insufficient informa- tion about the particular microbial species, assembly is then performed to join short metagenomic reads to form relatively long sequences, which are called as contigs [131, 134]. Since the contigs produced by metagenomic sequencing are from a set of species instead of a pure species isolate, a key challenge is to develop computational tools for classifying these data into various taxonomic categories, so that taxon abundance profiles can be estimated [162]. A wide variety of software tools have been developed to achieve this goal [23, 54, 86, 107, 130, 135, 152, 176, 183]. For example, some alignment-based methods employing alignment algorithms such as BLAST [5], BLAT [84], or read-mapping methods like BWA [98], BOWTIE [91]. Such methods usually depend on a complete set of reference genomes from known species, which is still lacking as a large portion of the microbial species remain poorly studied [104, 147, 165]. Another widely used category of methods is composition-based methods utilizing compositional properties like GC content, codon patterns, oligonucleotide patterns, etc [111]. These methods typically employ an initial training step to build models based on the compositional properties of some known genomes in several training categories, bypassing the requirement of a complete set of reference genomes. Machine learning models such as support vector machine, logistic regression, and Naive Bayes are then used to classify unknown contigs into the training cate- gories [114, 141, 149]. Although decent results based on these methods have been reported, the reproducibility and reliability of the results remain uncertain. In order to bridge the gap, in this 3 dissertation, we aim to develop novel statistical and computational approaches tackling both re- producibility and reliability in the analysis of metagenomic sequences to deeply understand the principles governing the assembly and composition of microbial communities. 1.1 Identifyingsequencemotifsfrommetagenomic sequenceswithreproducibility Recently, computational methods based on the frequencies ofk-mer (also known ask-tuples,k- grams, sequence signatures) have been shown to be highly effective for classifying metagenomic sequencing reads into different groups [53, 141, 144, 151]. However, such methods usually use all thek-mers as features for prediction without selecting relevantk-mers for the different groups of sequences, i.e. unique nucleotide patterns enriched or depleted for specific metagenomic groups. These k-mers that are relevant to the taxonomic groups of the contigs are also called as motifs which are short, recurring patterns in DNA containing presumed biological significance [35]. Identifying sequence motifs not only can potentially further increase the classification accuracy of k-mer based methods by narrowing down the set of relevant features but also promote un- derstanding of the biological functions of specific microbial groups. For example, it has been reported that viral motifs are essential for understanding the replication process of viruses [79, 185]. Viral motifs also play important roles in virus-host interactions by helping to evade clus- tered regularly interspaced short palindromic repeats (CRISPR) spacers [7]. To address the motif identification problem, various existing methods have been developed [13, 58, 94, 95, 113, 115]. Yet, reproducibility remains a critically concerning issue since biological data are usually complex 4 and contain remarkable variation [146, 163] and many studies have paid particular attention on the reproducibility issue [77, 110, 121, 133]. Addressing the reproducibility issue usually boils down to the measurement of the false dis- covery rate (FDR), a concept formally described by Yoav Benjamini and Yosef Hochberg in 1995 when they proposed the Benjamini–Hochberg (BH) procedure for FDR control [17]. The idea has been a huge success and the BH procedure has been widely used [16]. Several follow-up stud- ies as well as applications to biological data have also been proposed in line with the direction [18, 19, 59, 138, 186, 189]. Despite its effectiveness and popularity, the BH procedure relies on the assumption of independence or positive independence of p-values testing the variable sig- nificance which can be difficult to guarantee especially in non-linear or high dimensional cases [29]. In some extreme cases it is almost impossible to accurately calculate the p-values. In ad- dition to the BH procedure, the q-value method [166] has been also widely used. However, the q-value method suffers similar limitations as the BH procedure since it is also based on p-values. Recently, a completely novel concept called "knockoffs" has been developed for controlling the false discovery rate without depending on the calculation ofp-values [14]. The concept has been later refined to "model-X knockoffs" bypassing the restriction on the dimensionality of the data or the conditional distribution of the response variable on the covariates [29]. The basic idea of model-X knockoffs is to create a set of so-called knockoff variables mimicking the dependence structure of the original variables but are irrelevant to the response conditional on the original variables. Therefore, the knockoff variables can be treated as controls for measuring the rele- vance of the original variables to the response. Theoretical analysis has been provided in [29] to guarantee that model-X knockoffs can select covariates with controlled FDR at the target level 5 under arbitrary dimensionality and arbitrary dependence structure between the covariates and the response. 1.1.1 Researchproblems In Chapter 2, we aim to develop a general framework "KIMI" based on model-X Knockoffs for se- quence motif discovery with arbitrary target FDR level. The detailed formulation of the problem is as follows. Assume we have binary types of sequences with known labels, e.g. viruses vs. bac- teria. We start from couting the occurrence frequencies ofk-mers in these sequences and we treat thek-mer frequencies as covariates. Given an arbitrary target FDR level, our target is to identify a set of motifs (k-mers) that are relevant to the binary sequence labels with controlled FDR. The major challenge comes from the collinearity between the frequencies of k-mers, since they are naturally highly correlated with each other and as compositional data the sum of allk-mer fre- quencies is exact one. Another challenge is that the joint distribution of allk-mer frequencies, a prerequisite for generating model-X knockoff variables, is unknown. Therefore, adapting to the model-X knockoff framework for identifying sequence motifs is nontrivial. To tackle these challenges, in the KIMI framework, we drop one k-mer with the largest marginal p-value to be null to break the collinearity. Motivated by the central limit theory, we also assume multivariate normal distribution of the k-mer frequencies to conveniently gener- ate the knockoff variables. Although the assumption may be somewhat misspecified, we shown through simulation studies that KIMI is effective in simultaneously controlling FDR and yield- ing high power, outperforming the broadly used BH procedure and the q-value method which yielded almost no power. The relationship between FDR/power and various parameters has also 6 been studied. To illustrate the usefulness of KIMI in analyzing real data sets, we take the viral motif discovery problem as an example and implement KIMI on a real data set consisting of viral and bacterial contigs. We use KIMI as a screening step to select k-mers relevant to the contig labels and then train the prediction model only on relevant k-mers selected by KIMI. We show that the accuracy of predicting viral and bacterial contigs can be increased compared to direclty using allk-mers without the screening step. 1.2 Enhancingthereliabilityofclassifyingmetagenomic sequencesbydetectingout-of-distributionsequences Metagenomics reveals the taxonomic composition and diversity of microbial communities by se- quencing DNA directly from an environmental sample [176]. Due to the nature of metagenomic sequencing, the sources of the metagenomic reads are lost. Alignment or mapping based meth- ods provide insights into the classification of metagenomic reads but are unable to deal with sequences from unknown species due to the lack of a complete reference database of genomes. Machine learning or deep learning models have been widely used for taxonomic classification of metagenomic sequences in lieu of mappping the sequences to a known database [43, 53, 54, 100, 145]. Such models are usually trained based on sequences in several training classes in hope of accurately classifying unknown sequences into these classes and many studies have re- ported high accuracy on testing data sets. However, when deploying the classification models on real testing data sets, sequences that do not belong to any of the training classes, referred to as out-of-distribution (OOD) sequences, may be present. Due to the complexity of microbial com- munities, OOD sequences are ubiquitous in metagenomic studies but are rarely tackled based on 7 the assumption that these OOD sequences will receive low classification scores in real applica- tions. Unfortunately, this assumption is unlikely to remain true as some studies have reported that modern neural network classifiers may assign higher classification scores for OOD inputs [61, 66, 126]. Therefore, there is no guarantee for the reliability of the classification results since many OOD sequences may be falsely classified into one of the training classes. Detecting OOD samples has been an active research topic in machine learning as many meth- ods have been proposed to tackle this problem [42, 72, 76, 97, 101, 143, 159, 178]. Yet, most of the studies focus on dealing with image data and relatively little is available for genomic sequences. Given the distinctions between image data and genomic sequence data are evident such as im- age data are multidimensional and sequence data are one-dimensional, the methods proposed for image data cannot be directly applied to genomic sequence data or have much lower predic- tion accuracy. Therefore, accurate methods directly addressing the detection of OOD genomic sequences are urgently needed, so that the reliability of classifying metagenomic sequences can be enhanced by screening out OOD sequences. To the best of our knowledge, the likelihood ratio method (LLR) proposed by Ren et al. [143] is the only available method tackling the detection of OOD genomic sequences by far. Based on the likelihood ratio derived from the generative models trained for capturing the semantic and background parts of the sequences, LLR has been shown therein to be effective in detecting OOD sequences, although there is still room for improvement in terms of the prediction accuracy [143]. Another drawback of the LLR method is that it involves tuning two extra model parameters that can greatly affect the prediction accuracy [143], impeding the method to generalize well in real applications when OOD validation data are lacking. 8 1.2.1 Researchproblems In Chapter 3, we aim to develop novel computational methods that can be used to accurately detect OOD genomic sequences based on the in-distribution (ID) training sequences. We pro- pose a deep generative model-based method called "MLR-OOD" that measures the probability of a testing sequencing belonging to OOD by the likelihood ratio of the maximum of the ID class conditional likelihoods and the Markov chain likelihood of the testing sequence measuring the sequence complexity. Compared with the LLR method, MLR-OOD takes the detailed ID sequence labels into account and train generative models for the ID classes separately, leading to a more accurate modeling of the ID sequences. MLR-OOD is also free of tuning model parameters com- pared to LLR. We compose three different microbial data sets consisting of bacterial, viral, and plasmid sequences for comprehensively benchmarking OOD detection methods. We show that MLR-OOD achieves the state-of-the-art performance compred to LLR, demonstrating the gen- erality of MLR-OOD to various types of microbial data sets. It is also shown that MLR-OOD is robust to the GC content, which is a major confounding effect for OOD detection of genomic se- quences. In conclusion, MLR-OOD will greatly reduce false positives caused by OOD sequences in metagenomic sequence classification. 9 Chapter2 KIMI:KnockoffInferenceforMotifIdentificationfrom molecularsequenceswithcontrolledfalsediscoveryrate 2.1 Introduction With the ubiquitous data available in the era of big data, the identification of key variables that are relevant to an effect has become tremendously important but challenging because of the potentially complicated dependence structure such as the nonlinearity, high dimensionality and collinearity. Yet, despite the large body of literature on variable selection, reproducibility of the results still remains challenging, especially when dealing with biological data that often exhibit significant variation [146, 163]. Among all attempts to address the reproducibility issue, measuring the false discovery rate (FDR) has been widely used. Existing methods on FDR control typically depend on calculatingp- values measuring the significance of variables, such as the very widely used Benjamini-Hochberg (BH) procedure [17]. One basic assumption of the BH procedure is that validp-values satisfying additional conditions such as independence or positive dependencies can be calculated for testing 10 the variable significance. However, this can be difficult to guarantee especially in nonlinear or high dimensional cases. Recently, Candes et al. ([29]) proposed a new framework of model-X knockoffs to control the FDR bypassing the use of p-values. The salient idea of model-X knock- offs is to create the so-called knockoff variables that mimic the dependence structure of original variables but are irrelevant to the response conditional on the original variables. Thus, knockoff variables can act as controls for assessing importance of original variables. It is proved therein that model-X knockoffs can control FDR at the target level under arbitrary dimensionality and arbitrary dependence structure between covariates and the response. Due to the rapid development of sequencing technologies, innumerable metagenomic reads have been generated, while the understanding of the underlying principles governing the as- sembly of microbial communities remains far behind. Recent studies focusing on the microbial community have provided us deep insights into the relationship between microbial communities and human diseases [40], the environment [70], and the biosphere [164], etc. Different groups of microorganisms, such as bacteria, viruses, plasmids, etc, play distinct roles and interact with each other in the microbial community. Due to the nature of shotgun sequencing, the sources of the short reads are lost. One basic yet essential problem is to classify sequences into different categories in mixed metagenomic data sets. Several computational tools have been developed to solve this problem [53, 141, 144, 151]. Among all these efforts, VirFinder [141], a novel k- mer based classification tool, uses frequencies of all k-mers (ork-tuples,k-grams) as features to build the model for classifying viruses and bacteria. Although high accuracy can be achieved by VirFinder for the prediction of viral contigs, it remains important to know whichk-mers are indeed uniquely enriched or depleted in viruses. While logistic regression with lasso penalty can produce some support recovery, it is likely to miss some important ones or to have inflated FDR. 11 Therefore, the understanding towards the scientific problem of the accurate identification of vi- ral specific motifs, i.e., k-mers that have different abundances in viruses and bacteria, is lacking. The identification of specific k-mers containing particular biological significance to a group of species, also named motifs considered as patterns of residues or regions within nucleotide se- quences, is a key part of understanding function and regulation within biological systems [105]. Taking viruses as an example, understanding viral motifs could not only help us identify viral se- quences more accurately, but also promote the understanding of virus-host interactions, in which shuffled motifs help to evade clustered regularly interspaced short palindromic repeats (CRISPR) spacers [7]. Despite the fact that many existing studies are available for motif finding [58, 94, 95, 113, 115], none of them could simultaneously control the FDR. In this chapter, we start from comparing the occurrence frequencies ofk-mers in two types of sequences given the sequence labels. Our target is to identify important motifs (k-mers) that are relevant to the binary sequence types with controlled FDR. One challenge is that the frequencies of all k-mers, as compositional data, sum up to one, resulting in the covariates being perfectly collinear. A common practice to overcome this problem is to drop one k-mer. However, this does not completely solve our problem because the remaining k-mer frequencies can still have very high collinearity. Another natural way to deal with compositional covariates is to use the log-ratio transformed covariates, which is equivalent to taking log of the covariates with the constraint that the coefficients of the covariates sum up to 0, as proposed in [102]. We adopt the former one for KIMI throughout the paper and compare the performance of KIMI with the results based on the log-ratio transformation in the real data analysis. The second challenge is that the joint distribution of covariates is assumed to be known for the construction of knockoff variables, which specifically means the joint distribution of all k-mer frequencies needs to be known. Thus, 12 besides the afore-mentioned issue of high collinearity, we also face the problem of unknown joint covariate distribution. These challenges indicate that it is difficult to naively adapt the model-X knockoffs framework to our motif identification problem. We present KIMI, a general framework based on knockoff inference, for motif identification from binary types of molecular sequences with FDR control. First, we briefly review the frame- work of model-X knockoffs. The key step for FDR control and for overcoming the collinearity issue is to generate valid knockoff variables adapting to our problem. To tackle this, we directly generate the knockoff variables for k-mer frequencies of the original sequences by assuming multivariate normal distribution, which is shown to be able to control FDR and obtain high power simultaneously in simulation studies. The widely used BH procedure and the q-value method for FDR control, on the contrary, yielded almost no power. Finally, we consider an ap- plication example of viral motif identification, in which we test KIMI on real viral and bacterial contigs from VirFinder [141]. Our results show that higher prediction accuracy can be obtained by training the prediction model only on relevantk-mers selected by KIMI, although predicting the contig labels is not our primary goal. This application example demonstrates the promis- ing use of KIMI for analyzing real data sets. The KIMI software package is publicly available at https://github.com/xinbaiusc/KIMI. 2.2 Methods 2.2.1 Countingk-merfrequenciesfrommetagenomiccontigs Metagenomic reads are generally short consisting of several hundred of basepairs, making it challenging for statistical analysis. The first step in most metagenomic studies is to assemble the 13 reads into contigs, consecutive regions of genomes with overlapping reads. To facilitate statis- tical analysis, we consider contigs with length above a certain threshold. Suppose there are n contigs, each labeled asy i = 1 ory i = 0,1≤ i≤ n corresponding to two different categories, for example, viral or bacterial contigs. Our objective is to develop a method that can classify whether an observed contig comes from a virus or bacterium and simultaneously identify the key factors driving the successful classification. We will consider similar problems for multiple type classification in future studies. We represent the i-th contig asZ i = Z i,1 Z i,2 ··· Z i,L , where Z i,j ∈ A = {A,C,G,T} de- notes the j-th nucleotide in the i-th contig and L stands for the contig length. For any word w = w 1 w 2 ··· w k of length k, we count its number of occurrences in contig Z i and denote it as N i (w). Then the k-mer counts are further normalized into frequencies denoted as f i (w) by dividing the total counts of allk-mers. Some previous studies used logistic regression onk-mers [3, 141] to distinguish viral contigs from bacterial ones, showing strong associations between the composition abundance of k-mers and contig labels. In this study, we focus on identifying k-mers that are particularly relevant to binary types of sequences, i.e., motifs containing essential biological meanings for characterizing particular types of sequences. Hereafter, to simplify the presentation, we useS 0 to denote the set of suchk-mers and refer them as relevantk-mers. For any contigZ i , the k-mer frequencies always add up to one, i.e., P w f i (w) = 1. Thus, naively including allk-mer frequencies will lead to the problem of perfect collinearity in many computational methods. Thus, to alleviate the collinearity problem, we propose to drop one k- mer with the least likelihood of being relevant to viral contig identification according to some criterion. The details for selecting the dropped k-mer will be discussed later in Sections 2.2.4 and 2.2.5. Thus, the total number of interestedk-mers will bep = 4 k − 1. Note that even after 14 dropping one k-mer, the collinearity among k-mer frequencies can still be very high, which is indeed the main challenge in our study. Hereafter to simplify notation, we usex i = (X i1 ,··· ,X ip ) ′ ∈R p ,i = 1,··· ,n to denote the vectors of transformedk-mer frequencies (the transformation will be specified later), and y i ’s to denote the corresponding contig labels. The problem of identifying the set of useful k-mers S 0 from the full set of candidates is termed as variable selection in statistics literature. We aim at recoveringS 0 from observed(x i ,y i ),i = 1,··· ,n with controlled false discovery rate. 2.2.2 TheknockoffsframeworkforvariableselectionandFDRcontrol To evaluate variable selection methods, we first introduce some performance measures. Let ˆ S be the set ofk-mers selected by some statistical methods. The associated false discovery rate (FDR) can be defined as: FDR =E[FDP], with FDP = |S c 0 ∩ ˆ S| | ˆ S| , (2.1) where FDP means false discovery proportion,|·| denotes the cardinality of a set, andS c 0 is the complement ofS 0 , that is, the set of non-relevantk-mers. Correspondingly, the power, which is the expected fraction of selected relevantk-mers, is defined as: Power =E[TDP], with TDP = |S 0 ∩ ˆ S| |S 0 | , (2.2) where TDP means true discovery proportion. We aim to control FDR and achieve high power simultaneously for the motif discovery problem. 15 In Candesetal. (2018), a general model-X knockoffs framework was proposed to control FDR, which can be regarded as a wrapper and can be combined with any underlying variable selection methods that assign variable importance measure to achieve FDR control. It has been proved in Candes et al. (2018) that the model-X knockoffs framework controls FDR at the desired level with finite sample size for any dependence structure between the response and covariates. It also automatically adapts to the collinearity level in covariates. Therefore, we start from the idea of model-X knockoffs and adapt it to select relevant k-mers. We next give a brief review on the Model-X knockoffs framework. The salient idea of model-X knockoffs can be intuitively understood as creating a set of "fake" variables, the so-called knockoff variables, that mirror the dependence structure of original vari- ables, but are irrelevant to the dependent variableY conditional on the original variables. These knockoff variables are then used as controls for assessing importance of the original variables. The formal definition of Model-X knockoff variables [29] is given below: Definition1 [29] Model-X knockoffs for the family of random variables x = (X 1 ,··· ,X p ) ′ is a new family of random variables ˜ x = ( ˜ X 1 ,··· , ˜ X p ) ′ , constructed such that • for any subsetS⊂{ 1,··· ,p} (x ′ ,˜ x ′ ) swap(S) d = (x ′ ,˜ x ′ ), • conditionalontheoriginalvariablesx,theknockoffvariables ˜ xareindependentoftheresponse Y. 16 Here, swap(S) means we swap the entries X j and ˜ X j for each j ∈ S, and d = means that the two vectors have identical distribution. Candesetal. ([29]) suggested a general algorithm for generating knockoff variables when the joint distribution ofx isknown. We will discuss the details of implementation in our setting with unknown joint covariate distribution in Section 2.2.3 for ease of presentation. Now we assume such valid Model-X knockoffs have been generated. Let ˜ x i be the knockoff variable of x i , and x (i) = (x ′ i ,e x ′ i ) ′ ∈ R 2p the augmented feature vector. With (x (i) ,y i ), i = 1,··· ,n, we fit the following regularized logistic regression with the L 1 penalty: min b 0 ∈R,b∈R 2p n − 1 n n X i=1 h y i (b 0 +b ′ x (i) )− log 1+exp(b 0 +b ′ x (i) ) i +λ b∥ 1 o , (2.3) whereλ> 0 is the regularization parameter. In implementation, we chooseλ by theK-fold cross validation method based on prediction error. With the obtained solution ˆ b = ( ˆ b 1 ,..., ˆ b 2p ) ′ to problem (2.3), we then calculate the difference between the magnitudes of coefficients of the original variables and their corresponding knockoff copies, that is, V j =| ˆ b j |−| ˆ b j+p |, j = 1,··· ,p. (2.4) These statistics are called knockoff statistics as in Candes et al. ([29]). A desired property of valid knockoff statistics is to measure the importance of original variables: for relevant original variables, their knockoff statistics are expected to be large and positive, and for irrelevant ones, 17 their knockoff statistics are expected to be small in magnitudes and symmetric around zero. It has been shown thatV j ’s defined in (2.4) are valid knockoff statistics [29]. The set of relevant k-mers can then be selected as ˆ S = {j : V j ≥ τ + }, where τ + is the Knockoffs+ threshold defined as τ + = min t> 0 : 1+#{V j ≤− t} #{V j ≥ t} ≤ ρ withρ the prespecified target FDR. It has been proved in Candes etal. ([29]) that such a procedure controls FDR at the target level in finite sample size n and arbitrary dimensionalityp. The key step in implementing knockoffs FDR control is the construction of valid knockoff variables. Since the joint distribution ofk-mer frequencies is generally unknown, the general al- gorithm in [29] is not applicable and we need to develop new methods for constructing knockoff variables. We investigate two different approaches for constructing knockoff variables by bor- rowing ideas from two relevant publications [158, 51]. The approach introduced in Sesia et al. ([158]) constructs knockoff Markov contigs for each original contig based on the estimated tran- sition matrices. One can then calculatek-mer frequencies from the original contigs and knockoff contigs. However, this approach does not produce valid knockoffs for k-mer frequencies be- cause thek-mer frequencies from knockoff contigs are not exchangeable in distribution with the ones from the original contigs. Due to page limitations, we discuss the knockoff Markov contig method and provide detailed explanation on why it fails in our study in the Supplementary Ma- terials. Instead of modeling knockoff contigs, we directly generate knockoff k-mer frequencies by assuming multivariate normal distribution ofk-mer frequencies. 18 2.2.3 Generatingknockoff k-merfrequencies An applicable approach is to directly generate knockoffs for k-mer frequencies, considering that our goal is to select relevant k-mer frequencies. Let F = (f i (w j )),1 ≤ i ≤ n,1 ≤ j ≤ p be then byp matrix of originalk-mer frequencies, wherep = 4 k − 1 is the total number ofk-mers of interest after dropping one k-mer. Then rows of F, denoted by f 1 ,··· ,f n , can be regarded as an estimate of multinomial probabilities, where each entry of the multinomial distribution represents onek-mer. Classical central limit theory motivates us to model the rows ofF as inde- pendent normal random vectors. We further standardize each column of F to have mean 0 and standard deviation 1, and denote byX = (x 1 ,··· ,x n ) ′ ∈R n× p thek-mer frequency matrix after standardization. Thus, the rows ofX all have mean 0 and covariance matrix with diagonals equal 1. In addition, we assume that the distribution of rows ofX is still close to multivariate normal. Assume indeedx i ∼ i.i.d. N(0,Σ X ) where diag(Σ X ) =I p . Then a valid construction of knockoff variables is ˜ x i |x i ∼ N x i − diag(s)Ω x i ,2diag(s)− diag(s)Ω diag(s) , (2.5) whereΩ =Σ − 1 X is thep byp precision matrix (the inverse covariance matrix), ands is a vector of p nonnegative numbers satisfying 2diag(s)− diag(s)Ω diag(s) being positive definite. Here diag(s) is a diagonal matrix with items ofs on the diagonal. The vectors measures the dissim- ilarity between the original variables and the knockoff copies. We denote the knockoff k-mer frequency matrix by ˜ X = (˜ x 1 ,··· ,˜ x n ) ′ . Notice that due to the high dimensionality of the data, the normality assumption may be mis-specified and the constructed knockoff variables are not ideal. However, we evaluate KIMI by its variable selection and prediction performance, and show 19 that it achieves controlled FDR and high power simultaneously in simulations, and high predic- tion accuracy of the contig labels in real data analysis. In addition, there has been some studies in the literature showing that the knockoffs framework can achieve asymptotic FDR control even with some model mispecifications [15, 51, 52]. To construct valid knockoff variables for k-mer frequencies, we need the joint distribution of the normal random vectorx i , which is equivalent to knowing the precision matrixΩ . We assume thatp < n, which is generally not a problem since the number of contigs is usually abundantly enough compared to thek-mer size. For example, VirFinder analyzed a real metagenomic data set from Qinetal. (2014) in which 325,020 contigs longer than 1000bp were identified, and VirFinder uses k up to 8. Therefore, we estimateΩ by first calculating the sample covariance matrix ˆ Σ X of X, and then calculating its inverse. This method is theoretically feasible because the perfect collinearity is broken after dropping one k-mer. However, in order to avoid the numerical in- stability in inverting ˆ Σ X caused by high collinearity, we normalize ˆ Σ X by dividing the absolute value of the median of entries in the sample covariance matrix. Then the estimated precision matrix is calculated by inverting the normalized sample covariance matrix. The original knock- off frequencies are normalized correspondingly to match the normalized covariance matrix. For easy presentation, we slightly abuse the notation and still usex i ’s to denote the original variables and ˜ x i ’s to denote their knockoff copies. 20 2.2.4 Simulationstudies 2.2.4.1 SimulationproceduresforFDRcontrolusingKIMI We start from simulating contigs using Markov models given that Markov chains have been widely used in molecular sequence analyses [4, 8, 9, 10, 21, 22, 140, 179]. Note that the Markov model assumption is only needed to facilitate simulation studies and is not essential since no assumptions on the data are needed for KIMI. We first generate n Markov contigs with order r and lengthL. The contigs are simulated from the same model and their labels will be determined later by prespecified relevant k-mers. The transition probability matrix of Markov contigs T with dimension4 r × 4 is randomly simulated as follows. For each row ofT, we sample 4 random numbers following a uniform distribution in (0,1) and then normalize by dividing the sum of these 4 numbers assuring the row sum equals to 1. These normalized numbers are treated as the row entries ofT. We repeat the random sampling until all rows ofT are generated. After that, we simulate Markov contigs by starting from the initial distribution of equal probabilities (1/4 r ,··· ,1/4 r ). Then we continue the simulation according to the corresponding transition probability matrix until the contig length reachesL. We count the occurrence of each k-mer in every original contig and normalize the counts into frequency. We further standardize thek-mer frequency matrixF as described at the end of Section 2.2.3, and denote byX the final data matrix. 21 Next, we sample the binary contig labels according to the following procedures. First, we randomly pickκk -mers from all4 k k-mers to be relevant. Second, we sampley from a Bernoulli distribution with parameter based on the following logistic model P(y i = 1) = exp(x ′ i β ) 1+exp(x ′ i β ) , whereβ = (β 1 ,··· ,β p ) ′ is a coefficient vector corresponding to the effect of each k-mer on the contig labels. For those irrelevantk-mers, we assign 0 as the coefficient. Those relevant k-mers will be equally likely assigneda or− a as the coefficient, where a is a positive number indicating the amplitude of contribution to the probability of having label 1. Finally, we use the Wilcoxon- Mann-Whitney (WMW) test [112] to compare each column ofX for the two types of contigs and drop the k-mer having the largest p-value, i.e., least likely being a relevant k-mer. A knockoff matrix ˜ X is then constructed using equation (2.5). Once we obtain X and y, we can implement KIMI, starting from calculating the knockoff statisticsV j using Equation (2.4) for eachk-mer, followed by estimating theKnockoff+ threshold τ + based on a target FDRρ , and selectingk-mers with knockoff statistics larger than τ + . We then calculate the FDP as in equation (2.1). The corresponding TDP is also calculated using equation (2.2). The above process is repeated 100 times and the means of FDP and TDP are used as estimates of FDR and power, respectively. Several parameters may impact the overall performance ofk-mer selection: 1) the number of relevantk-mersκ , 2) the signal amplitudea, 3) the number of contigsn, 4) the contig lengthL, 22 and 5) the target FDRρ . We will keep other parameters unchanged while studying the relation- ship between one particular parameter and power/FDR. Notice that the choice of k-mer size of interest is usually at the user’s discretion depending on the scientific problem as long as n > p is not violated, while we acknowledge that larger k will likely result in higher collinearity of the variables which may make the sample covariance matrix numerically singular, as shown in Section 2.3.1. 2.2.4.2 SimulationproceduresforFDRcontrolusingtheBHprocedure The BH procedure based onp-values has been widely used for variable selection with FDR control. To adapt to our problem, three potential approaches for calculating thep-values are applicable. The first is to compute the p-values from joint logistic regression using R function "glm". We do not consider regularized logistic regression because no available package could outputp-values in this scenario to the best of our knowledge. Specifically, we carry out a logistic regression on data(X,y) to obtain thep-value for eachk-mer indicating the statistical significance against the null hypothesis that the regression coefficient is 0 conditional on all other k-mers. Therefore, a very small p-value suggests that the corresponding k-mer is relevant. Here the data (X,y) are the same as in Section 2.2.4.1. The second is to compute the p-value for each column independently, which we refer to as marginalp-values. Note that these marginalp-values are for testing the marginal independence of eachk-mer with the response, which is generally different from our goal of testing the conditional independence of one k-mer with the response given all others. Nevertheless, this alternative method has been very popularly used especially in high dimensions when joint regression is 23 difficult. We use the Wilcoxon-Mann-Whitney (WMW) test [112] to compare the transformed k-mer frequency for two types of contigs in this scenario. Post-selection inference provides another perspective of data-driven variable selection by first selecting potentially relevant predictor variables and then carrying out statistical inference on those selected variables [20]. We use the R package "selectiveInference" to calculate p-values of conditional (post-selection) hypothesis tests for lasso [170] as the third option for producing p-values. After obtaining the jointp-values from logistic regression, the independentp-values from the WMW test, or the conditional post-selection p-values for lasso for each of the k-mers, we use the BH procedure to adjust the p-values at target FDR ρ = 0.2. Thus, the selected k-mers and corresponding FDR and power can be computed from 100 rounds of repetitions. We also compare KIMI with theq-value method that has been commonly used for FDR control [166]. The details are the same as those for the BH procedure except for thep-value adjustment and are thus omitted. The existing literature only shows that the BH and q-value approaches work well under in- dependence or certain types of dependence assumptions ofp-values [59, 82, 138, 167]. Since the k-mer frequencies depend strongly on each other, these approaches are not really applicable in such situations. We acknowledge that the above mentioned p-values are not ideal. We are not aware of any valid p-value calculating methods that completely adapt to the BH and q-value approaches fork-mer selection to the best of our knowledge. 24 2.2.5 Application in viral contig identification by identifying relevant k-mers We next present the usefulness of KIMI in real data analysis, applying KIMI on real data sets used in VirFinder [141] for viral motif identification. Due to the lack of ground truth, we will evaluate the performance of KIMI by measuring the prediction accuracy of contig labels based on only thek-mers selected by KIMI. In Ren et al. [141], all but onek-mer were used in lasso-penalized logistic regression for viral contig prediction. Although lasso can produce a sparse model with a set of selectedk-mers, there is no guarantee on FDR control. For each contig length L, we use the same training and testing sets as in Ren et al. [141]. Next, we fix the k-mer sizek, count the number of occurrences of eachk-mer, and normalize the count into frequency for each contig in the training set. Thek-mer with the highestp-value of WMW test comparing the distributions ofk-mer frequencies in viral and bacterial contigs is then dropped. We further normalize the design matrix following the procedure discussed at the end of Section 2.2.3 and denote the resulting data as(X,y). Then we apply KIMI to select relevantk-mers ˆ S based on (X,y) from the training set accord- ing to details presented in Section 2.2.3. We set the target FDR as ρ = 0.2. In addition to the current framework of KIMI which drops one k-mer to avoid perfect collinearity, we also select relevantk-mers using knockoff inference on the log-ratio transformed data matrix and compare the prediction accuracy with that of KIMI. The details are shown in the Supplementary Materials. KIMI serves as a screening step for selecting relevant k-mers that will be useful for viral contig prediction. For prediction purpose, we only include k-mers in ˆ S as predictors. Using these selected features, a logistic regression model withL 1 penalty, which is identical to the one 25 VirFinder used, is trained using the training data, and is then applied to the testing data to predict the probability of each contig in the testing set coming from the viral group. A threshold is then chosen, and contigs with predicted probabilities above it are classified to the viral group. True positive rate (TPR) and false positive rate (FPR) vary with the threshold. To better compare the prediction performances based on KIMI and VirFinder, we plot the receiver operating character- istic (ROC) curve by varying the threshold. The area under the ROC curve (AUROC) is calculated as the measure of accuracy. For VirFinder, AUROC values based on 30 rounds of bootstrap of the prediction scores in the testing set are presented. For KIMI, we generate the knockoff k-mer frequency matrix ˜ X for 30 times and calculate the AUROC value for each time. We compare the AUROC values based on KIMI and those presented in VirFinder. 2.3 Results 2.3.1 The relationship between FDR/power and various parameters in simulationstudies In this section, we present simulation results answering the following questions. First, will KIMI control FDR and yield high power simultaneously for different simulated scenarios? Second, how do FDR and power change with different parameters involved in simulations? The results shown below are based on the selection of 4-mers based on simulation procedures described in Section 2.2.4.1 with the Markov order r specified as 3. Hence, the dimensionality is p = 255, after dropping one k-mer. Although KIMI only requires n > p, a larger k will generally increase the collinearity betweenk-mer frequencies, as shown in Figure 2.1. We focus 26 ● ● ● ● ● 0.60 0.64 0.68 0.72 2 3 4 5 6 kmer size Maximum Spearman correlation Figure 2.1: The maximum Spearman correlation betweenk-mer frequencies for different k. The contigs are simulated from a third order Markov model with a randomly generated transition probability matrix. The number of contigs is 500 and the contig length is 10kbp. on the maximum correlation because it is the key factor dominating the numerical behavior of computing the knockoff statistics by penalized logistic regression and constructing the knockoff variables by inverting the covariance matrix as the precision matrix. To study the relationship between FDR/power and a parameter, we vary a particular param- eter while keeping all others fixed. Unless declared, we set the following default values for the parameters: 1) the number of relevant k-mers κ = 20, 2) the signal amplitude a = 0.2, 3) the number of contigsn = 3000, 4) the contig lengthL = 1kbp and 5) the target FDRρ = 0.2. We start by exploring how the power and FDR change with the number of relevantk-mersκ . All other parameters are fixed at their default values. The power and FDR for different values of κ are shown in Figure 2.2a. As shown therein, the power decreases withκ , which is reasonable 27 given that selecting all relevant k-mers becomes harder when the target set grows larger. The power remains generally high, i.e., around or above 0.9 though. In addition to achieving high power, we also observe that the FDR is mostly controlled. Next, we explore the relationship between FDR/power and the signal amplitudea. It is shown in Figure 2.2b that there is an obvious increasing trend for the power witha. The power ata = 0.1 is around 0.25 and increase steeply to above 0.9 at a = 0.2. The growth then slow down until the power reaches almost 1 at a = 0.3. The variation of power also decreases with a, showing larger signal amplitude helps stabilize the performance of KIMI. Meanwhile, the FDR keeps under control in all scenarios. The relationship between FDR/power and the number of contigs in one round is presented in Figure 2.2 c. Generally the trend of power is similar to what is shown in Figure 2.2 b, which is reasonable given a larger sample size also strengthens the signal. The FDR also slightly increases withn but remains safely under controlled. We show in Figure 2.2 d that the power also rises with the contig length L, which can be explained as thek-mer frequency in longer contigs tend to become stationary. The FDR is again strictly under controlled for all values ofL. Among all 2000 rounds of simulations of which results are shown in Figure 2.2, the relevant k-mer is dropped in only 5 rounds. In general, the chance of dropping a relevantk-mer is very rare in simulations. Due to page limitations, we present the effect of the target FDR ρ on the power and FDR as Figure 5.1 in the Supplementary Materials. 28 Figure 2.2: The relationship between power/FDR and various parameters in simulation studies. a, The relationship between power/FDR and the number of relevantk-mersκ . b, The relationship between power/FDR and the signal amplitudea. c, The relationship between power/FDR and the number of contigsn. d, The relationship between power/FDR and the length of contigsL. The thick black horizontal line indicates the target FDR atρ = 0.2. Standard error of the mean (SEM) calculated from 100 rounds is added to each data point. 29 0.00 0.25 0.50 0.75 1.00 KIMI BH (glm) BH (WMW) BH (post) qvalue (glm) qvalue (WMW) qvalue (post) Different methods Power/FDR Measure FDR Power Figure 2.3: KIMI outperforms both the BH procedure and theq-value method by simultaneously achieveing controlled FDR and high power. Default parameters for simulation studies are used for the comparison. We implement the BH procedure and theq-value method based onp-values calculated from joint logistic regression implemented by the R package "glm", marginalp-values calculated from independent WMW tests, and conditionalp-values of post selection inference for lasso by the R package "selectiveInference". The black horizontal line indicates the target FDR at ρ = 0.2. Standard error of the mean (SEM) calculated from 100 rounds of simulations is added to each data point. 30 2.3.2 TheBHprocedureandtheq-valuemethodproduceverylowpower comparedtoKIMI We next investigate the FDR and power of the BH procedure and theq-value method using the similar simulation procedures as for KIMI. All default parameters are used as in Section 2.3.1 for KIMI, the BH procedure, and theq-value method. The results for comparing the performances of KIMI, the BH procedure, and the q-value method are summarized in Figure 2.3. In contrast to KIMI that simultaneously achieves controlled FDR and high power at around 0.9, both the BH procedure and theq-value method regardless of using joint, marginal, or post selectionp-values failed to produce noticeable power, i.e., the power remains almost 0, making these methods meaningless despite that the FDR is controlled. In ad- dition, the variation of FDR from both methods is larger than that of KIMI. The failure of both methods usingp-values from joint logistic regression and post selection inference for lasso is pos- sibly due to the very high collinearity of k-mer frequencies, and the marginal p-values are also unsurprisingly useless since they are used to test the marginal independence between eachk-mer and the response instead of the conditional independence. Notice that the "glm" R package by default uses Wald tests to computep-values from the joint logistic regression. The results based on the likelihood ratio test (LRT) are similar and are presented as Figure 5.2 in the Supplementary Materials. These results show that neither the BH procedure nor theq-value method is applicable in our problem, highlighting the necessity and usefulness of KIMI for motif discovery. 31 2.3.3 Predictionaccuracyofviralcontigsbasedonrelevantk-mers selectedbyKIMIishigherthanthatofVirFinder To see the applications of KIMI for motif identification in real data analysis, we test KIMI on contigs sampled from real viral and bacterial genomes, as discussed in Section 2.2.5. We fix the k-mer size of interest at 3 as a simple example. Our objective is to carry out a pre-screening of 3-mers by KIMI, train a classification model based on logistic regression and lasso regularization on those selected relevant 3-mers, and predict viral contigs on the independent testing set with a focus on selecting 3-mers with controlled FDR. We next investigate the prediction accuracy of viral contigs in the testing data using the selected k-mers. The AUROC values measuring the prediction accuracy on the testing set for different contig lengths by KIMI and VirFinder are shown in Figure 2.4 a showing that the AUROC values using thek-mers selected by KIMI are consistently higher than those of VirFinder for all contig lengths. For example, although longer contigs tend to have higher prediction accuracy in general, the prediction accuracy based on KIMI selected k-mers at contig length 3kbp (5kbp) is higher than that based on VirFinder using all k-mers at 5kbp (10kbp), respectively. The results show the superiority of KIMI selecting relevantk-mers since we are using the same feature type (k-mer frequency) and prediction model (logistic regression with lasso regression). The improved prediction accuracy provides evidence on the relevance ofk-mers selected by KIMI. We also show the fractions of selected relevantk-mers among allk-mers in Figure 2.4b. Around 0.6-0.75 of all thek-mers are selected by KIMI for different contig lengths, indicating that majority of the k-mers may have different frequency compositions in viral and bacterial contigs. The large proportion of selected k-mers may be due to the high correlation between k-mer frequencies or diverse 32 compositions ofk-mer abundance in viral and bacterial genomes. We also present the consistency of k-mers selected by KIMI in each round since the construction of knockoff variables involves random sampling. As shown in Figure 2.4c, among thek-mers that have been selected by KIMI in at least 1 round, around 80% of those have been consistently selected in all 30 rounds, showing that KIMI is highly stable for k-mer selection in this real data set. The fraction will increase to around 85% to 90% if we lower the threshold to 24 rounds (80% of all rounds). Lin et al. [102] developed a method to predict responses based on composition data using log-ratio of the relative frequencies. We extend KIMI to this model and investigate the prediction accuracy of viral contigs with the same data set based on KIMI selected variables. We show that the prediction accuracy using the log-ratio is lower than that presented in this chapter. Details of the results are given as Figure 5.3. 2.4 Discussionandconclusions Identification of motifs relevant to specific types of molecular sequences is of paramount impor- tance for understanding the composition of genetic materials of particular species, accurately classifying metagenomic reads, and capturing the underlying mechanisms of how different types of molecules evolve, interact with other molecules, impact public health, etc. Although many ex- isting methods may be applied for identifyingk-mers, or motifs, it is still crucial to develop meth- ods that can enhance the reproducibility, or equivalently, control the FDR and no such methods are yet available to the best of our knowledge. We develop KIMI, a wrapped framework based on knockoff inference for identifying k-mers relevant to binarily labeled contigs using their occur- rence frequency with guaranteed FDR control. Several key findings were shown in this chapter. 33 Figure 2.4: Prediction accuracy of viral contigs based on relevant k-mers selected by KIMI is higher than that of VirFinder. The k-mer size is fixed as 3. a, The AUC values on the testing set for different contig lengths by using all k-mers (VirFinder) and subset ofk-mers selected by KIMI. Error bars indicate the standard deviation of AUC values calculated from 30 rounds. b, The fractions of 3-mers selected by knockoffs for different contig lengths. Error bars indicate the standard deviation of fractions calculated from 30 rounds. c, Among those 3-mers that are selected by KIMI in at least 1 round, the fractions of 3-mers that are consistently selected by KIMI in 100% (30 times) and 80% (24 times) of all the 30 rounds. 34 First, we adapted the Model-X knockoffs framework to the motif identification problem despite the difficulty from high collinearity among k-mer frequencies. Second, we showed through sim- ulation studies that KIMI could simultaneously yield a high power and controlled FDR, while the popularly used BH procedure and q-value method for FDR control suffer from producing very low power. Third, we presented that KIMI could be reliably used in real data analysis by showing that the prediction accuracy of contig labels could be increased with only thosek-mers selected by KIMI used, compared to VirFinder. The results are consistent for all contig lengths. We expect that KIMI can be generalized for any binary types of contigs assembled from real metagenomic data and outputk-mers under guaranteed target FDR level. In spite of the key findings, KIMI also has some limitations. First, KIMI assumes that the sample size is greater than thek-mer size, which may be violated if the sample size is not large enough and largek is being investigated. Second, even ifn>p is guaranteed, the high collinear- ity caused by large k may make the sample covariance matrix numerically singular, prompting challenges on the construction of knockoff k-mer frequencies. Finally, KIMI currently deals with two types of molecules. In many real word problems, there are many different types of molecules and it is important to extend our framework to multiple types of molecules. This is a topic for future research. 35 Chapter3 MLR-OOD:aMarkovchainbasedLikelihoodRatiomethod forOut-Of-Distributiondetectionofgenomicsequences 3.1 Introduction Classification of metagenomic sequences into different taxons is an essential problem for un- derstanding the compositions of microbial communities. Some mapping based computational software tools [24, 86, 116, 130, 182, 183] can rapidly and accurately classify mappable microbial sequences based on reference databases. However, due to the lack of a complete set of reference genomes, such mapping based approaches are not able to discover novel sequences in specific classes, which is becoming increasingly important since a large portion of the metagenomic se- quences, the so-called "microbial dark matter", remain poorly understood [104, 147, 165]. For example, several studies have estimated that around 85%-99% of bacteria and archaea cannot be cultured [104] and up to 60%-80% bacterial sequences in some environments belong to unknown taxons [48, 123, 193]. Meanwhile, only several thousand of viral species have been recognized by 2017 [47]. Recently, many machine learning or deep learning based approaches have been 36 developed to classify metagenomic sequences into several classes without depending on refer- ence databases [43, 53, 54, 100, 145]. These approaches are usually based on classification models trained on sequences in several training classes. High classification accuracies have been re- ported in these studies and it is believed that these approaches can generalize well to unknown sequences, meaning that novel sequences belonging to these classes can also be discovered. How- ever, due to the complexity of compositions of metagenomic sequences, it is inevitable for these approaches to deal with sequences that do not belong to any of the training classes, the so-called out-of-distribution (OOD) genomic sequences. Although microbial researchers usually tend to believe that these OOD sequences will receive low classification scores in real applications, this assumption is unlikely to hold true as some studies have reported that modern neural network classifiers may assign higher classification scores for OOD inputs [61, 66, 126]. Thus, the de- tection of OOD genomic sequences is urgently needed to ensure that only in-distribution (ID) sequences, i.e. those belonging to one of the training classes, will be classified so that the credi- bility of taxonomic classification approaches can be improved. The detection of OOD inputs is an active research topic in machine learning and many ap- proaches have been proposed to address this problem [42, 72, 76, 97, 101, 143, 159, 178]. However, most of the current methods, either generative model based or discriminative model based, are de- signed for detecting OOD images on which very high prediction accuracy can be easily achieved. However, image data and genomic sequence data are distinct in nature. For example, genomic sequences are one-dimensional data with only four possible nucleotides at each position, while images are multi-dimensional data with much more complex pixel values for different dimen- sions of colors. Therefore, most of these methods are either not directly applicable or have much lower prediction accuracy on the genomic data [143]. Among all these attempts, the likelihood 37 ratio (LLR) method proposed by Ren et al. particularly addressed the problem of detecting OOD genomic sequences [143]. The LLR method uses likelihood ratios of two generative models for the ID data and randomly perturbed ID data, based on the hypothesis that only the semantic part of a sequence is associated with its taxon while the background part, in contrast, bias the predic- tion of OOD sequences. For example, the semantic part can be understood as the coding genes or motifs specific to a taxon and the background part can be understood as the repeated regions or transferred genes shared by sequences in different taxons. Ren et al. [143] compared LLR with nine different approaches and showed that their approach achieves the highest prediction accu- racy on detecting OOD bacterial sequences. It is also shown in [143] that LLR is robust to the GC content of genomic sequences which is a major confounding effect on detecting OOD genomic sequences. Another major contribution of LLR is that it composed the first data set consisting of bacterial sequences from multiple genera for benchmarking future OOD detection methods. Despite its effectiveness, the prediction accuracy of LLR for detecting OOD genomic se- quences is still not very high, leaving room for improvement based on more powerful methods. For example, the reported optimal area under the receiver operating characteristic curve (AU- ROC) for predicting 250bp bacterial OOD sequences using LLR is 0.755 [143], which is based on tuning two model-parameters. Besides, LLR requires a validation data set containing both ID and OOD sequences for model-parameters tuning, leading to the concern that the model trained on one data set may not be able to easily generalize to another data set. For example, Renetal. [143] used sequences in 10 bacterial genera as the ID sequences and constructed the OOD validation data set using OOD sequences from 60 other bacterial genera. However, in practice, we usually have knowledge on the ID sequences only and OOD sequences can belong to any other classes such as sequences in other bacterial genera, viral sequences, contaminations from the human 38 genome, etc. Therefore, it is unrealistic to expect that the model-parameters of LLR tuned on a specific validation data set can perform well on any other testing data set. In this chapter, we present MLR-OOD, a Markov chain based likelihood ratio method, for detecting OOD genomic sequences. MLR-OOD detects OOD genomic sequences based on the likelihood ratio of the maximum of the ID class conditional likelihood and the likelihood of the sequence under a Markov chain model mimicking the sequence complexity of testing sequences. The rationale of MLR-OOD is based on two aspects. First, compared to LLR that uses one general model for all ID sequences, MLR-OOD uses the maximum of the ID class conditional likelihood taking into account different models for sequences in various ID classes, promoting a more precise modeling of the ID sequences. Second, based on the assumption that input sequence complexity, which can be modeled by Markov chain likelihood, is a factor that biases OOD detection, we propose to use Markov chain likelihoods to adjust the maximum of the ID class conditional like- lihood, bypassing the generation of background null sequences used by LLR. Thus, MLR-OOD is completely free of tuning model-parameters. On the other hand, LLR depends on the optimal perturbation rate that needs to be determined by a validation set consisting of both ID and OOD sequences. In addition to the bacterial data sets composed by Ren et al. [143], we composed two more microbial data sets for viruses and plasmids, to more comprehensively benchmark the per- formance of MLR-OOD. We show that MLR-OOD yields notably higher prediction accuracy than LLR. We also conclude that MLR-OOD is robust to the GC content on almost all data sets while LLR can be somewhat biased on the viral and plasmid data sets. In summary, we have two major contributions in this chapter. First, we construct viral and plasmid data sets that can be jointly used with the bacterial data set composed by Ren et al. [143] for comprehensively benchmarking the performances of different OOD detection methods 39 for genomic sequences. Second, we develop MLR-OOD, a powerful method achieving the cur- rent state-of-the-art prediction accuracy for OOD detection of genomic sequences without tuning model-parameters on validation data sets. We believe that MLR-OOD will improve the credibil- ity of machine learning based metagenomic taxonomic classification. The MLR-OOD software package is publicly available at https://github.com/xinbaiusc/MLR-OOD. 3.2 Materialsandmethods 3.2.1 DatasetsfordetectingOODgenomicsequences Renetal. [143] composed a genomic sequence data set consisting of bacterial sequences in differ- ent genera that can serve as a benchmark for detecting OOD genomic sequences. Although bacte- ria are abundantly distributed in microbial communities, there are many other types of molecules including viruses, plasmids, fungi, archaea, etc [34, 45]. To benchmark the prediction accuracy of OOD detection on different types of molecules, we construct a new testing data set for bacteria and two more data sets for viruses and plasmids, respectively. A brief summary of the data sets we use is shown in Table 3.1. 3.2.1.1 Thebacterialdataset Ren et al. have already constructed a comprehensive bacterial data set that is publicly available in [143] for OOD detection. We use the same data set and add another new testing data set to demonstrate the prediction accuracy of MLR-OOD. Specifically, Ren etal. downloaded 11,672 bacteria genomes from National Center for Biotech- nology Information (NCBI) and chopped the genomes to short 250bp sequences [143]. Different 40 Table 3.1: The data sets we use for benchmarking the detction of OOD genomic sequences. The details about the specific ID and OOD genera names and the construction of the training and testing data sets are discussed in Supplementary Materials. Type Bacteria Virus Plasmid Name Test2016 Test2018 N/A N/A Phytogenetic Genus Family Genus level for classes Number of 10 6 6 ID classes Number of 60 20 20 OOD classes Description The same The same genera Viruses whose Plasmids whose data set as theTest2016, hosts are hosts are as the one data set bacteria bacteria used in [143] but more novel and archaea and archaea genera of these genomes were used to define the class labels (both ID and OOD classes). For ex- ample, genomes in 10 particular bacterial genera and that were discovered before 01/01/2011 are used to construct the ID training data set. The validation data set contained ID sequences in these 10 genera but were discovered between 01/01/2011 and 01/01/2016, and OOD sequences in other 60 genera discovered in the same time period. The testing data set similarly contained ID se- quences in the 10 genera and that were discovered between 01/01/2016 and September 2018, and OOD sequences in the 60 genera that did not overlap with the validation OOD genera discovered in the same time period. We collected the accession numbers of the original bacterial genomes and downloaded them from NCBI in order to chop the bacterial genomes into non-overlapping sequence fragments of different training and testing lengths as discussed in Sections 3.2.5 and 3.2.6. To distinguish from our newly constructed testing data set, we refer to the testing data set consisting of bacterial genomes used by Ren et al. as theTest2016 data set. For the details, please refer to [143]. 41 We also constructed another testing data set consisting of ID and OOD sequences discov- ered between 10/01/2018 and 10/1/2021 to benchmark OOD detection methods on relatively more novel sequences. The ID and OOD sequences were from the same ID and OOD genera used by Ren et al. [143] for constructing the Test2016 data set. We refer to the new testing data set as theTest2018 data set. 3.2.1.2 Theviraldataset We downloaded 1295 viral genomes whose hosts are bacteria and archaea from NCBI and then constructed an ID training data set and a testing data set containing both ID and OOD sequences. Since viral sequences from NCBI are much fewer and shorter compared to bacterial sequences, we used different viral families to define ID and OOD classes. Viral genomes that were in 6 particular families and were discovered before 01/01/2016 were used to construct 6 ID viral classes. Viral genomes that were in these families but were discovered between 01/01/2016 and 10/01/2021 were treated as ID testing sequences. The OOD testing data set contained viral genomes in 20 other randomly chosen families and were also discovered between 01/01/2016 and 10/01/2021. Since potentially highly similar sequences will bias the training and testing process, we use CD-HIT [57] with parameters "−c 0.95 −n 10 −M 0−T 16" to cluster all the sequences and then remove the duplicate ones. The viral genomes were then chopped into non-overlapping sequence frag- ments of length 250bp for training and various lengths for testing. The details about the specific ID and OOD family names and the construction of the training and testing data sets are discussed in Supplementary Materials. 42 3.2.1.3 Theplasmiddataset We also downloaded 818 plasmid genomes whose hosts are bacteria and archaea from NCBI to build up a data set for detecting OOD plasmid sequences. We used plasmid genera to define ID and OOD classes, the same as the bacterial data set. Similar to Section 3.2.1.2, we built up an ID training data set containing plasmid genomes discovered before 01/01/2016 in 6 different classes. The testing data set contains plasmid genomes discovered between 01/01/2016 and 10/01/2021 in 6 ID and 20 randomly chosen OOD classes. The plasmid genomes were similarly clustered by CD-HIT and then chopped into non-overlapping sequence fragments of length 250bp for training and various lengths for testing. The details about the specific ID and OOD genera names and the construction of the training and testing data sets are discussed in Supplementary Materials. 3.2.2 OutlineoftheMethods Suppose we have an in-distribution (ID) data set of genomic sequences D(X,Y). Let the pair (x,y) denote an individual nucleotide sequencex = x 1 ...x d ...x D ,x∈X,x d ∈{A,C,G,T} and its in-distribution class label y ∈ Y := {1,...,k,...,K}), i.e., the specific taxon that se- quence belongs to, where K denotes the number of in-distribution classes and D denotes the sequence length. We are interested in these ID sequences belonging to particular taxons and aim to detect other sequences that do not belong to any of the K ID classes, i.e. y / ∈ Y, that are referred to as the OOD sequences, for accurate downstream analyses of both ID and OOD sequences. We develop MLR-OOD, a Markov chain based likelihood ratio method combining both the ID generative models and the Markov models for the testing sequences, for the detection of OOD 43 genomic sequences. Before discussing the details of our method, we first give a brief review on the LLR method proposed by Renetal. [143] which was the first study paying particular attention to detecting OOD genomic sequences. 3.2.3 Theframeworkofthelikelihoodratiomethod The LLR method proposed by Ren et al. uses the likelihood ratio of an original model and a background model to measure the chance of being OOD for the testing sequences [143]. The original model, denoted asp θ (· ), is a generative model trained on all the ID sequences that can be naturally used to measure the likelihoods of being OOD for testing sequences. However, Renetal. observed that the likelihood of the original model fails to separate ID and OOD testing sequences and is biased by the GC content of each testing sequence [143]. This observation can be explained by the fact that the original model captures both the semantic and the background parts of the ID sequences but only the semantic part is helpful for OOD detection. The GC content is considered a confounding effect arising from the background part of the sequences, thus motivating them to generate a background model that can adjust the confounding effects of the original model. Based on the assumption that random perturbations can corrupt the semantic part in the data, Ren et al. [143] randomly perturbed a certain fraction µ of the nucleotides in the original sequences by changing the specific nucleotides to the other three with equal probability to build a null set of background sequences. The background model p θ 0 (· ) is then trained on the background 44 sequences. For an incoming testing sequencex, the predicting scoreS LLR (x) is calculated as the log-likelihood ratio of the original and background models S LLR (x) = log p θ (x) p θ 0 (x) = logp θ (x)− logp θ 0 (x). (3.1) As shown in equation (3.1), the information of the background part, which is contained in both the original and background likelihoods, is canceled out by taking the ratio, so that the semantic part in the original model stands out for OOD detection. A larger value of S LLR (x) indicates a higher chance of being ID (equivalently, a lower chance of being OOD) for sequencex. The original and background models were trained using long short-term memory (LSTM) [74], a deep generative model that has been widely used to model genomic sequences [63, 90, 169], on entire original or background sequences. Specifically, Ren et al. [143] used one-hot encoder to transform the genomic sequences from strings comprised of{A,C,G,T} nucleotides to binary numeric vectors and then feed them to a LSTM layer, followed by a dense layer and a softmax function to predict the probability distribution over{A,C,G,T} at each position. The LSTM model was jointly trained at all positions from 1 toD. Then, if we denote the LSTM model trained for the original model by p θ (x), the log-likelihood logp θ (x) for any incoming testing sequencex can be calculated as follows logp θ (x) = D X d=1 logp k,θ (x d |x <d ), where x <d means all the nucleotides of sequence x before position d. The calculation for the log-likelihood of the background modellogp θ 0 (x) is similar as above. 45 Two model-parameters need to be carefully tuned during the training process of LLR: the perturbation rateµ and the coefficient of L 2 regularizationλ added to the model weights when training the background model (optional). Ren et al. used a validation data set consisting of additional ID and OOD data to determine these two model-parameters [143]. 3.2.4 TheframeworkofthenewMLR-OODmethod Although the LLR method achieves a higher prediction accuracy and is more robust to the GC content compared to the method only using the original likelihood, there is still room for im- provement in two aspects. First, the LLR method only considers a general modelp θ (· ) for all ID sequences and does not consider the K ID classes when modeling ID training data. This may not be optimal since sequences in different ID classes may follow different models. Therefore, we borrow the ideas of using likelihood ratios from [143] but utilize the information of the ID classes to further increase the prediction accuracy. Second, the LLR method relies on the valida- tion data set for tuning model-parameters, that is, to have access to some of the OOD data, which is unlikely to remain true in real practice. We first propose using the the likelihood that is maximum across LSTM models across all ID classes p max,θ (· ) instead of p θ (· ) for modeling the ID data. The high level idea is to train the generative models for the data in each ID class separately and choose the most appropriate model for new testing sequences by taking the maximum of the class conditional likelihoods across all ID classes. In addition to using LSTM which is also used by LLR to train the generative models, we also tried Markov chains which are generally more interpretable than LSTM on the bacterial data set but the prediction accuracy was much lower, at AUROC around 0.6 for 250bp bacterial 46 sequences. We present the details and the results in Supplementary Materials. Therefore, we only present our LSTM model and the corresponding results in the rest of the main text. Letp k,θ (· ) be the model we trained on the sequences in thek-th ID class. For each incoming sequencex, we calculate the maximum of the class conditional likelihoods as follows p max,θ (x) = max k∈{1,··· ,K} p θ k (x), (3.2) wherep θ k (x) denotes the LSTM likelihood of sequencex belonging to classk. We use the same model parameters as in [143] for training the LSTM models on the ID data for the methods listed above. In detail, the size of the hidden units in the LSTM model is 2,000. The number of epoches is 900,000 and the learning rate is 0.0005. The batch size is 100 and Adam optimizer is used to minimize the training loss. We denote the method using p max,θ (x) as the prediction score by the max-LL method. As a preliminary method for predicting OOD sequences proposed in this chapter, we will show in Section 3.3 that the max-LL method itself is a very powerful predicting statistic compared to S LLR , even without background adjustment. Note that the mixture model likelihood p mix,θ (· ) = P K k=1 p θ k (x)p(k) is another natural way to more precisely model the ID data, where p(k) denotes the prior probability for each ID class. However, it is challenging to estimatep(k) for real metagenomic data since a large portion of the metagenomic sequences still cannot be classified. Therefore, we adopt p max,θ (x) for our method. However, since the prediction score of the max-LL method is still likely to be biased by con- founding effects such as the GC content, another key question is yet remaining: how to adjust p max,θ (x) to bypass the effect of confounding effects such that the prediction accuracy can be fur- ther increased without tuning model-parameters? We propose to adjust the maximum of the class 47 conditional likelihoods p max,θ (x) by the Markov chain likelihood of the testing sequence which can be regarded as a special case of sequence complexity. Serra et al. observed that input com- plexity can bias the likelihood as relatively “simple" patterns generally have higher likelihoods compared to complex patterns [157]. Several measures have been proposed [129, 174] to model the complexity of genomic sequences. Among all these sequence complexity measures, the CE complexity [129] based on entropy of sequences is essentially a constant times the log-likelihood of the testing sequence under the independent and identically distributed (i.i.d.) model, which is equivalently, a zero-th order Markov chain. The GC content is essentially related to the i.i.d. likelihood if we do not distinguish betweenG andC (alsoA andT ) nucleotides. Inspired by the idea that the likelihood under the i.i.d. model may confound OOD detection, we generalize the idea and use the log-likelihood of each testing sequence modeled by a Markov chain to adjust p max,θ (x) given that Markov chains have been widely used in genomic sequence analyses [4, 8, 9, 10, 21, 22, 140, 179]. Specifically, assume that the testing sequence x = x 1 ...x d ...x D is mod- eled by ar-th order Markov chain, then the log-likelihood ofx is estimated as follows if we safely disregard the initial distributionπ (x 1 ··· x r ) L r MC (x) = logπ (x 1 ··· x r )+ X w N w logP(w r+1 |w− ), ≈ X w N w log N w N w− , wherew =w 1 w 2 ··· w r+1 denotes a word of nucleotides of lengthr+1,N w denotes the count of occurrences ofw in sequencex,w− =w 1 w 2 ··· w r denotes wordw with the last letter removed, and N w− is the count of occurrences ofw− . When r = 0, N w− degenerates to the sequence 48 Training sequences Class 1 Class 2 Class K … LSTM LSTM LSTM Testing sequence x 1 ( ) 2 ( ) ( ) Markov chain likelihood Take max max , ( ) ( ) Training Evaluation M LR − OOD = log max , ( ) ( ) ID Large? Small? OOD Prediction Figure 3.1: The complete workflow of MLR-OOD for predicting OOD sequences. lengthD. The transition probability is denoted byP(w r+1 |w− ) and is estimated by Nw N w− using the maximum likelihood estimation. Therefore, we propose to use S r MLR-OOD (x) defined in equation (3.3) for OOD genomic se- quences detection. S r MLR-OOD (x) = logp max,θ (x)− L r MC (x). (3.3) A larger value of S r MLR-OOD (x) indicates a higher probability of belonging to ID. Note that the calculation ofS r MLR-OOD (x) is completely free of tuning model-parameters sincer is data-driven, determined by the most commonly estimated Markov order of the testing sequences, and has nothing to do with the training process. We use Bayesian information criterion (BIC) [83] to estimate the Markov orders of the testing sequences. To facilitate the understanding of MLR-OOD, we present its complete workflow in Figure 3.1 summarizing the above procedures. 49 3.2.5 Investigatingtheeffectoftrainingsequencelengthonprediction accuracyandcomputationaltime Ren et al. [143] chopped the bacterial genomes into short sequences of 250bp for training the LSTM model. However, the effect of the training sequence length on prediction accuracy is yet unknown since different lengths of training sequences may lead to different model perfor- mances and computational time. We study the effect of training sequence length on our MLR- OOD method by chopping the original bacterial genomes used by [143] into short sequences of 100bp, 250bp, and 500bp, respectively. Then we train the LSTM models for ID classes based on different training lengths and test the corresponding model performances on the Test2016 bac- terial testing data set with testing contig lengths being 1,000bp, 2,500bp, and 5,000bp. We follow the protocol in Section 3.2.6 to deal with different training and testing lengths. Our target is to select the best training sequence length for MLR-OOD balancing both prediction accuracy and computational time. We will use the selected training sequence length on the other data sets and compare with the LLR method. 3.2.6 Investigatingtheeffectoftestingcontiglengthonprediction accuracy Renetal. present the usefulness of the LLR method by showing the prediction accuracy on 250bp testing sequences [143]. However, in reality, metagenomic reads are usually assembled from short reads of several hundred of basepairs into contigs that are consecutive regions of genomes with overlapping reads to facilitate downstream analyses. These contigs usually have various lengths 50 greater than 250bp. To show the effect of testing contig length on the prediction accuracy of MLR- OOD, we fix the training sequence length as 250bp as it performs the best compared to 100bp and 500bp as shown in Figure 3.3 and chop the testing genomes into contigs of lengths 250bp, 500bp, 1000bp, 2,500bp, and 5,000bp. For each contig length except 2,500bp and 5,000bp for the viral data set, 10k ID and 10k OOD contigs are randomly selected to build up the corresponding testing data sets. We use 5k ID and 5k OOD 2,500bp contigs and 3k ID and 3k OOD 5,000bp contigs to construct the corresponding testing data sets. It is expected that longer contigs generally contain more information about the taxons and thus should have higher prediction accuracy for OOD detection. To overcome the discrepancy of the training and testing sequence lengths, we follow the ideas in [191] by splitting the testing contigs into non-overlapping fragments of 250bp and calculating the prediction score for each of these fragments. The final predicting score for one contig is calculated by averaging the prediction scores of all fragments in that contig. We follow the same protocol to compare MLR-OOD with other methods. 3.2.7 InvestigatingtheeffectofthegenomedistancebetweenOOD testingsequencesandIDtrainingsequencesontheprediction accuracyofMLR-OOD In real applications, the OOD sequences may come from any genetic materials other than the ID sequences. Therefore, it would be interesting to study how the prediction accuracy of MLR- OOD changes with the overall similarity between OOD testing sequences and our ID training sequences. Ren et al. concluded that the prediction accuracy of LLR becomes generally higher if the minimum d S 2 distance [139, 142] between the OOD bacterial classes and the ID bacterial 51 classes increases. We investigate the same problem on all bacterial, viral, and the plasmid data sets we built for MLR-OOD as a byproduct of our analyses. First, we combine all the genomes in each ID training class into a single fasta file representing that class. Second, for each OOD testing genome, we calculate its pairwise Mash distance [128] to all the ID training classes and then take the minimum. We specify the number of sketches as "-s 1000000" and use default values for other parameters for decent performance of the Mash distance. The reason for choosing the Mash distance is due to that it achieves much faster computational time thand S 2 . Once we obtain the minimum distance to ID classes for each OOD testing genome, we cut a threshold and then only select those genomes having a minimum distance greater than that threshold to build the OOD testing data set. Since currently most metagenomic contigs have length greater or equal to 1,000bp, we cut the ID and OOD testing genomes to 1,000bp sequences and then calculate the prediction accuracy metrics (see Section 3.2.10) based on MLR-OOD to study the relationship between the Mash distance threshold and the prediction accuracy. We use the same ID testing data set while varying the OOD testing data set in comparison. The number of available OOD testing sequences after constraining the threshold on the minimum Mash distance is certainly fewer and we report the detailed numbers for each type of sequences in Supplementary Materials. 3.2.8 Investigatingtheeffectofchimericcontigsontheprediction accuracyofMLR-OOD Chimeric contigs are unavoidable in the assembly process of metagenomic sequences since mul- tiple closely related species usually exist in metagenomic samples [11, 175]. In real practice, OOD detection methods may face chimeric contigs containing sequence fragments from both ID and 52 OOD genomes. We investigate the performance of MLR-OOD when dealing with such chimeric contigs. First, for each given fractionc, we simulate 10k chimeric contigs in which the proportion of nucleotides belonging to ID genomes isc and the proportion of nucleotides belonging to OOD genomes is 1− c. The contig length is fixed at 1,000bp. For example, if c = 0.25, we randomly sample a 250bp sequence fragment from the ID genomes in a particular testing data set, then sample another 750bp sequence fragment from the OOD genomes, and finally insert the 750bp OOD sequence fragment at a random position of the 250bp ID sequence fragment. We test dif- ferent values ofc = 0, 0.25, 0.5, 0.75, and 1 for all bacterial, viral, and plasmid data sets. Second, we calculate the prediction scores of MLR-OOD on those chimeric contigs following the proce- dures introduced in Section 3.2.6. Third, we calculate the prediction accuracy metric AUROC (see Section 3.2.10) for classifying completely OOD contigs (c = 0) and contigs containing a certain fraction of ID sequences (c = 0.25, 0.5, 0.75, and 1) based on the prediction scores of MLR-OOD. We study how the distribution of the prediction scores and the prediction accuracy change with different values of c. 3.2.9 ComparisontoLLR Since Ren et al. [143] already compared the LLR method with a large number of other methods [32, 72, 73, 89, 96, 97, 101], most of which are designed for detecting OOD images, and showed that the LLR method achieved the best prediction accuracy, we focus on comparing our MLR-OOD method with LLR for detecting OOD genomic sequences. Although some other OOD detection methods have been proposed after the publication of the LLR method, none of them are designed for detecting OOD genomic sequences or can be directly applied to this problem to the best of 53 our knowledge. Therefore, we only show the prediction accuracy of the LLR method and two methods proposed in this chapter: 1) the max-LL method directly using the maximum of the in-distribution (ID) class conditional likelihoods, and 2) the refined MLR-OOD method on the bacterial, viral, and plasmid data sets. Unlike the LLR method, the max-LL and the MLR-OOD methods do not have model-parameters to be tuned. Therefore, we do not need a validation data set. To compare with the LLR method on the bacterial data set, we use the optimal model-parametersµ = 0.1 andλ = 10 − 4 that have already been chosen by Ren et al. for the LLR method based on the validation data set [143]. We try different values of the perturbation rate µ = [0.1,0.15,0.2] for the LLR method as suggested by Ren et al. [143] and report all these results. Ren et al. use another model-parameterλ which is the coefficient of L 2 regularization in training the LSTM model [143]. Ren et al. reported that λ is optional and does not markedly affect the overall performance of the LLR method compared toµ [143], we fix λ = 0 while changingµ to save computational resources. We also report the results based on λ = 1e-4 which are very similar to the results based on λ = 0 for the viral and plasmid data sets in Supplementary Materials. 3.2.10 Evaluationmetrics We use two widely adopted metrics to evaluate the prediction accuracy for detecting OOD ge- nomic sequences. The first one is the area under the receiver operating characteristic courve (AUROC). Assume we have calculated the predicting scores of whatever method for the testing sequences. For a given threshold we predict all the sequences having a prediction score higher than the threshold as ID sequences and otherwise as OOD sequences. The corresponding true 54 positive rate (TPR) and false positive rate (FPR) are then calculated. By varying the threshold we can draw a receiver operating characteristic (ROC) curve with FPR and TPR as the two axes. Finally, we calculate the AUROC as the evaluation metric. The second one is the area under the precision-recall curve (AUPRC). The calculation of AUPRC is similar to that of AUROC except that we use precision and recall as the two axes. Higher values of both metrics indicate better model performance. Both metrics have been commonly used for evaluating the model perfor- mance of OOD detection [72, 143]. For each testing data set, we randomly select 1,000 ID and 1,000 OOD testing contigs from the testing data set and calculate the AUROC or AUPRC based on these 2,000 contigs. We repeat the process for 30 rounds and report the mean and standard deviation of both metrics. 3.3 Results 3.3.1 CNNclassificationmodelsfailtodistinguishbetweenIDandOOD testingsequences Although Ren et al. [143] have shown that deep neural network classifiers fail to properly deal with OOD bacterial sequences, we reproduce the conclusion and extend it to viral and plasmid sequences, showing that this phenomenon is universal in metagenomic sequence classification regardless of the sequence type. We use convolutionary neural network (CNN) as used by Renet al. [143] to train classification models on the ID bacterial, viral, and plasmid data sets. We use the same CNN architecture as in Ren et al. [143]. Specifically, the CNNs contain one convolutional layer, one max-pooling layer, and a final dense layer with softmax activation for predicting class 55 probabilities. The number of motifs for the convolutional layer is set as 1000 for the bacterial data set and 100 for the viral and plasmid data sets since there are much fewer ID training sequences in these two data sets. We monitor the ID validation loss that is calculated every 100 epoches up to 100,000 epoches and choose the epoch yielding the smallest validation loss. Since the viral and plasmid data sets lack an ID validation data set, we split each ID training data set and use 90% of the sequences for training the classification models and the remaining 10% for validation. Then we use the trained classification models to calculate the maximum class probability p(˜ y|x) = max k p(y = k|x),1 ≤ k ≤ K on the ID and OOD testing data sets. Conceptually ID testing sequences should have much higher maximum class probability than OOD testing sequences if the classifiers work well. Figure 5.6 shows the maximum class probability of ID and OOD 250bp testing sequences based on the CNN classification models in the bacterial, plasmid, and viral data sets. As shown in the figure, ID sequences have slightly higher maximum class probability than OOD sequences in the bacterialTest2016 andTest2018 testing data sets. On the other hand, no clear separation between the distributions of ID and OOD sequences can be observed. For the plasmid and viral data sets, on the contrary, OOD sequences even have relatively higher maximum class probability compared to ID sequences, indicating that OOD sequences are more likely to be classifier into one of the training classes than ID sequences. This figure clearly demonstrates that the CNN models may fail to distinguish OOD sequences from ID sequences, meaning that OOD sequences are very likely to be misclassified by deep learning models with high confidence. On the other hand, we also notice that the validation loss fluctuates markedly during training, meaning that the CNN classifiers are actually not stable for classifying new sequences. Therefore, OOD detection 56 ● ●●●● ● ● 0.25 0.50 0.75 1.00 bacteria 2016 bacteria 2018 plasmid virus Different testing datasets max_k P(y=k|x) ID OOD Figure 3.2: The bar plots of the maximum class probabilities of 10k ID and 10k OOD sequences in each type of testing data set. The contig length of the testing sequences is 250bp. of genomic sequences is of great significance to microbial studies to ensure the credibility of taxonomic classification. 3.3.2 MLR-OODoutperformsLLRonalldatasets 3.3.2.1 Markovorderestimationfordifferenttestingdatasets The choice of Markov order for the MLR-OOD method depends on different testing data sets and contig lengths. For each testing data set and contig length, we use Bayesian Information Criterion (BIC) (details shown in Supplementary Materials) to estimate the Markov order of each contig 57 and choose the most common estimated order for calculatingS r MLR-OOD (x) in equation (3.3). The estimated orders based on BIC have been shown to perform well in molecular sequence analyses [122]. For the Test2016 bacterial data set, zero-th order (i.i.d.) is estimated as the most common order for contigs with length less or equal to 1,000bp. The longer 2,500bp and 5,000bp contigs are most likely to be estimated as first order Markov chains. The conclusion is the same for the Test2018 bacterial data set and the plasmid data set except that i.i.d. is the most common order for the 2,500bp contigs. For viruses, zero-th order (i.i.d.) is the most common order regardless of the contig length. The distributions of the estimated orders for all data sets are shown in Supplementary Materials. 3.3.2.2 TheoptimaltraininglengthforMLR-OODis250bp We first present the prediction accuracy of our MLR-OOD method for OOD genomic sequence detection on different training and testing lengths. The corresponding LSTM models are trained using ID training sequences of lengths 100bp, 250bp, and 500bp in the bacterial data set. Testing contigs of lengths 1,000bp, 2,500bp, and 5,000bp are used to evaluate the performances of LSTM models based on different training lengths. As shown in Figure 3.3, models trained using 250bp sequences consistently perform the best in terms of prediction accuracy. On the other hand, we observe that the computational time for training the LSTM models is proportional to the training sequence length. Models trained using 100bp sequences perform relatively lower with the least computational time. Models trained using 500bp sequences perform poorly in terms of both prediction accuracy and computational time. The poor performance of LSTM on long sequences (500bp) is probably due to the fact that the use of activation functions may result in gradient 58 0.900 0.925 0.950 0.975 1.000 1000 2500 5000 Testing sequence length (bp) AUROC A 0.900 0.925 0.950 0.975 1.000 1000 2500 5000 Testing sequence length (bp) AUPRC B Training length (bp) 100 250 500 Figure 3.3: The prediction accuracy of MLR-OOD on the Test2016 bacterial data set based on different training/testing sequence lengths. Each bar shows the mean accuracy of 30 random repetitions for a particular training/testing sequence length. Error bars indicate the standard deviation. decay over layers [99]. Therefore, in this chapter, we fix the training sequence length at 250bp as in Renetal. [143] which gives excellent prediction accuracy and acceptable computational time. 3.3.2.3 ThecomputationaltimeofMLR-OOD The computational time of MLR-OOD is dominant by training the generative models. Compared to training, the time for calculating the likelihoods of LSTM and Markov models is minimal. The computational time for training is proportional to both the number of epoches and the train- ing sequence length. For example, training 900,000 epoches for lengths 100bp, 250bp, and 500bp sequences using the NVIDIA Tesla V100 GPU takes approximately 42, 105, and 210 hours, respec- tively. We recommend using GPU resources for training a large number of epoches in practice. 59 3.3.2.4 The comparison between MLR-OOD, max-LL, and LLR on different lengths of testingsequences Next, we compare our MLR-OOD and max-LL methods with the LLR method using different lengths of testing contigs and present the results in Figure 3.4. The ID LSTM models are trained using the training data set of Ren et al. [143] and two corresponding testing data sets (Test2016 andTest2018) are used for evaluating the performances of different methods. As shown in Fig- ure 3.4 (A)-(D), our MLR-OOD method greatly outperforms the LLR method for detecting OOD genomic sequences, regardless of the contig length. Note that the max-LL method improves the prediction accuracy by a remarkable margin compared to the LLR method, revealing that consid- ering the ID classes separately instead of using one general model for all ID classes is the major reason for the superiority of MLR-OOD. That being said, the Markov chain likelihood further increases the prediction accuracy as shown therein that MLR-OOD performs consistently bet- ter than max-LL, which can be possibly attributed to that Markov chain likelihoods reduced the effects of the confounding factors. We would like to emphasize that MLR-OOD does not need to tune model-parameters and does not access to the extra OOD validation data, while the LLR method does need to tune two extra model-parameters on the validation data set. For the comparison, we used the optimal model-parameters chosen by Ren et al. [143]. As for the contig length, all these three methods show an increasing trend of the prediction accuracy for longer contigs, which can be explained by that longer contigs generally contain more information for the particular taxons they belongs to. It is also worth mentioning that MLR-OOD yields very high prediction accuracy (AUROC and AUPRC > 0.9) when the testing contig length is at least 1,000bp, a length that many assembled 60 metagenomic contigs can easily reach, clearly demonstrating that MLR-OOD is highly promising for detecting OOD bacterial sequences in real practice. We present the results for predicting OOD viral sequences in Figure 3.4 (E) and (F). As shown therein, although the prediction accuracy still increases with the testing contig length, all these three methods generally yield a lower performance compared to the results of the bacterial data sets shown in Figure 3.4 (A)-(D). This can be possibly explained by the fact that viruses tend to have much shorter genomes but much higher mutation rates than their hosts [46, 132], making it more difficult to train LSTM models using limited amount of data for the ID classes to accurately capture the underlying taxonomic characteristics. Among these three methods, the two methods proposed in this chapter: max-LL and MLR-OOD both perform much better than LLR in all sce- narios, regardless of the choice of the model-parameterµ . Although the prediction accuracy of MLR-OOD is slightly lower than max-LL, the differences are minimal. This phenomenon can be explained by the fact that the prediction scores of max-LL of the viral contigs are already robust to the GC content, thus the adjustment by Markov likelihood is no longer needed, as we will show later in Figure 3.5. In general, max-LL and MLR-OOD are exchangeable for the viral data sets and both of them have noticeable improvement compared to the LLR method. Note thatµ = 0.1 performs better thanµ = 0.15 andµ = 0.2 among the three different models of LLR. Finally, we compare the model performances of the three methods on the plasmid data sets in Figure 3.4 (G) and (H). It is clearly shown in Figure 3.4 (G) and (H) that the prediction accuracy of these methods remains relatively low for short contigs compared to the bacterial data set, but increases rapidly if the contigs become longer than 1,000bp. Similar to the results of the viral data sets, we see that both max-LL and MLR-OOD achieve higher prediction accuracy than the LLR method even if compared with the model-parameters yielding the highest accuracy for LLR. 61 0.7 0.8 0.9 1.0 250 500 1000 2500 5000 Sequence length (bp) AUROC 0.7 0.8 0.9 1.0 250 500 1000 2500 5000 Sequence length (bp) AUPRC 0.7 0.8 0.9 1.0 250 500 1000 2500 5000 Sequence length (bp) AUROC 0.7 0.8 0.9 1.0 250 500 1000 2500 5000 Sequence length (bp) AUPRC LLR max−LL MLR−OOD A B C D 0.4 0.6 0.8 1.0 250 500 1000 2500 5000 Sequence length (bp) AUROC 0.4 0.6 0.8 1.0 250 500 1000 2500 5000 Sequence length (bp) AUPRC 0.5 0.6 0.7 0.8 0.9 1.0 250 500 1000 2500 5000 Sequence length (bp) AUROC 0.5 0.6 0.7 0.8 0.9 1.0 250 500 1000 2500 5000 Sequence length (bp) AUPRC LLR(0.1) LLR(0.15) LLR(0.2) max−LL MLR−OOD E F G H Figure 3.4: The prediction accuracies of LLR, max-LL, and MLR-OOD for OOD genomic sequences detection on theTest2016,Test2018 bacterial data sets, viral data sets, and the plasmid data sets. (A) and (B): AUROC and AUPRC for the bacterial Test2016 data set. (C) and (D): AUROC and AUPRC for the bacterialTest2018 data set. (E) and (F): AUROC and AUPRC for the viral data set. (G) and (H): AUROC and AUPRC for the plasmid data set. Each bar shows the mean accuracy of 30 random repetitions for each method and a particular sequence length. Error bars indicate the standard deviation. 62 Unlike the viral data sets, MLR-OOD slightly outperforms max-LL. The differences are almost negligible though. We will see in Figure 3.5 that both max-LL and MLR-OOD are robust to the GC content compared to LLR. For LLR, it is interesting to observe thatµ = 0.2 performs better than µ = 0.1 and µ = 0.15 for this data set, which is different from the viral data sets. This phenomenon indicates that the choice of the model-parameter µ using an extra validation data set is essential for LLR, as different data sets require different optimal choices of µ . In conclusion, MLR-OOD has better model performances than LLR even without using the information from extra validation data sets. 3.3.3 MLR-OODisrobusttotheGCcontentonalldatasets In this section, we show that the MLR-OOD method is robust to the GC content that is a major confounding effect for detecting OOD genomic sequences. Figure 3.5 shows the relationship between the GC content and the prediction scores of LLR, max-LL and MLR-OOD defined in equations 3.1-3.3 on the bacterial, viral and plasmid testing data sets. For the bacterialTest2016 data set in which Ren et al. showed that the LLR method was robust to the GC content, we observe the same pattern they presented in [143] for LLR. As for max-LL, it is shown that the separation between ID and OOD sequences becomes clearer compared to LLR. However, the prediction score of max-LL is slightly biased by the GC content as OOD sequences having GC content between 0.4 and 0.7 tend to have slightly lower prediction scores. In contrast, our MLR-OOD method is less biased by the GC content compared to max-LL while maintaining a much better separation than LLR, explaining the best performance among the three methods shown in Figure 3.4 (A) and (B). 63 The viral data sets and the plasmid testing data sets, nevertheless, display different patterns from the bacterialTest2016 data set. First, the LLR method is biased by the GC content for both data sets. For the plasmid data set, there is a slightly increasing trend between the GC content and the prediction score. For the viral data set, the trend is reversed. Second, the prediction scores of max-LL of testing sequences in both data sets are not obviously associated with the GC content, possibly explaining the decent performances of max-LL on these two data sets. Third, as shown in Figure 3.5, the prediction scores of MLR-OOD are similar to those of max-LL, which is consistent to the results shown in Figure 3.4 that max-LL and MLR-OOD have very close prediction accuracy on the viral and plasmid data sets. This is understandable because the prediction score of max-LL is less biased by GC content compared to the bacterial data sets and adjustment using Markov chain likelihoods does not markedly improve the prediction scores. That being said, MLR-OOD still makes the prediction score slightly more independent of the GC content for the plasmid testing data set. 3.3.4 ThepredictionaccuracyofMLR-OODincreaseswiththeMash distancethresholdforchoosingtheOODtestingsequences After constraining the OOD testing sequences to have minimum Mash distance to the ID classes greater than a certain threshold, we observe in Figure 3.6 that the prediction accuracy for OOD genomic sequences increases with the Mash distance threshold. For bacterial sequences which have been studied by Ren et al. based on thed S 2 distance [143], the trend is consistent as shown in Figure 3.6 (A) and (B). It is shown therein that the prediction accuracy increases from threshold 0 (no constrain) to 0.3 and then remains stable. For viral sequences shown in Figure 3.6 (C) and (D), 64 MLR−OOD bacteria MLR−OOD virus MLR−OOD plasmid max−LL bacteria max−LL virus max−LL plasmid LLR bacteria LLR virus LLR plasmid 0.3 0.5 0.7 0.3 0.5 0.7 0.3 0.5 0.7 −1 0 1 −6 −4 −2 0 −6 −4 −2 0 −2 −1 0 1 −4 −2 0 −4 −3 −2 −1 0 1 0.0 0.4 0.8 1.2 −1.6 −1.2 −0.8 −0.4 −0.5 0.0 0.5 GC content Prediction score ID OOD Figure 3.5: The relationship between the GC content and the prediction scores of LLR, max-LL and MLR-OOD on three types of testing sequences. Each subfigure contains 500 randomly selected ID and 500 randomly selected OOD 250bp testing sequences. For the bacterial data set, we use the Test2016 data set. For the viral and the plasmid data sets, we select the prediction score corresponding to theµ yielding the highest prediction, that is,µ = 0.1 for the viral data sets and µ = 0.2 for the plasmid data sets. 65 the slightly increasing trend is similar. The plasmid sequences are shown in Figure 3.6 (E) and (F) to have the most obvious increment. The AUROC and AUPRC increase by more than 0.1 from Mash distance threshold 0 to 0.3 and then remain stable. We choose different cutoff thresholds for different types of sequences because that the distributions of the Mash distances for bacterial, viral, and plasmid sequences are different and we choose these cutoffs where the majority of OOD testing sequences have minimum Mash distance to ID classes. The histograms of the minimum Mash distance of OOD testing sequences to ID classes of bacterial, viral, and plasmid sequences are shown in Supplementary Materials Figure S1. 3.3.5 Theimpactofchimericcontigsonthepredictionscoreand predictionaccuracyofMLR-OOD In this section, we show that the prediction score of MLR-OOD increases with the fraction of ID sequence fragments for chimeric contigs. As a result, the prediction accuracy of MLR-OOD also changes accordingly. Figure 3.7 illustrates the impact of chimeric contigs on the prediction score and prediction accuracy of MLR-OOD. The distributions of the prediction scores of MLR-OOD for bacterial, viral, and plasmid chimeric contigs are shown in Figure 3.7 (A), (C), and (E). It is shown that there is an obvious increasing trend with the fraction of ID sequence fragments for both bacterial and plasmid chimeric contigs. For viral chimeric contigs the trend is not as clear, which is possibly because the prediction accuracy of MLR-OOD on the viral data set is generally low as shown in Figure 3.4. However, we still see generally more contigs receive higher prediction scores when the fraction of ID sequence fragments increases. This phenomenon is understandable since ID sequence fragments generally receive higher prediction scores than OOD sequence fragments. 66 0.900 0.925 0.950 0.975 1.000 0 0.3 0.35 0.4 0.45 Mash distance threshold AUROC 0.900 0.925 0.950 0.975 1.000 0 0.3 0.35 0.4 0.45 Mash distance threshold AUPRC 0.60 0.65 0.70 0.75 0.80 0 0.5 0.55 0.6 0.65 Mash distance threshold AUROC 0.60 0.65 0.70 0.75 0.80 0 0.5 0.55 0.6 0.65 Mash distance threshold AUPRC 0.80 0.85 0.90 0.95 1.00 0 0.1 0.3 0.5 0.7 Mash distance threshold AUROC 0.80 0.85 0.90 0.95 1.00 0 0.1 0.3 0.5 0.7 Mash distance threshold AUPRC A B C D E F Figure 3.6: The relationship between the Mash distance threshold for choosing OOD sequences and the prediction accuracy of MLR-OOD on the bacterialTest2016, viral, and the plasmid data sets. (A) and (B): AUROC and AUPRC for the bacterialTest2016 data set. (C) and (D): AUROC and AUPRC for the viral data set. (E) and (F): AUROC and AUPRC for the plasmid data set. Each point shows the mean accuracy of 30 random repetitions for each method and a particular sequence length. Error bars indicate the standard deviation. The x-axis represents the minimum Mash distance threshold for choosing OOD testing genomes. The contig length is fixed as 1,000bp. 67 Table 3.2: The comparison among MLR-OOD, max-LL, and LLR in several aspects showing their strengths and weaknesses. Methods MLR-OOD max-LL LLR Accuracy High High Relatively low Parameter tuning No No Two model-parameters to be tuned manually Effect of GC content Highly robust Relatively robust Not robust on the viral and plasmid data sets Computational resource High High Low We also quantify the trend by calculating the AUROC for classifying completely OOD contigs (ID fractionc = 0) and contigs containing a certain fraction of ID sequences (c = 0.25, 0.5, 0.75, and 1) based on the prediction scores of MLR-OOD. As shown in Figure 3.7 (B) and (F), the AUROC increases remarkably from classifying completely OOD contigs and chimeric contigs with ID fraction 0.25 to classifying completely OOD and ID (ID fractionc = 1) contigs. For viral chimeric contigs shown in Figure 3.7 (D), the trend is also monotonic increasing although the jump is not that much. These results are reasonable and consistent with the results shown in Figure 3.7 (A), (C), and (E). In conclusion, chimeric contigs receive intermediate prediction scores between completely ID and OOD contigs from MLR-OOD and the prediction accuracy depends on the fraction of ID/OOD sequences in those contigs. 3.4 DiscussionandConclusions Machine learning or deep learning models have been gaining in popularity for classifying micro- bial sequences because of their power in learning sequence patterns and generality to discover unknown sequences. However, their weakness in dealing with OOD sequences has been long ne- glected. Several studies show that deep learning classifiers are likely to classify OOD sequences 68 −0.5 0.0 0.5 1.0 0.00 0.25 0.50 0.75 1.00 Fraction of ID sequences Prediction score A 0.6 0.7 0.8 0.9 1.0 ID 0/25 ID 0/50 ID 0/75 ID 0/100 Classification model AUROC B −4 −3 −2 −1 0.00 0.25 0.50 0.75 1.00 Fraction of ID sequences Prediction score C 0.45 0.50 0.55 0.60 0.65 ID 0/25 ID 0/50 ID 0/75 ID 0/100 Classification model AUROC D −6 −4 −2 0 0.00 0.25 0.50 0.75 1.00 Fraction of ID sequences Prediction score E 0.5 0.6 0.7 0.8 ID 0/25 ID 0/50 ID 0/75 ID 0/100 Classification model AUROC F Figure 3.7: The impact of chimeric contigs on the prediction score and prediction accuracy of MLR-OOD. (A), (C), and (E): the violin plots of the prediction scores of bacterialTest2016, viral, and plasmid chimeric contigs. Thex-axis represents the fraction of ID sequences in the chimeric contigs. The contig length is fixed at 1,000bp. (B), (D), and (F): the AUROC for classifying bacterial Test2016, viral, and plasmid contigs containing different fractions of ID sequences. For example, "ID 0/25" represents classifying contigs containing 0% and 25% ID sequences. Each bar shows the mean accuracy of 30 repetitions based on 1k randomly drawn testing contigs from each chimeric contig set. Error bars indicate the standard deviation. 69 into one of the training classes with high confidence, revealing that the detection of OOD genomic sequences is urgently needed. In this chapter, we propose MLR-OOD, a Markov chain based like- lihood ratio method to tackle this problem. We summarize the strengths and weaknesses of the three methods for detecting OOD genomic sequences: MLR-OOD and max-LL proposed by us and the LLR method proposed by Ren er al. [143], in different aspects in Table 3.2. Compared to the LLR method, the first work particularly addressing the detection of OOD genomic sequences, MLR-OOD has several key advantages. First, MLR-OOD utilizes the specific ID class labels by training a generative model for each ID class separately, making it possible to more precisely capture the distribution of ID sequences. Second, MLR-OOD bypasses tuning model-parameters by using the Markov chain likelihoods to adjust the likelihoods given by the ID models. This is of paramount importance for real applications since the assumption that part of the OOD data are accessible for tuning model-parameters is questionable in reality. Third, we show that the prediction score of MLR-OOD is robust to the GC content which is a main confounding effect for detecting OOD genomic sequences. Fourth, MLR-OOD consistently achieves remarkably higher prediction accuracy compared to the LLR method on all testing data sets even if LLR is based on the optimal model-parameters chosen from the validation data sets, clearly revealing the state- of-the-art performance of MLR-OOD. Fifth, in addition to the bacterial data set composed by Ren et al. [143], we also construct a more updated bacterial testing data set along with the viral and plasmid data sets for comprehensively benchmarking current and future OOD detection methods. Despite these key advantages, there are also some limitations for MLR-OOD. First, the training of the generative models for the ID classes requires a large amount of computational time and resource, especially when there are many ID classes. Second, just like other deep learning based methods, MLR-OOD needs a large training data set to avoid overfitting. For example, we observe 70 that the prediction accuracy of MLR-OOD on the viral data set dropped compared to the bacterial data set, which can possibly be explained by the fact that the viral data set contains much fewer training sequences and thus overfitting may happen. Third, it is difficult for MLR-OOD to increase the prediction accuracy if the maximum of the class conditional likelihoods is already robust to the GC content, just as shown on the plasmid and viral data sets. In the future, we hope to develop an updated version of MLR-OOD which can save the computational resource without compromising on the prediction accuracy. We also expect to combine novel methods targeting optimizing the model parameters of deep neural networks [44, 78, 108, 187] with MLR-OOD to make it more robust and powerful. 71 Chapter4 Conclusionsandfuturework This dissertation presents two novel statistical and computational methods that can be used to improve the reproducibility and reliability in analyzing metagenomic sequences. In particular, the KIMI method deals with the identification of motifs from binary types of metagenomic sequences with controlled false discovery rate and high power. On the one hand, identifying sequence motifs (k-mers) provides insights into understanding the key regions or nu- cleotide patterns driving the underlying biological principles of complex microbial communities. The prediction accuracy of the sequence classification models can also be improved based on the selected sequence motifs. On the other hand, reproducibility of the selected motifs is guaranteed because of the controlled false discovery rate. The MLR-OOD method tackles the reliability issue by accurately detecting OOD sequences which are highly likely to bias the classification of metagenomic sequences using machine learn- ing or deep learning models. We construct three microbial data sets consisting of bacterial, viral, and plasmid sequences for comprehensively benchmarking the performance of detecting OOD genomic sequences. Based on these benchmarking data sets, we show that high prediction ac- curacy can be achieved using MLR-OOD. We hope that the development of MLR-OOD will raise 72 the awareness of OOD sequences in the field of microbiology and motivate researchers to dive deep into this topic and propose more powerful and useful methods. In terms of potential topics for future research, the following problems may worth exploring. 4.1 Identifyingsequencemotifsbasedondeeplearning Although KIMI has promising performance on selectingk-mer based motifs, there are also some limitations such as the k-mer size needs to be fixed and prespecified while in reality sequence motifs may have various lengths. The collinearity between the frequencies of k-mers and the dimensionality of the data also bring challengs for KIMI in real applications. Recently, novel computational methods combining knockoff inference and deep learning have been proposed [106, 148, 192], providing more powerful tools for us to study the motif identifica- tion problem. In addition, autoencoder, a type of artificial neural network for efficiently encoding and nonlinear dimensional reduction [88, 153], has recently been used for encoding genomic se- quences [1, 67, 180]. Specifically, Agarwal et al. proposed a sequence-to-sequence autoencoder model based on LSTM to learn a latent representation of a fixed dimension for long and variable length DNA sequences [1]. Based on that model, we can start from simulating a set of Markov chains as DNA sequences, learn the autoencoder model based on the simulated sequences, and extract the latent dimensional vectors as the unsupervised representations of those DNA sequences. Next, we simulate sequence labels based on the latent dimensional vectors and randomly chosen relevant positions of the vectors using the logistic regression model in Chapter 2. We treat the latent dimensional vectors as the input and aim to select those relevant positions using our knockoff 73 inference framework, possibly combining the novel deep learning knockoff inference frameworks mentioned above. Therefore, the corresponding knockoff matrix will be constructed and the remaining knockoff procedures will be applied to recover the pre-defined relevant positions under a target FDR. We expect that FDR will be controlled and high power will be obtained in the above mentioned simulation setting. As for real data analysis, we will learn two autoencoder models separately for the binary types of sequences with variable lengths. After obtaining the latent dimensional vectors, we apply our knockoff inference framework to select relevant positions under a target FDR. Then we can use Integrated Gradients [168], DeepLIFT [160], or DeepPINK [106] to attribute the relevant positions at the latent dimensional vectors back to specific nucleotides in the original sequences. We will check with existing literature to verify if those nucleotides carry specific biological meanings. In general, we can potentially use autoencoder to directly encode whole DNA sequences with variable lengths without depending on k-mers, so that we can eliminate many obstacles when dealing withk-mers. Deep learning based knockoff inference can potentially facilitate the iden- tification of motifs with more complex and high dimensional input data. 4.2 DetectingOODgenomicsequencesusingautoencoder As introduced in Chapter 3, MLR-OOD is still a relatively preliminary study in the field of detect- ing OOD genomic sequences and we believe there is still room for improvement. Autoencoder sheds light on many research topics including the detection of OOD inputs as many recent studies providing promising signals have been proposed [38, 39, 119, 137, 184]. Autoencoder provides a perspective to learn a low dimensional representation of ID sequences in a unsupervised manner, 74 which may hopefully accurately capture the characteristics of ID classes. Meanwhile, some stud- ies reported that autoenconder works well for relatively small data sizes [31, 118, 120], although the specific application scenarios are different. As the first step for exploration, we plan to train a sequence-to-sequence autoencoder model in [1] for the training sequences in each ID class. Data augmentation methods such as generative adversarial network (GAN) [60] may also be used in case the ID training data are insufficient. Next, we feed an incoming testing sequence x to each of the trained autoencoder model and extract the corresponding low dimensional vector. For the i-th ID class, we calculate the mean Mahalanobis distance [109] or Euclidean distance between the low dimensional vectors of the testing sequence and all the training sequences in that class as d i (x). Therefore, the minimum mean distance calculated as below can be used as a natural indicator for the likelihood of sequence x belonging to ID. d min (x) = min i d i (x). A larger value ofd min (x) indicates a larger probability of sequencex belonging to OOD. We will test this method out on the benchmarking data sets we proposed in Chapter 3 so the AUROC and AUPRC measuring the prediction accuracy can be calculated by varying the cutoff threshold. We can also visualize the distribution patterns of ID training and testing sequences based on the low dimensional vectors using principle coordinate analysis (PCoA) or principle component analysis (PCA). If positive signal can be observed through the above mentioned method based on autoencoder, we will continue refining our method along this direction. We will pay particular attention on the prediction accuracy of contigs with length above 1,000bp as most assembled metagenomic contigs are longer than 1,000bp. 75 In summary, it would be interesting to develop computational approaches based on autoen- coder that can further improve the prediction accuracy of MLR-OOD. This is a topic for future research. 76 Chapter5 Supplementarymaterials 5.1 Chapter2supplementarymaterials 5.1.1 Knockoffvariablesconstructionviaknockoffcontigs In Sesiaetal. (2019), the authors developed a method to generate knockoff hidden Markov chains from a given set of hidden Markov chains. Since molecular contigs are commonly modeled us- ing Markov chains, one naive approach to generate knockoff variables corresponding to k-mer frequencies is to first generate knockoff Markov contigs and then calculate the corresponding k- mer frequencies from knockoff contigs as knockoff variables. The key for constructing knockoff Markov contigs is knowing the corresponding transition probability matrices of the original con- tigs, which can be estimated from the data using existing methods such as maximum likelihood estimation (MLE). The detailed implementations for generating knockoff Markov contigs are described as fol- lows. First, for thei-th contigZ i = Z i,1 ··· Z i,L , we generate the corresponding knockoff contig ˜ Z i = ˜ Z i,1 ··· ˜ Z i,L , independently from other contigs usingAlgorithm2 in Sesiaetal. ([158]). The 77 transition probability matrix of interest can be either the true one by assuming ideal case in sim- ulation studies or estimated from the data using MLE in real data analysis. Then each base ˜ Z i,j generated in the knockoff contig is a valid knockoff copy of the original base Z i,j according to [158]. Finally, for each contigi = 1,··· ,n, the frequency fork-merw j in the original contigZ i and its knockoff counterpart ˜ Z i are calculated and denoted asf i (w j ) and ˜ f i (w j ), respectively. However, we discover that the knockoff Markov contig method as implemented above does not yield valid knockoffs for k-mer frequencies even if true transition matrices are used, and consequently, the FDR ofk-mer selection is not controlled. The main reason is that although for any subsetS⊂{ 1,··· ,L}, we have the exchangeability at the contig level, that is, (Z i,1 ··· Z i,L , ˜ Z i,1 ··· ˜ Z i,L ) swap(S) d = (Z i,1 ··· Z i,L , ˜ Z i,1 ··· ˜ Z i,L ), (5.1) the exchangeability at the correspondingk-mer frequency level breaks down. To understand this, note that eachk-mer frequencyf i (w) is a summary statistic depending on multiple positions in Z i,j ,1≤ j ≤ L. Thus, the positions related tof i (w) and ˜ f i (w) are generally different and thus the exchangeability in (5.1) cannot be translated into the exchangeability in the correspondingk- mer frequencies, resulting in FDR uncontrolled. To make it more clear, we assume that we swap thej-th position of the original and knockoff contigs, which does not violate the exchangeability at the contig level. However, thek-mer frequencies (k > 1) related to swapping thej-th position will be changed, leading to different joint distributions of k-mer frequencies before and after the swap. Therefore, we will not adopt this method for our problem due to the above reasons. 78 5.1.2 Knockoffinferencebasedonthelog-ratiotransformed k-mer frequenciesforpredictingviralcontigs In Chapter 2, we drop onek-mer having the largestp-value from the WMW test to avoid perfect collinearity amongk-mer frequencies. Alternatively, Lin et al. proposed the log-ratio of relative frequencies or using constraints on the regression coefficients to deal with compositional covari- ates [102]. LetF = (f i (w j )),1≤ i≤ n,1≤ j ≤ p+1 be then byp+1 matrix of the original k-mer frequencies, wherep = 4 k − 1. The log-ratio transformation proposed in Linetal. tackles the following problem for logistic regression logitP(y i = 1) = b 0 + p+1 X j=1 b j log(f i (w j )), s.t. p+1 X j=1 b j = 0. This is equivalent to, for anyt, logitP(y i = 1) =b 0 +b 1 log f i (w 1 ) f i (w t ) +··· +b t− 1 log f i (w t− 1 ) f i (w t ) +b t+1 log f i (w t+1 ) f i (w t ) +··· +b p+1 log f i (w p+1 ) f i (w t ) . We compare the the prediction accuracies of viral contigs using the logistic regression mod- els based on log-ratio transformation or the k-mer frequencies by dropping one k-mer. We use the same knockoff inference framework for both approaches except how the data matrix X is 79 generated. For the log-ratio transformation, we first do the following log-ratio transformation to obtain ˆ F fromF ˆ f i = (log N i (w j )+1 N i (w t )+1 ,j∈{1,··· ,t− 1,t+1,··· ,p+1}), ˆ F = ( ˆ f 1 ,··· , ˆ f n ) ′ . Here N i (w j ) = (L− k +1)f i (w j ) denotes the number of occurrences of k-merw j in the i-th contig. Since somek-mers may not appear in a particular contig, we add 1 to both the numerator and the denominator to deal with those zero countk-mers. The basek-mer is chosen by the one with the highestp-value from the WMW test comparing the distributions ofk-mer frequencies in viral and bacterial contigs, consistent with the criteria for dropping onek-mer in the main text. Next, we implement the same transformation as discussed in Section 2.2.3 of the main text to obtain ˆ X from ˆ F to approximate the multivariate normal distribution. Then we use ( ˆ X,y) to apply knockoff inference and select relevant k-mers. We then use the log-ratio transformed frequencies of those selected k-mers to train a logistic regression model with lasso penalty to train a classification model on the training set and calculate the AUROC on the testing set. See Section 2.2.5 of the main text for more details. 5.1.3 The relationship between FDR/power and the target FDRρ based onsimulationstudies We vary the target FDR and report the changing FDR/power in Figure 5.1. As shown therein, KIMI can automatically adapt to different pre-specified target FDR levels by retaining the control 80 ● ● ● ● ● 0.0 0.1 0.2 0.3 0.10 0.15 0.20 0.25 0.30 Target FDR FDR ● ● ● ● ● 0.80 0.85 0.90 0.95 1.00 0.10 0.15 0.20 0.25 0.30 Target FDR Power Figure 5.1: The relationship between power/FDR and the target FDRρ . The thick black line in- dicates the target FDR from 0.1 to 0.3 by the step of 0.05. Standard error of the mean (SEM) calculated from 100 rounds is added to each data point. Default parameters are used for simula- tions. of FDR. Generally, power increases with target FDR, which is understandable since larger target FDR allows more room for false discoveries, in turn enhancing true discoveries. 5.1.4 TheBHprocedureandtheq-valuemethodproducealmostno powerbasedonp-valuesoflikelihoodratiotestsfromjoint logisticregression In the main test, we present the performance of the BH procedure and theq-value method based onp-values calculated from joint logistic regression, which by default uses Wald tests. We present similar results based on likelihood ratio tests (LRT) in Figure 5.2. Figure 5.2 shows that neither the BH procedure nor the q-value method produce noticeable power, just similar to results shown in Figure 3 in the main text. Therefore, we conclude that the 81 0.00 0.25 0.50 0.75 1.00 BH (glm−LRT) qvalue (glm−LRT) Different methods Power/FDR Measure FDR Power Figure 5.2: The power/FDR of the BH procedure and the q-value method based on p-values of LRT calculated from joint logistic regression. Default parameters for simulation studies are used. The joint logistic regression is implemented by the R package "glm". The black horizontal line indicates the target FDR atρ = 0.2. Standard error of the mean (SEM) calculated from 100 rounds is added to each data point. failure of both the BH procedure and the q-value is possibly due to the fact that such methods are not applicable in the motif identification problem where high collinearity between k-mer frequencies is present. 5.1.5 KIMIbasedondroppingonek-merhashigheraccuracyofpredicting viralcontigsthanbasedonthelog-ratiotransformation Figure 5.3 shows the prediction accuracy of viral contigs based on the linear model after dropping onek-mer and the log-ratio transformation. Note that when the contig length is 500bp, knockoff inference based on the log-ratio transformation did not select anyk-mer in some rounds. Thus, no 82 Figure 5.3: Prediction accuracy of viral contigs based on relevantk-mers selected by KIMI based on the relative k-mer frequency after dropping onek-mer is higher than that based on the log- ratio transformation. Thek-mer size is fixed as 3. The AUC values on the testing set for different contig lengths by using subset of k-mers selected by KIMI based on relative k-mer frequencies after dropping onek-mer and the log-ratio transformation. Error bars indicate the standard de- viation of AUC values calculated from 30 rounds. k-mers can be used for prediction. We assign AUC as 0.5 in these rounds since we can only predict all contigs with the same probability of being from viruses. This explains the high variation at L = 500bp for the log-ratio transformation. In general, we see that dropping onek-mer results in higher prediction accuracy and lower variation across 30 rounds. Therefore, we adopt dropping onek-mer for the KIMI framework. 83 Table 5.1: The viral families used in the viral genomic data set for the ID and OOD classes. *un- classified contains viruses that are currently not able to be classified into any existing families. ID Autographiviridae, Microviridae, Myoviridae, Inoviridae, Podoviridae, Siphoviridae OOD Ackermannviridae, Bicaudaviridae, Chaseviridae, Cystoviridae, Demerecviridae, Drexlerviridae, Fuselloviridae, Globuloviridae, Herelleviridae, Leviviridae, Lipothrixviridae, Lipothrixviridae, Pleolipoviridae, Rountreeviridae, Rudiviridae, Salasmaviridae, Schitoviridae, Sphaerolipoviridae, Tectiviridae, Tristromaviridae, unclassified* 5.2 Chapter3supplementarymaterials 5.2.1 Theviralandplasmiddatasets 5.2.1.1 Theviraldataset For the viral data set, the 6 ID classes are defined based on 6 viral families. The OOD classes are defined as 20 randomly chosen viral families other than those ID classes. The specific names of the viral families used as ID and OOD classes are summarized in Table 5.1. We use the family level for defining viral ID and OOD classes because viral sequences are less abundantly available on NCBI compared to bacterial and plasmid sequences. Thus, viral genomes belonging to a specific genus are insufficient for training an LSTM model, leading to overfitting in such scenarios. Therefore, we use the family level to define classes for the viral data set. The ID training data set contains 358 viral genomes discovered before 01/01/2016 from the 6 viral ID families. The ID testing data set contains 530 viral genomes discovered between 01/01/2016 and 10/01/2021 from the 6 viral ID families. The OOD testing data set contains 407 viral genomes discovered between 01/01/2016 and 10/01/2021 from the 20 OOD viral families. 84 Table 5.2: The plasmid genera used in the plasmid genomic data set for the ID and OOD classes. ID Bacillus, Escherichia, Klebsiella, Rhizobium, Rhizobium, Yersinia OOD Aeromonas, Caballeronia, Citrobacter, Enterobacter, Enterococcus, Haemophilus, Lacticaseibacillus, Lactobacillus, Leisingera, Mycoplasma, Photobacterium, Piscirickettsiaceae, Proteus, Pseudomonas, Ralstonia, Raoultella, Serratia, Shigella, Staphylococcus, Vibrio 5.2.1.2 Theplasmiddataset For the plasmid data set, the 6 ID classes are defined based on 6 plasmid genera. The OOD classes are defined as 20 randomly chosen plasmid genera other than those ID classes. The specific names of the plasmid genera used as ID and OOD classes are summarized in Table 5.2. Since we have enough plasmid data for many plasmid genera, we reuse the genus level, the same as the bacterial data set, for defining plasmid ID and OOD classes. The ID training data set contains 236 plasmid genomes discovered before 01/01/2016 from the 6 plasmid ID genera. The ID testing data set contains 217 plasmid genomes discovered between 01/01/2016 and 10/01/2021 from the 6 plasmid ID genera. The OOD testing data set contains 365 plasmid genomes discovered between 01/01/2016 and 10/01/2021 from the 20 OOD plasmid genera. 5.2.2 BayesianinformationcriterionforestimatingtheMarkovorder We use Bayesian information criterion (BIC) [83] to estimate the Markov order of the testing sequences. Letw =w 1 w 2 ··· w k+1 denotes a word of nucleotides of lengthk+1,N w denote the count of occurrence ofw in a given sequence,w− = w 1 w 2 ··· w k denote wordw with the last letter removed, andN w− be the number of occurrences ofw− . Whenr = 0, N w− degenerates 85 to the sequence lengthD. Then we can estimate the Markov order of that sequence using BIC as follows: BIC(k) = − 2 X w N w log N w N w− +(C− 1)C k log(D− k+1). ˆ r BIC = argmin k BIC(k), whereC denotes the alphabet size, which is 4 for nucleotide acid sequences. 5.2.3 GenerativemodelsbasedonMarkovchainsachievelower predictionaccuracythanLSTM We investigate the prediction accuracy of max-LL and MLR-OOD proposed in this paper if the corresponding generative models are changed from LSTM to Markov chains. The other parts of these two methods are the same as that in the main text. We fit the Markov chain models using the 250bp bacterial sequences in the training data set, and evaluate the performances of max- LL and MLR-OOD based on the fitted Markov models instead of LSTM on the bacterial testing sequences of the same length. For the generative models based on Markov chains, we first estimate the Markov orders for each sequence in each ID training class based on BIC. It is found that the estimated orders for most 250bp bacterial sequences in the training data set are 0, i.e. independent and identi- cally distributed (i.i.d.). Therefore, we simply estimate the fractions of nucleotides f k (A),A ∈ {A,C,G,T} using all the sequences in each ID training classk. Once we obtain the fitted Markov 86 chain model, we calculate the likelihood of a testing sequencex belonging to ID classk evaluated under the fitted Markov chain model for the particular class, denoted as p θ k (x), as follows p θ k (x) = Y A∈{A,C,G,T} f k (A) N A (x) , whereN A (x) denotes the number of occurrences of nucleotideA in testing sequencex. As shown in Figure 5.4, LSTM markedly outperforms Markov chains as the generative model for max-LL and MLR-OOD. For both the Test2016 and Test2018 data sets, the AUROC and AUPRC based on Markov chains are around 0.6 while those based on LSTM are at 0.8 ∼ 0.9, suggesting that the genomic sequences are better fitted by the more sophisticated model LSTM than the simple Markov chain model. Therefore, although Markov chains are more interpretable than LSTM, we only present our LSTM model and the corresponding results in the main text. 5.2.4 ThedistributionofestimatedMarkovordersofthetesting sequencesineachdataset We summarize the distribution of estimated Markov orders for each testing data set in Tables 5.3-5.6. The most commonly estimated Markov order of each contig length and data set was used for calculating the Markov chain likelihoods of the corresponding testing sequences. In general, more contigs are estimated to have higher orders when the contig length increases. 87 0.5 0.6 0.7 0.8 0.9 1.0 max−LL MLR−OOD AUROC A 0.5 0.6 0.7 0.8 0.9 1.0 max−LL MLR−OOD AUPRC B 0.5 0.6 0.7 0.8 0.9 1.0 max−LL MLR−OOD AUROC C 0.5 0.6 0.7 0.8 0.9 1.0 max−LL MLR−OOD AUPRC D MC LSTM Figure 5.4: The prediction accuracies of max-LL and MLR-OOD on the bacterial 250bp sequences in theTest2016 andTest2018 data sets based on two different generative models: Markov chains and LSTM. (A) and (B): AUROC and AUPRC for the bacterial Test2016 data set. (C) and (D): AUROC and AUPRC for the bacterial Test2018 data set. Each bar shows the mean accuracy of 30 random repetitions for each method and a particular sequence length. Error bars indicate the standard deviation. 88 Table 5.3: The distribution of the estimated Markov orders for the bacterial sequences in the Test2016 data set. The testing data set for each contig length contains 10k ID and 10k OOD contigs. We show the number of contigs that are estimated by BIC to have a specific order. ID/OOD Contig length Estimated Markov order 0 1 2 3 ID 250bp 9,827 173 0 0 500bp 9,435 562 3 0 1,000bp 8,595 1,392 13 0 2,500bp 3,420 6,517 62 1 5,000bp 988 8,226 784 2 OOD 250bp 9,898 101 1 0 500bp 9,394 605 1 0 1,000bp 7,041 2,953 6 0 2,500bp 2,374 7,574 51 1 5,000bp 524 8,569 902 5 Table 5.4: The distribution of the estimated Markov orders for the bacterial sequences in the Test2018 data set. The testing data set for each contig length contains 10k ID and 10k OOD contigs. We show the number of contigs that are estimated by BIC to have a specific order. ID/OOD Contig length Estimated Markov order 0 1 2 ID 250bp 9,883 117 0 500bp 9,749 250 1 1,000bp 9,314 686 0 2,500bp 5,078 4,922 0 5,000bp 204 9,796 0 OOD 250bp 9,922 77 1 500bp 9,822 178 0 1,000bp 9,492 508 0 2,500bp 5,873 4,127 0 5,000bp 406 9,594 0 89 Table 5.5: The distribution of the estimated Markov orders for the sequences in the viral testing data set. The testing data sets for contig lengths 250bp, 500bp, and 1,000bp contain 10k ID and 10k OOD contigs. The testing data set for contig length 2,500bp contains 5k ID and 5k OOD contigs. The testing data set for contig length 5000bp contains 3k ID and 3k OOD contigs. We show the number of contigs that are estimated by BIC to have a specific order. ID/OOD Contig length Estimated Markov order 0 1 ID 250bp 9,946 54 500bp 9,975 25 1,000bp 9,985 15 2,500bp 4,987 13 5,000bp 2,967 33 OOD 250bp 9,937 63 500bp 9,980 20 1,000bp 9,990 10 2,500bp 4,992 8 5,000bp 2,992 8 Table 5.6: The distribution of the estimated Markov orders for the sequences in the plasmid testing data set. The testing data set for each contig length contains 10k ID and 10k OOD contigs. We show the number of contigs that are estimated by BIC to have a specific order. ID/OOD Contig length Estimated Markov order 0 1 2 ID 250bp 9,981 19 0 500bp 9,960 40 0 1,000bp 9,922 78 0 2,500bp 8,665 1335 0 5,000bp 2,431 7569 0 OOD 250bp 9,970 28 2 500bp 9,923 77 0 1,000bp 9,971 229 0 2,500bp 7,008 2,992 0 5,000bp 871 9,129 0 90 5.2.5 The L 2 coefficient λ does not affect much on LLR for viral and plasmidexperiments In Chapter 3, we compare MLR-OOD and max-LL with LLR [143]. While our methods do not need to tune model-parameters, LLR has two model-parameters: the perturbation rateµ and the L 2 coefficient λ [143]. Due to the lack of validation data sets for the viral and plasmid data sets, the comparison in the main text is based onλ = 0 and multiple values ofµ = 0.1, 0.15, and 0.2. It is found that the LLR method [143] is not sensitive to the model-parameterλ , as shown in Figure 5.5 that the prediction accuracies usingλ = 1e-4 are highly consistent with the results based on λ = 0 shown in Figure 4 of the main text. Therefore, in the main text, we present the results based onλ = 0. In conclusion, whether based onλ = 1e-4 orλ = 0, LLR achieves much lower prediction accuracy than MLR-OOD and max-LL on both the viral and plasmid data sets. 5.2.6 The distribution of the minimum Mash distance of OOD testing genomestoIDclasses In Chapter 3, we choose different threshold cutoffs of the minimum Mash distance [128] of OOD testing genomes to ID classes for bacterial, viral, and plasmid sequences, which is due to the distinct distributions of the minimum Mash distance for these three types of sequences. We show the distributions in Figure 5.6. As shown therein, most OOD bacterial testing genomes have minimum Mash distances to ID training genomes at around 0.3 to 0.5, while most viral genomes scatter around 0.5 to 0.65. The minimum Mash distances of the plasmid OOD testing genomes generally have a more even distribution. These observations explain our choices of the cutoff thresholds for different types of sequences in Section 3.3.4 of the main text. 91 0.5 0.6 0.7 0.8 0.9 1.0 250 500 1000 2500 5000 Sequence length (bp) AUROC A 0.5 0.6 0.7 0.8 0.9 1.0 250 500 1000 2500 5000 Sequence length (bp) AUPRC B 0.5 0.6 0.7 0.8 0.9 1.0 250 500 1000 2500 5000 Sequence length (bp) AUROC C 0.5 0.6 0.7 0.8 0.9 1.0 250 500 1000 2500 5000 Sequence length (bp) AUPRC D LLR (0.1) LLR (0.15) LLR (0.2) Figure 5.5: The accuracies of LLR with model-parameterλ = 1e-4 and different values of µ = 0.1, 0.15, and 0.2 on the viral and plasmid data sets. (A) and (B): AUROC and AUPRC for the viral data set. (C) and (D): AUROC and AUPRC for the plasmid data set. The numbers in the parentheses indicate another model-parameter µ in LLR. Each bar shows the mean accuracy of 30 random repetitions for each method and a particular sequence length. Error bars indicate the standard deviation. 92 Constraining the OOD testing data set on the minimum Mash distance leads to fewer genomes eligible to be included in the OOD testing data sets. Therefore, as a result, for Figure 6 of the main text, we only have 3k OOD testing sequences of length 1,000bp for the viral data set when the Mash distance threshold is 0.65, and 5k OOD testing sequences when the threshold is 0.6. For the plasmid data set, we only have 1k OOD testing sequences of length 1,000bp when the Mash distance threshold is 0.7. In other scenarios we have 10k OOD testing sequences, which is the same as the case where there are no restrictions on the Mash distance. 93 0 2 4 6 0.00 0.25 0.50 0.75 1.00 Minimum Mash distance to the ID training classes Density bacteria plasmid virus Figure 5.6: The histograms of the minimum Mash distance of OOD testing genomes to the ID classes of bacterial, viral, and plasmid genomes. The number of sketches is specified as 1,000,000 for calculating the Mash distance. 94 Bibliography [1] Vishal Agarwal, N Reddy, and Ashish Anand. “Unsupervised representation learning of dna sequences”. In: arXiv preprint arXiv:1906.03087 (2019). [2] Jiyoung Ahn, Rashmi Sinha, Zhiheng Pei, Christine Dominianni, Jing Wu, Jianxin Shi, James J Goedert, Richard B Hayes, and Liying Yang. “Human gut microbiome and risk for colorectal cancer”. In: Journal of the National Cancer Institute 105.24 (2013), pp. 1907–1911. [3] Sajia Akhter, Ramy K Aziz, and Robert A Edwards. “PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity-and composition-based strategies”. In: Nucleic Acids Research 40.16 (2012), e126–e126. [4] Hagai Almagor. “A Markov analysis of DNA sequences”. In: Journal of Theoretical Biology 104.4 (1983), pp. 633–645. [5] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. “Basic local alignment search tool”. In: Journal of Molecular Biology 215.3 (1990), pp. 403–410. [6] Rudolf I Amann, Wolfgang Ludwig, and Karl-Heinz Schleifer. “Phylogenetic identification and in situ detection of individual microbial cells without cultivation”. In: Microbiological reviews 59.1 (1995), pp. 143–169. [7] Anders F Andersson and Jillian F Banfield. “Virus population dynamics and acquired virus resistance in natural microbial communities”. In: Science 320.5879 (2008), pp. 1047–1050. [8] Jonathan Arnold, A Jamie Cuticchia, David A Newsome, W Wesley Jennings III, and Robert Ivarie. “Mono-through hexanucleotide composition of the sense strand of yeast DNA: a Markov chain analysis”. In: Nucleic Acids Research 16.14 (1988), pp. 7145–7158. 95 [9] Peter J Avery and Daniel A Henderson. “Fitting Markov chain models to discrete state series such as DNA sequences”. In: Journal of the Royal Statistical Society: Series C (Applied Statistics) 48.1 (1999), pp. 53–61. [10] PJ Avery. “The analysis of intron data and their use in the detection of short signals”. In: Journal of Molecular Evolution 26.4 (1987), pp. 335–340. [11] Martin Ayling, Matthew D Clark, and Richard M Leggett. “New approaches for metagenome assembly with short reads”. In: Briefings in Bioinformatics 21.2 (2020), pp. 584–594. [12] Fredrik Bäckhed, Claire M Fraser, Yehuda Ringel, Mary Ellen Sanders, R Balfour Sartor, Philip M Sherman, James Versalovic, Vincent Young, and B Brett Finlay. “Defining a healthy human gut microbiome: current concepts, future directions, and clinical applications”. In: Cell Host & Microbe 12.5 (2012), pp. 611–622. [13] Timothy L Bailey, Nadya Williams, Chris Misleh, and Wilfred W Li. “MEME: discovering and analyzing DNA and protein sequence motifs”. In: Nucleic Acids Research 34.suppl_2 (2006), W369–W373. [14] Rina Foygel Barber, Emmanuel J Candés, et al. “Controlling the false discovery rate via knockoffs”. In: The Annals of Statistics 43.5 (2015), pp. 2055–2085. [15] Rina Foygel Barber, Emmanuel J Candés, Richard J Samworth, et al. “Robust inference with knockoffs”. In: Annals of Statistics 48.3 (2020), pp. 1409–1431. [16] Yoav Benjamini. “Discovering the false discovery rate”. In:JournaloftheRoyalStatistical Society: series B (statistical methodology) 72.4 (2010), pp. 405–416. [17] Yoav Benjamini and Yosef Hochberg. “Controlling the false discovery rate: a practical and powerful approach to multiple testing”. In: Journal of the Royal statistical society: series B (Methodological) 57.1 (1995), pp. 289–300. [18] Yoav Benjamini, Abba M Krieger, and Daniel Yekutieli. “Adaptive linear step-up procedures that control the false discovery rate”. In: Biometrika 93.3 (2006), pp. 491–507. [19] Yoav Benjamini and Wei Liu. “A step-down multiple hypotheses testing procedure that controls the false discovery rate under independence”. In: Journal of Statistical Planning and Inference 82.1-2 (1999), pp. 163–170. [20] Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, Linda Zhao, et al. “Valid post-selection inference”. In: The Annals of Statistics 41.2 (2013), pp. 802–837. 96 [21] B Edwin Blaisdell. “A measure of the similarity of sets of sequences not requiring sequence alignment”. In: Proceedings of the National Academy of Sciences 83.14 (1986), pp. 5155–5159. [22] B Edwin Blaisdell. “Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding”. In: Journal of Molecular Evolution 21.3 (1985), pp. 278–288. [23] Ivan Borozan, Stuart Watt, and Vincent Ferretti. “Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification”. In: Bioinformatics 31.9 (2015), pp. 1396–1404. [24] Florian P Breitwieser, DN Baker, and Steven L Salzberg. “KrakenUniq: confident and fast metagenomics classification using unique k-mer counts”. In: Genome Biology 19.1 (2018), pp. 1–10. [25] Odile Bruneel, Aurélie Volant, Sébastien Gallien, Bertrand Chaumande, Corinne Casiot, Christine Carapito, Amélie Bardil, Guillaume Morin, Gordon E Brown, Christian J Personné, et al. “Characterization of the active bacterial community involved in natural attenuation processes in arsenic-rich creek sediments”. In: Microbial Ecology 61.4 (2011), pp. 793–810. [26] Louise Brunkwall and Marju Orho-Melander. “The gut microbiome as a target for prevention and treatment of hyperglycaemia in type 2 diabetes: from current human evidence to future possibilities”. In: Diabetologia 60.6 (2017), pp. 943–951. [27] DH Buckley and TM Schmidt. “The structure of microbial communities in soil and the lasting impact of cultivation”. In: Microbial Ecology 42.1 (2001), pp. 11–21. [28] Matthew J Bull and Nigel T Plummer. “Part 1: The human gut microbiome in health and disease”. In: Integrative Medicine: A Clinician’s Journal 13.6 (2014), p. 17. [29] Emmanuel Candes, Yingying Fan, Lucas Janson, and Jinchi Lv. “Panning for gold:’model-X’ knockoffs for high dimensional controlled variable selection”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80.3 (2018), pp. 551–577. [30] Patrice D Cani. “Human gut microbiome: hopes, threats and promises”. In: Gut 67.9 (2018), pp. 1716–1725. [31] Sarath Chandar AP, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas C Raykar, and Amrita Saha. “An autoencoder approach to learning bilingual word representations”. In: Advances in Neural Information Processing Systems 27 (2014). 97 [32] Hyunsun Choi, Eric Jang, and Alexander A Alemi. “Waic, but why? generative ensembles for robust anomaly detection”. In: arXiv preprint arXiv:1810.01392 (2018). [33] Human Microbiome Project Consortium et al. “Structure, function and diversity of the healthy human microbiome”. In: Nature 486.7402 (2012), p. 207. [34] Sara Cuadros-Orellana, Laura Rabelo Leite, Ash Smith, Julliane Dutra Medeiros, Fernanda Badotti, Paula LC Fonseca, Aline BM Vaz, Guilherme Oliveira, and Aristóteles Góes-Neto. “Assessment of fungal diversity in the environment using metagenomics: a decade in review”. In: Fungal Genomics & Biology 3.2 (2013), p. 1. [35] Patrik D’haeseleer. “What are DNA sequence motifs?” In: Nature Biotechnology 24.4 (2006), pp. 423–425. [36] Julian Davies. “In a map for human life, count the microbes, too”. In: Science 291.5512 (2001), pp. 2316–2316. [37] Cindy D Davis. “The gut microbiome and its role in obesity”. In: Nutrition Today 51.4 (2016), p. 167. [38] Erik Daxberger and José Miguel Hernández-Lobato. “Bayesian variational autoencoders for unsupervised out-of-distribution detection”. In: arXiv preprint arXiv:1912.05651 (2019). [39] Taylor Denouden, Rick Salay, Krzysztof Czarnecki, Vahdat Abdelzad, Buu Phan, and Sachin Vernekar. “Improving reconstruction autoencoder out-of-distribution detection with mahalanobis distance”. In: arXiv preprint arXiv:1812.02765 (2018). [40] Les Dethlefsen, Margaret McFall-Ngai, and David A Relman. “An ecological and evolutionary perspective on human–microbe mutualism and disease”. In: Nature 449.7164 (2007), p. 811. [41] Sridevi Devaraj, Peera Hemarajata, and James Versalovic. “The human gut microbiome and body metabolism: implications for obesity and diabetes”. In: Clinical Chemistry 59.4 (2013), pp. 617–628. [42] Terrance DeVries and Graham W Taylor. “Learning confidence for out-of-distribution detection in neural networks”. In: arXiv preprint arXiv:1802.04865 (2018). [43] Gregory Ditzler, Robi Polikar, and Gail Rosen. “Multi-layer and recursive neural networks for metagenomic classification”. In: IEEE Transactions on Nanobioscience 14.6 (2015), pp. 608–616. 98 [44] Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. “Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves”. In: Twenty-fourth International Joint Conference on Artificial Intelligence . 2015. [45] Paul D Donovan, Gabriel Gonzalez, Desmond G Higgins, Geraldine Butler, and Kimihito Ito. “Identification of fungi in shotgun metagenomics datasets”. In: PLoS One 13.2 (2018), e0192898. [46] Siobain Duffy. “Why are RNA virus mutation rates so damn high?” In: PLoS Biology 16.8 (2018), e3000003. [47] Bas E Dutilh, Alejandro Reyes, Richard J Hall, and Katrine L Whiteson. “Virus discovery by metagenomics: the (im) possibilities”. In: Frontiers in Microbiology 8 (2017), p. 1710. [48] Paul B Eckburg, Elisabeth M Bik, Charles N Bernstein, Elizabeth Purdom, Les Dethlefsen, Michael Sargent, Steven R Gill, Karen E Nelson, and David A Relman. “Diversity of the human intestinal microbial flora”. In: Science 308.5728 (2005), pp. 1635–1638. [49] U. O. Edet, S. P. Antai, A. A. Brooks, O. Enya A. D. Asitok, and F. H. Japhet. “An overview of cultural, molecular and metagenomic techniques in description of microbial diversity”. In: Journal of Advances in Microbiology 7.2 (2017), pp. 1–19. [50] Kathryn G Eilers, Spencer Debenport, Suzanne Anderson, and Noah Fierer. “Digging deeper to find unique microbial communities: the strong effect of depth on the structure of bacterial and archaeal communities in soil”. In: Soil Biology and Biochemistry 50 (2012), pp. 58–65. [51] Yingying Fan, Emre Demirkaya, Gaorong Li, and Jinchi Lv. “RANK: large-scale inference with graphical nonlinear knockoffs”. In: Journal of the American Statistical Association (2019), pp. 1–43. [52] Yingying Fan, Jinchi Lv, Mahrad Sharifvaghefi, and Yoshimasa Uematsu. “IPAD: stable interpretable forecasting with knockoffs inference”. In: Journal of the American Statistical Association (2019), pp. 1–13. [53] Zhencheng Fang, Jie Tan, Shufang Wu, Mo Li, Congmin Xu, Zhongjie Xie, and Huaiqiu Zhu. “PPR-Meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning”. In: GigaScience 8.6 (2019), giz066. [54] Antonino Fiannaca, Laura La Paglia, Massimo La Rosa, Giovanni Renda, Riccardo Rizzo, Salvatore Gaglio, Alfonso Urso, et al. “Deep learning models for bacteria taxonomic classification of metagenomic data”. In: BMC bioinformatics 19.7 (2018), pp. 61–76. 99 [55] Dawn Field, Linda Amaral-Zettler, Guy Cochrane, James R Cole, Peter Dawyndt, George M Garrity, Jack Gilbert, Frank Oliver Glöckner, Lynette Hirschman, Ilene Karsch-Mizrachi, et al. “The genomic standards consortium”. In: PLoS Biology 9.6 (2011), e1001088. [56] Kristoffer Forslund, Falk Hildebrand, Trine Nielsen, Gwen Falony, Emmanuelle Le Chatelier, Shinichi Sunagawa, Edi Prifti, Sara Vieira-Silva, Valborg Gudmundsdottir, Helle Krogh Pedersen, et al. “Disentangling type 2 diabetes and metformin treatment signatures in the human gut microbiota”. In: Nature 528.7581 (2015), pp. 262–266. [57] Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. “CD-HIT: accelerated for clustering the next-generation sequencing data”. In: Bioinformatics 28.23 (2012), pp. 3150–3152. [58] David J Galas, Mark Eggert, and Michael S Waterman. “Rigorous pattern-recognition methods for DNA sequences: Analysis of promoter sequences from Escherichia coli”. In: Journal of Molecular Biology 186.1 (1985), pp. 117–128. [59] Yulia Gavrilov, Yoav Benjamini, Sanat K Sarkar, et al. “An adaptive step-down procedure with proven FDR control under independence”. In: The Annals of Statistics 37.2 (2009), pp. 619–629. [60] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative adversarial nets”. In: Advances in Neural Information Processing Systems 27 (2014). [61] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. “Explaining and harnessing adversarial examples”. In: arXiv preprint arXiv:1412.6572 (2014). [62] J Graessler, Y Qin, H Zhong, J Zhang, Julio Licinio, Ma-Li Wong, A Xu, T Chavakis, AB Bornstein, Monika Ehrhart-Bornstein, et al. “Metagenomic sequencing of the human gut microbiome before and after bariatric surgery in obese patients with type 2 diabetes: correlation with inflammatory and metabolic parameters”. In: The Pharmacogenomics Journal 13.6 (2013), pp. 514–522. [63] Dmitry Grapov, Johannes Fahrmann, Kwanjeera Wanichthanarak, and Sakda Khoomrung. “Rise of deep learning for genomic, proteomic, and metabolomic data integration in precision medicine”. In: Omics: a journal of integrative biology 22.10 (2018), pp. 630–636. [64] Sharon Greenblum, Peter J Turnbaugh, and Elhanan Borenstein. “Metagenomic systems biology of the human gut microbiome reveals topological shifts associated with obesity and inflammatory bowel disease”. In: Proceedings of the National Academy of Sciences 109.2 (2012), pp. 594–599. 100 [65] Paul E Gribben, Shaun Nielsen, Justin R Seymour, Daniel J Bradley, Matthew N West, and Torsten Thomas. “Microbial communities in marine sediments modify success of an invasive macrophyte”. In: Scientific Reports 7.1 (2017), pp. 1–8. [66] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. “On calibration of modern neural networks”. In: International Conference on Machine Learning. PMLR. 2017, pp. 1321–1330. [67] Anvita Gupta and Anshul Kundaje. “Targeted optimization of regulatory DNA sequences with neural editing architectures”. In: bioRxiv (2019), p. 714402. [68] Kirsten S Habicht, Mette Miller, Raymond P Cox, Niels-Ulrik Frigaard, Mauro Tonolla, Sandro Peduzzi, Lasse G Falkenby, and Jens S Andersen. “Comparative proteomics and activity of a green sulfur bacterium through the water column of Lake Cadagno, Switzerland”. In: Environmental Microbiology 13.1 (2011), pp. 203–215. [69] Jonas Halfvarson, Colin J Brislawn, Regina Lamendella, Yoshiki Vázquez-Baeza, William A Walters, Lisa M Bramer, Mauro D’amato, Ferdinando Bonfiglio, Daniel McDonald, Antonio Gonzalez, et al. “Dynamics of the human gut microbiome in inflammatory bowel disease”. In: Nature Microbiology 2.5 (2017), pp. 1–7. [70] Sarah T Hamman, Ingrid C Burke, and Mary E Stromberger. “Relationships between microbial community structure and soil environmental conditions in a recently burned system”. In: Soil Biology and Biochemistry 39.7 (2007), pp. 1703–1711. [71] Anna Heintz-Buschart and Paul Wilmes. “Human gut microbiome: function matters”. In: Trends in Microbiology 26.7 (2018), pp. 563–574. [72] Dan Hendrycks and Kevin Gimpel. “A baseline for detecting misclassified and out-of-distribution examples in neural networks”. In: arXiv preprint arXiv:1610.02136 (2016). [73] Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. “Deep anomaly detection with outlier exposure”. In: arXiv preprint arXiv:1812.04606 (2018). [74] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural Computation 9.8 (1997), pp. 1735–1780. [75] Tatsuhiko Hoshino, Hideyuki Doi, Go-Ichiro Uramoto, Lars Wörmer, Rishi R Adhikari, Nan Xiao, Yuki Morono, Steven D’Hondt, Kai-Uwe Hinrichs, and Fumio Inagaki. “Global diversity of microbial communities in marine sediment”. In: Proceedings of the National Academy of Sciences 117.44 (2020), pp. 27587–27597. 101 [76] Yen-Chang Hsu, Yilin Shen, Hongxia Jin, and Zsolt Kira. “Generalized ODIN: Detecting out-of-distribution image without learning from out-of-distribution data”. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecognition. 2020, pp. 10951–10960. [77] Yunda Huang and Raphael Gottardo. “Comparability and reproducibility of biomedical data”. In: Briefings in Bioinformatics 14.4 (2013), pp. 391–401. [78] Ilija Ilievski, Taimoor Akhtar, Jiashi Feng, and Christine Shoemaker. “Efficient hyperparameter optimization for deep learning algorithms using deterministic rbf surrogates”. In: Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 31. 1. 2017. [79] Tatyana V Ilyina and Eugene V Koonin. “Conserved sequence motifs in the initiator proteins for rolling circle DNA replication encoded by diverse replicons from eubacteria, eucaryotes and archaebacteria”. In: Nucleic Acids Research 20.13 (1992), pp. 3279–3285. [80] Janet K Jansson and Kirsten S Hofmockel. “The soil microbiome—from metagenomics to metaphenomics”. In: Current Opinion in Microbiology 43 (2018), pp. 162–168. [81] N Jehmlich, S Kleinsteuber, C Vogt, D Benndorf, H Harms, F Schmidt, Martin Von Bergen, and J Seifert. “Phylogenetic and proteomic analysis of an anaerobic toluene-degrading community”. In: Journal of Applied Microbiology 109.6 (2010), pp. 1937–1945. [82] Sin-Ho Jung. “Sample size for FDR-control in microarray data analysis”. In: Bioinformatics 21.14 (2005), pp. 3097–3104. [83] Richard W Katz. “On some criteria for estimating the order of a Markov chain”. In: Technometrics 23.3 (1981), pp. 243–249. [84] W James Kent. “BLAT—the BLAST-like alignment tool”. In: Genome Research 12.4 (2002), pp. 656–664. [85] Ronald P Kiene, Laura J Linn, and Jody A Bruton. “New and important roles for DMSP in marine microbial communities”. In: Journal of Sea Research 43.3-4 (2000), pp. 209–224. [86] Daehwan Kim, Li Song, Florian P Breitwieser, and Steven L Salzberg. “Centrifuge: rapid and sensitive classification of metagenomic sequences”. In: Genome Research 26.12 (2016), pp. 1721–1729. [87] Peter Kotanko, Mary Carter, and Nathan W Levin. “Intestinal bacterial microflora—a potential source of chronic inflammation in patients with chronic kidney disease”. In: Nephrology Dialysis Transplantation 21.8 (2006), pp. 2057–2060. 102 [88] Mark A Kramer. “Nonlinear principal component analysis using autoassociative neural networks”. In: AIChE Journal 37.2 (1991), pp. 233–243. [89] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. “Simple and scalable predictive uncertainty estimation using deep ensembles”. In: arXiv preprint arXiv:1612.01474 (2016). [90] Jack Lanchantin, Ritambhara Singh, Beilun Wang, and Yanjun Qi. “Deep motif dashboard: Visualizing and understanding genomic sequences using deep neural networks”. In: Pacific Symposium on Biocomputing 2017 . World Scientific. 2017, pp. 254–265. [91] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. “Ultrafast and memory-efficient alignment of short DNA sequences to the human genome”. In: Genome Biology 10.3 (2009), pp. 1–10. [92] Peter Larsen, Yuki Hamada, and Jack Gilbert. “Modeling microbial communities: Current, developing, and future technologies for predicting microbial community interaction”. In: Journal of Biotechnology 160.1-2 (2012), pp. 17–24. [93] Federico M Lauro, Matthew Z DeMaere, Sheree Yau, Mark V Brown, Charmaine Ng, David Wilkins, Mark J Raftery, John AE Gibson, Cynthia Andrews-Pfannkoch, Matthew Lewis, et al. “An integrative study of a meromictic lake ecosystem in Antarctica”. In: The ISME Journal 5.5 (2011), pp. 879–895. [94] Charles E Lawrence, Stephen F Altschul, Mark S Boguski, Jun S Liu, Andrew F Neuwald, and John C Wootton. “Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment”. In: Science 262.5131 (1993), pp. 208–214. [95] Charles E Lawrence and Andrew A Reilly. “An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences”. In: Proteins: Structure, Function, and Bioinformatics 7.1 (1990), pp. 41–51. [96] Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. “Training confidence-calibrated classifiers for detecting out-of-distribution samples”. In: arXiv preprint arXiv:1711.09325 (2017). [97] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. “A simple unified framework for detecting out-of-distribution samples and adversarial attacks”. In: arXiv preprint arXiv:1807.03888 (2018). [98] Heng Li and Richard Durbin. “Fast and accurate long-read alignment with Burrows–Wheeler transform”. In: Bioinformatics 26.5 (2010), pp. 589–595. 103 [99] Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. “Independently recurrent neural network (IndRNN): Building a longer and deeper rnn”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 5457–5466. [100] Qiaoxing Liang, Paul W Bible, Yu Liu, Bin Zou, and Lai Wei. “DeepMicrobes: taxonomic classification for metagenomics with deep learning”. In: NAR Genomics and Bioinformatics 2.1 (2020), lqaa009. [101] Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. “Enhancing the reliability of out-of-distribution image detection in neural networks”. In: arXiv preprint arXiv:1706.02690 (2017). [102] Wei Lin, Pixu Shi, Rui Feng, and Hongzhe Li. “Variable selection in regression with compositional covariates”. In: Biometrika 101.4 (2014), pp. 785–797. [103] Karen G Lloyd, Andrew D Steen, Joshua Ladau, Junqi Yin, and Lonnie Crosby. “Phylogenetically novel uncultured microbial cells dominate earth microbiomes”. In: MSystems 3.5 (2018), e00055–18. [104] Corie Lok. “Mining the microbial dark matter”. In: Nature News 522.7556 (2015), p. 270. [105] Michael A Lones and Andy M Tyrrell. “The evolutionary computation approach to motif discovery in biological sequences”. In: Proceedings of the 7th annual workshop on Genetic and evolutionary computation. 2005, pp. 1–11. [106] Yang Lu, Yingying Fan, Jinchi Lv, and William Stafford Noble. “DeepPINK: reproducible feature selection in deep neural networks”. In:Advances in Neural Information Processing Systems. 2018, pp. 8676–8686. [107] Chengwei Luo, Luis M Rodriguez-r, and Konstantinos T Konstantinidis. “MyTaxa: an advanced taxonomic classifier for genomic and metagenomic sequences”. In: Nucleic Acids Research 42.8 (2014), e73–e73. [108] Dougal Maclaurin, David Duvenaud, and Ryan Adams. “Gradient-based hyperparameter optimization through reversible learning”. In: International Conference on Machine Learning. PMLR. 2015, pp. 2113–2122. [109] Prasanta Chandra Mahalanobis. “On the generalized distance in statistics”. In: National Institute of Science of India. 1936. [110] James Malone, Andy Brown, Allyson L Lister, Jon Ison, Duncan Hull, Helen Parkinson, and Robert Stevens. “The Software Ontology (SWO): a resource for reproducibility in biomedical data analysis, curation and digital preservation”. In: Journal of Biomedical Semantics 5.1 (2014), pp. 1–13. 104 [111] Sharmila S Mande, Monzoorul Haque Mohammed, and Tarini Shankar Ghosh. “Classification of metagenomic sequences: methods and challenges”. In: Briefings in Bioinformatics 13.6 (2012), pp. 669–681. [112] Henry B Mann and Donald R Whitney. “On a test of whether one of two random variables is stochastically larger than the other”. In: The Annals of Mathematical Statistics (1947), pp. 50–60. [113] Laurent Marsan and Marie-France Sagot. “Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification”. In: Journal of Computational Biology 7.3-4 (2000), pp. 345–362. [114] Alice Carolyn McHardy, Héctor Garcıa Martın, Aristotelis Tsirigos, Philip Hugenholtz, and Isidore Rigoutsos. “Accurate phylogenetic classification of variable-length DNA fragments”. In: Nature Methods 4.1 (2007), pp. 63–72. [115] Galina Mengeritsky and Temple F Smith. “Recognition of characteristic patterns in sets of functionally equivalent DNA sequences”. In: Bioinformatics 3.3 (1987), pp. 223–227. [116] Peter Menzel, Kim Lee Ng, and Anders Krogh. “Fast and sensitive taxonomic classification for metagenomics with Kaiju”. In: Nature Communications 7.1 (2016), pp. 1–9. [117] Barbara A Methé, Karen E Nelson, Mihai Pop, Heather H Creasy, Michelle G Giglio, Curtis Huttenhower, Dirk Gevers, Joseph F Petrosino, Sahar Abubucker, Jonathan H Badger, et al. “A framework for human microbiome research”. In: Nature 486.7402 (2012), p. 215. [118] Pan Mian, Jiang Jie, Li Zhu, Cao Jing, and Zhou Tao. “Radar HRRP recognition based on discriminant deep autoencoders with small training data size”. In: Electronics Letters 52.20 (2016), pp. 1725–1727. [119] Felix Moller, Diego Botache, Denis Huseljic, Florian Heidecker, Maarten Bieshaar, and Bernhard Sick. “Out-of-distribution detection and generation using soft brownian offset sampling and autoencoders”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 46–55. [120] Andriy Myronenko. “3D MRI brain tumor segmentation using autoencoder regularization”. In: International MICCAI Brainlesion Workshop. Springer. 2018, pp. 311–320. [121] Kristen Naegle, Nancy R Gough, and Michael B Yaffe. “Criteria for biological reproducibility: what does “n” mean?” In: Science Signaling 8.371 (2015), fs7–fs7. 105 [122] Leelavati Narlikar, Nidhi Mehta, Sanjeev Galande, and Mihir Arjunwadkar. “One size does not fit all: on how Markov model order dictates performance of genomic sequence analyses”. In: Nucleic Acids Research 41.3 (2013), pp. 1416–1424. [123] Stephen Nayfach, Zhou Jason Shi, Rekha Seshadri, Katherine S Pollard, and Nikos C Kyrpides. “New insights from uncultivated genomes of the global human gut microbiome”. In: Nature 568.7753 (2019), pp. 505–510. [124] Girish Neelakanta and Hameeda Sultana. “The use of metagenomic approaches to analyze changes in microbial communities”. In: Microbiology Insights 6 (2013), MBI–S10819. [125] Charmaine Ng, Matthew Z DeMaere, Timothy J Williams, Federico M Lauro, Mark Raftery, John AE Gibson, Cynthia Andrews-Pfannkoch, Matt Lewis, Jeffrey M Hoffman, Torsten Thomas, et al. “Metaproteogenomic analysis of a dominant green sulfur bacterium from Ace Lake, Antarctica”. In: The ISME Journal 4.8 (2010), pp. 1002–1019. [126] Anh Nguyen, Jason Yosinski, and Jeff Clune. “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 427–436. [127] Balbina Nogales, Mariana P Lanfranconi, Juana M Piña-Villalonga, and Rafael Bosch. “Anthropogenic perturbations in marine microbial communities”. In: FEMS Microbiology Reviews 35.2 (2011), pp. 275–298. [128] Brian D Ondov, Todd J Treangen, Páll Melsted, Adam B Mallonee, Nicholas H Bergman, Sergey Koren, and Adam M Phillippy. “Mash: fast genome and metagenome distance estimation using MinHash”. In: Genome Biology 17.1 (2016), pp. 1–14. [129] Yuri L Orlov and Vladimir N Potapov. “Complexity: an internet resource for analysis of DNA sequence complexity”. In: Nucleic Acids Research 32.suppl_2 (2004), W628–W633. [130] Rachid Ounit, Steve Wanamaker, Timothy J Close, and Stefano Lonardi. “CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers”. In: BMC Genomics 16.1 (2015), pp. 1–13. [131] Bhavya Papudeshi, J Matthew Haggerty, Michael Doane, Megan M Morris, Kevin Walsh, Douglas T Beattie, Dnyanada Pande, Parisa Zaeri, Genivaldo GZ Silva, Fabiano Thompson, et al. “Optimizing and evaluating the reconstruction of Metagenome-assembled microbial genomes”. In: BMC Genomics 18.1 (2017), pp. 1–13. [132] Kayla M Peck and Adam S Lauring. “Complexities of viral mutation rates”. In: Journal of Virology 92.14 (2018). 106 [133] Lucia Peixoto, Davide Risso, Shane G Poplawski, Mathieu E Wimmer, Terence P Speed, Marcelo A Wood, and Ted Abel. “How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets”. In: Nucleic Acids Research 43.16 (2015), pp. 7664–7674. [134] Fitra Adi Prayogo, Anto Budiharjo, Hermin Pancasakti Kusumaningrum, Wijanarka Wijanarka, Agung Suprihadi, and Nurhayati Nurhayati. “Metagenomic applications in exploration and development of novel enzymes from nature: a review”. In: Journal of Genetic Engineering and Biotechnology 18.1 (2020), pp. 1–10. [135] Yuyang Qiao, Ben Jia, Zhiqiang Hu, Chen Sun, Yijin Xiang, and Chaochun Wei. “MetaBinG2: a fast and accurate metagenomic sequence classification system for samples with many unknown organisms”. In: Biology Direct 13.1 (2018), pp. 1–21. [136] Nan Qin, Fengling Yang, Ang Li, Edi Prifti, Yanfei Chen, Li Shao, Jing Guo, Emmanuelle Le Chatelier, Jian Yao, Lingjiao Wu, et al. “Alterations of the human gut microbiome in liver cirrhosis”. In: Nature 513.7516 (2014), pp. 59–64. [137] Xuming Ran, Mingkun Xu, Lingrui Mei, Qi Xu, and Quanying Liu. “Detecting out-of-distribution samples via variational auto-encoder with reliable uncertainty estimation”. In: Neural Networks 145 (2022), pp. 199–208. [138] Anat Reiner-Benaim. “FDR control by the BH procedure for two-sided correlated tests with implications to gene expression data analysis”. In: Biometrical Journal 49.1 (2007), pp. 107–126. [139] Gesine Reinert, David Chew, Fengzhu Sun, and Michael S Waterman. “Alignment-free sequence comparison (I): statistics and power”. In: Journal of Computational Biology 16.12 (2009), pp. 1615–1634. [140] Gesine Reinert, Sophie Schbath, and Michael S Waterman. “Probabilistic and statistical properties of words: an overview”. In: Journal of Computational Biology 7.1-2 (2000), pp. 1–46. [141] Jie Ren, Nathan A Ahlgren, Yang Young Lu, Jed A Fuhrman, and Fengzhu Sun. “VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data”. In: Microbiome 5.1 (2017), p. 69. [142] Jie Ren, Xin Bai, Yang Young Lu, Kujin Tang, Ying Wang, Gesine Reinert, and Fengzhu Sun. “Alignment-free sequence analysis and applications”. In: Annual Review of Biomedical Data Science 1 (2018), pp. 93–114. [143] Jie Ren, Peter J Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, and Balaji Lakshminarayanan. “Likelihood ratios for out-of-distribution detection”. In: Advances in Neural Information Processing Systems. 2019, pp. 14680–14691. 107 [144] Jie Ren, Kai Song, Chao Deng, Nathan A Ahlgren, Jed A Fuhrman, Yi Li, Xiaohui Xie, Ryan Poplin, and Fengzhu Sun. “Identifying viruses from metagenomic data using deep learning”. In: Quantitative Biology (2020), pp. 1–14. [145] Jie Ren, Kai Song, Chao Deng, Nathan A Ahlgren, Jed A Fuhrman, Yi Li, Xiaohui Xie, Ryan Poplin, and Fengzhu Sun. “Identifying viruses from metagenomic data using deep learning”. In: Quantitative Biology (2020), pp. 1–14. [146] Carmen Ricós, Natalia Iglesias, José-Vicente Garcıa-Lario, Margarita Simón, Fernando Cava, Amparo Hernández, Carmen Perich, Joanna Minchinela, Virtudes Alvarez, Maria-Vicenta Doménech, et al. “Within-subject biological variation in disease: collated data and clinical consequences”. In: Annals of Clinical Biochemistry 44.4 (2007), pp. 343–352. [147] Robert J Robbins, Leonard Krishtalka, and John C Wooley. “Advances in biodiversity: metagenomics and the unveiling of biological dark matter”. In: Standards in Genomic Sciences 11.1 (2016), pp. 1–17. [148] Yaniv Romano, Matteo Sesia, and Emmanuel Candès. “Deep knockoffs”. In: Journal of the American Statistical Association 115.532 (2020), pp. 1861–1872. [149] Gail L Rosen, Erin R Reichenberger, and Aaron M Rosenfeld. “NBC: the Naive Bayes Classification tool webserver for taxonomic classification of metagenomic reads”. In: Bioinformatics 27.1 (2011), pp. 127–129. [150] Despoina D Roumpeka, R John Wallace, Frank Escalettes, Ian Fotheringham, and Mick Watson. “A review of bioinformatics tools for bio-prospecting from metagenomic sequence data”. In: Frontiers in Genetics 8 (2017), p. 23. [151] Simon Roux, Francois Enault, Bonnie L Hurwitz, and Matthew B Sullivan. “VirSorter: mining viral signal from microbial genomic data”. In: PeerJ 3 (2015), e985. [152] Subrata Saha, Jethro Johnson, Soumitra Pal, George M Weinstock, and Sanguthevar Rajasekaran. “MSC: a metagenomic sequence classification algorithm”. In: Bioinformatics 35.17 (2019), pp. 2932–2940. [153] Mayu Sakurada and Takehisa Yairi. “Anomaly detection using autoencoders with nonlinear dimensionality reduction”. In: Proceedings of the MLSDA 2014 2nd workshop on machine learning for sensory data analysis. 2014, pp. 4–11. [154] R-A Sandaa, V Torsvik, and Ø Enger. “Influence of long-term heavy-metal contamination on microbial communities in soil”. In: Soil Biology and Biochemistry 33.3 (2001), pp. 287–295. 108 [155] Melanie Schirmer, Eric A Franzosa, Jason Lloyd-Price, Lauren J McIver, Randall Schwager, Tiffany W Poon, Ashwin N Ananthakrishnan, Elizabeth Andrews, Gildardo Barron, Kathleen Lake, et al. “Dynamics of metatranscription in the inflammatory bowel disease gut microbiome”. In: Nature Mcrobiology 3.3 (2018), pp. 337–346. [156] Meredith E Seeley, Bongkeun Song, Renia Passie, and Robert C Hale. “Microplastics affect sedimentary microbial communities and nitrogen cycling”. In: Nature Communications 11.1 (2020), pp. 1–10. [157] Joan Serrà, David Álvarez, Vicenç Gómez, Olga Slizovskaia, José F Núñez, and Jordi Luque. “Input complexity and out-of-distribution detection with likelihood-based generative models”. In: arXiv preprint arXiv:1909.11480 (2019). [158] Matteo Sesia, Chiara Sabatti, and Emmanuel J Candés. “Gene hunting with hidden Markov model knockoffs”. In: Biometrika 106.1 (2019), pp. 1–18. [159] Gabi Shalev, Yossi Adi, and Joseph Keshet. “Out-of-distribution detection using multiple semantic label representations”. In: arXiv preprint arXiv:1808.06664 (2018). [160] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. “Learning important features through propagating activation differences”. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org. 2017, pp. 3145–3153. [161] Carola Simon and Rolf Daniel. “Metagenomic analyses: past and future trends”. In: Applied and Environmental Microbiology 77.4 (2011), pp. 1153–1161. [162] H Ye Simon, Katherine J Siddle, Daniel J Park, and Pardis C Sabeti. “Benchmarking metagenomics tools for taxonomic classification”. In: Cell 178.4 (2019), pp. 779–794. [163] Rashmi Sinha, Galeb Abu-Ali, Emily Vogtmann, Anthony A Fodor, Boyu Ren, Amnon Amir, Emma Schwager, Jonathan Crabtree, Siyuan Ma, Christian C Abnet, et al. “Assessment of variation in microbial community amplicon sequencing by the Microbiome Quality Control (MBQC) project consortium”. In: Nature Biotechnology 35.11 (2017), p. 1077. [164] Mitchell L Sogin, Hilary G Morrison, Julie A Huber, David Mark Welch, Susan M Huse, Phillip R Neal, Jesus M Arrieta, and Gerhard J Herndl. “Microbial diversity in the deep sea and the underexplored "rare biosphere"”. In: Proceedings of the National Academy of Sciences 103.32 (2006), pp. 12115–12120. [165] Lindsey Solden, Karen Lloyd, and Kelly Wrighton. “The bright side of microbial dark matter: lessons learned from the uncultivated majority”. In: Current Opinion in Microbiology 31 (2016), pp. 217–226. 109 [166] John D Storey. “A direct approach to false discovery rates”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64.3 (2002), pp. 479–498. [167] John D Storey et al. “The positive false discovery rate: a Bayesian interpretation and the q-value”. In: The Annals of Statistics 31.6 (2003), pp. 2013–2035. [168] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. “Axiomatic attribution for deep networks”. In:InternationalConferenceonMachineLearning. PMLR. 2017, pp. 3319–3328. [169] Neda Tavakoli. “Modeling genome data using bidirectional LSTM”. In: 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC). Vol. 2. IEEE. 2019, pp. 183–188. [170] Ryan J Tibshirani, Jonathan Taylor, Richard Lockhart, and Robert Tibshirani. “Exact post-selection inference for sequential regression procedures”. In: Journal of the American Statistical Association 111.514 (2016), pp. 600–620. [171] Herbert Tilg, Arthur Kaser, et al. “Gut microbiome, obesity, and metabolic dysfunction”. In: The Journal of Clinical Investigation 121.6 (2011), pp. 2126–2132. [172] Susannah Green Tringe, Christian Von Mering, Arthur Kobayashi, Asaf A Salamov, Kevin Chen, Hwai W Chang, Mircea Podar, Jay M Short, Eric J Mathur, John C Detter, et al. “Comparative metagenomics of microbial communities”. In: Science 308.5721 (2005), pp. 554–557. [173] Manikant Tripathi, Durgesh Narain Singh, Surendra Vikram, Vijay Shankar Singh, and Shailendra Kumar. “Metagenomic approach towards bioprospection of novel biomolecule (s) and environmental bioremediation”. In: Annual Research & Review in Biology (2018), pp. 1–12. [174] Olga G Troyanskaya, Ora Arbell, Yair Koren, Gad M Landau, and Alexander Bolshoy. “Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity”. In: Bioinformatics 18.5 (2002), pp. 679–688. [175] Joshua A Udall and R Kelly Dawe. “Is it ordered correctly? Validating genome assemblies by optical mapping”. In: The Plant Cell 30.1 (2018), pp. 7–14. [176] Kévin Vervier, Pierre Mahé, Maud Tournoud, Jean-Baptiste Veyrieras, and Jean-Philippe Vert. “Large-scale machine learning for metagenomics sequence classification”. In: Bioinformatics 32.7 (2016), pp. 1023–1032. [177] Emily Vogtmann, Xing Hua, Georg Zeller, Shinichi Sunagawa, Anita Y Voigt, Rajna Hercog, James J Goedert, Jianxin Shi, Peer Bork, and Rashmi Sinha. “Colorectal cancer and the human gut microbiome: reproducibility with whole-genome shotgun sequencing”. In: PloS One 11.5 (2016), e0155362. 110 [178] Apoorv Vyas, Nataraj Jammalamadaka, Xia Zhu, Dipankar Das, Bharat Kaul, and Theodore L Willke. “Out-of-distribution detection using an ensemble of self supervised leave-out classifiers”. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 550–564. [179] Michael S Waterman. Introduction to computational biology: maps, sequences and genomes. CRC Press, 1995. [180] Ruoqi Wei and Ausif Mahmood. “Recent advances in variational autoencoders with representation learning for biomedical informatics: A survey”. In: Ieee Access 9 (2020), pp. 4939–4956. [181] Gabriele Wieland, Regine Neumann, and Horst Backhaus. “Variation of microbial communities in soil, rhizosphere, and rhizoplane in response to crop species, soil type, and crop development”. In: Applied and Environmental Microbiology 67.12 (2001), pp. 5849–5854. [182] Derrick E Wood, Jennifer Lu, and Ben Langmead. “Improved metagenomic analysis with Kraken 2”. In: Genome Biology 20.1 (2019), pp. 1–13. [183] Derrick E Wood and Steven L Salzberg. “Kraken: ultrafast metagenomic sequence classification using exact alignments”. In: Genome Biology 15.3 (2014), pp. 1–12. [184] Zhisheng Xiao, Qing Yan, and Yali Amit. “Likelihood regret: An out-of-distribution detection score for variational auto-encoder”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 20685–20696. [185] Qi Xie, Paula Suárez-López, and Crisanto Gutierrez. “Identification and analysis of a retinoblastoma binding motif in the replication protein of a plant DNA virus: requirement for efficient viral DNA replication.” In: The EMBO Journal 14.16 (1995), pp. 4073–4082. [186] Daniel Yekutieli and Yoav Benjamini. “Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics”. In: Journal of Statistical Planning and Inference 82.1-2 (1999), pp. 171–196. [187] Steven R Young, Derek C Rose, Thomas P Karnowski, Seung-Hwan Lim, and Robert M Patton. “Optimizing deep learning hyper-parameters through an evolutionary algorithm”. In: Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments. 2015, pp. 1–5. [188] Joseph P Zackular, Mary AM Rogers, Mack T Ruffin, and Patrick D Schloss. “The human gut microbiome as a screening tool for colorectal cancer”. In: Cancer Prevention Research 7.11 (2014), pp. 1112–1121. 111 [189] Amit Zeisel, Or Zuk, and Eytan Domany. “FDR control with adaptive procedures and FDR monotonicity”. In: The Annals of Applied Statistics (2011), pp. 943–968. [190] Baoli Zhu, Xin Wang, and Lanjuan Li. “Human gut microbiome: the second genome of human body”. In: Protein & Cell 1.8 (2010), pp. 718–725. [191] Huaiqiu Zhu, Qian Guo, Mo Li, Chunhui Wang, Zhengcheng Fang, Peihong Wang, Jie Tan, Shufang Wu, and Yonghong Xiao. “Host and infectivity prediction of Wuhan 2019 novel coronavirus using deep learning algorithm”. In: BioRxiv (2020). [192] Zifan Zhu, Yingying Fan, Yinfei Kong, Jinchi Lv, and Fengzhu Sun. “DeepLINK: Deep learning inference using knockoffs with applications to genomics”. In: Proceedings of the National Academy of Sciences 118.36 (2021). [193] Zifan Zhu, Jie Ren, Sonia Michail, and Fengzhu Sun. “MicroPro: using metagenomic unmapped reads to provide insights into human microbiota and disease associations”. In: Genome Biology 20.1 (2019), pp. 1–13. 112
Abstract (if available)
Abstract
Microbial communities are ubiquitous in diverse environments and play important roles in many aspects including the human health in particular. Within a microbial community, many types of microbes exist such as bacteria, viruses, plasmids, and fungi. To analyze the composition and diversity of microbial communities, metagenomics offers an alternative to culturing in laboratory which is a challenge for a majority of microbial species. Particularly, in metagenomics the DNA fragments of all samples in a specific environment are extracted and sequenced. Although metagenomics is able to capture the diversity of microbial communities, the source of metagenomic sequences is lost due to the nature of metagenomic sequencing. Many computational methods have been proposed towards deciphering metagenomic sequences including assembly, binning, and labeling the taxonomic information.
This dissertation presents two novel computational methods to address reproducibility and reliability in the analysis of metagenomic sequences. In Chapter 2, we present KIMI for identifying sequence motifs (k-mers) from binary types of sequences with controlled false discovery rate (FDR). By adapting the model-X knockoffs framework to this particular problem, we show that high power and controlled FDR are achieved simultaneously by KIMI. We also provide a real data application example to show that the prediction accuracy of viral contigs is improved if KIMI serves as a pre-screening step for identifying relevent k-mers. In Chapter 3, we present a maximum likelihood ratio based method, MLR-OOD, for detecting out-of-distribution (OOD) sequences that can bias the classification results of metagenomic sequences in machine learning related methods. Based on long short term memory (LSTM) which is an advanced deep learning based generative model and the Markov chain likelihood, MLR-OOD is shown to achieve markedly higher prediction accuracy than other methods on bacterial, viral, and plasmid data sets. We conclude that MLR-OOD will greatly reduce false positives caused by OOD sequences in metagenomic sequence classification.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
PDF
Computational algorithms and statistical modelings in human microbiome analyses
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Predicting virus-host interactions using genomic data and applications in metagenomics
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Exploration of human microbiome through metagenomic analysis and computational algorithms
PDF
Enhancing phenotype prediction through integrative analysis of heterogeneous microbiome studies
PDF
Clustering 16S rRNA sequences: an accurate and efficient approach
PDF
Large scale inference with structural information
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Application of machine learning methods in genomic data analysis
PDF
Data-driven learning for dynamical systems in biology
PDF
Validating structural variations: from traditional algorithms to deep learning approaches
PDF
A generalization of the accumulation curve in species sampling and its applications to high‐throughput sequencing
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Fast search and clustering for large-scale next generation sequencing data
PDF
Genome-wide studies reveal the isoform selection of genes
Asset Metadata
Creator
Bai, Xin
(author)
Core Title
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
03/30/2022
Defense Date
03/04/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
false discovery rate control,knockoff inference,Markov chain,metagenomics,OAI-PMH Harvest,out-of-distribution detection
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sun, Fengzhu (
committee chair
), Chen, Liang (
committee member
), Edge, Michael (
committee member
), Fan, Yingying (
committee member
)
Creator Email
xinbai@usc.edu,xinbai92@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC110843448
Unique identifier
UC110843448
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Bai, Xin
Type
texts
Source
20220331-usctheses-batch-918
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
false discovery rate control
knockoff inference
Markov chain
metagenomics
out-of-distribution detection