Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Computational algorithms and statistical modelings in human microbiome analyses
(USC Thesis Other)
Computational algorithms and statistical modelings in human microbiome analyses
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COMPUTATIONAL ALGORITHMS AND STATISTICAL
MODELINGS IN HUMAN MICROBIOME ANALYSES
by
Zifan Zhu
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
May 2022
Copyright 2022 Zifan Zhu
Dedication
I dedicate this dissertation to my devoted parents, Xinli Peng and Xiangyang Zhu, and to my
beloved girlfriend, Jingwen Ren.
ii
Acknowledgements
First and foremost, I would like to thank my advisor, Dr. Fengzhu Sun, for his great mentorship
and guidance of my Ph.D. journey over the past six years. Dr. Sun is a role model to me because
of his passion in research, attention to details, and perseverance through challenges. He always
respects individual research interests and spares no eort to nd a project that satises and suits
both sides. Outside of scientic research, Dr. Sun provides me with valuable suggestions in career
planning. I really appreciate that he values my research ability and keeps pushing me towards
academia. I will always bear his precious words in mind as I embark on a new journey.
I would like to thank my Ph.D. qualication and dissertation committee members, Dr. Peter
Calabrese, Dr. Liang Chen, Dr. Yingying Fan, Dr. Sonia Michail, and Dr. Michael Waterman for
their constructive suggestions and unreserved help to guide my research.
I would like to thank Dr. Yingying Fan and Dr. Jinchi Lv for leading me to the amazing eld
of statistical inference. I am really glad to work under their guidance on research projects that
explore the mystery of the knockos framework and have applications to important problems in
computational biology.
I would like to thank Dr. Sonia Michail for sharing her expertise in pediatric diseases and
their associations with microbiome. Her biological insights were invaluable to our collaborations
and truly broadened my horizons.
iii
I would like to thank Dr. Minping Qian and Dr. Hongzhe Li for introducing me to the mag-
nicient world of computational biology. I am really grateful for the opportunity to do research
under their guidance when I was still an undergraduate. I would not have achieved what I have
today without their tremendous help and support.
I would like to thank my collaborators, Dr. Yingying Fan, Dr. Yinfei Kong, Dr. Jinchi Lv, Dr.
Sonia Michail, and Dr. Jie Ren for their great support in joint publications.
I would like to thank professors in QCB, Dr. Peter Calabrese, Dr. Mark Chaisson, Dr. Liang
Chen, Dr. Doc Edge, Dr. Georey Fudenberg, Dr. Vsevolod Katritch, Dr. Adam Maclean, Dr. Remo
Rohs, and Dr. Andrew Smith for teaching me knowledge of dierent branches in computational
biology.
I would like to thank all my labmates at Sun lab: Dr. Mengge Zhang, Dr. Han Li, Dr. Yang Lu,
Dr. Jie Ren, Dr. Fang Zhang, Dr. Kujin Tang, Dr. Weili Wang, Xin Bai, Siliangyu Chen, Yilin Gao,
Tianqi Tang, Beibei Wang, Yuxuan Du, Dallace Francis, Jiawei Huang, and Wenxuan Zuo. I would
also like to thank other friends at USC: Mezmur Belew, Brendon Cooper, Robel Dagnew, Bida Gu,
Wei Jiang, Yibei Jiang, Dr. Jinsen Li, Tsung-Yu Lu, Raktim Mitra, Dandan Peng, Keon Rabbani, Ji
Youn Seo, Bo Sun, Amal Thomas, George Wang, Yingfei Wang, Xiaojun Wu, Qifan Yang, Quentin
Yang, Qingyang Yin, Yuxiang Zhan and many others. I am truly grateful for meeting you all and
we shared so many good memories over the past six years.
Last but most importantly, I would like to thank my parents, Xinli Peng and Xiangyang Zhu,
and my girlfriend, Jingwen Ren for their unwavering love and endless support. I am really grateful
that they are always by my side throughout the whole journey.
iv
TableofContents
Dedication ii
Acknowledgements iii
ListofTables viii
ListofFigures x
Abstract xii
Chapter1: Introduction 1
1.1 Research problems in human microbiome analysis . . . . . . . . . . . . . . . . . . 4
1.1.1 Predicting disease status based on human microbial proles . . . . . . . . 4
1.1.2 Identifying disease associated microbes from human microbiome proles 7
1.1.3 Investigating the gut virome change of ulcerative colitis patients with
successful fecal microbiota transplantation . . . . . . . . . . . . . . . . . . 9
1.2 Other authors and contributors to the dissertation . . . . . . . . . . . . . . . . . . 10
Chapter2: MicroPro: Usingmetagenomicunmappedreadstoprovideinsightsinto
humanmicrobiotaanddiseaseassociations 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 MicroPro: a pipeline of predicting phenotypes based on metagenomic data 17
2.2.2.1 Step 1: Reference based known microbial abundance charac-
terization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2.2 Step 2: Estimating abundance proles of unknown microbial
organisms based on reads assembly followed by contig binning 17
2.2.2.3 Step 3: Predicting phenotypes using Random Forests . . . . . . 20
2.2.3 Reference based and de-novo assembly based methods . . . . . . . . . . . 21
2.2.4 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.5 Predicting phenotypes based on virus abundance proles . . . . . . . . . 23
2.2.5.1 Step 1: Known viral abundance extraction . . . . . . . . . . . . 23
2.2.5.2 Step 2: Unknown viral feature detection . . . . . . . . . . . . . 23
v
2.2.5.3 Step 3: Predicting phenotypes based on viral abundance . . . . 24
2.2.6 Alpha diversity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.7 Signicantly associated microbial organisms for each disease . . . . . . . 24
2.2.8 Predictive study between the two T2D data sets . . . . . . . . . . . . . . . 24
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 MicroPro: a metagenomic disease-related prediction analysis pipeline
taking unmapped reads into consideration . . . . . . . . . . . . . . . . . . 25
2.3.2 Comparison between MicroPro, reference based method and de-novo
assembly based method on simulated data set . . . . . . . . . . . . . . . . 26
2.3.3 Application of MicroPro to four real metagenomic data sets . . . . . . . . 28
2.3.4 Analysis of the role of unknown viruses in virus-only prediction study . . 30
2.3.5 Alpha diversity analysis of the abundance proles of both microbial
organisms and viruses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.6 Signicantly associated microbial organisms for each disease . . . . . . . 33
2.3.7 Taxonomic assignments of the MAGs generated in four data sets . . . . . 34
2.3.8 Prediction analysis between two T2D data sets . . . . . . . . . . . . . . . 35
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Chapter 3: DeepLINK: Deep learning inference using knockos with applications
togenomics 41
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Deep learning based knockos inference . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.1 Variable selection with false discovery rate (FDR) control . . . . . . . . . 44
3.2.2 Model settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.3 The model-X knockos framework . . . . . . . . . . . . . . . . . . . . . . 48
3.2.4 DeepLINK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.4.1 Part 1: autoencoder for knockos construction . . . . . . . . . . 52
3.2.4.2 Part 2: MLP for feature selection . . . . . . . . . . . . . . . . . . 53
3.3 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.1 Simulation designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.2 Parameter settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.3 Neural network settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.4 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.4.1 Simulation results with the linear factor model . . . . . . . . . . 58
3.3.4.2 Simulation results with the nonlinear factor model . . . . . . . 59
3.3.4.3 Robustness of DeepLINK to dierent activation functions . . . . 62
3.4 Real data applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4.1 Application to a microbiome data set . . . . . . . . . . . . . . . . . . . . . 66
3.4.2 Application to a murine single-cell RNA-seq data set . . . . . . . . . . . . 70
3.4.3 Application to a human single-cell RNA-seq data set . . . . . . . . . . . . 72
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
vi
Chapter 4: A metagenomic analysis to investigate the gut virome change of pedi-
atric ulcerative colitis patients with successful fecal microbiota trans-
plantation 76
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2.1 Sample collection and shotgun metagenomic sequencing . . . . . . . . . . 78
4.2.2 Known viral prole characterization . . . . . . . . . . . . . . . . . . . . . 79
4.2.3 Novel viral prole characterization . . . . . . . . . . . . . . . . . . . . . . 79
4.2.4 Principal coordinates analysis of viral proles . . . . . . . . . . . . . . . . 80
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.1 Virome proles shift towards those of the donors in responders after FMT 80
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Chapter5: Conclusionsandfuturework 84
5.1 Future work for the MicroPro project . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Future work for the DeepLINK project . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 Future work for the FMT project . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Chapter6: Supplementarymaterials 89
6.1 Supplementary materials for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Supplementary materials for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.1 Computational time and memory analysis of DeepLINK . . . . . . . . . . 90
6.2.2 Signal strength measure in nonlinear models . . . . . . . . . . . . . . . . 91
6.2.3 Robustness of DeepLINK to the number of neurons in autoencoder’s
bottleneck layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.4 MSLE works well with other choice of nonlinear functions . . . . . . . . . 96
6.2.5 DeepLINK performs well with highly correlated features . . . . . . . . . . 97
6.2.6 Robustness of DeepLINK to the estimation of the error matrix . . . . . . . 99
6.2.7 Comparison of DeepLINK with random forests and IPAD . . . . . . . . . 101
6.2.8 Optimal training and validation split . . . . . . . . . . . . . . . . . . . . . 104
Bibliography 114
vii
ListofTables
2.1 Four large-scaled metagenomic data sets spanning three dierent diseases . . . . 16
2.2 Wall time and memory use for individual methods applied to the simulated data set 28
2.3 Wall time and memory use for individual methods applied to four real
metagenomic data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Number of assembled contigs from unmapped reads and number of detected
viral contigs by VirFinder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Summary of signicantly associated microbes for each data set . . . . . . . . . . . 34
3.1 Neural network parameter settings . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Mean and standard error of the misclassication error rates for the microbiome
data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3 Top20 most selected microbial species whend = 30 for the microbiome data set . 69
3.4 Mean and standard error of the misclassication error rates for the murine
single-cell RNA-seq data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5 Top 20 most selected genes whend = 200 for the murine single-cell RNA-seq
data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.6 Mean and standard error of the misclassication error for the human single-cell
RNA-seq data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.7 Top 20 most selected genes whend = 200 for the human single-cell RNA-seq
data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.1 Computational time and memory used by DeepLINK under the linear factor model 91
viii
6.2 MLE estimates of,, and for three real data sets . . . . . . . . . . . . . . . . . 100
6.3 Comparison of the mean and standard error of the misclassication error on the
test data for the three real data sets . . . . . . . . . . . . . . . . . . . . . . . . . . 103
ix
ListofFigures
2.1 Histograms of mapping rates of each data set . . . . . . . . . . . . . . . . . . . . . 18
2.2 Procedures of microbial abundance characterization in MicroPro . . . . . . . . . . 26
2.3 Results of simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Prediction results on four real metagenomic data sets . . . . . . . . . . . . . . . . 29
2.5 Prediction results on four real metagenomic data sets only using viral abundances 31
2.6 Cumulative probability of the alpha diversity . . . . . . . . . . . . . . . . . . . . . 33
2.7 Histograms of pairwise Mash distances for all the complete bacteria genomes in
the Centrifuge database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.8 Prediction analysis between two T2D data sets . . . . . . . . . . . . . . . . . . . . 36
3.1 The autoencoder architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 The DeepPINK architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3 Comparisons between DeepLINK and IPAD in simulation settings with the linear
factor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Simulation results of DeepLINK in settings with the additive quadratic factor model 60
3.5 Simulation results of DeepLINK in settings with the logistic factor model . . . . . 61
3.6 Linear factor model simulation results of DeepLINK using dierent activation
functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.7 Additive quadratic factor model simulation results of DeepLINK using dierent
activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
x
3.8 Logistic factor model simulation results of DeepLINK using dierent activation
functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1 PCoA of known viral abundances based on metagenomic sequence data from the
four responders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2 PCoA of novel virus contig bin abundances based on metagenomic sequence
data from the four responders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.1 Cumulative probability of alpha diversity of known prole . . . . . . . . . . . . . 89
6.2 Cumulative probability of alpha diversity of unknown prole . . . . . . . . . . . . 90
6.3 Plot ofjh
0
(x
1
;A)j as a function ofA whenx = 0:1, 0.5, 2, and -1 . . . . . . . . . . 92
6.4 Linear factor model simulation results of DeepLINK with dierent choice on the
number of bottleneck neurons ^ r . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5 Additive quadratic factor model simulation results of DeepLINK with dierent
choice on the number of bottleneck neurons ^ r . . . . . . . . . . . . . . . . . . . . 94
6.6 Logistic factor model simulation results of DeepLINK with dierent choice on
the number of bottleneck neurons ^ r . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.7 Simulation results of DeepLINK in settings with the cubic link function . . . . . . 97
6.8 The boxplots ofv
1
(A),v
2
(B), andv
3
(C) over 100 repetitions under the linear
factor model and linear link function setting . . . . . . . . . . . . . . . . . . . . . 99
6.9 Q-Q plots of
^
E for three real data sets . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.10 Comparison of empirical FDR and power in a simulation study when the error
distribution is misspecied in the knocko variable generation . . . . . . . . . . . 101
6.11 The objective function in (6.12) as a function off . . . . . . . . . . . . . . . . . . 113
xi
Abstract
Trillions of microbes inhabit human body and play an important role in metabolism, reproduc-
tion, immune system activity, and development of disorders and diseases. Starting from the 19th
century, human microbiome analysis mainly relies on culture based methods. However, our view
of the microbial world was greatly limited since a large fraction of microbes were not cultivable.
With the rapid development and reduced cost of the sequencing technology, metagenomic stud-
ies coupled with the 16S rRNA sequencing and the whole genome shotgun (WGS) sequencing
markedly expand our understanding of the microbial community and bring the human micro-
biome analysis to a new era. Great technological revolutions come with great challenges. Many
important scientic questions in the eld of human microbiome await new ways of thinking and
new types of analysis methodology. In this dissertation, we present our eorts to address several
research problems in human microbiome anlayses using computational algorithms and statistical
modelings.
xii
Chapter1
Introduction
Microbes, including bacteria, archaea, viruses, protists and fungi, exist everywhere on earth: soil,
ocean, human body and many other enviroments [55, 60]. They play important roles in plant
productivity [160], marine biogeochemical cycles [55], and human well-beings [97]. The total
number of microbes in human body is estimated to be at least 10 times more than the aggregate
of human somatic and germ cells, and the total number of microbial genes is at least 100 times
more than that of human genes [59, 61]. It is also estimated that 500 to 1,000 bacteria species
coexist in the human body at any time, and this number can grow to 10,000 or 100,000 if we con-
sider subspecies [59]. The term “human microbiome” refers to the entire community of microbes
inhabiting human body together with their genes and “human microbiome analysis” refers to the
research study in human microbiome diversity, microbe-microbe interactions, host-microbe in-
teractions and many other aspects. Human microbiome analyses over the past three decades have
demonstrated that microbes are essential to many physiological processes including metabolism,
reproduction, immune system activity [17, 100], and the development of human disorders and dis-
eases such as allergies [18], obesity [154], diabetes [74, 124], inammatory bowel disease [107],
liver cirrhosis [125], and cancer [26, 80, 142, 170].
1
Human microbiome analyses have witnessed dramatic changes with technological revolu-
tions. In the beginning, humans had little to no knowledge of microbes’ existence since they
were not visible to naked eyes. Thanks to the invention of the microscope by Leeuwenhoek
about 350 years ago, the mystery of the microbial world began to unveil. However, our view
of microbes was still constrained due to the fact that the simple morphological classication
of microbes was infeasible even under the microscope [113]. In the late 19th century, the ad-
vent of culture-based techniques made the individual study of microbes possible. However, it is
estimated that 20% to 60% of human related microbes are uncultivable [120]; thus culture-based
techniques could only investigate a limited number of microbes and failed to provide insights into
the entire human microbiome as well as the interactions between hosts and microbes involved.
Starting from the mid 20th century, the development of the sequencing technology has signi-
cantly enhanced our capability to analyze the human microbial community. The rst generation
of sequencing technology, Sanger sequencing [129], was based on the selective incorporation
of chain-terminating dideoxynucleotides during in vitro DNA replication. This technology was
broadly used in the 2001 Human Genome Project and is still applicable to the low-throughput
DNA sequencing nowadays [97]. The classic Sanger sequencing had its limitations including low
throughput, low sensitivity, and low scalability. The advances in the next generation sequencing
(NGS) technologies overcame these limitations and brought the human microbiome analysis into
a new era. NGS is cost-eective, supports high-throughput, and requires relatively little techni-
cal expertise [97]. Major NGS platforms include Roche 454 sequencing, Illumina sequencing, Ion
Torrent sequencing, etc. Moreover, due to the rapid development and decreasing cost of the se-
quencing technology, “metagenomics”, a culture-independent method that characterizes genetic
2
materials directly from the environmental samples, is widely used in all kinds of microbial anal-
yses including the investigation of the human microbiome. The metagenomic analysis facilitates
a more comprehensive understanding of the microbial community and the dynamic interactions
therein.
Human microbiome analysis during the NGS era mainly relies on the 16S rRNA sequenc-
ing and the whole genome shotgun (WGS) sequencing to characterize the microbial taxonomy.
WGS sequencing is more expensive compared with the 16S rRNA sequencing, but provides much
higher resolution and can even allow strain level detections [116]. Both sequencing technolo-
gies, especially WGS, produce data with enormous size, making the downstream data mining
and data analysis quite demanding. Many bioinformatics tools have been developed to extract
useful and meaningful information from raw sequences, including Blast [96], MetaPhlAn2 [153],
Centrifuge [76], MEGAN [68], and Kraken2 [165] for taxonomic assignment, metaSPAdes [108],
Megahit [90], and Minia3 [29] for sequence assembly, CONCOCT [4], MetaBAT2 [73], MaxBin2
[166], and COCACOLA [94] for contig binning, and so on. However, many scientic problems in
the eld of human microbiome remain challenging, such as the phenotype prediction based on
microbial proles, the detection of phenotype associated microbes, etc. In this dissertation, we
present our works to tackle these important human microbiome problems using computational
algorithms and statistical modelings.
3
1.1 Researchproblemsinhumanmicrobiomeanalysis
1.1.1 Predictingdiseasestatusbasedonhumanmicrobialproles
Predicting the disease status, or more generally predicting the phenotype, based on microbial
proles is a very important problem in human microbiome analysis. Many studies have demon-
strated the associations between human microbiome and disease [26, 74, 80, 107, 124, 125, 142,
170]. Building disease prediction framework based on microbial proles will expand our un-
derstanding of host-microbe interactions, help the diagnosis and treatment of diseases, and ulti-
mately benet public health [97]. There are usually two steps to tackle this problem: chracter-
izing microbial proles from raw sequenced reads and building a prediction model based on the
extracted microbial proles.
In microbiome studies, a microbial prole is often represented by an abundance matrix, where
each row refers to a sample, each column refers to a microbe, and each(i; j) entry refers to the
abundance of ith microbe in jth sample. Researchers generally use two types of methods to
estimate the abundances from the raw WGS reads: reference based method and de-novo assem-
bly based method. Reference based method employs sequence alignment algorithms to map raw
reads against a microbial reference genome database, such as NCBI RefSeq [122], or a catalogue of
taxon-associated marker sequences, such as the one used in MetaPhlAn2 [153]. Microbial abun-
dances can then be estimated from the mapping results. On the other hand, de-novo assembly
based method estimates the microbial abundances without the help of the reference or marker
sequences. Instead, it characterizes metagenomic assembled groups (MAGs) by assembling the
raw WGS reads into contigs and then binning the contigs into MAGs. De-nove assembly based
method calculates the abundance proles of the MAGs based on the mapping of raw WGS reads
4
against MAGs. It is worth noting that both methods estimate abundances based on the mapping
results. Many algorithms have been proposed to compute the abundances. A widely used one is:
Abundanceof ithmicrobe=MAG =
rc
i
P
N
j=1
rc
j
; (1.1)
whereN is the total number of microbes/MAGs, andrc is the length normalized read counts, de-
ned as the number of raw WGS reads mapped to the microbe/MAG divided by its genome length.
However, both methods have some issues. For reference based method, many reads cannot be
mapped to any references/markers, which leads to a potential loss of valuable information. Take
the commonly used NCBI RefSeq database as an example, our experiments showed that about half
of the reads on average are unmapped [173]. De-nove assembly based method has the potential to
capture microbes that are not in the database, which solves the main issue of the reference based
method. But sequence assembly used in the de-novo assembly based method is very expensive
in terms of computational overhead, posing serious challenges for studies with high sequencing
depths or research groups without powerful computing resources.
Machine learning models, such as LASSO [151], random forests [23], support vector machine
(SVM) [65], articial neural networks [101], are widely used for the human disease status pre-
diction task based on microbial proles. In the eld of machine learning, a supervised predictive
study aims to build models based on sets of features to maximally approximate the response
value (regression) or correctly classify the label (classication) of a sample. In the setting of mi-
crobial prole based disease prediction, abundances of dierent microbes serve as the features
and disease status of a sample serves as the response. Disease status is naturally a binary vari-
able: disease or healthy; thus, a supervised classication model should be trained. Examples of
5
building machine learning classication models for microbiome based disease prediction can be
found in [51, 115, 121, 125, 150, 164, 170].
In Chapter 2, we present MicroPro [173], a software that characterizes microbial abundance
proles from both mapped and unmapped reads. MicroPro is a hybrid of reference based and
de-novo assembly based methods. It rst performs sequence alignment of raw NGS reads against
the NCBI RefSeq microbial reference database and then creates MAGs from the unmapped reads
pooled from all samples. Known and unknown microbial abundances are computed separately
and then combined into a single abundance matrix used for follow-up analyses. Here, “unknown""
microbes means those not present in the NCBI RefSeq database and are analyzed through MAGs.
MicroPro addresses the issues of reference based and de-novo assembly methods. On one hand,
MicroPro is able to capture “unknown” microbes through MAGs, avoiding the information loss in
unmapped reads. On the other hand, MicroPro only performs sequence assembly on unmapped
reads, which greatly reduces the computational costs. We applied MicroPro to four publicly avail-
able WGS metagenomic data sets that are associated with three diseases: colorectal cancer [170],
type 2 diabetes [74, 124], and liver cirrhosis [125]. We built random forests classication models
based on the microbial prole characterized by MicroPro to predict disease status and demon-
strated that including the unknown microbial prole signicantly improved the prediction accu-
racy.
6
1.1.2 Identifyingdiseaseassociatedmicrobesfromhumanmicrobiome
proles
Another important human microbiome problem is to identify disease associated microbes from
microbial proles. Given that there are at least tens of thousand unique microbial species inhab-
iting human body, it is generally believed that only a small proportion of them are truly related
to the development of a certain disease. Successful identication of such small set of disease as-
sociated microbes will greatly benet the diagnosis and treatment of the corresponding disease
and avoid unnecessary eorts put into irrelevant microbes. Many methods have been proposed
to tackle this problem. One group of methods rely on machine learning models with the abil-
ity of feature selection. For example, the linear model withL
1
penalization, LASSO [151], gives
some features zero weights, so that the set of features with nonzero weights can be considered
associated with the response. One example of using LASSO in disease related human microbiome
analysis can be found in [170]. However, we do not know how reliable the feature selection is,
since this type of method does not provide a bound of the feature selection error rate. Another
group of methods perform the marginal association analysis coupled with multiple testing cor-
rection. For example, Karlsson et al. [74] computed the wilcoxon rank sum testp-value for each
microbial species, adjusted thep-values by the Benjamini-Hochberg procedure [13, 14] for false
discovery rate (FDR) control, and identied type 2 diabetes related species with adjustedp-values
smaller than 0.05. Although this type of method provides the bound of feature selection error rate
in terms of FDR, it only investigates the marginal association and ignores the joint dependence
of features and the response.
7
In the eld of feature selection, many methods for FDR control have been proposed, including
Benjamini-Hochberg procedure [13, 14], Benjamini-Yekutieli procedure [16], local FDR [41],q-
value [144], Adaptive BH procedure [15], p-value weighting [58], independent hypothesis weight-
ing (IHW) [69], adaptive p-value thresholding (AdaPT) [87], and structure-adaptive Benjamini-
Hochberg algorithm (SABHA) [89]. Most of these existing methods achieve the FDR control
under the assumption that valid p-values can be calculated. However, having valid p-values can
become a luxury in the high-dimensional big data settings. Taking the generalized linear mod-
els (GLMs) as an example, when the feature dimensionalityp diverges with sample sizen at a
rate ofn
2=3
or faster, the classical asymptotic theory of MLE no longer applies. Consequently,
the resulting p-values calculated using the formula from the classical asymptotic theory becomes
invalid [48]. To overcome this challenge, Barber and Candès [10] introduced the xed-X knock-
os framework bypassing the use ofp-values to achieve the FDR control in the Gaussian linear
model whenp is smaller thann=2. A few years later, Candès et al. [25] proposed the model-X
knockos framework which achieves theoretically guaranteed FDR control in arbitrary dimen-
sions and for arbitrary dependence structure of the response on features. Thanks to the exibility
of model-X knockos, it was recently extended to the setting of deep neural networks in [95] via
proposing a new network architecture, DeepPINK, when the features have joint Gaussian distri-
butions. However, the joint Gaussian assumption limits the practical applicability of DeepPINK.
A new method for deep learning inference using knockos that can be used with more general
distributional assumptions is greatly needed.
Latent factor models, which use lower-dimensional unobservable factors to model the co-
movements of features and discover principal sources of variations, have been well studied and
8
broadly used in statistics [53, 157, 171], sociology [70, 140], bioinformatics [5, 19, 52], and eco-
nomics [38, 43, 134, 135, 156]. The most commonly used factor model assumes a linear relation-
ship between the feature vector and latent factors. However, in real applications such as the hu-
man microbiome analysis, linear factor models are often oversimplied and a more complicated
nonlinear factor model should be used. Recently, the rapid development of deep learning has
motivated the nonlinear factor models described by the architecture of the autoencoder, which
brings new possibilities to the applications of factor models to real world problems.
In Chapter 3, we present DeepLINK [172], a deep learning based feature selection framework
using knockos. DeepLINK combines the exible nonlinear factor modeling power of the au-
toencoder with the feature selection and prediction power of DeepPINK. It considers the joint
dependence of features and the response and enables feature selection with FDR control and
high power. We demonstrate the superior performance of DeepLINK in various simulation de-
signs. DeepLINK has a broad range of real world applications including disease related human
microbiome analysis. We applied DeepLINK to a metagenomic NGS data set related to colorectal
cancer (CRC) and discovered many microbial species that were also reported to have important
associations with CRC in the previous literature.
1.1.3 Investigating the gut virome change of ulcerative colitis patients
withsuccessfulfecalmicrobiotatransplantation
Ulcerative colitis (UC) is characterized by chronic inammation of the colon. It carries a signif-
icant disease burden in children, since 20%–25% of all cases are present during childhood [155].
Although immunomodulator and biologic therapies have positive eects on some patients, many
9
children still develop severe disease and even colon cancer [88, 110, 155]. Fecal microbiota trans-
plantation (FMT), a useful tool for manipulating the gut microbiome, has been considered eec-
tive in some adults with UC [103, 114, 128], since the rst FMT report of UC patients in 1989
[22]. However, relatively few trials have taken place testing FMT in children with UC. To our
knowledge, the trial described by Kunde et al. [82] is the largest study to date investigating FMT
in children with ulcerative colitis. In order to better understand the change of gut viral commu-
nity of four pediatric UC patients with successful FMT, we perform a metagenomic analysis and
demonstrate that the virome proles in UC patients after successful FMT shifted towards those
of the donors. Detailed methods and results are provided in Chapter 4.
1.2 Otherauthorsandcontributorstothedissertation
Dr. Fengzhu Sun, Dr. Liang Chen, Dr. Yingying Fan, Dr. Jinchi Lv, Dr. Sonia Michail, Dr. Jie Ren,
and Dr. Yinfei Kong.
10
Chapter2
MicroPro: Usingmetagenomicunmappedreadstoprovide
insightsintohumanmicrobiotaanddiseaseassociations
In this chapter, we present MicroPro, a metagenomic data analysis pipeline that takes into ac-
count all reads from known and unknown microbial organisms. We utilize MicroPro to ana-
lyze four metagenomic data sets relating to colorectal cancer, type 2 diabetes, and liver cirrho-
sis and show that including reads from unknown organisms signicantly increases the predic-
tion accuracy of the disease status for three of the four data sets. We identify new microbial
organisms associated with these diseases and show viruses play important prediction roles in
colorectal cancer and liver cirrhosis, but not in type 2 diabetes. MicroPro is freely available
at https://github.com/zifanzhu/MicroPro and https://doi.org/10.5281/zenodo.3336360 under the
GNU General Public License (GPL), version 3.
2.1 Introduction
Trillions of microbes populate various sites of the human body and form microbiome communi-
ties [131]. These microorganisms and their interactions between each other and the host play an
11
important role in many physiological processes including metabolism, reproduction, and immune
system activity [17, 100]. In the nineteenth century, culture-based methods demonstrated that
changes in these microbes might lead to disease. Since then, many subsequent studies conrmed
these ndings [37]. However, the cultivation technology only provided a limited view since many
microorganisms could not be cultured in vitro [84]. Over the past twenty years, and thanks to the
rapid development of sequencing technology, sequencing-based methods have gradually replaced
the cultivation technology and have become the most widely used tools for microbial analysis.
The 16S ribosomal RNA sequencing together with the recent whole genome shotgun sequencing
not only discover large amounts of non-cultivable microbes, but also fundamentally change the
way microbial analysis is performed [148, 30]. Researchers are now nding more evidence cor-
relating human microbiota with various diseases such as colorectal cancer [170], type-2 diabetes
[74, 124], liver cirrhosis [125] and many others. In addition, human microbiota has been linked
to the eectiveness of cancer chemo-therapy [174]. In some studies, a single species or strain is
associated with a disease while in other cases, groups of microorganisms interact to aect human
health [78].
Mounting evidence connecting the microbiome with disease description has gradually brought
about the concept of a supervised predictive study of microorganisms for dierent diseases. Al-
though most of the studies are merely observational, which means we cannot simply conclude
the causality between microbes and the disease [30], the existing correlations are sucient to
prove that performing a predictive study about the eect of microbiota on diseases is plausible.
More specically, many advances in this area have made it possible to predict the existence or
states of a certain disease given information of the microorganisms for a specic subject.
12
In the eld of machine learning, a supervised predictive study aims to build models based
on sets of features to maximally approximate the response value or correctly classify the label
of a sample. In the microbiota-disease setting, the response can either be disease/non-disease or
dierent subtypes within a disease; thus a classication version of supervised predictive study is
desired [77]. However, selection of features varies greatly among dierent studies. Our study is
focused on analyzing the microbial abundance in the context of whole genome shotgun sequenc-
ing. Similar analysis can also be applied to other choices of the feature including operational
taxonomic units (OTUs, widely used in 16S rRNA analysis) [133], NCBI non-redundant Clus-
ters of Orthologous Groups (COG) [149] or Kyoto Encyclopedia of Genes and Genomes (KEGG)
groups [72]. With many software packages like MetaPhlAn2 [153] or Centrifuge [76] tackling
the computation of the microorganisms’ abundance, the microbiota-disease predictive study can
be formulated as a machine learning task based on a species-by-sample matrix with qualitative
labels.
Recently, many studies have focused on the predictive analysis between human microbiota
and diseases. For example, Zeller et al. [170] developed a species abundance based LASSO [151]
model to dierentiate between colorectal cancer patients and healthy individuals. Qin et al. [125]
used gene markers to predict liver cirrhosis based on a Support Vector Machine (SVM) [65]. More-
over, Pasolli et al. [115] built a database named curatedMetagenomicData, which stored uniformly
processed microbiome analysis results across 5,716 publicly available shotgun metagenomic sam-
ples. Using this database, Pasolli et al. developed a Random Forest [23] model to analyze the
predictive power of dierent microbial features (such as species abundance, pathway coverage,
etc.) on various diseases.
13
However, the current available approaches face a few challenges. First, in microbiome studies,
there are generally two types of methods for microbial abundance characterization from metage-
nomic data sets: reference based methods and de-novo assembly based methods. Many reference
based methods involve the process of mapping short reads against known microbial reference se-
quences in the NCBI RefSeq database [122] or a catalogue of taxon-associated marker sequences
[153]. Microbial abundances can be estimated from the mapping results. However, a large pro-
portion of the reads cannot be successfully mapped to a particular reference, which results in
potential loss of valuable information. On the other hand, de-novo assembly based methods do
not need any reference genomes or marker sequences. These methods create metagenomic as-
sembled groups (MAGs) by rst assembling the reads into contigs, then binning the metagenomic
contigs into MAGs, and nally estimating the abundance proles of the MAGs. For example, Xing
et al. [168] and Ren et al. [126] both identied microbial species in the metagenomic data sets
through de-novo assembling reads into contigs and then binning contigs into MAGs, and an-
alyzed disease association with the relative abundance of the MAGs. De-novo assembly based
methods have the potential to capture microbes without reference genomes; thus solving the
main problem of the reference based methods. However, de-novo assembly based methods also
have their own issues. Sequence assembly is computationally expensive and takes a lot of time
and memory. For example, Minia 3 [29] took 53 hours and 63 GB memory to perform de-novo
assembly while reference based method, Centrifuge [76], completed the mapping in less than 2
hours and used 4 GB memory on the same machine for the QinJ_T2D data set.
Secondly, the roles of viruses in diseases are often neglected. Within human microbial com-
munity, bacteria reads constitute the majority while virus reads are reported as a small propor-
tion of the total reads (less than 5% in data sets analyzed in our study). Additionally, incomplete
14
database of viral reference genomes and the high mutation rates of viruses make them even more
challenging to characterize and analyze [21]. Therefore, most disease-related microbiome stud-
ies focus only on the connection between bacteria and the disease. However, learning about
viruses is important as the number of viruses is about 10 times that of bacteria [24], and they
can play important roles in multiple diseases. Norman et al. [107] showed that enteric virome
change happened in patients with inammatory bowel disease and bacteriophages might serve
as antigens in human immune system. Ren et al. [126] demonstrated that decreased viral diver-
sity was observed in patients with liver cirrhosis as compared to healthy individuals. Reyes et
al. [127] identied disease-discriminatory viruses associated with childhood malnutrition, which
might help to characterize gut microbiota development. Therefore, the role of viruses on human
diseases should be investigated.
In order to overcome the challenges mentioned above, we developed a metagenomic pre-
dictive pipeline, MicroPro, which analyzes data in three main steps: 1). Reference based known
microbial abundance characterization: perform taxonomic proling based on sequence alignment
against reference genomes. 2). Assembly-binning-based unknown organism feature extraction:
use cross-assembly to assemble the combined unmapped reads from all samples and consider
each assembled contig as originated from an ‘unknown’ organism, which refers to an organism
with no known references available in the database. Since some contigs may originate from the
same organism, we cluster assembled contigs into bins and then treat each bin as an ‘unknown’
organism. 3). Machine learning predictive analysis: apply machine learning tools to predicting
disease/non-disease or disease states based on the species-by-sample matrix. To the best of my
knowledge, this is the rst predictive pipeline based on a combination of both known and un-
known microbial organisms. We tested MicroPro on four public NGS data sets and showed that
15
consideration of unknown organisms signicantly increased the prediction accuracy for three of
the four data sets. Furthermore, we systematically investigated the eect of viruses on multiple
diseases with the virus version of MicroPro. We examined the predictive power of the model
with known and unknown viruses, and showed that unknown viruses played an important role
in disease prediction warranting further attention.
2.2 MaterialsandMethods
2.2.1 Datasets
We downloaded all the data sets using the links provided in the original papers [74, 124, 125, 170].
All the data sets used in this study are publicly available from the European Nucleotide Archive
(ENA) database (https://www.ebi.ac.uk/ena). Accession number for Zeller_CRC is ERP005534
[170], for Karlsson_T2D is ERP002469 [74], for QinN_LC is ERP005860 [125], and for QinJ_T2D
is SRA045646 [124]. The number of cases and controls are given in Table 2.1. For Zeller_CRC,
the “small adenoma” samples were treated as controls while the “large adenoma” samples were
removed.
Data set name Sample size Number of cases Number of controls Data size (Gbp)
Zeller_CRC 184 91 93 915
Karlsson_T2D 96 53 43 296
QinJ_T2D 145 71 74 376
QinN_LC 237 123 114 1200
Table 2.1: Four large-scaled metagenomic data sets spanning three dierent diseases. Detailed information
of the four metagenomic data sets analyzed in this study is provided.
16
2.2.2 MicroPro:apipelineofpredictingphenotypesbasedonmetagenomic
data
2.2.2.1 Step1: Referencebasedknownmicrobialabundancecharacterization
We used Centrifuge [76] to map the reads to the microbial genomes and calculated the abundance
proles of known microbial organisms from the metagenomic data. In terms of Centrifuge com-
mand, we set ag “-q” which indicated the input was in fastq format and the other arguments
were set as default. Centrifuge is an alignment-based taxonomic proling tool. Its microbial
database contains all the available bacterial, viral, and archaeal complete reference genomes in
NCBI (up to Jan 4, 2018). Centrifuge also utilizes an Expectation-Maximization (EM) algorithm to
compute the abundance for each microbial species. This EM based algorithm is similar in spirit
as those used in Cuinks [152], Sailsh [117] and GRAMMy [167]. It takes into account reads
mapped to multiple genomes or multiple locations in the same genome. In our study, we adopted
the species abundance calculated by Centrifuge as the known microbial feature.
2.2.2.2 Step2: Estimatingabundanceprolesofunknownmicrobialorganismsbased
onreadsassemblyfollowedbycontigbinning
Although Centrifuge accurately characterizes known microbial relative abundance proles, a
large fraction of reads cannot be mapped to the known microbial organisms. The average map-
ping rate for each data set is about 35% - 40% in our study (Figure 2.1). The large amount of
unmapped reads can potentially provide extra information on the prediction accuracy of pheno-
types based on the metagenomic data. Therefore, our main objective in this step is to take into
account the unmapped reads for phenotype prediction.
17
Figure 2.1: Histograms of mapping rates of each data set. Dashed lines show the mean mapping rate.
18
After ltering out mapped reads from the metagenomic data, we performed cross-assembly on
the unmapped reads from all samples. We tested two assemblers: Megahit [90] and Minia 3 [29] in
this step. Megahit assembles large and complex metagenomic data de novo based on succinct de
Bruijin graph. Minia 3 utilized a more space-ecient bloom lter to perform sequence assembly.
As shown in Section 2.3.3, Megahit performed better in real data analysis in terms of prediction
but required much more computing time and memory than Minia 3. After cross-assembly, we
used MetaBAT 2.12.1 [73] to perform binning on the assembled contig set. MetaBAT 2.12.1 is a
reference-free metagenomic binner and its binning criterion is based on tetranucleotide frequency
and mean base coverage. This “reference-free” feature is crucial to our study, since the contig set
to be binned contained no reads that could be mapped to a known reference. Recent comparative
studies on contig binning [102] showed that MetaBAT 2.12.1 performs well compared to other
contig binning algorithms.
Reads assembly and contig binning are highly important to recover unknown organisms from
the unmapped reads. Here, ‘unknown organisms’ represents the organisms without a known
reference. Once we nished cross-assembly and metagenomic binning, we treated each contig
bin as an unknown organism and the binned reads as a part of its genome. In terms of dening
the feature of the unknown organisms, we still used the relative abundance, just as what we did
for known species. The formula of the relative abundance (Ab) of unknown organismi was:
Ab(i) =
rc
i
P
N
j=1
rc
j
; (2.1)
where rc was the length normalized read counts, which was dened as the number of reads
mapped to that organism divided by its genome length. Here calculatingrc was a major issue,
19
since we do not know the whole genome of the unknown organism. To overcome this challenge,
we rst mapped all the unmapped reads back to the contig set using BWA-aln [92] with parameter
“-n” set as 0.03 (only alignments with more than 97% accuracy were considered mapped). Then
we calculated the length normalized read counts (rc) for each contig according to the mapping
results. Finally, for each contig bin (i.e. each unknown organism), we took the averagerc of all
the contigs that belonged to it as an approximation of its realrc. We could compute the unknown
feature for all contig bins using the above formula. In terms of combining the known and un-
known abundances, we calculated the mapping rate (dened as the number of mapped reads /
the number of total reads) for each sample and multiplied the known and unknown abundances
by and1 respectively, so that the combined abundance table sums to one for each sample.
2.2.2.3 Step3: PredictingphenotypesusingRandomForests
In the above two steps, we extracted the relative abundance proles of both known and unknown
microbial organisms. We then trained a Random Forests [23] classication model based on the
combined abundance proles to dierentiate between the cases and the controls. Random Forests
is an ensemble of the decision tree algorithm and is highly robust to over-tting when the number
of features is greater than the number of samples. Our analysis was performed with R package
“randomForest”. We randomly separated the data set into training set and test set with ratio 7:3.
During model training, we used 10-fold cross-validation to tune the number of variables selected
at each split, which is the “mtry” argument of the randomForest function in R, for best predictive
performance. In terms of the measure of prediction accuracy, we adopted the Area Under the
receiver operating characteristic Curve (AUC) score, a widely used performance measure of the
classication model. An AUC score close to 1 indicated perfect classication, while a 0.5 AUC
20
score revealed that the model was close to a random guess. The above procedure was repeated
30 times.
2.2.3 Referencebasedandde-novoassemblybasedmethods
Reference based methods use a reference database to characterize microbial abundances. In this
study, the AUC scores for the reference based method were obtained by training a Random Forest
classication model based only on the Centrifuge abundance output (i.e. the known abundance
table in the MicroPro pipeline). De-novo assembly based methods generate metagenomic assem-
bled groups by assembly and binning of raw reads without the help of any reference genomes.
To compare its predictive performance with MicroPro, we implemented de-novo assembly based
method on all the four metagenomic data sets. We rst generated a cross-assembly of all the
metagenomic reads in a data set. Due to insucient computing memory, cross-assembling all
samples using Megahit was computationally infeasible. Thus we only used Minia 3 for cross-
assembly. After obtaining the assembled contigs, we performed metagenomic binning of the
assembled contigs by MetaBAT 2.12.1 and computed the contig bin abundances in the same way
as the MicroPro pipeline. The abundance prole of bins was used as features for the Random
Forests classication studies.
2.2.4 Simulationstudies
We performed simulation studies to compare the predictive performance of MicroPro, reference
based method and de-novo assembly based method. We simulated 50 shotgun metagenomic se-
quenced samples with 25 cases and 25 controls in the following way. To mimic the real human gut
21
microbial community, the abundance proles used in the simulation were modied based on the
known abundance table of the QinN_LC data set. In particular, we calculated the average relative
abundance of the microbes at genus level among all control samples and only kept the top 100 bac-
terial genera by the descending order of abundance. Then we divided this abundance vector by its
sum and treated it as the standard abundance prole of the control samples. For the case samples,
we randomly selected 10 microbes and multiplied their abundances byf
i
;i = 1; ;10, where
eachf
i
was sampled fromUniform(0:1;3). We renormalized the derived abundance vector to
sum to 1 and used it as the standard abundance prole of the case samples. We also introduced
absolute random Gaussian noise with mean zero and standard deviation equal to each component
to the standard abundance proles to further diversify the microbial composition of the simulated
samples. CAMISIM [54] was then used to generate 50 samples with Illumina 2×150 bp paired-end
reads based on the generated abundance proles. Each generated sample had size of 1GB (500
Mbp).
MicroPro with dierent assemblers Megahit and Minia 3 was tested on the simulated data sets.
Reference based method only used the Centrifuge abundance output as the feature of the classi-
cation study. For this simulated data set, we randomly picked 30 microbes out of 100 to generate
the reference genome database used in Centrifuge taxonomic proling. De-novo assembly based
method generated metagenomic assembled groups by assembly and binning of raw reads with-
out any reference genomes. We also tested two assemblers Megahit and Minia 3 for the de-novo
assembly based method. The Random Forest classication analysis was performed in the same
manner as Step 3 in the MicroPro pipeline. Since we used predetermined abundance proles to
simulate metagenomic reads, we obtained the ground truth AUCs with these abundance proles
input as the classication feature.
22
2.2.5 Predictingphenotypesbasedonvirusabundanceproles
Viruses play a very important role in human microbial community by controlling the balance
of dierent bacterial organisms. However, due to its relative low abundance, extraction of all
the viral information, especially those without a known reference, remains a major diculty.
Aimed at making full use of all the viral features within metagenomic samples, the virus version
of MicroPro is similar in spirit to the general Pipeline presented in Section 2.2.2, except for an
additional step for viral contig detection. The full pipeline is shown below.
2.2.5.1 Step1: Knownviralabundanceextraction
For the known viral abundance, we again used the software Centrifuge, but only extracted the
viral abundances from the Centrifuge proling output and treated it as the known viral feature.
2.2.5.2 Step2: Unknownviralfeaturedetection
We performed cross-assembly using Megahit on the unmapped reads ltered out by Centrifuge
results. Before metagenomic binning, we applied VirFinder [126] for viral contigs detection.
VirFinder utilized a logistic regression model to dierentiate between bacterial and viral con-
tigs. We considered a contig as virus if its VirFinder q-value is smaller than 0.2. Q-value [145]
is a p-value correction method targeting exact false discovery rate (FDR) control. We performed
metagenomic binning on the viral contigs and calculated viral bins’ abundance using the same
method as described in Section 2.2.2 Step 2.
23
2.2.5.3 Step3: Predictingphenotypesbasedonviralabundance
With both the known and unknown viral features at hand, the next step was to perform the
prediction analysis. We combined two viral features in the same way as in the general MicroPro
pipeline and trained a Random Forest model based on extracted viral abundance. We used 10-fold
cross-validation to tune the parameters and set AUC score as the measure of prediction accuracy.
2.2.6 Alphadiversityanalysis
Alpha diversity is a widely-used diversity measure in microbiome studies. It is dened based on
both the number of species within a sample and the abundance of each species. We performed
alpha diversity analysis of both microbial and viral abundance proles. Alpha diversity with
Shannon index is calculated by package “vegan” in R.
2.2.7 Signicantlyassociatedmicrobialorganismsforeachdisease
We identied the signicantly associated features by the Boruta feature selection method [83].
Boruta is an iterative algorithm to select all relevant features through statistical tests. The analysis
was carried out with R package “Boruta”.
2.2.8 PredictivestudybetweenthetwoT2Ddatasets
We trained a Random Forest model based on one of the T2D data sets and tested it on the other to
obtain the AUC score. Features included were also the known and unknown microbial abundance.
Obtaining the known feature was essentially the same procedure as MicroPro’s Step 1. We used
the following strategy to calculate the abundance proles of the unknown microbial organisms.
24
For the train set, we used MicroPro’s Step 2 with assembler Megahit to nd out the unknown
microbial feature. For the test set, instead of mapping back to its own contig set, we aligned the
unmapped reads in the test set against the train data contig set. In this way, we could obtain a
consistent feature matrix so that the following prediction analysis could be performed seamlessly.
2.3 Results
2.3.1 MicroPro:ametagenomicdisease-relatedpredictionanalysispipeline
takingunmappedreadsintoconsideration
We developed a new metagenomic analysis pipeline, MicroPro, to take into account both known
and unknown microbial organisms for the prediction of disease status. MicroPro consists of
three main steps: 1) Reference based known microbial abundance characterization; 2) Assembly-
binning-based unknown organism feature extraction; and 3) Machine learning predictive analy-
sis. Figure 2.2 presents the procedures to extract the abundance table of both known and unknown
microbial organisms. Various machine learning tools can then be applied to study the association
between microbial abundances and the disease. Detailed explanations of each step are available
in Sections 2.2.2 and 2.2.5.
25
Figure 2.2: Procedures of microbial abundance characterization in MicroPro.
2.3.2 Comparison between MicroPro, reference based method and de-
novoassemblybasedmethodonsimulateddataset
We simulated 50 metagenomic shotgun sequenced samples (25 cases and 25 controls) consisting
of bacteria from 100 genera. Each sample had size of 1GB (500 Mbp). The details of the simu-
lation setup are described in Section 2.2.4. We then tested MicroPro, and compared it with the
reference based method and the de-novo assembly based method on the simulated data set for
their prediction performance of disease status. The reference based method only used the known
microbial abundances produced in the rst step of MicroPro to perform the classication study.
On the other hand, the de-novo assembly based method skipped the rst step of MicroPro and
performed assembly and binning on the whole data set. The simulation study showed that the
predictive performance of the reference based method was signicantly lower than that of the
de-novo assembly based method and MicroPro, since reference based method only captured mi-
crobes within the reference database which possibly ignored other microbes important for the
26
Figure 2.3: Results of simulation studies. Boxplots of random forest AUC scores obtained using features
from dierent methods are provided. Each random forest classication model was repeatedly trained and
tested 30 times. Student’s t test p values between pairs of methods are given.
classication. De-novo assembly based method and MicroPro had similar performance in terms
of prediction, as they both used all the reads in the sample without the information loss encoun-
tered in the reference based method (Figure 2.3). However, in terms of computational cost, the
reference based method needed the fewest computing resources as sequence alignment was com-
putationally cheaper than assembly. Additionally, de-novo assembly based method required at
least twice the walltime and 1.5 times the memory compared to MicroPro. This result was not
unexpected since sequence assembly was the computational bottleneck for these two methods
and MicroPro only assembled unmapped reads while de-novo assembly based method assembled
all of them (Table 2.2). In summary, MicroPro performed better in prediction than reference based
method and required much fewer computing resources than de-novo assembly based method.
27
Wall time (min) MaxRSS (GB)
Reference-based 14 0.2
MicroPro (Megahit) 81 12
MicroPro (Minia 3) 43 6
De novo assembly-based (Megahit) 305 20
De novo assembly-based (Minia 3) 79 9
Table 2.2: Wall time and memory use for individual methods applied to the simulated data set. Computing
resources required for the individual methods used in the simulation study are provided. MaxRSS refers
to maximum memory used by the corresponding method.
Sczyrba et al. [137] showed that Megahit [90] and Minia 3 [29] were among the top assemblers
and produced contigs of similar quality in the Critical Assessment of Metagenome Interpretation
(CAMI) challenge. To compare these two assemblers, we tested Megahit and Minia 3 in the sim-
ulation study and found that they had similar performance in prediction (Figure 2.3) but Minia 3
was computationally more ecient than Megahit (Table 2.2).
2.3.3 ApplicationofMicroProtofourrealmetagenomicdatasets
We downloaded four publicly available shotgun-sequenced metagenomic data sets related to
three dierent diseases: Colorectal Cancer (CRC) [170], Type 2 Diabetes (T2D) [74, 124], and
Liver Cirrhosis (LC) [125] (Table 2.1).
We then analyzed these four data sets using MicroPro. We found that MicroPro signicantly
improved the prediction accuracy over reference based method in three of the four data sets
(Karlsson_T2D, QinJ_T2D and QinN_LC). This result uncovered the predictive value of the abun-
dance proles of unknown organisms that were commonly ignored by many reference based
metagenomic analysis pipelines (Figure 2.4A). We also compared MicroPro with de-novo assem-
bly based method. Due to insucient computing memory, we only used Minia 3 for de-novo
assembly. The prediction results showed that MicroPro (Minia 3) performed slightly better than
28
Figure 2.4: Prediction results on four real metagenomic data sets. SubplotA shows boxplots of random
forest AUC scores obtained by reference-based method and MicroPro (with assembler Megahit). Each
random forest classication model was repeatedly trained and tested 30 times. Student’s t test p values
are given. SubplotB shows boxplots of random forest AUC scores obtained by MicroPro and de novo
assembly-based method. Results of MicroPro with two dierent assemblers are shown. Each random
forest classication model was repeatedly trained and tested 30 times. Student’s t test p values between
pairs of methods are given.
de-novo assembly based method with the AUC increase being signicant in Zeller_CRC, QinN_-
LC and weakly signicant in Karlsson_T2D (Figure 2.4B). As in the simulation study, the de-novo
assembly based method was computationally more expensive than MicroPro (Table 2.3). More-
over, we compared the performance of MicroPro using two dierent assemblers: Megahit and
Minia 3. Results showed that MicroPro (Megahit) performed signicantly better than MicroPro
(Minia 3) in data sets Karlsson_T2D and QinJ_T2D and both had similar prediction accuracy in
the other two data sets (Figure 2.4B). Again Megahit required much more computing resources
than Minia 3 (Table 2.3). It suggests that for small data sets or with ample computing resources,
Megahit is a better choice over Minia 3 for real data. Unless specied, all the following analyses
are based on Megahit assembled contigs.
29
MicroPro (Megahit) MicroPro (Minia 3) De-novo assembly (Minia 3)
Walltime (h) MaxRSS (GB) Walltime (h) MaxRSS (GB) Walltime (h) MaxRSS (GB)
Zeller_CRC 212 334 36 46 71 92
Karlsson_T2D 161 192 18 19 49.5 39
QinJ_T2D 186 249.5 23 35 53 63
QinN_LC 351 563 59 91.5 85 134
Table 2.3: Computing resources required for each method on four real metagenomic data sets are provided.
MaxRSS refers to the maximum memory used by the corresponding method.
2.3.4 Analysis of the role of unknown viruses in virus-only prediction
study
To test the predictive power of the viral organisms within the microbial community, we applied
the virus version of MicroPro to all the four data sets. Although the prediction accuracy obtained
by the abundance proles of known viruses was much lower than that obtained by known mi-
crobial abundances including bacteria, adding the unknown feature signicantly improved the
prediction accuracy for data sets Zeller_CRC, QinJ_T2D and QinN_LC (Figure 2.5). For Zeller_-
CRC and QinJ_T2D, the role of unknown viruses was remarkable as they increased the average
AUC score from 0.55 to 0.72 and 0.56 to 0.65, respectively. For QinN_LC, the average AUC score
with known viruses was 0.73 which was much better than the other three data sets and the in-
clusion of unknown viral abundances further increased it to 0.80. These results highlight the
advantage of MicroPro to consider both known and unknown microbial organisms in metage-
nomic prediction study and further demonstrate the important association of viruses, especially
unknown viruses with multiple diseases.
30
Figure 2.5: Prediction results on four real metagenomic data sets only using viral abundances. Boxplots
of random forest AUC scores obtained using dierent viral features are provided. “Viral known” refers to
only using known viral abundances to perform the classication while “Viral combined” means using both
known and unknown viral abundances. Each random forest classication model was repeatedly trained
and tested 30 times. Student’s t test p values are given.
Data set name Number of contigs Number of viral contigs
Zeller_CRC 361629 188
Karlsson_T2D 208763 28
QinJ_T2D 184766 48180
QinN_LC 194120 11277
Table 2.4: Number of assembled contigs from unmapped reads and number of detected viral contigs by
VirFinder for each data set are provided.
31
On the other hand, we acknowledge that the increase in prediction accuracy for Karlsson_T2D
is weaker than the other three data sets. Considering the fact that there were only 28 unknown
viral contigs found for this data set (Table 2.4), the number of unknown viruses were too small
to play a major role in the prediction analysis hence the low AUC increment. However, in the
other T2D data set QinJ_T2D, much more viral contigs were discovered (Table 2.4), suggesting
that the detection of viral contigs can be data set-dependent with confounding factors like sample
collection method, shotgun sequencing protocols etc. aecting the generated metagenomic reads.
For prediction performance using both known and unknown viruses, QinN_LC (mean AUC =
0.80) and Zeller_CRC (mean AUC = 0.72) are much higher than Karlsson_T2D (mean AUC =
0.58) and QinJ_T2D (mean AUC = 0.65), which indicates the potential weaker prediction role of
viruses in T2D compared to the other two diseases.
2.3.5 Alphadiversityanalysisoftheabundanceprolesofbothmicrobial
organismsandviruses
We also performed alpha diversity analysis for both microbial and viral abundance proles in the
cases and controls. Figure 2.6 shows the results of using the abundance proles of both known
and unknown microbial organisms. Alpha diversity results based on the abundance proles of
only known or unknown organisms are provided in Chapter 6 Section 6.1 (Figures 6.1 and 6.2).
For microbial alpha diversity (Figure 2.6A), a consistent pattern of the case being less diverse
is observed. This pattern is most remarkable for QinN_LC, which corresponds to its high AUC
score when using microbial abundances to dierentiate between cases and controls (Figure 2.4A).
For the viral alpha diversity, we did not identify statistical signicant dierences between cases
32
Figure 2.6: Cumulative probability of the alpha diversity. Cumulative probability distributions of alpha
diversity with Shannon index are shown. Abundance proles of both known and unknown organisms are
used for the calculation. SubplotA uses the abundance proles of all the microbes while subplotB only
uses the abundance proles of viruses. P values based on the WMW test for the alpha diversity between
the cases and the controls are provided.
and controls for liver cirrhosis (QinN_LC) and type 2 diabetes (Karlsson_T2D, QinJ_T2D) at the
type I error of 0.05. Surprisingly, we discovered that the viral diversity in CRC cases is much
higher than that in the healthy controls, a nding consistent with the result from a recent study
of Nakatsu et al. [106] that analyzed the viromes in CRC cases and controls.
2.3.6 Signicantlyassociatedmicrobialorganismsforeachdisease
We explored the microbial organisms that were signicantly associated with a certain disease in
the metagenomic analysis. In our study, signicantly associated microbial organisms were se-
lected by the Boruta feature selection method [83]. Table 2.5 illustrates that a majority of the
selected microbes are unknown, further highlighting the advantage of our pipeline to charac-
terize unknown microbes from unmapped reads. We also discussed the novel microbe-disease
33
Data set name # Signicant microbes # Known # Unknown
Zeller_CRC 49 (2313) 8 (1287) 41 (1026)
Karlsson_T2D 25 (1379) 4 (785) 21 (594)
QinJ_T2D 21 (1411) 5 (925) 16 (486)
QinN_LC 68 (1442) 21 (936) 47 (506)
Table 2.5: Numbers of signicantly associated microbes for each data set are provided. “# Signicant
microbes”, “# Known” and “# Unknown” represent the number of selected signicant, known and unknown
microbes, respectively. Numbers shown in the parenthesis are the corresponding total count of microbes
associations discovered in this study (see Section 2.4). These discoveries can lay ground work for
future mechanistic understanding of the pathophysiology of the corresponding diseases.
2.3.7 TaxonomicassignmentsoftheMAGsgeneratedinfourdatasets
To further identify the taxonomic assignment of the MAGs derived in each data set, we calculated
the pairwise distance between each MAG and the reference genomes in the Centrifuge database
(up to Dec 10, 2018) with Mash v.2.0 [111], a widely used alignment-free genome comparison tool
based on the overlap of kmers between genomes. In order to determine the species and genus
level threshold for Mash distance, we used Mash v.2.0 with sketch size 1,000,000 and k-mer size of
21 to calculate pairwise Mash distances between all the 11,444 complete bacterial genomes in the
Centrifuge database (up to December 10, 2018). We then obtained the distributions of the pairwise
Mash distances for three dierent groups of genome pairs: A. both are from the same species; B.
the two bacterial genomes are from dierent species but the same genus; and C. the two bacterial
genomes are from dierent genera. We could observe clear separations between the three groups
of genome pairs under cuto of 0.05 and 0.34 (Figure 2.7). We found that 98.02% of group A
distances are below 0.05, 92.37% of group B distances are above 0.05 and below 0.34, and 91.27%
of group C distances are above 0.34. These results demonstrate that Mash distance thresholds
34
Figure 2.7: Histograms of pairwise Mash distances for all the complete bacteria genomes in the Centrifuge
database. “within_species”, “between_species_within_genus”, and “between_genus” represent three dif-
ferent groups of genome pairs depending on whether the two bacteria genomes are: A. from the same
species; B. from dierent species but the same genus; and C. from dierent genera. Mash v.2.0 with sketch
size 1,000,000 and k-mer size of 21 is used to calculate the pairwise distances.
of 0.05 and 0.34 can be reasonably utilized as species and genus level thresholds, respectively.
Using these thresholds, we found that 0.65% and 19.83% of the total MAGs could be assigned to
the species and genus levels, respectively.
2.3.8 PredictionanalysisbetweentwoT2Ddatasets
Although prediction within one study can give good results, prediction accuracy drops sharply
when applied to a dierent data set. Dierent experiment protocols, various sequencing plat-
forms, variable time points of data collection are all possible reasons for the drop in prediction
accuracy. In our study, there were two T2D data sets, which oered an opportunity to analyze the
generalization potential of the predictive model across dierent studies. As shown in Figure 2.8,
the AUC scores dropped markedly for both cases from above 0.75 to around 0.6 when compared
with the prediction within one study (Figure 2.4A). When using Karlsson_T2D to predict QinJ_-
T2D, adding the unknown feature seemed to have no eect on the prediction accuracy. However,
35
Figure 2.8: Prediction analysis between two T2D data sets. Boxplots of random forest AUC scores ob-
tained in the cross-study analysis are provided. “MicroPro known” refers to using only known microbial
abundance prole extracted by MicroPro as the feature while “MicroPro combined” refers to using both
known and unknown abundances. Each random forest classication model was repeatedly trained and
tested 30 times. Student’s t test p values are given.
in the other case, adding the unknown features signicantly increased the AUC scores suggesting
that in cross-study settings, adding unknown organisms can result in higher prediction accuracy.
2.4 Discussion
Many studies have described the development of computational tools to investigate the associa-
tion of microbial organisms with complex traits. However, most of the available reference based
tools focus on the microbial species with a known reference genome and the reads not mapped
to the known genomes are not considered, which can result in loss of potentially useful infor-
mation. Other de-novo assembly based methods demand signicant computing resources with
long computational time and large memory requirement. In order to address these issues, we de-
veloped the MicroPro pipeline that extracts both known and unknown microbial features within
metagenomic data sets. We tested MicroPro in a disease prediction study involving four public
36
metagenomic data sets covering three dierent diseases. We show that the prediction accuracy is
signicantly increased when adding unknown microbial features for three of the four data sets,
which demonstrates the important predictive role of unknown organisms. Additionally, since
MicroPro only assembles the unmapped reads, it is computationally much more ecient than
de-novo assembly based methods.
Many studies have demonstrated the important role of viruses in human diseases like inam-
matory bowel disease [107] and liver cirrhosis [126]. However, due to the limited virus genome
database and high mutation rates, viruses were often neglected in metagenomic association stud-
ies. The virus version of MicroPro aims at extracting both known and unknown viral features
from sequenced reads. We performed prediction analysis with viral abundances extracted by the
virus version of MicroPro on the same public metagenomic data sets. The results indicated that
viruses did play some roles in diseases like colorectal cancer and liver cirrhosis. Thus, the role
of viruses should not be ignored in metagenomic analysis. Also, for some data sets, like Zeller_-
CRC in our study, the power of predicting disease when using known virus only, was close to
random guess. However, the inclusion of unknown viral features remarkably increased the pre-
diction accuracy. This demonstrated that our pipeline was able to distinguish the role of viruses
by investigating unknown features.
We also discovered many novel microbial associations with specic diseases and disease pre-
diction. Some of these associations are consistent with what has been described in the past. We
discovered a number of organisms which were predictive of liver cirrhosis. These organisms in-
clude Veillonella parvula, Veillonella rodentium, Fusobacterium periodonticum, Lactobacillus sali-
varius and Selenomonas sp. oral taxon 136. These organisms frequently inhabit the oral cavity
and many are pathogenic. For example,Veillonellaparvula is a bacterium in the genusVeillonella.
37
Veillonella are Gram-negative bacteria anaerobic cocci. Veillonella parvula is well known for its
lactate fermenting abilities and inhabit the intestines and oral mucosa. In humans,Veillonella can
cause osteomyelitis, endocarditis, periodontitis and dental caries as well as various systemic infec-
tions [98]. Similarly, Fusobacterium is a genus of anaerobic, Gram-negative, non-spore-forming
bacteria, similar to Bacteroides. Although in the past, Fusobacterium was considered part of the
normal oral microbiome, the current consensus is thatFusobacterium should always be treated as
a pathogen [3] and has been linked to periodontal diseases, ulcerative colitis, and colon cancer.
These organisms originate from the mouth, but may also inhabit the intestine [175]. Even though
our model discovered new organism-associations for disease prediction, it has been shown that
the oral microbiota can inuence the gut microbiome and has been detected in the stools of pa-
tients with cirrhosis [125]. Chen et al. [27] described Veillonella and other oral microbiota as
discriminative taxa between patients with cirrhosis compared to controls. The permissive oral
microbial invasion may be related to altered hepatic bile production or the frequent use of proton
pump inhibitors in this population. Both bile and gastric acid are natural gates that can inhibit
the survival of many of the ingested organisms. Furthermore, bacterial populations originating
from the oral microbiota are capable of producing high levels of methyl mercaptan (CH3SH). El-
evated blood levels of CH3SH have been linked to the development of hepatic encephalopathy
[2]. The presence of both Dialister pneumosintes and Parvimonas micra was predictive of de-
velopment of colorectal cancer in our model. Dialister pneumosintes was found in patients with
periodontitis [34] and has been shown to have potential pathogenic roles in various human body
sites including the lung and brain [105]. It has been recently shown to be an important com-
ponent of the dysbiotic microbiome in patients with gastric cancer [31]. Parvimonas micra can
cause infectious endocarditis [62], native joint septic arthritis [6] and spondylodiscitis [158] and
38
has also been associated with gastric cancer [31]. Not only enrichment of specic organism was
predictive of colorectal cancer in our model, but we also report depletion of specic organisms,
such as Cutibacterium acnes, is seen in association with this type of cancer. While this organism
was originally described in subjects with acne, it can still be found throughout the digestive tract
[119] and was originally namedPropionibacteriumacnes for its ability to generate propionic acid
[39]. Propionic acid, among other short chain fatty acids (SCFA), contributes to the health of
colonocytes and has been shown to be depleted in colorectal cancer [109]. The discovery that
subjects with colorectal cancer harbor lessCutibacteriumacnes could potentially explain the pre-
vious reports of depletion of propionic acid in this population and may shed some light on the
pathophysiology of disease development.
We acknowledge that there are limitations of our pipeline. One potential issue of MicroPro is
under the situation that the core genomes of some microbes are present in the reference database
while their corresponding pan-genomes are not, MicroPro will report the core genome in the
known abundance prole and the remaining parts as separate unknown MAGs. This issue may
not be problematic for the prediction of disease using Random Forest as it can use one of the
abundance proles for phenotype prediction. However, caution is needed when the objective is
to identify the microbes signicantly associated with the disease since both the core genome and
the corresponding MAG could be reported as associations although they are actually from the
same genome.
We also acknowledge that although unknown features are extracted through assembly and
binning, more functional analysis is needed to further understand the roles of each bin in dis-
eases. Additionally, the disease prediction study is only observational and does not show the
causality between a certain or a group of microbes and diseases. Furthermore, though we only
39
tested MicroPro in disease related analysis, MicroPro is readily to be applied to any type of phe-
notype prediction metagenomic studies. By fully utilizing both known and unknown organisms
including viruses in microbiota, we expect MicroPro will help to largely improve the prediction
accuracy and facilitate biomarker detections.
In summary, MicroPro provides a highly useful tool to study the associations between micro-
biota and diseases without neglecting key information from unknown organisms. The microbial
prediction of disease can be useful in understanding disease pathogenesis and may become crucial
in laying ground work for future development of specic disease biomarkers.
40
Chapter3
DeepLINK:Deeplearninginferenceusingknockoswith
applicationstogenomics
In this chapter, we present DeepLINK, a deep learning based knockos inference framework that
guarantees the False Discovery Rate (FDR) control in high-dimensional settings. DeepLINK is
applicable to a broad class of covariate distributions described by the possibly nonlinear latent
factor models. It consists of two major parts: an autoencoder network for the knocko variable
construction and a multilayer perceptron network for feature selection with the FDR control. The
empirical performance of DeepLINK is investigated through extensive simulation studies, where
it is shown to achieve FDR control in feature selection with both high selection power and high
prediction accuracy. We also apply DeepLINK to three real data applications to demonstrate its
practical utility. DeepLINK is freely available at https://github.com/zifanzhu/DeepLINK under
the GNU General Public License (GPL), version 3.
41
3.1 Introduction
The era of big data gives us enormous new opportunities but meanwhile also produces unprece-
dented challenges in solving various data related problems. The challenges are not just because
of the large size of the data but also and even more caused by the complexity in, for example, text,
image, video, and audio data. As a result, complicated models such as deep neural networks have
been proposed and popularly used to analyze big data. Despite the appealing high prediction and
classication power of deep neural networks, there is strong push back from scientic commu-
nity because of its “black-box” nature. The complicated structure of many deep neural networks
has made the interpretation and reproducibility of such models incredibly dicult if even possi-
ble at all. To alleviate these issues, dimension reduction methods such as variable selection and
latent factor models have been used in statistics and related applications.
In the past decade, feature (variable) selection has been a central topic in statistics [42, 44].
Feature selection aims at identifying the truly important features that contribute to the eect of
some response of interest. One desirable property of feature selection methods is that the er-
ror rate of selecting incorrect features can be controlled at some preselected target level while
achieving high power. The celebrated procedure of Benjamini and Hochberg [13, 14] for the false
discovery rate (FDR) control has been shown to enjoy such property both theoretically and em-
pirically under some conditions of the p-values calculated for evaluating the feature importance.
Although a vast number of methods have been proposed for feature selection with the goal of con-
trolling error rate, such as the Benjamini–Yekutieli procedure [16], local FDR [41],q-value [144],
Adaptive BH procedure [15], p-value weighting [58], FDR regression [136], independent hypoth-
esis weighting (IHW) [69], adaptive shrinkage [143], adaptive p-value thresholding (AdaPT) [87],
42
and structure-adaptive Benjamini-Hochberg algorithm (SABHA) [89], very few can be used in
complicated models such as deep neural networks. The intrinsic diculties are that most ex-
isting methods were proposed under much simpler model settings that are dicult or not even
possible at all to generalize, or depend heavily on the p-values as the feature importance measure.
Such p-values can be calculated based on some classical or asymptotic theory in simpler models.
When we move away from these simple model settings to more complicated ones such as deep
neural networks, however, we no longer have the luxury of calculating theoretically justied p-
values, making feature selection highly challenging. Recently, Candès et al. [25] proposed a new
framework of model-X knockos for achieving the FDR control in feature selection, bypassing
the use of conventional p-values. Model-X knockos can be used as a wrapper by combining
with any feature selection methods that produce feature importance measures satisfying certain
conditions. We provide a brief review on the model-X knockos in a later section. Thanks to
the exibility of model-X knockos, it was recently extended to the setting of deep neural net-
works in [95] via proposing a new network architecture, DeepPINK, when the features have joint
Gaussian distributions. The distributional assumption of joint Gaussian limits the practical ap-
plicability of the proposed method therein. In this study, we explore more general distributional
assumptions for the feature vector and propose a new method for deep learning inference using
knockos, named as DeepLINK.
Latent factor models, which use lower-dimensional unobservable factors to model the co-
movements of features, have been well studied and broadly used in statistics [53, 157, 171], so-
ciology [70, 140], bioinformatics [5, 19, 52], and economics [38, 43, 134, 135, 156]. The most
commonly used factor model assumes a linear relationship between the feature vector and latent
factors. Since in practice, we can never be certain whether the dependency is truly linear, we
43
are likely to face the problem of model misspecication, making the statistical estimation and
inference results unreliable. Recently, the advent of deep learning has motivated the nonlinear
factor models described by the architecture of the autoencoder.
DeepLINK combines the exible nonlinear factor modeling power of the autoencoder with
the feature selection and prediction power of DeepPINK. The nonlinear factor model for the fea-
ture vector described by the autoencoder enables us to generate the knocko variables eectively
without imposing restrictive joint distribution assumptions (e.g., Gaussian) on features. The fea-
ture selection and prediction power of DeepPINK allows for interpretable and reproducible sta-
tistical inference without sacricing much power. It is worth mentioning that for the special
case when both the factor model and the regression model of response on features are linear,
the problem of model-X knockos inference was investigated in [49] via proposing a parametric
inference framework of IPAD. We demonstrate the superior performance of DeepLINK via sim-
ulations and three real data examples. Compared to IPAD, DeepLINK is more exible and more
robust to model misspecication, and meanwhile achieves comparable feature selection results
with generally higher power.
3.2 Deeplearningbasedknockosinference
3.2.1 Variableselectionwithfalsediscoveryrate(FDR)control
Consider the high-dimensional supervised learning with independent and identically distributed
(i.i.d.) observations(x
i
;y
i
),i = 1; ;n, wherex
i
= (x
i1
; ;x
ip
)
T
2R
p
is the feature vector
andy
i
2R is the scalar response. The number of featuresp can be comparable to or even larger
than the number of observationsn. Letf1;2; ;pg be the full set of all the features. Assume
44
that the conditional distribution of responsey
i
depends only on a small subset of features, and
we aim to nd the Markov blanket [118], i.e., the smallest subsetS
0
such thaty
i
is independent
of all remaining features given those inS
0
. That is,
y
i
? ?fx
ij
:j2S
c
0
gjfx
ik
:k2S
0
g; (3.1)
whereS
c
0
denotes the complement of subsetS
0
in the full setf1;2; ;pg. The existence and
uniqueness of the Markov blanket can be guaranteed under mild conditions on the joint distribu-
tion of(x
i
;y
i
). See the discussions in [25] for more details. For the ease of presentation, we refer
to features inS
0
the “true” features and those inS
c
0
the “null” features in future presentation.
The goal of our study is to identify true features while controlling the error rate under a
predetermined level. Various performance metrics have been proposed to measure the feature
selection error rate, such as the familywise error rate (FWER), k-FWER [66], false discovery pro-
portion (FDP) [86], and false discovery rate (FDR) [14]. Here, we adopt the widely used FDR
dened as
FDR :=E[FDP] withFDP :=
j
^
S\S
c
0
j
maxfj
^
Sj;1g
; (3.2)
where FDP represents the false discovery proportion,
^
S is the set of selected features using some
statistical procedure,jj means the cardinality of a set, and the expectation is taken with respect
to the randomness in
^
S. A modied version of FDR is dened as
mFDR :=E
j
^
S\S
c
0
j
j
^
Sj+q
1
; (3.3)
45
whereq2 (0;1) is the target FDR level. It is seen that FDR is more conservative than mFDR,
since controlling the FDR naturally results in the control of mFDR. We also use another important
performance measure, power, to investigate the capability of a statistical procedure in discovering
the true features. Formally speaking, power is dened as the expectation of the true discovery
proportion (TDP)
Power :=E[TDP] withTDP :=
j
^
S\S
0
j
jS
0
j
: (3.4)
A desirable inference framework should be able to control the FDR at a pre-chosen target level
and meanwhile achieve high power.
3.2.2 Modelsettings
We focus on the setting where the high-dimensional feature vector x
i
depends on some low-
dimensional latent factor vector f
i
2R
r
withr p in a potentially nonlinear fashion. Speci-
cally, assume the following factor structure forx
i
x
i
= g(f
i
)+
i
; i = 1; ;n; (3.5)
where g is a vector-valued function whose coordinates can take some nonlinear functional forms
that are unknown to us, and
i
2R
p
is the factor model error vector with i.i.d. components. We
make the additional assumption that the marginal distribution of the components of
i
is from
some parametric familyf
with unknown parameter 2 R
m
, wherem is some xed positive
integer.
46
When the coordinates of g are all linear functions, model (3.5) becomes the widely used latent
factor model in the literature, which we will refer to as the linear factor model to ease the pre-
sentation. Most existing works have been developed under the linear factor model assumption,
which can be restrictive in some applications. Our proposed method will use a data-adaptive way
to estimate the possibly nonlinear functiong.
We also assume that the responsey
i
depends onx
i
via the following nonparametric regression
model
y
i
=h(x
i
)+"
i
; i = 1; ;n; (3.6)
whereh is some unknown function and can be either linear or nonlinear, and"
i
’s are independent
model errors. For the ease of presentation, we will use matrix and vector notation by denoting
X = (x
1
; ;x
n
)
T
thenp design matrix, F = (f
1
; ;f
n
)
T
thenr matrix of factors, and
y = (y
1
; ;y
n
)
T
then-dimensional response vector. DeneC as annp matrix whose rows
areg(f
i
)
T
,i = 1; ;n. Then model (3.5) can be rewritten as
X = C+E; (3.7)
whereE is a matrix with theith row being
T
i
. Our goal can be specically stated as developing
a feature selection method with the FDR controlled at the target levelq under the exible model
settings (3.5) and (3.6).
47
3.2.3 Themodel-Xknockosframework
We will adopt the recently developed model-X knockos framework introduced in [25] to achieve
our goal of feature selection. For completeness, we give a brief review of the model-X knockos
framework below. We refer the readers to [25] for full details.
As discussed in the Introduction, various FDR control methods have been proposed since the
seminal work of the Benjamini–Hochberg procedure (BH for short hereafter). Most of these exist-
ing methods achieve the FDR control under the assumption that valid p-values can be calculated.
However, having valid p-values can become a luxury in the high-dimensional big data settings.
Taking the generalized linear models (GLMs) as an example, when the feature dimensionalityp
diverges with sample size n at a rate of n
2=3
or faster, the classical asymptotic theory of MLE
no longer applies. Consequently, the resulting p-values calculated using the formula from the
classical asymptotic theory becomes invalid. See [48] for formal results on such a phenomenon.
When more complicated models such as the random forests or deep neural networks are used,
how to calculate valid p-values for evaluating the feature importance is still an open question.
To overcome this diculty, Barber and Candès [10] introduced the xed-X knockos framework
bypassing the use ofp-values to achieve the FDR control in the Gaussian linear model whenp
is smaller than n=2. Recently, Candès et al. [25] proposed the model-X knockos framework
which achieves theoretically guaranteed FDR control in arbitrary dimensions and for arbitrary
dependence structure of response y on features x. These advantages motivate us to adapt the
model-X knockos framework to our model settings.
48
The salient idea of the model-X knockos is to construct the so-called “model-X knocko
variables” which perfectly mimic the dependence structure of the original variables but are condi-
tionally independent of the response. For completeness, we include the denition of the model-X
knocko variables introduced in [25] as follows.
Denition1 Model-X knocko variables for a set of random variables x = (X
1
; ;X
p
)
T
are a
new set of random variables ~ x = (
~
X
1
; ;
~
X
p
)
T
that satises the following properties:
(1). for any subsetSf1; ;pg,(x
T
;~ x
T
)
swap(S)
D
= (x
T
;~ x
T
), where(x
T
;~ x
T
)
swap(S)
is obtained
by swapping the componentsX
j
and
~
X
j
in (x
T
;~ x
T
) for eachj2 S and
D
= denotes equal in
distribution;
(2). ~ x??yjx.
The second property above is satised as long as ~ x is constructed without using the infor-
mation of responsey. To construct knocko variables that satisfy the rst property, we need to
know the joint distribution of x. When such distribution is available, [25] proposed a generic
algorithm SCIP for the knocko variable construction. When such information is unavailable,
there has been some recent work on the practical construction of knocko variables. See, for
example, [8, 11, 47, 49, 67, 138].
Denote by~ x
i
the vector of knocko variables forx
i
,i = 1; ;n, and let
~
X = (~ x
1
; ;~ x
n
)
T
.
For each j = 1; ;p, let W
j
be the knocko statistic dened for measuring the importance
of the jth original feature. Specically, W
j
is a function of the augmented data matrix [X;
~
X]
49
and the response vector y, i.e. W
j
=w
j
([X;
~
X]; y), withw
j
a function satisfying the “sign-ip”
property:
w
j
([X;
~
X]
swap(S)
; y) =
8
>
>
>
<
>
>
>
:
w
j
([X;
~
X]; y); j = 2S;
w
j
([X;
~
X]; y); j2S;
(3.8)
whereS can be any subset off1; ;pg. See the formal characterizations of the desired knocko
statistics as well as examples in [25]. Intuitively, valid knocko statistics measure the importance
of original features, with large positive ones indicating the original features being important, and
for unimportant features inS
c
, the correspondingW
j
’s are expected to have small magnitudes
and be symmetric around 0.
Finally, the set of important features is selected as
^
S =fj :W
j
tg witht =T ort =T
+
,
where T is the knocko threshold and T
+
is the knocko+ threshold as proposed in [25] and
included below for completeness:
T = min
t> 0 :
jfj :W
j
6tgj
maxfjfj :W
j
>tgj;1g
6q
; (3.9)
T
+
= min
t> 0 :
1+jfj :W
j
6tgj
maxfjfj :W
j
>tgj;1g
6q
: (3.10)
Here,
^
S is dened as an empty set ifT =1 orT
+
=1.
It has been formally shown in [25] that the knocko threshold controls the mFDR exactly
and the knocko+ threshold controls the FDR exactly at the nite-sample level, regardless of the
sample sizen, feature dimensionalityp, and dependence structure of responsey on featuresx.
50
3.2.4 DeepLINK
We next introduce our new framework of DeepLINK, a deep learning based statistical inference
framework using knockos. It consists of two parts: (1) an autoencoder network for the knocko
variable construction and (2) a multilayer perceptron network for feature selection with the FDR
control.
As reviewed in the last section, there are two key ingredients in the successful implementation
of the model-X knockos framework: 1) the construction of knocko variables and 2) the con-
struction of knocko statistics. Since the joint distribution of x
i
is unknown to us, the generic
algorithm proposed in [25] is no longer applicable to our settings. A remedy is to exploit the
nonlinear factor model structure in (3.5) to construct approximate knocko variables using the
estimated distribution.
In view of (3.7), ideally if the realization C and the marginal distribution f
of
i
are both
known a priori, then the knocko variables can be constructed as
~
X = C+
~
E; (3.11)
with the entries of
~
E independently drawn from distributionf
. It can be easily checked that such
~
X satises the two properties in Denition 1. SinceC andf
are generally unknown in practice,
we next discuss methods to estimate them. We will also discuss the construction of knocko
statistics.
51
3.2.4.1 Part1: autoencoderforknockosconstruction
The principal component analysis (PCA) has been a predominant method for extracting latent
factors in the existing literature [12, 71]. However, a key assumption for PCA to work well is that
x
i
depends on f
i
linearly. To address the challenge caused by the potentially nonlinear factor
model as specied in (3.5), we propose to use the deep learning model of autoencoder.
Given the design matrix X, we train an autoencoder with X as the input as well as the target
output. An illustrative plot of the autoencoder network is shown in Figure 3.1. Denote by
^
C the
corresponding autoencoder output matrix. We propose to construct the knockos data matrix as
~
X =
^
C+
~
E; (3.12)
where
~
E is a matrix with entries independently sampled from the estimated marginal distribution
f
^
of
i
. For the specic case whenf
is the Gaussian density ofN(0;
2
), we have =
2
which
can be estimated as
^
2
=
1
np
X
1in;1jp
^ e
2
ij
; (3.13)
with ^ e
ij
’s the entries of the residual matrix
^
E = X
^
C. This corresponds to the maximum
likelihood estimate with the pseudo observations
^
E. In general, parameter can be estimated by
the maximum likelihood approach or the method of moments based on
^
E.
52
Figure 3.1: The autoencoder architecture.
3.2.4.2 Part2: MLPforfeatureselection
To construct the knocko statistics, we need to rst construct the feature importance measure.
Since an important goal of our framework is to accommodate the exible nonlinear relationship
between responsey and featuresx, we propose to use the multilayer perceptron (MLP) for such
modeling purpose. The input of MLP is the augmented data matrix [X;
~
X]. Instead of directly
feeding the augmented feature vector into the MLP, we exploit the idea of DeepPINK developed
in [95] and construct a pairwise-connected lter layer with each lter representing a linear com-
bination of one original feature and its knocko counterpart. The lter layer is then fed to the
canonical MLP. The illustrative architecture of DeepPINK is shown in Figure 3.2.
To simplify the notation, we use DeepPINK with one hidden layer after the lter layer to
discuss the construction of the knocko statistics. Let z = (z
1
; ;z
p
)
T
and ~ z = (~ z
1
; ;~ z
p
)
T
be the lter weights (i.e., each lterF
j
=z
j
x
j
+ ~ z
j
~ x
j
), andW
(1)
2R
pm
; W
(2)
2R
m1
be the
53
Figure 3.2: The DeepPINK architecture.
two weight matrices connecting the lter layer with the output layer, wherem is the number of
neurons before the output layer. The knocko statistics are dened as
W
j
=Z
2
j
~
Z
2
j
; j = 1; ;p; (3.14)
whereZ
j
= z
j
w
j
;
~
Z
j
= ~ z
j
w
j
, and w = (w
1
; ;w
p
)
T
= W
(1)
W
(2)
. This can be easily gen-
eralized to cases with more than one hidden layer. Since the weights of neurons are natural
measures of their importance, intuitivelyW
j
’s dened in (3.14) are valid knocko statistics. See
[95] for more detailed discussions on the intuition ofW
j
’s in (3.14). Important features can then
be selected using the knockos inference procedure reviewed previously.
54
3.3 Simulationstudies
We rst evaluate the performance of DeepLINK on the simulated data sets. We consider various
simulation settings when 1) the factor model is linear or nonlinear, 2) the link function between
the response and the features is linear or nonlinear, and 3) the feature dimensionality is low or
high. The computational cost of DeepLINK is presented in Chapter 6 Section 6.2.1.
3.3.1 Simulationdesigns
We explore three dierent factor models: the linear factor model (3.15), additive quadratic factor
model (3.16), and logistic factor model (3.17), where fori = 1; ;n,
x
i
= f
i
+
i
; (3.15)
x
i
= [f
T
i
; (f
2
i
)
T
; f
i1
f
i2
; f
i1
f
i3
; f
i2
f
i3
]
T
+
i
; (3.16)
x
ij
=
c
j
1+exp([1;f
T
i
]
j
)
+
ij
; j = 1; ;p: (3.17)
Here,f
i
= (f
i1
;f
i2
;f
i3
)
T
is the vector of latent factors,f
2
i
is the shorthand notation for(f
2
i1
;f
2
i2
;f
2
i3
)
T
,
and
j
’s are the factor loading parameters of appropriate dimensions, andc
j
’s are some con-
stants. The f
ij
’s, c
j
’s,
ij
’s, and entries of and
i
are all sampled independently from the
standard normal distributionN(0;1).
The response vector y = (y
1
; ;y
n
)
T
is simulated from model (3.6). We investigate two
dierent forms of the link functionh: the linear design (3.18) and nonlinear design (3.19):
h(x) = x
T
; (3.18)
55
h(x) = sin(x
T
)exp(x
T
): (3.19)
To simulate the coecient vector = (
1
; ;
p
)
T
, we rst randomly choose s true signal
locations and then set the
j
at each location to beA orA with equal probability, whereA is
some positive value that varies in our simulation studies. The remainingps components of
are set to zero. It is seen that when the link functionh is linear,A measures the signal strength
with larger value corresponding to stronger signal. When h is nonlinear, however, the signal
strength may no longer be a monotone increasing function ofA. See the discussions in Chapter
6 Section 6.2.2 for an example illustrating this. In fact, to the best of our knowledge, there lacks
a widely adopted measure for the signal strength in the nonlinear model settings. Model errors
"
i
’s are also sampled independently from the standard normal distributionN(0;1).
3.3.2 Parametersettings
For all the simulation studies, the target FDRq is set to 0.2 and sample sizen is set to 1000. For
the linear link function setting, we explore two dierent feature dimensionalitiesp = 500;1500
with true signal sizes set to 10 and 30, respectively. For the nonlinear link function setting,p is
set to 50 and 500, ands is xed at 10. We vary the value ofA to investigate its impact on the
performance of DeepLINK.
3.3.3 Neuralnetworksettings
We next provide the details on the neural network architectures. We train the autoencoder net-
work using the Adam algorithm with the mean squared error (MSE) as the loss function. For the
linear factor model, the number of neurons in autoencoder’s bottleneck layer is estimated by the
56
Autoencoder
Activation Loss Optimizer Regularization
Linearh ELU MSE Adam None
Nonlinearh ELU MSE Adam None
Multilayer perceptron
Activation Loss Optimizer Regularization
Linearh ELU MSE Adam L
1
-regularization
Nonlinearh ELU MSLE Adam L
1
-regularization
Table 3.1: Neural network parameter settings.
PC
p1
algorithm proposed in [7]. It is worth noting thatPC
p1
is designed for linear factor mod-
els. For the nonlinear factor models, we set it to the true number of factorsr = 3. We conduct a
robustness study of DeepLINK to the misspecication ofr in Chapter 6 Section 6.2.3. We remark
thatr can be tuned by the cross-validation in real applications. For DeepPINK used in the feature
selection step, we use MSE as the loss function coupled with theL
1
-regularization when the link
functionh is linear. When the link function is nonlinear, we change the loss function to the mean
squared logarithmic error (MSLE) because MSE may cause explosive gradients for large response
values. In fact, MSLE also works well with other nonlinear link functions (see Chapter 6 Section
6.2.4). For a general guidance, we suggest using MSE rst and switching to MSLE when the gradi-
ents become too large during the model training. For both linear and nonlinear link functions, we
use the Adam optimizer to train the network. For both autoencoder and DeepPINK networks, we
recommend to use the exponential linear unit (ELU) as the activation function according to our
experience gained from empirical studies. Our numerical study also suggests that the learning
rate of Adam and the coecient ofL
1
-regularization need to be tuned for the best performance
of our method. The neural network settings are summarized in Table 3.1.
57
3.3.4 Simulationresults
We investigate the performance of DeepLINK in the simulation study with dierent combinations
of factor models, link functions, and dimensionalities. For each setting, we apply DeepLINK to
100 independently simulated data sets and calculate the average FDP and TDP as the empirical
FDR and power, respectively. The knocko+ threshold is used in our numerical studies because
it controls the exact FDR.
3.3.4.1 Simulationresultswiththelinearfactormodel
We compare DeepLINK with IPAD reviewed in the Introduction using the simulated data in the
linear factor model setting as specied in (3.15). Both methods successfully control the FDR
under the target level 0.2. In terms of power, IPAD slightly outperforms DeepLINK in settings
with the linear link function (Figure 3.3A, B). This is reasonable because IPAD was proposed
under the assumption of linear factor model and linear link function, and makes full use of these
parametric model structures, while DeepLINK makes no use of these model structures at all. For
the nonlinear link function (Figure 3.3C, D), however, the power of IPAD drops signicantly, while
DeepLINK still maintains decently high power. It is also interesting to observe that the power of
DeepLINK rst increases sharply to the peak and then decreases slightly asA increases, which
can be explained by the fact thatA no longer serves as a good measure of the signal strength.
These results demonstrate the versatility of DeepLINK. The capability of DeepLINK to tackle
complicated nonlinear link functions makes it more useful in real applications since it is more
robust to possible model misspecication.
58
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
linear h, p=500, s=10 A
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
linear h, p=1500, s=30 B
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
nonlinear h, p=50, s=10 C
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
nonlinear h, p=500, s=10 D
Algorithm
DeepLINK
IPAD
Measure
FDR+
Power+
Figure 3.3: Comparisons between DeepLINK and IPAD in simulation settings with the linear factor model.
Each subgure represents a combination of the link functionh and the dimensionality design. ‘FDR+’ and
‘Power+’ denote the empirical FDR and power obtained using the knocko+ threshold. The black dashed
line indicates the target FDR level. Each plot shows the change of FDR+ (solid line) and Power+ (long
dashed line) for DeepLINK (blue) and IPAD (orange) against varying signal amplitudeA.
Another interesting observation is that our simulation produces highly correlated features.
To study DeepLINK’s ability to disentangle important features from their highly correlated noise
features, we conducted additional analyses in Chapter 6 Section 6.2.5.
3.3.4.2 Simulationresultswiththenonlinearfactormodel
We now consider nonlinear factor models (3.16) and (3.17). We will drop IPAD from the compari-
son because IPAD was proposed under the assumption of linear factor model and is not expected
to perform well when the model is severely misspecied (the performance of IPAD is already very
poor when the link functionh alone takes a nonlinear form, as shown in Figure 3.3C, D). For the
additive quadratic factor model in (3.16), FDR is perfectly controlled in all settings. Meanwhile,
59
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
10 15 20 25 30 35 40 45 50
signal amplitude A
FDR+
Power+
linear h, p=500, s=10 A
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
10 20 30 40 50 60 70 80 90 100
signal amplitude A
FDR+
Power+
linear h, p=1500, s=30 B
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
5 10 15 20 25 30 35 40
signal amplitude A
FDR+
Power+
nonlinear h, p=50, s=10 C
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
5 10 15 20 25 30
signal amplitude A
FDR+
Power+
nonlinear h, p=500, s=10 D
Measure
FDR+
Power+
Figure 3.4: Simulation results of DeepLINK in settings with the additive quadratic factor model. Each
subgure represents a combination of the link function h and the dimensionality design. ‘FDR+’ and
‘Power+’ denote the empirical FDR and power obtained using the knocko+ threshold. The black dashed
line indicates the target FDR level. Each plot shows the change of FDR+ (solid line) and Power+ (long
dashed line) against varying signal amplitudeA.
high power is achieved with reasonably large A in the two settings with linear link function
(Figure 3.4A, B). However, the two settings with nonlinear link functions are very challenging
and the power of DeepLINK is signicantly lower (Figure 3.4C, D). For the logistic factor model
(3.17), DeepLINK controls the FDR and can achieve power close to one with a wide range of values
forA in each setting (Figure 3.5). The success of FDR control by DeepLINK in nonlinear factor
model settings provides evidence that the autoencoder network can well capture the nonlinear
factor structure and thus generates valid knockos data matrices. Similar to the linear factor
model setting, we again observe an inverted U-shape curve of the power when the link function
is nonlinear, which can be explained by the same reason as before.
60
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
linear h, p=500, s=10 A
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
linear h, p=1500, s=30 B
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
nonlinear h, p=50, s=10 C
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
nonlinear h, p=500, s=10 D
Measure
FDR+
Power+
Figure 3.5: Simulation results of DeepLINK in settings with the logistic factor model. Each subgure
represents a combination of the link function h and the dimensionality design. ‘FDR+’ and ‘Power+’
denote the empirical FDR and power obtained using the knocko+ threshold. The black dashed line
indicates the target FDR level. Each plot shows the change of FDR+ (solid line) and Power+ (long dashed
line) against varying signal amplitudeA.
61
3.3.4.3 RobustnessofDeepLINKtodierentactivationfunctions
We explore the eects of dierent activation functions used in the autoencoder and DeepPINK
networks on the performance of DeepLINK (Figures 3.6-3.8). In general, DeepLINK is robust
to dierent combinations of activation functions in terms of both FDR control and power. The
only exception is using the rectied linear unit (ReLU) activation in the autoencoder network.
In the linear factor model setting with linear link functionh and low feature dimensionality, the
autoencoder with ReLU activation fails to control the FDR when the signal amplitude is small
(Figure 3.6A). We also observe that the autoencoder with ReLU activation has slightly inated
FDR in some other settings (Figure 3.7A, D and Figure 3.8A). In the linear and additive quadratic
factor model settings with linear link functionh and large feature dimensionality (Figure 3.6B
and Figure 3.7B), ReLU yields lower power than other activation functions when used in the
autoencoder network. We thus recommend against using the ReLU activation in the autoencoder
network for the DeepLINK applications.
62
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
linear h, p=500, s=10 A
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
linear h, p=1500, s=30 B
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
nonlinear h, p=50, s=10 C
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
nonlinear h, p=500, s=10 D
Activations
ELU−ELU
ELU−ReLU
ELU−Tanh
ReLU−ELU
ReLU−ReLU
ReLU−Tanh
Tanh−ELU
Tanh−ReLU
Tanh−Tanh
Measure
FDR+
Power+
Figure 3.6: Linear factor model simulation results of DeepLINK using dierent activation functions. Each
subgure represents a combination of the link function h and the dimensionality design. ‘FDR+’ and
‘Power+’ denote the empirical FDR and power obtained using the knocko+ threshold. The black dashed
line indicates the target FDR level. Each plot shows the change of FDR+ (solid line) and Power+ (long
dashed line) against varying signal amplitudeA for dierent activation functions used in autoencoder and
DeepPINK (e.g., ‘ReLU-ELU’ represents using ReLU in autoencoder and ELU in DeepPINK).
63
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
10 15 20 25 30 35 40 45 50
signal amplitude A
FDR+
Power+
linear h, p=500, s=10 A
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
10 20 30 40 50 60 70 80 90 100
signal amplitude A
FDR+
Power+
linear h, p=1500, s=30 B
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
5 10 15 20 25 30 35 40
signal amplitude A
FDR+
Power+
nonlinear h, p=50, s=10 C
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
5 10 15 20 25 30
signal amplitude A
FDR+
Power+
nonlinear h, p=500, s=10 D
Activations
ELU−ELU
ELU−ReLU
ELU−Tanh
ReLU−ELU
ReLU−ReLU
ReLU−Tanh
Tanh−ELU
Tanh−ReLU
Tanh−Tanh
Measure
FDR+
Power+
Figure 3.7: Additive quadratic factor model simulation results of DeepLINK using dierent activation
functions. Each subgure represents a combination of the link functionh and the dimensionality design.
‘FDR+’ and ‘Power+’ denote the empirical FDR and power obtained using the knocko+ threshold. The
black dashed line indicates the target FDR level. Each plot shows the change of FDR+ (solid line) and
Power+ (long dashed line) against varying signal amplitudeA for dierent activation functions used in
autoencoder and DeepPINK (e.g., ‘ReLU-ELU’ represents using ReLU in autoencoder and ELU in Deep-
PINK).
64
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
linear h, p=500, s=10 A
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
linear h, p=1500, s=30 B
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
nonlinear h, p=50, s=10 C
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
nonlinear h, p=500, s=10 D
Activations
ELU−ELU
ELU−ReLU
ELU−Tanh
ReLU−ELU
ReLU−ReLU
ReLU−Tanh
Tanh−ELU
Tanh−ReLU
Tanh−Tanh
Measure
FDR+
Power+
Figure 3.8: Logistic factor model simulation results of DeepLINK using dierent activation functions.
Each subgure represents a combination of the link functionh and the dimensionality design. ‘FDR+’ and
‘Power+’ denote the empirical FDR and power obtained using the knocko+ threshold. The black dashed
line indicates the target FDR level. Each plot shows the change of FDR+ (solid line) and Power+ (long
dashed line) against varying signal amplitudeA for dierent activation functions used in autoencoder and
DeepPINK (e.g., ‘ReLU-ELU’ represents using ReLU in autoencoder and ELU in DeepPINK).
65
3.4 Realdataapplications
We further apply DeepLINK to three real data applications. All predictors in the three data sets
below were standardized to unit variance before the analysis. In all real data applications, the
error distribution f
was tted assuming Gaussian distribution. The robustness of DeepLINK
with respect to misspecied error distribution is investigated in Chapter 6 Section 6.2.6. We also
compare the performance of DeepLINK with that of random forests [23, 28] in Chapter 6 Section
6.2.7.
3.4.1 Applicationtoamicrobiomedataset
The microbiome data set is publicly available in a colorectal cancer (CRC) related metagenomic
study in Zeller et al. [170]. The data set contains the whole-genome sequenced (WGS) DNAs from
stool samples of 184 individuals (91 CRC patients and 93 healthy controls). We aligned the DNA
sequences against the NCBI microbial reference genome database and constructed an abundance
matrix according to the alignment results. The matrix consisted of 184 rows and 434 columns
with each entry representing the abundance of a microbial species in the corresponding sample.
We randomly split the data set into the training set (80%) and the test set (20%), and implemented
the DeepLINK on the training part. The trained model was then applied to the test data and the
classication error rate was calculated. The random splitting procedure was repeated100 times.
However, the mean misclassication error on the test data was consistently around 0:5 under
various parameter settings of DeepLINK, suggesting that the simple application of DeepLINK
could fail. We also tried some other popular classication methods such as the Lasso and simple
66
deep neural network without the special architecture as in DeepLINK, all of which gave us error
rates between0:4 and0:5, similar to the random guessing.
Consequently, we performed a variable screening step rst and then applied the DeepLINK
method on the screened data set. Considering the relatively small sample size and to reduce
the chance of including noise confounders, we identied an independent microbiome data set
for screening. This independent data set was also publicly available and was collected for CRC-
microbiome association analysis [169]. It contained 128 WGS DNA samples with 74 CRC patients
and 54 controls. Since these two microbiome data sets had dierent numbers of features, we
constrained ourselves to the 274 common features in our analysis. There are multiple options
for the screening step [42, 45, 46]. We adopted one of state-of-the-art methods which was based
on the distance correlation and ranked these 274 variables by the values of the asymptotic test
statistics [56, 147]. We randomly split the Zeller et al. [170] CRC microbiome data into the training
and test sets at the ratio of 80% to 20%. The justication for the training/testing set split ratio
is given in Chapter 6 Section 6.2.8. Then, we trained the DeepLINK model using the top ranked
variables with the training data. To evaluate the impact of the number of retained variables
after the screening step, denoted asd, we examined multiple values ofd. Finally, we applied the
trained DeepLINK model onto the training and test data sets and calculated the corresponding
classication error rates, respectively. The whole process was repeated 100 times. We set the
number of neurons in the bottleneck layer of the autoencoder to 3. We chose the other model
parameters by cross-validation. The MLP in DeepPINK had only one hidden layer withd neurons.
The dropout rate was 0:4, whileL
1
- andL
2
-regularization weights were both 0:001. The mean
and standard error of the misclassication error on the training and test data are given in Table
3.2. We can see that the mean test error was the lowest when30 variables were retained after the
67
Training Test
d = 20 0.172 (0.003) 0.319 (0.008)
d = 30 0.104 (0.004) 0.306 (0.007)
d = 40 0.019 (0.003) 0.328 (0.009)
d = 50 0.008 (0.002) 0.319 (0.008)
d = 100 0.000 (0.000) 0.385 (0.012)
Table 3.2: Mean and standard error (in parenthesis) of the misclassication error rates for the microbiome
data set.
screening step. However, asd increased to as large as100, the mean test error became relatively
high (0:385), indicating that when DeepLINK lost the help of the screening step in eliminating
noise variables, its performance could be compromised.
The top20 most selected microbial species along with their selection frequencies by DeepLINK
coupled with screening are presented in Table 3.3. We only present the results ford = 30 when
the mean classication error rate was the lowest. Many of these selected species were reported to
have important associations with CRC in the previous literature. For example,Parvimonasmicra
and Akkermansia muciniphila were among the four-bacteria biomarker panel of CRC identied
by Osman et al. [112]. In addition, Parvimonas micra’s enrichment in CRC was demonstrated in
a number of previous studies [35, 40, 93, 169] and Purcell et al. [123] also reported its enrichment
in one of the CRC subtypes. Other important CRC related species that were also reported in
previous studies include Dialister pneumosintes [1] and Bacteroides fragilis [20, 64, 159, 163].
68
Species Frequency
Dialister pneumosintes 99
Eikenella corrodens 96
Staphylococcus haemolyticus 75
Intestinimonas butyriciproducens 75
Bacteroides fragilis 70
Latilactobacillus sakei 58
Clostridium bornimense 58
Parvimonas micra 51
Akkermansia muciniphila 49
Gemella sp. oral taxon 928 48
Clostridium chauvoei 45
Corynebacterium sp. NML98-0116 43
Prevotella intermedia 34
Streptococcus sp. A12 31
Ndongobacter massiliensis 30
Ruminococcus bicirculans 29
Lactococcus garvieae 28
Fusobacterium varium 26
Anaerococcus mediterraneensis 26
Desulfovibrio faireldensis 23
Table 3.3: Top20 most selected microbial species whend=30 for the microbiome data set.
69
3.4.2 Applicationtoamurinesingle-cellRNA-seqdataset
The murine single-cell RNA-seq (scRNA-seq) data set is publicly available from Lane et al. [85]
aiming to investigate the eect of lipopolysaccharides (LPS) stimulated nuclear factorB (NF-
B) on gene expression. We rst preprocessed the data following the suggestions in [79]. We
ltered out cells either with mapping rate below 20% or with nonzero expression proportion
below 5%. We also ltered out genes expressed in less than 5% of total cells. The preprocessed
data matrix contained the expression, in the form of Transcripts Per Million (TPM), of 13,777
genes from 570 cells. We were interested in dierential gene expression between cells with two
conditions: unstimulated (202 cells) and stimulated with LPS after 150 minutes (368 cells). Due
to the high dimensionality, it was computationally infeasible to implement the DeepLINK on this
data set directly, even with powerful servers. The success of screening in the previous microbiome
example motivated us to apply a screening step rst to reduce the dimensionality. Since this
data set had a relatively larger sample size than the microbiome data set, we randomly split the
data set into three parts for screening (50%), training (40%), and test (10%), instead of using an
independent data set for screening. We used the same model architecture and parameters tuned
from the previous microbiome analysis. The mean and standard error of the misclassication
error over100 repetitions on the training and test sets, respectively, for this scRNA-seq data set
are provided in Table 3.4. We observe that the mean misclassication error on the test data can
get as low as0:010 whend = 200.
We further looked at the top 20 most selected genes by DeepLINK equipped with screening
ford = 200 as presented in Table 3.5. Many of the selected genes were also reported as signif-
icant features in the original study [85] including Sqstm1, Sdc4, Abcg1, Rab31, Gmnn, Angpt2,
70
Training Test
d = 20 0.000 (0.000) 0.021 (0.003)
d = 30 0.000 (0.000) 0.018 (0.002)
d = 40 0.000 (0.000) 0.012 (0.002)
d = 50 0.000 (0.000) 0.014 (0.002)
d = 100 0.000 (0.000) 0.016 (0.002)
d = 200 0.000 (0.000) 0.010 (0.001)
d = 300 0.000 (0.000) 0.013 (0.002)
d = 400 0.000 (0.000) 0.012 (0.002)
d = 500 0.000 (0.000) 0.015 (0.002)
Table 3.4: Mean and standard error (in parenthesis) of the misclassication error rates for the murine
single-cell RNA-seq data set.
Hsp90aa1, Tnfaip2, Clec4e, Gpx1, Sod2, and Fas. Gene Ontology (GO) analysis with domain Bio-
logical Process (BP) indicates that the up-regulation of genes in LPS stimulated cells are related
to NF-B signaling (Sqstm1) and LPS response (Sod2). Also, Hsp90aa1 can bind LPS and mediate
LPS-induced inammatory response according to Uniprot [33], which may be related to its up-
regulation in LPS stimulated cells.
Gene Frequency Gene Frequency
Sqstm1 73 Gm26825 61
Cdkn1a 66 Hsp90aa1 60
Sdc4 65 Tnfaip2 60
Abcg1 64 Clec4e 58
Rab31 64 Gpx1 58
Gm28875 64 Sod2 57
Gmnn 63 Srsf5 55
Angpt2 63 Fas 51
Ehd4 61 Get1 50
Dnaja1 61 Hsp90ab1 50
Table 3.5: Top20 most selected genes whend=200 for the murine single-cell RNA-seq data set.
71
3.4.3 Applicationtoahumansingle-cellRNA-seqdataset
Another single-cell RNA-seq (scRNA-seq) data set that we investigated is from a human glioblas-
toma study led by Darmanis et al. [36]. We were interested in dierential gene expression be-
tween Neoplastic cells in the tumor core and the surrounding periphery. We used the same criteria
as in the murine scRNA-seq study to preprocess the data, which resulted in a data set with TPMs
of 23,257 genes from 632 cells (580 in the tumor core and 52 in the periphery). Again, due to the
high dimensionality, we rst conducted dimensionality reduction using the distance correlation
screening an then applied DeepLINK. The model architecture and parameters were the same as
those in the last two real data studies. We repeated the experiment 100 times and present the
mean and standard error of the misclassication error on the training and test data, respectively,
in Table 3.6. We see that the mean misclassication errors on the test data achieve the smallest
value whend = 200 and then become more or less stable.
Training Test
d = 20 0.006 (0.001) 0.072 (0.003)
d = 30 0.002 (0.000) 0.068 (0.003)
d = 40 0.001 (0.000) 0.064 (0.003)
d = 50 0.001 (0.000) 0.060 (0.003)
d = 100 0.001 (0.000) 0.059 (0.003)
d = 200 0.000 (0.000) 0.046 (0.003)
d = 300 0.000 (0.000) 0.056 (0.003)
d = 400 0.000 (0.000) 0.049 (0.003)
d = 500 0.000 (0.000) 0.050 (0.003)
Table 3.6: Mean and standard error (in parenthesis) of the misclassication error for the human single-cell
RNA-seq data set.
The top20 most selected genes by DeepLINK equipped with screening ford = 200 are shown
in Table 3.7. We next examined the biological meaning of these selected genes. As pointed out
72
in the original study [36], down-regulation of genes like ATP1A2 and PRODH in the periphery
might be related to their functions in the interstitial matrix invasion. We also observed that HIF3A
was down-regulated in the tumor core, which was probably associated with the hypoxia in core.
Previous study also demonstrated that HIF3A was a dominant-negative regulator of HIF-1 and
was thus down-regulated in a hypoxic environment [99]. Gene Ontology (GO) analysis with do-
main Biological Process (BP) indicates that some genes up-regulated in periphery have functions
related to cell migration from periphery to core. For instance, HES6 has GO term nervous system
development, which is highly relevant to tumor cell migration. IGSF21 and CNTN1 have GO
term cell-cell adhesion, which is a central part in cell migration. ALDOC has GO term glycolytic
process, which produces small amount of ATP and may help the cell migration as an energy
provider. SERPINE2 has GO term regulation of cell migration. Also, genes such as SPARCL1,
NPL, and ST6GALNAC3 are involved in various metabolic processes.
Gene Frequency Gene Frequency
ATP1A2 73 ANKRD20A9P 45
PRODH 73 HIF3A 44
HES6 62 NPL 44
IGSF21 57 AC131097.1 44
PPM1K 56 FAM240C 41
ALDOC 55 ST6GALNAC3 40
SPARCL1 55 MMP28 40
SERPINE2 52 CNTN1 39
RNPC3 52 MTRNR2L1 39
LOC102724788 47 EFHD1 38
Table 3.7: Top20 most selected genes whend=200 for the human single-cell RNA-seq data set.
73
3.5 Discussion
In this study, we have developed a new high-dimensional inference framework via knockos,
DeepLINK, to enhance the interpretability and reproducibility of deep learning models. DeepLINK
generates the knocko variables under the possibly nonlinear factor model assumption using an
autoencoder network, and then ts the regression/classication model using the DeepPINK net-
work. We have used various simulated data sets to numerically demonstrate that DeepLINK can
achieve successful FDR control with attractive power in selecting features that are truly impor-
tant for the response of interest. We have also showcased the practical utility and performance
of DeepLINK on three real data applications.
When comparing the prediction performance of DeepLINK with random forests in Chapter
6 Section 6.2.7, we noticed that random forests can outperform DeepLINK in terms of prediction
for the microbiome data set. This is likely caused by the distinctive prediction power of MLP and
random forests. We remark that the MLP in the second step of DeepLINK can be replaced with
random forests if one suspects that the latter can outperform in prediction. We also emphasize
that the main purpose of DeepLINK is feature selection with controlled error rate, and to achieve
the goal of FDR control in feature selection, the prediction power may be slightly compromised
in some applications.
There are ve potential directions for future investigations. First, in the real data applica-
tions, we consider binary outcomes. DeepLINK can be easily extended to the case of multiple
classes if we replace the loss function in the second step from binary cross-entropy to multi-
class cross-entropy. Second, the knocko variable generating process of DeepLINK simulates the
idiosyncratic matrix E outside of the autoencoder network with non-deep-learning techniques.
74
Designing a new deep neural network which can automate the knocko variable generating pro-
cess may increase its eciency and accuracy. Third, we currently have two separate networks
in DeepLINK: the knocko variable generating network of autoencoder, and the model tting
and inference network of DeepPINK. We would like to integrate them into one single network
for a joint optimization so that the whole process can be fully automated. Such a feature can
make DeepLINK even more user friendly. Fourth, heterogeneity in the samples is a practically
important issue. It is possible that the samples consist of multiple subpopulations and they have
dierent true features. It is likely that DeepLINK can be extended to accommodate the hetero-
geneity. The key is to construct valid knocko variables reecting the subpopulation information.
One naive method is to construct knocko variables for each subpopulation, and then combine
them appropriately to form valid knocko variables for the overall population. If this can be
achieved, the second step of feature selection using MLP can be applied without modication.
Finally, we would like to provide theoretical justications on DeepLINK in terms of both FDR
control and power. This can in turn guide the training of the underlying networks and further
improve the interpretation of our deep learning inference method.
75
Chapter4
Ametagenomicanalysistoinvestigatethegutvirome
changeofpediatriculcerativecolitispatientswith
successfulfecalmicrobiotatransplantation
In this chapter, we present a metagenomic analysis to investigate the gut virome change of pe-
diatric ulcerative colitis (UC) patients with successful fecal microbiota transplantation (FMT).
Ulcerative colitis is a chronic inammatory disease of the colon that carries a signicant disease
burden in children. Therefore, new therapeutic approaches are being explored to help children
living with this disease. Fecal microbiota transplantation (FMT) has been successful in some chil-
dren with ulcerative colitis. However, the mechanism of its therapeutic eect in this patient pop-
ulation is not well understood. To analyze changes in gut viral proles after successful FMT, we
chracterized both known and novel viral abundances from the shotgun metagenomic sequenced
reads and found that virome proles in responders after FMT shifted towards those of the donors.
76
4.1 Introduction
Ulcerative colitis (UC) is characterized by chronic inammation of the colon. It is an important
pediatric disease, since as many as 20%–25% of all casesa are present during childhood [155],
and the incidence of the disease is increasing worldwide [104, 110, 132]. Immunomodulator and
biologic therapies have resulted in stable management for some patients, but many children still
develop severe disease, continue to require colectomy, and develop colon cancer [88, 110, 155].
Therefore, there is a great need for more eective and safer therapies.
The enteric microbiota is now accepted as an important factor in the pathogenesis of inam-
matory bowel disease (IBD) [130], with data from multiple studies demonstrating a correlation
of microbial community composition with the occurrence of IBD [9, 81]. The gut microbiota has
been suggested as a therapeutic target for IBD, including ulcerative colitis [139]. Fecal microbiota
transplantation (FMT) has been established as a useful tool for manipulating the gut microbiome
and is an eective treatment for recurrent Clostridium dicile infection [161]. FMT therapy has
been considered for ulcerative colitis patients, with the rst reported FMT in the context of UC
described in 1989 [22]. Since then, FMT for UC has been studied in at least three randomized
placebo-controlled trials in adults and has shown ecacy in inducing remission [103, 114, 128].
Systematic reviews and meta-analyses have shown that FMT is a safe treatment option that is
eective in some patients [32, 141].
While there have been several controlled studies investigating the ecacy of FMT for adults
with UC, relatively few trials have taken place testing FMT in children with UC. These small,
uncontrolled studies and case reports [162] have had mixed results: one report found no benet of
FMT in a trial with four pediatric UC patients [146], while another study of three patients showed
77
a signicant benet of FMT [75]. To our knowledge, the trial described by Kunde et al. [82] is
the largest study to date investigating FMT in children with ulcerative colitis. In this open-label,
uncontrolled trial, six out of nine patients with mild-to-moderate ulcerative colitis showed clinical
improvement four weeks after FMT, with three patients achieving clinical remission. In order
to understand the viral changes associated with successful FMT in pediatric ulcerative colitis
patients, we utilized a metagenomic approach to characterize the gut virome proles from four
patients in the Kunde et al. study.
4.2 MaterialsandMethods
4.2.1 Samplecollectionandshotgunmetagenomicsequencing
Patients were recruited as described by Kunde et al. [82]. Briey, subjects between the ages of 7
and 20 with mild-tomoderate ulcerative colitis were enrolled between April and December 2012
in a single-center pilot study conducted at the outpatient gastroenterology facility at Helen De-
Vos Children’s Hospital in Grand Rapids, Michigan. Fecal samples of the patients were collected
before FMT (Baseline) and four weeks after FMT (PostFMT). Fecal samples of the healthy adult
donors (selected by each subject from their family members or close friends) were collected within
6h prior to FMT. Samples were analyzed from seven of the nine patients that completed the Kunde
et al. study; samples were not available from two subjects. DNA was extracted from fecal samples
using the Qiagen DNeasy DNA Extraction Kit (Maryland, USA) and paired-end sequenced using
the Illumina NextSeq500 High Output v2 owcell on an Illumina NextSeq500 System. Reads
from all samples including controls were preprocessed and quality ltered using trim galore
78
(https://www.bioinformatics.babraham.ac.uk/projects/trim galore/). Hostderived reads were re-
moved using KneadData (https://bitbucket.org/biobakery/kneaddata/overview). Subjects were
classied as responders (n = 4) or non-responders (n = 3) based on their pediatric ulcerative col-
itis activity index (PUCAI) score 4 weeks after FMT (in this cohort the responders had a PUCAI
15). Only the gut virome of responders were analyzed in this study.
4.2.2 Knownviralprolecharacterization
To calculate the abundance of known viruses in samples from the responders,we mapped shotgun
metagenomic reads against 8403 virus reference genomes (downloaded from NCBI in June 2017)
using BWA-MEM [91] with the default parameters. The relative abundance of a virus in a sample
was calculated as the fraction of reads mapped to the virus genome normalized by the genome
length.
4.2.3 Novelviralprolecharacterization
It has been estimated that the known viruses only account for 15% of virus diversity in the human
gut [107]. To comprehensively investigate the relationships among the viral community in the
FMT responders, we implemented reference-free de novo virus analysis as follows. We rst de
novo assembled shotgun metagenomic reads using metaSPAdes [108] with default parameters
to obtain sequence contigs of length at least 1kbp. The metaSPAdes cross-assembly resulted
in 587,898 contigs with lengths ranging from 55 to 838,676 bp. After removing those contigs
with length shorter than 1000, there were 63,372 contigs left. Then VirFinder [126] was used to
identify contigs that potentially come from viruses. VirFinder analysis yielded 37 contigs at the
79
0.05 q-value threshold and 52 contigs at the 0.1 threshold level. Finally, we clustered the predicted
viral contigs into 14 bins using the software COCACOLA [94]. Relative abundance proles were
calculated for each sample by mapping the metagenomic reads to the viral contig bins.
4.2.4 Principalcoordinatesanalysisofviralproles
Principal coordinates analysis (PCoA) is aimed at projecting high dimension data points onto
two-dimension space while maintaining the relative distances between data points. PCoA makes
it easy to detect patterns within high dimension data sets. We calculated the pairwise Manhattan
distance between the viral relative abundance proles of the dierent samples and then performed
PCoA using the “cmdscale” function of the “stats” package in R.
4.3 Results
4.3.1 Virome proles shift towards those of the donors in responders
afterFMT
Known and novel viral abundance proles were generated from metagenomic sequencing data of
the responders as well as their corresponding donors. PCoA plots based on Manhattan distances
between known and novel viral abundance proles are shown in Figure 4.1 and 4.2, respectively.
It’s seen in both gures that “PostFMT” points are closer to “Donor” points than “Baseline”. This
result shows that patient samples shifted toward donor proles after FMT, indicating the success-
ful transfer of viromes.
80
Figure 4.1: PCoA of known viral abundances based on metagenomic sequence data from the four respon-
ders. The PCoA is based on Manhattan distances between the samples. The circles, squares, triangles, and
diamonds represent Responder 1, Responder 2, Responder 3, and Responder 4, respectively. Red, blue, and
green represent Baseline, Donor, and PostFMT samples, respectively.
81
Figure 4.2: PCoA of novel virus contig bin abundances based on metagenomic sequence data from the
four responders. The PCoA is based on Manhattan distances between the samples. The circles, squares,
triangles, and diamonds represent Responder 1, Responder 2, Responder 3, and Responder 4, respectively.
Red, blue, and green represent Baseline, Donor, and PostFMT samples, respectively. Viral contig bins were
selected based on 0.1 q-value threshold in VirFinder.
82
4.4 Discussion
Because the Kunde et al. pilot study showed ecacy in some but not all patients, we sought to
understand the viral features underlying successful FMT for pediatric ulcerative colitis. We chrac-
terized both known and novel viral abundance proles from the shotgun metagenomic samples
of patients before and after FMT as well as their corresponding donors. We used PCoA with
Manhattan distances between the viral abundances proles of dierent samples to show that the
post-FMT composition became closer to the donor samples compared to baseline, which is an
indication of successful transplantation of viromes. Larger trials utilizing a similar approach will
be required to validate the role of specic viruses in mediating the therapeutic eect of FMT,
and to identify factors predictive of whether a given patient is likely to respond to FMT. Further
study of FMT in pediatric ulcerative colitis may allow for the eventual development of targeted
therapies involving alteration of specic components of the gut microbiota instead of FMT itself.
83
Chapter5
Conclusionsandfuturework
In this dissertation, we present our works to tackle several important research problems in human
microbiome analysis using computational algorithms and statistical modelings. We have devel-
oped MicroPro that chracterizes human microbial proles using both mapped and unmapped
reads and signicantly increases the microbiome based disease prediction accuracy compared to
reference based methods. We have also developed DeepLINK, a deep learning based feature se-
lection framework using knockos, that has important applications in the detection of disease
associated microbes from next generation sequencing reads. Last but not least, we have per-
formed a metagenomic analysis to investigate the gut virome change of four pediatric ulcerative
colitis patients with successful fecal microbiota transplantation and have found that the virome
proles in UC patients after successful FMT shifted towards those of the donors. These research
eorts expand our understanding of human microbiome’s important roles in diseases and have
great potentials in a broad range of future applications. We next discuss possible future directions
of these existing works.
84
5.1 FutureworkfortheMicroProproject
In Chapter 2, we present MicroPro, a metagenomic data analysis pipeline that extracts microbial
abundance proles using all reads from known and unknown microbial organisms. We apply
MicroPro to analyzing four publicly available metagenomic data sets associated with colorectal
cancer, type 2 diabetes, and liver cirrhosis and demonstrate that including reads from unknown
organisms signicantly increases the prediction accuracy of the disease status for three of the
four data sets compared with using mapped reads alone. We identify new microbial organisms
associated with these diseases and show viruses play important prediction roles in colorectal
cancer and liver cirrhosis, but not in type 2 diabetes.
In the MicroPro framework, the known and unknown abundance proles are combined ac-
cording to the proportions of mapped and unmapped reads respectively. This combination method
is in accordance with the biological meaning, but not necessarily benets the prediction. There-
fore, one possible future direction to extend MicroPro is exploring other ways to incorporate the
two proles in the prediction model, such as assigning weights to each prole based on their
prediction power or using model stacking instead of feature stacking. Gao et al. [57] developed
a Leave-one-sample-out (LOSO) model stacking method that trained two random forests mod-
els using known and unknown abundance proles separately and ensembled two models with
weights corresponding to each model’s prediction power. Although the LOSO method didn’t
signicantly improve the prediction accuracy, further exploration of this path is promising and
worth trying.
Another possible future direction is the prediction across dierent data sets. MicroPro only
characterize the microbial prole within one data set. However, the microbial proles between
85
two dierent data sets are not directly comparable because the data sets are collected by dierent
research groups and there exist many confounding factors such as sampling methods, sequenc-
ing platforms, experiment protocols, and so on. In the MicroPro project [173], we preformed the
prediction between two type 2 diabetes data sets, but the prediction accuracy dropped signi-
cantly compared to that in the within data set setting. Reliable predictions across dierent data
sets remain an open problem and more research eorts should be put into it.
5.2 FutureworkfortheDeepLINKproject
In Chapter 3, we present DeepLINK, a deep learning based knockos inference framework that
guarantees the False Discovery Rate (FDR) control in high-dimensional settings. DeepLINK is
applicable to a broad class of covariate distributions described by the possibly nonlinear latent
factor models and has a wide range of real data applications including sociology, economics, ge-
nomics, and many others. We demonstrate DeepLINK’s superior performance in both simulation
studies and real data analyses.
There are many future directions for this project. First of all, theoretical justications on
DeepLINK in terms of both FDR control and power will be greatly desired. This can in turn
guide the training of the underlying networks and further improve the interpretation of our deep
learning inference method. Second, we currently have two separate networks in DeepLINK: the
autoencoder network to generate knocko variables and the DeepPINK network to t the model
and select features. One potential future direction is to integrate them into one single network
for a joint optimization so that the whole process can be fully automated. Third, heterogeneity
in the samples is a practically important issue. It is possible that the samples consist of multiple
86
subpopulations and they have dierent true features. It is likely that DeepLINK can be extended
to accommodate the heterogeneity. The key is to construct valid knocko variables reecting the
subpopulation information. One naive method is to construct knocko variables for each sub-
population, and then combine them appropriately to form valid knocko variables for the overall
population. If this can be achieved, the second step of feature selection using MLP can be applied
without modication. Finally, we would like to extend the use of DeepLINK to time series data
that is extremely common in real world. In the DeepLINK project [172], the model setting re-
quires the observations to be independent. DeepLINK may fail to control the FDR when applied
to time series data since both the autoencoder and DeepPINK do not consider the dependence
structure between observations. One potential idea is to change both networks to their recurrent
neural network (RNN) variants. We have made some explorations along this path and our pre-
liminary results show that the RNN version of DeepLINK can achieve higher power compared to
normal DeepLINK in time series settings.
5.3 FutureworkfortheFMTproject
In Chapter 4, we present a metagenomic analysis to investigate the gut virome change of pedi-
atric UC patients with successful FMT. We chracterized both known and novel viral abundance
proles from the shotgun metagenomic samples of patients before and after FMT as well as their
corresponding donors. We used PCoA with Manhattan distances between the viral abundances
proles of dierent samples to show that the post-FMT composition became closer to the donor
samples compared to the baseline, indicating successful transplantation of viromes.
87
One limitation of the study is that there are only four samples. Therefore, a future analysis
with a larger sample size will greatly benet the understanding of the eectiveness of FMT in
pediatric UC patients. Also, we only analyzed patients with successful FMT. A comparative study
considering both responders and non-responders will help to validate the role of specic viruses
in mediating the therapeutic eect of FMT, and to identify factors predictive of whether a given
patient is likely to respond to FMT.
88
Chapter6
Supplementarymaterials
6.1 SupplementarymaterialsforChapter2
Figure 6.1: Cumulative probability of alpha diversity of known prole. SubplotA uses all the microbial
abundances while SubplotB only uses viral abundances. For both plots, only known abundances are used
for the calculation. Shannon index is set as the diversity index. WMW test p-values between the cases and
the controls are provided.
89
Figure 6.2: Cumulative probability of alpha diversity of unknown prole. SubplotA uses all the microbial
abundances while SubplotB only uses viral abundances. For both plots, only unknown abundances are
used for the calculation. Shannon index is set as the diversity index. WMW test p-values between the
cases and the controls are provided.
6.2 SupplementarymaterialsforChapter3
6.2.1 ComputationaltimeandmemoryanalysisofDeepLINK
We perform the run time and memory analysis of DeepLINK with varying sample size and feature
dimensionality:n = 1000;2000;3000 andp = 1000;2000;3000. The analysis is conducted under
the linear factor model design with linear or nonlinear functions specied in (3.18) and (3.19). The
signal amplitudeA = 10 and the number of true signalss = 50. The activation functions used in
autoencoder and DeepPINK are both ELU. All the computation was done on a single GPU with
model NVIDIA Tesla K40. Table 6.1 povides thaverage run time and max random-access memory
(RAM) (in the parenthesis) of DeepLINK over 100 repetitions. The computational resources used
by DeepLINK are almost the same for linear and nonlinear link functions. Also, the computational
90
time and max RAM grow linearly with respect to the sample size and feature dimensionality.
linear link p = 1000 p = 2000 p = 3000
n = 1000 15.58s (1.33G) 30.23s (1.52G) 57.88s (1.64G)
n = 2000 29.33s (1.50G) 38.69s (1.73G) 62.09s (1.90G)
n = 3000 39.48s (1.60G) 50.24s (1.94G) 77.56s (2.25G)
nonlinear link p = 1000 p = 2000 p = 3000
n = 1000 14.37s (1.35G) 29.37s (1.51G) 50.69s (1.62G)
n = 2000 31.95s (1.51G) 56.40s (1.75G) 67.48s (1.90G)
n = 3000 37.26s (1.59G) 51.94s (1.88G) 72.95s (2.27G)
Table 6.1: Computational time and memory used by DeepLINK under the linear factor model. p is the
number of features andn is the number of samples.
6.2.2 Signalstrengthmeasureinnonlinearmodels
We provide a simple example to illustrate the point that with the nonlinear link function as spec-
ied in (3.19) of the main text, A no longer serves as a good measure of variable importance.
To understand this, consider the same nonlinear model as we have in the main text with feature
dimensionalityp = 1 andA> 0:
h(x
1
;A) = sin(Ax
1
)exp(Ax
1
): (6.1)
Note that in the linear regression model withh(x;A) = Ax, the importance of variablex can
be measured asjh
T
(x
1
;A)j =A. Motivated by this, we examine the same quantity under model
(6.1), which gives rise to
jh
T
(x
1
;A)j =Ajcos(Ax
1
)+sin(Ax
1
)jexp(Ax
1
):
91
It is seen that the above quantity depends on the specic covariate valuex
1
. In fact, for a large
range ofx
1
values,jh
T
(x
1
;A)j as a function ofA is far from monotone increasing. See Figure 6.3
for an illustrative plot ofjh
T
(x
1
;A)j as a function ofA whenx = 0:1, 0.5, 2, and -1.
Figure 6.3: Plot ofjh
T
(x
1
;A)j as a function ofA whenx=0:1, 0.5, 2, and -1.
6.2.3 RobustnessofDeepLINKtothenumberofneuronsinautoencoder’s
bottlenecklayer
In each simulation setting, we vary ^ r, the number of neurons in autoencoder’s bottleneck layer,
from 1 to 5, while the true number of factorsr = 3. The parameterA is xed as 10 for linear
and logistic factor models and 20 for additive quadratic factor model. In all settings, DeepLINK is
robust to overestimated ^ r’s (Figures 6.4-6.6). Whenr is underestimated, the FDR is controlled in
linear factor and logistic factor settings but can be inated in the additive quadratic factor model
setting (Figure 6.5). This indicates that DeepLINK is more robust to overestimation than underes-
timation. The result is reasonable because whenr is severely underestimated, a large amount of
92
signal (factors) is missing and the corresponding knocko variables cannot be constructed accu-
rately. On the other hand, slightly overestimatedr only increases estimation variance marginally.
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3 4 5
number of bottleneck neurons
FDR+
Power+
linear h, p=500, s=10 A
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3 4 5
number of bottleneck neurons
FDR+
Power+
linear h, p=1500, s=30 B
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3 4 5
number of bottleneck neurons
FDR+
Power+
nonlinear h, p=50, s=10 C
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3 4 5
number of bottleneck neurons
FDR+
Power+
nonlinear h, p=500, s=10 D
Measure
FDR+
Power+
Figure 6.4: Linear factor model simulation results of DeepLINK with dierent choice on the number of
bottleneck neurons^ r. Subgures A-D represent dierent combinations of the link functionh, the number
of features p, and the number of true signals s. ‘FDR+’ and ‘Power+’ denote the empirical FDR and
power obtained using the knocko+ threshold. The black dashed line indicates the target FDR level. Each
plot shows the change of FDR+ (solid line) and Power+ (long dashed line) against varying number of
bottleneck neurons, ^ r.
93
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3 4 5
number of bottleneck neurons
FDR+
Power+
linear h, p=500, s=10 A
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3 4 5
number of bottleneck neurons
FDR+
Power+
linear h, p=1500, s=30 B
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3 4 5
number of bottleneck neurons
FDR+
Power+
nonlinear h, p=50, s=10 C
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3 4 5
number of bottleneck neurons
FDR+
Power+
nonlinear h, p=500, s=10 D
Measure
FDR+
Power+
Figure 6.5: Additive quadratic factor model simulation results of DeepLINK with dierent choice
on the number of bottleneck neurons ^ r. Subgures A-D represent dierent combinations of the
link functionh, the number of featuresp, and the number of true signalss. ‘FDR+’ and ‘Power+’
denote the empirical FDR and power obtained using the knocko+ threshold. The black dashed
line indicates the target FDR level. Each plot shows the change of FDR+ (solid line) and Power+
(long dashed line) against varying number of bottleneck neurons, ^ r.
94
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3 4 5
number of bottleneck neurons
FDR+
Power+
linear h, p=500, s=10 A
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3 4 5
number of bottleneck neurons
FDR+
Power+
linear h, p=1500, s=30 B
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3 4 5
number of bottleneck neurons
FDR+
Power+
nonlinear h, p=50, s=10 C
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3 4 5
number of bottleneck neurons
FDR+
Power+
nonlinear h, p=500, s=10 D
Measure
FDR+
Power+
Figure 6.6: Logistic factor model simulation results of DeepLINK with dierent choice on the
number of bottleneck neurons ^ r. Subgures A-D represent dierent combinations of the link
functionh, the number of featuresp, and the number of true signalss. ‘FDR+’ and ‘Power+’
denote the empirical FDR and power obtained using the knocko+ threshold. The black dashed
line indicates the target FDR level. Each plot shows the change of FDR+ (solid line) and Power+
(long dashed line) against varying number of bottleneck neurons, ^ r.
95
6.2.4 MSLEworkswellwithotherchoiceofnonlinearfunctions
In our simulation studies, we coupled MSLE with the nonlinear factor model (3.19) which has
an exponential link function. We demonstrate now using a dierent simulation example that
MSLE can work well for other nonlinear factor models whose link functions are dierent from
the exponential function. Specically, we simulate data using the following cubic link function
h(x) = (x
T
)
3
(6.2)
and apply DeepLINK with the MSLE loss function. Simulation results are shown in Figure 6.7.
DeepLINK successfully controls the FDR and achieves high power whenA is suciently large
for the linear and logistic factor models. The setting with the additive quadratic factor model is
very challenging and the power of DeepLINK drops dramatically compared to other two factor
models; but the FDR is generally controlled (taking into account the Monte Carlo error). This
demonstrates that the MSLE loss function can be applied to a wide range of nonlinear link func-
tions that have the potential issue of exploding gradients.
96
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
linear, p=50, s=10 A
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
linear, p=500, s=10 B
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
10 20 30 40 50 60 70
signal amplitude A
FDR+
Power+
additive quadratic, p=50, s=10 C
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
10 20 30 40 50 60 70 80
signal amplitude A
FDR+
Power+
additive quadratic, p=500, s=10 D
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
logistic, p=50, s=10 E
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3 6 9 12 15 18 21 24
signal amplitude A
FDR+
Power+
logistic, p=500, s=10 F
Measure
FDR+
Power+
Figure 6.7: Simulation results of DeepLINK in settings with the cubic link function. Subgures A-F rep-
resent dierent combinations of the factor model design, the number of features p, and the number of
true signalss. ‘FDR+’ and ‘Power+’ denote the empirical FDR and power obtained using the knocko+
threshold. The black dashed line indicates the target FDR level. Each plot shows the change of FDR+ (solid
line) and Power+ (long dashed line) against varying signal amplitudeA. The activation functions used in
autoencoder and DeepPINK are both ELU.
6.2.5 DeepLINKperformswellwithhighlycorrelatedfeatures
An interesting observation was that our simulation models considered in the main text produce
highly correlated covariates. We provide more in-depth investigations on the performance of
DeepLINK using the same linear factor model and linear link function considered in our main
97
paper. To ease the presentation, we dene two featuresi andj to be highly correlated if their
Pearson correlation is at least0:7 in magnitude.
For each repetition in our simulation, in addition to the empirical FDR and power, we also
calculate the following three quantities:
1). The average number of spurious features per true signal,v
1
, where spurious feature is a
null feature that is highly correlated with some true signal.
2). The percentage of highly correlated (true, null) pairs being co-selected,v
2
, calculated as:
v
2
=
number of highly correlated (true positive, false positive) pairs
number of highly correlated (true signal, null feature) pairs
: (6.3)
3). The percentage of vulnerable true signals being selected,v
3
. A true signal is vulnerable if
it is highly correlated with some null feature.
The boxplots of v
1
, v
2
, and v
3
over 100 repetitions are shown in Figure 6.8. It is seen that
on average, each signal feature is highly correlated with 19.01 (mean value ofv
1
) null features.
Among all highly correlated pairs of (true signal, null feature), only a tiny fraction is co-selected
by DeepLINK (Figure 6.8B). In addition, Figure 6.8C shows that vulnerable true signals are almost
always selected by DeepLINK. This simulation example demonstrates that DeepLINK is capable
of selecting true signals in the presence of highly spurious correlations.
98
10
20
30
v 1
Average number of spurious features per true signal A
0.00
0.01
0.02
0.03
v 2
Percentage of highly correlated
(true, null) pairs being co−selected
B
0.88
0.92
0.96
1.00
v 3
Percentage of vulnerable true signals being selected C
Figure 6.8: The boxplots ofv
1
(A),v
2
(B), andv
3
(C) over 100 repetitions under the linear factor model
and linear link function setting. v
1
is the average number of spurious features per true signal. v
2
is the
percentage of highly correlated (true, null) pairs being co-selected.v
3
is the percentage of vulnerable true
signals being selected.
6.2.6 RobustnessofDeepLINKtotheestimationoftheerrormatrix
One potential concern on the real data applications is that the entries of error matrix
^
E can have
non-normal distribution (see Figure 6.9 for the Q-Q plots on the empirical distribution of
^
E from
the three real data sets), but
~
E for the knocko variable construction is sampled from a normal
distribution. We investigate the robustness of DeepLINK to the misspecication of the error dis-
tribution in this section.
Figure 6.9: Q-Q plots of
^
E for three real data sets: microbiome data set (A), murine scRNA-seq data set
(B), and human scRNA-seq data set (C).
99
Motivated from the Q-Q plots in Figure 6.9, we design our simulation study with the entries
of E independently generated from the asymmetric Laplace distribution (ALD). The density of
ALD has three parameters of shape, location, and scale, and takes the following form
f(x;;;) =
8
>
>
>
<
>
>
>
:
1
+
1
e
(
x
)
; x> 0;
1
+
1
e
1
(
x
)
; x< 0:
(6.4)
Table 6.2 summarizes the tted values of,, and from the three real data sets considered in
our main text using the maximum likelihood (MLE) approach. It is seen that the estimated values
of the three parameters are close to each other. The average values of,, and across the three
data sets are 0.632, -0.307, and 0.292, respectively.
Data set
microbiome data set 0.685 -0.222 0.237
murine scRNA-seq data set 0.626 -0.387 0.378
human scRNA-seq data set 0.586 -0.313 0.262
Table 6.2: MLE estimates of,, and for three real data sets.
We perform a simulation study under the setting of the linear factor model and linear link
function withn = 1000;p = 500, ands = 10. The underlying distribution for E is chosen as
ALD with parameters set as the averages of the MLE estimates presented in Table 6.2. When
generating
~
E matrix in knocko variable construction, we experimented with both normal distri-
bution and ALD, where the former is a misspecied distribution and the latter correctly species
the underlying distribution. Simulation results are shown in Figure 6.10. We see that the missp-
cication of the distribution ofE has little impact on both FDR and power, which demonstrates
100
the robustness of DeepLINK. This observation is in line with the one in [25]. In fact, an open
conjecture on the knockos method is that the rst two moments match in distributions may
be sucient for generating valid knocko variables asymptotically (the misspecied Gaussian
achieves exactly the rst two moments match).
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
5 10 15 20 25 30 35 40 45 50
signal amplitude A
FDR+
Power+
Algorithm
ald_fit
normal_fit
Measure
FDR+
Power+
Figure 6.10: Comparison of empirical FDR and power in a simulation study when the error distribution
is misspecied in the knocko variable generation. ‘FDR+’ and ‘Power+’ denote the empirical FDR and
power obtained using the knocko+ threshold. The black dashed line indicates the target FDR level. The
plot shows the change of FDR+ (solid line) and Power+ (long dashed line) for ‘normal_t’ (orange) and
‘ald_t’ (blue) against varying signal amplitudeA, where ‘normal_t’ is the misspecied case.
6.2.7 ComparisonofDeepLINKwithrandomforestsandIPAD
To evaluate the prediction performance of DeepLINK for the real data sets considered in the
main text, we add two methods for comparison in the real data applications. The rst one is
screening+random forests (Column 1 in Table 6.3), where the same distance-correlation-based
101
screening process as for DeepLINK is used. Note that for this method, we need to decide how
many features to retain after the screening step when implementing random forests. To ensure
a fair comparison, we use the same number of selected features by screening+DeepLINK. For
example, if 20 features are selected by screening+DeepLINK, we t random forests using the
top 20 variables from screening by the distance correlation. The second method for comparison
is screening+IPAD (Column 2 in Table 6.3). The number of retained variables in screening is
denoted asd and we considerd up to100 for the microbiome data set (denoted as Zeller) and500
for the two scRNA-seq data sets (denoted as RNA1 and RNA2, respectively). The results for our
method are shown in Column 3 of Table 6.3. The retting step of IPAD on the selected variables
relies on the Lasso estimator for the binary outcome in [151]. All three data sets were randomly
split into screening (50%), training (40%), and test (10%)100 times.
As we can see from Table 6.3, for the two scRNA-seq data sets, DeepLINK (shown in Col-
umn 3) is comparable to the random forests based method (shown in Column 1) in terms of test
error, and signicantly outperforms IPAD (Column 2). Noting that random forests (Column 1)
has no guarantee on the FDR control, it is fair to say that DeepLINK outperforms both compet-
ing methods. For the Zeller data set, the random forests based method (Column 1) outperforms
screening+DeepLINK in terms of test error, while the IPAD based method (Column 2) performs
the worst. To provide further understanding, we ret the training data using random forests with
only features selected by screening+DeepLINK, and calculate the test error of the retted model
(Column 4). Thus, the only dierence between Columns 3 and 4 is that the former uses the MLP
for prediction while the latter uses random forests. We observe that the random forests retting
gives us improved results on the test error. Since the sets of variables used for calculating test
102
errors in Columns 3 and 4 are identical (i.e., the ones selected by screening+DeepLINK), the im-
proved results in Column 4 suggests that the better performance of screening+random forests
(Column 1) than screening+DeepLINK (Column 3) is partly caused by the superior prediction
power of random forests over the MLP for the Zeller application.
Column 1 Column 2 Column 3 Column 4
d = 20 Zeller 0.276 (0.011) 0.479 (0.014) 0.299 (0.012) 0.281 (0.011)
RNA1 0.017 (0.002) 0.087 (0.014) 0.021 (0.003)
RNA2 0.063 (0.003) 0.079 (0.003) 0.072 (0.003)
d = 30 Zeller 0.278 (0.011) 0.502 (0.013) 0.351 (0.011) 0.286 (0.011)
RNA1 0.014 (0.002) 0.059 (0.011) 0.018 (0.002)
RNA2 0.062 (0.003) 0.074 (0.003) 0.068 (0.003)
d = 40 Zeller 0.277 (0.010) 0.517 (0.011) 0.329 (0.012) 0.296 (0.011)
RNA1 0.012 (0.001) 0.024 (0.005) 0.012 (0.002)
RNA2 0.063 (0.003) 0.073 (0.003) 0.064 (0.003)
d = 50 Zeller 0.282 (0.011) 0.514 (0.011) 0.335 (0.012) 0.334 (0.012)
RNA1 0.012 (0.001) 0.032 (0.008) 0.014 (0.002)
RNA2 0.066 (0.004) 0.069 (0.003) 0.060 (0.003)
d = 100 Zeller 0.327 (0.013) 0.520 (0.011) 0.428 (0.014) 0.380 (0.015)
RNA1 0.011 (0.001) 0.014 (0.002) 0.016 (0.002)
RNA2 0.058 (0.003) 0.069 (0.003) 0.059 (0.003)
d = 200 Zeller
RNA1 0.011 (0.001) 0.012 (0.001) 0.010 (0.001)
RNA2 0.055 (0.003) 0.073 (0.003) 0.046 (0.003)
d = 300 Zeller
RNA1 0.009 (0.001) 0.012 (0.001) 0.013 (0.002)
RNA2 0.053 (0.003) 0.076 (0.003) 0.056 (0.003)
d = 400 Zeller
RNA1 0.010 (0.001) 0.013 (0.002) 0.012 (0.002)
RNA2 0.053 (0.003) 0.075 (0.003) 0.049 (0.003)
d = 500 Zeller
RNA1 0.009 (0.001) 0.011 (0.001) 0.015 (0.002)
RNA2 0.052 (0.003) 0.077 (0.003) 0.050 (0.003)
Table 6.3: Comparison of the mean and standard error (in parenthesis) of the misclassication error on the
test data for the three real data sets. Column 1: screening+random forests with matched model size, Col-
umn 2: screening+IPAD-logistic Lasso, Column 3: screening+DeepLINK, Column 4: screening+DeepLINK-
retted random forests. “Zeller" represents the microbiome data set, “RNA1" represents the murine scRNA-
seq data set, and “RNA2" represents the human scRNA-seq data set.d is the number of retained variables
after the screening step.
103
6.2.8 Optimaltrainingandvalidationsplit
Consider the linear regression model
y = X
0
+;
where y = (y
1
; ;y
n
)
T
is the observed response vector, X = (x
1
; ;x
n
)
T
is the observed
design matrix withx
i
i:i:d:
N(0; I
p
) and I
p
thepp identity matrix,
0
= (
0;1
; ;
0;p
)
T
2
R
p
is the vector of regression coecients, and 2 R
n
is the vector of i.i.d. N(0;
2
"
) errors.
Throughout the presentation, we will drop the subscript and use I to denote the identity matrix
of appropriate dimensions to simplify the notation. We use the following ridge regression model
to estimate the coecient vector
0
argmin
2R
p
f
1
n
kyXk
2
+kk
2
g;
where 0 is the regularization parameter andkk stands for the L
2
-norm. Observe that
the ridge regression can be viewed as a special case of neural network (NN) with the linear
activation function and L
2
-regularization on network weights. The goal is to select the regu-
larization parameter from a prechosen grid of length N in the range [
min
;
max
] using the
training/validation splits, where
max
min
p
n
1
p. Our technical analysis uses similar
idea as in [63], and will focus on the moderate-dimensional case ofp=n! 0. But the general
idea may be extended to other cases with some dierences in the technical conditions. We assume
var(x
T
1
0
)
2
"
to avoid the trivial case of negligible model noise. This condition together with
the assumptionx
i
N(0; I) immediately implies thatk
0
k
2
"
.
104
To ease the technical presentation, we introduce some additional notation. Following the
notation in [63], we uset to denote the entire training sample of sizen. We use
^
(t;) to denote
the estimated regression coecient vector based on training samplet and tuning parameter. For
some givenf2 (0;1), suppose that we use100f% of the data as the training sample (denoted as
ft) to estimate the regression coecient, and the remaining100(1f)% of the data (denoted as
(1f)t) as the validation sample to choose the tuning parameter. Here, without loss of generality,
we assume thatnf andn(1f) are both positive integers. Denote byX
f
,y
f
, and
^
f
the design
matrix, the response vector, and the sample covariance matrix of covariates based on sampleft,
respectively. Then we have
^
:=
1
n
X
T
X = f
^
f
+ (1f)
^
1f
. For a given and training
samplet, the ridge regression estimator is
^
(;t) = (X
T
X+nI)
1
X
T
y =n
1
(
^
+I)
1
X
T
y.
We rst review some existing results in the literature (cf. [50]):
d
max
(
^
f
)12
p
pn
1
=O
p
(pn
1
+n
1=2
p
1=6
); (6.5)
d
min
(
^
f
)1+2
p
pn
1
=O
p
(pn
1
+n
1=2
p
1=6
); (6.6)
whered
max
() andd
min
() represent the largest and smallest eigenvalues of a matrix, respectively.
With a given and given training sampleft, the estimated coecient vector is
^
(ft;), and
the empirical loss function on the(1f)t validation set is dened as
`(
^
(ft;);(1f)t) :=
1
(1f)n
y
1f
X
1f
^
(ft;)
2
: (6.7)
It is seen that`(
^
(ft;);(1f)t) is a function off. The tuning parameter that minimizes the
validation loss is denoted as
val
(f). Note that we write such as a function off to emphasize its
105
dependency on the ratio of training/validation splitf=(1f). After obtaining
val
(f), we t the
ridge regression on the entire training sett to obtain the new regression coecient
^
(t;
val
(f)).
We also need a target. Based on the full training sample t, the best we can achieve is to
estimate the regression coecient using the entiret, and then choose as the one that minimizes
the following out of sample population prediction error on an independent test point(x
;y
):
`(
^
(t;);1) :=E[(y
x
^
(t;))
2
jt] =
2
"
+k
0
^
(t;)k
2
; (6.8)
where we use1 to denote the population expectation with respect to test point (x
;y
). It is
seen that`(
^
(t;);1) is a function of. We denote by
opt
the minimizer of`(
^
(t;);1;)
with respect to.
We use
opt
as our target – we want to ndf such that the prediction loss`(
^
(t;
val
(f));1)
can be as close to`(
^
(t;
opt
);1) as possible . To ease the presentation, we will drop1 in the no-
tation of loss function whenever there is no confusion; that is,`(
^
(t;
val
(f))) :=`(
^
(t;
val
(f));1).
We will consider the following decomposition
0`(
^
(t;
val
(f)))`(
^
(t;
opt
))
=
n
`(
^
(t;
val
(f)))`(
^
(ft;
val
(f)))
o
+
n
`(
^
(ft;
val
(f)))`(
^
(ft;
opt
))
o
+
n
`(
^
(ft;
opt
))`(
^
(t;
opt
))
o
:=I
1
+I
2
+I
3
:
We study the three termsI
1
,I
2
, andI
3
on the right-hand side above separately.
106
Let us rst consider termI
1
. Note that for a xed, we have
^
(;t)
0
= (X
T
X+nI)
1
X
T
y
0
=
(
^
+I)
1
^
I
0
+(
^
+I)
1
(n
1
X
T
):
It follows from (6.5) and (6.6) that
k(
^
f
+I)
1
k = (1+)
1
+O
p
(
p
p=n);
k
^
f
+Ik = 1++O
p
(
p
p=n);
wherekk denotes the matrix operator norm. In addition, sinceX and both have i.i.d. Gaussian
components, by Bernstein’s inequality and noting thatX
0
N(0;k
0
k
2
I), we have
n
2
T
XX
T
=
2
"
p
n
(1+o
p
(1));
n
1
T
0
X
T
=O
p
(
1
p
n
):
Then it follows from (6.5) and (6.6) that
T
X(
^
+I)
2
X
T
(1+)
2
T
XX
T
O
p
((p=n)
1=2
)
T
XX
T
:
Thus, it holds that
k(
^
+I)
1
(n
1
X
T
)k
2
=
(1+)
2
+O
p
(
p
p=n)
n
2
T
XX
T
= (1+)
2
2
"
p
n
(1+o
p
(1)):
107
The above result together with (6.6) entails that
k
^
(;t)
0
k
2
=
(
^
+I)
1
^
I
0
2
+
(
^
+I)
1
(n
1
X
T
)
2
=k(
^
+I)
1
0
k
2
+
(
^
+I)
1
(n
1
X
T
)
2
2
T
0
(
^
+I)
2
(n
1
X
T
)
+2
T
0
(
^
+I)
1
^
I
T
(
^
+I)
1
(n
1
X
T
)
=k(
^
+I)
1
0
k
2
+(1+)
2
2
"
p
n
(1+o
p
(1))2
T
0
(
^
+I)
2
(n
1
X
T
)
=
2
(1+)
2
k
0
k
2
(1+o
p
(1))+(1+)
2
2
"
p
n
(1+o
p
(1))
+(1+)
2
r
p
n
(1+o
p
(1)); (6.9)
where in the last step the upper bound for the third term is because of the Cauchy–Schwarz
inequality (i.e., x
T
ykxkkyk for any vectorsx andy). Observe that theo
p
() terms above are
uniformly over all2 [
min
;
max
]. Since
min
max
p
p=n andk
0
k
2
2
"
, we have that
uniformly over all2 [
min
;
max
],
`(
^
(t;))`(
^
(ft;)) =k
^
(;t)
0
k
2
k
^
(;ft)
0
k
2
= (1+)
2
2
"
(
p
n
p
nf
)(1+o
p
(1)):
That is, it is negative for all values of with asymptotic probability one. Hence, we can bound
termI
1
from above by 0.
108
We now bound termI
3
. Using similar arguments as forI
1
, we can deduce that
0I
3
=`(
^
(ft;
opt
))`(
^
(t;
opt
)) =
2
"
(1+
opt
)
2
(
p
fn
p
n
)(1+o
p
(1))
2
"
(1+
min
)
2
(
p
fn
p
n
)(1+o
p
(1)):
It remains to bound termI
2
. Since
val
(f) minimizes`(
^
(ft;);(1f)t), it holds that
I
2
=
`(
^
(ft;
val
(f)))`(
^
(ft;
val
(f));(1f)t)
+
`(
^
(ft;
val
(f));(1f)t)`(
^
(ft;
opt
);(1f)t)
+
`(
^
(ft;
opt
);(1f)t)`(
^
(ft;
opt
))
`(
^
(ft;
val
(f)))`(
^
(ft;
val
(f));(1f)t)
+
`(
^
(ft;
opt
);(1f)t)`(
^
(ft;
opt
))
:
(6.10)
Note that for each given,`(
^
(ft;);(1f)t) is the empirical estimate of`(
^
(ft;);1) based
on (1f)t. But since
opt
(f) is random and depends on t, we need to carefully analyze it. We
start with the following decomposition
`(
^
(ft;);(1f)t)`(
^
(ft;))
= (
^
(;ft)
0
)
T
^
1f
I
(
^
(;ft)
0
)+
1
n(1f)
k
1f
k
2
2
"
+2
1
n(1f)
T
1f
X
1f
(
^
(;ft)
0
):
109
Then in view of (6.10) and the above inequality, we have
I
2
`(
^
(ft;
val
(f)))`(
^
(ft;
val
(f));(1f)t)
+
`(
^
(ft;
opt
);(1f)t)`(
^
(ft;
opt
))
=(
^
(
val
(f);ft)
0
)
T
^
1f
I
(
^
(
val
(f);ft)
0
)2
1
n(1f)
T
1f
X
1f
(
^
(
val
(f);ft)
0
)
+(
^
(
opt
;ft)
0
)
T
^
1f
I
(
^
(
opt
;ft)
0
)+2
1
n(1f)
T
1f
X
1f
(
^
(
opt
;ft)
0
):
(6.11)
Leth() =k
^
(;ft)
0
k. Then conditional onft, for each we havez
:=
1
h()
X
1f
(
^
(;ft)
0
)N(0; I). Observe that
T
1f
X
1f
(
^
(;ft)
0
) = z
T
1f
h(). By the Markov inequality,
we have that for anyr> 0 andt> 0,
P
1
n(1f)
z
T
1f
tjft
E[exp(
r
n(1f)
z
T
1f
)jft]
e
rt
=e
rt
E[exp(rn
1
(1f)
1
z
;1
"
1
)jft]
n(1f)
=e
rt
Eexp(2
1
r
2
n
2
(1f)
2
"
2
1
)
n(1f)
=e
rt
1r
2
n
2
(1f)
2
2
"
n(1f)=2
;
where z
;1
is the rst coordinate of z
and "
1
is the rst corrdinate of
1f
. By symmetry, it
follows that
P
1
n(1f)
jz
T
jt
2e
rt
1r
2
n
2
(1f)
2
2
"
n(1f)=2
:
110
Since there areN choices of tuning parameters in total, if we sett =
q
9
2
n
1
(1f)
1
logN
"
andr =
1
"
p
2n(1f)logN, then it holds that
P
1
n(1f)
sup
jz
T
jt
2Ne
rt
1r
2
n
2
(1f)
2
2
"
n(1f)=2
=O(N
1
):
That is,
1
n(1f)h(
opt
)
T
1f
X
1f
(
^
(
opt
;ft)
0
)
1
n(1f)
sup
jz
T
j
"
r
9
2
n
1
(1f)
1
logN:
Thus, with probability at least1O(N
1
), we have
T
1f
X
1f
(
^
(
opt
;ft)
0
)
h(
opt
)
"
r
9
2
n
1
(1f)
1
logN:
Similarly, we can show that
T
1f
X
1f
(
^
(
val
;ft)
0
)
h(
val
)
"
r
9
2
n
1
(1f)
1
logN:
Using similar arguments as in (6.9), we can deduce that
sup
k
^
(;ft)
0
k (1+
min
)
1
"
r
p
nf
(1+o
p
(1)):
111
The above three results along with (6.11), (6.5), and (6.6) entail that with probability at least
1o(1),
I
2
4sup
k
^
(;ft)
0
k
r
9
2
n
1
(1f)
1
logN +4
r
p
n(1f)
sup
k
^
(;ft)
0
k
2
4(1+
min
)
1
2
"
q
9
2
plogN
n
1
p
f(1f)
+4(1+
min
)
2
2
"
p
n
3=2
1
f
p
1f
:
Combining termsI
1
,I
2
, andI
3
, we can obtain that the optimalf should minimize the follow-
ing upper bound
(1+
min
)
2
2
"
p
n
1f
f
+4(1+
min
)
1
2
"
q
9
2
plogN
n
1
p
f(1f)
+4(1+
min
)
2
2
"
p
n
3=2
1
f
p
1f
;
which is equivalent to
f
opt
= argmin
f
n
p
p
1f
f
+4(1+
min
)
s
9
2
logN
f(1f)
+4
p
p
n
1
f
p
1f
o
: (6.12)
Since the optimization problem in (6.12) does not admit an explicit solution, we numerically
deomstrate how the objective function depends on the split fractionf in Figure 6.11 below when
sample sizen = 1;000, dimensionalityp = 500, the number of grid points for tuning parameter
N = 100, and
min
= 0:01
p
n
1
logp. It is seen that the minimum value is achieved at around
f = 0:7, and the range of 0.6 to 0.8 gives roughly the same upper bound. It is also interesting to
notice that this plot seems to be robust to dierent parameter values in terms of the approximate
location of the minimizer. In our numerical analysis, we chosef = 0:8.
112
Figure 6.11: The objective function in (6.12) as a function off, wheref is the fraction of data used for
training.
113
Bibliography
[1] Luoyan Ai, Haiying Tian, Zhaofei Chen, Huimin Chen, Jie Xu, and Jing-Yuan Fang.
“Systematic evaluation of supervised classiers for fecal microbiota-based prediction of
colorectal cancer”. In: Oncotarget 8.6 (2017), p. 9546.
[2] H Al Mardini, K Bartlett, and CO Record. “Blood and brain concentrations of mercaptans
in hepatic and methanethiol induced coma.” In: Gut 25.3 (1984), pp. 284–290.
[3] SH Aliyu, RK Marriott, MD Curran, S Parmar, N Bentley, NM Brown, JS Brazier, and
H Ludlam. “Real-time PCR investigation into the importance of Fusobacterium
necrophorum as a cause of acute pharyngitis in general practice”. In: Journal of medical
microbiology 53.10 (2004), pp. 1029–1035.
[4] Johannes Alneberg, Brynjar Smári Bjarnason, Ino de Bruijn, Melanie Schirmer,
Joshua Quick, Umer Z Ijaz, Nicholas J Loman, Anders F Andersson, and
Christopher Quince. “CONCOCT: clustering contigs on coverage and composition”. In:
arXiv preprint arXiv:1312.4038 (2013).
[5] Ricard Argelaguet, Britta Velten, Damien Arnol, Sascha Dietrich, Thorsten Zenz,
John C Marioni, Florian Buettner, Wolfgang Huber, and Oliver Stegle. “Multi-Omics
Factor Analysis—a framework for unsupervised integration of multi-omics data sets”. In:
Molecular Systems Biology 14.6 (2018), e8124.
[6] Adam Baghban and Shaili Gupta. “Parvimonas micra: a rare cause of native joint septic
arthritis”. In: Anaerobe 39 (2016), pp. 26–27.
[7] Jushan Bai and Serena Ng. “Determining the number of factors in approximate factor
models”. In: Econometrica 70.1 (2002), pp. 191–221.
[8] Xin Bai, Jie Ren, Yingying Fan, and Fengzhu Sun. “KIMI: Knocko Inference for Motif
Identication from molecular sequences with controlled false discovery rate”. In:
Bioinformatics 37.6 (2021), pp. 759–766.
114
[9] Syeda M Bakhtiar, Jean Guy LeBlanc, Emiliano Salvucci, Amjad Ali, Rebeca Martin,
Philippe Langella, Jean-Marc Chatel, Anderson Miyoshi, Luis G Bermúdez-Humarán,
and Vasco Azevedo. “Implications of the human microbiome in inammatory bowel
diseases”. In: FEMS microbiology letters 342.1 (2013), pp. 10–17.
[10] Rina Foygel Barber and Emmanuel J Candès. “Controlling the false discovery rate via
knockos”. In: The Annals of Statistics 43.5 (2015), pp. 2055–2085.
[11] Stephen Bates, Emmanuel Candès, Lucas Janson, and Wenshuo Wang. “Metropolized
knocko sampling”. In: Journal of the American Statistical Association 116.535 (2021),
pp. 1413–1427.
[12] Yoshua Bengio, Aaron Courville, and Pascal Vincent. “Representation learning: A review
and new perspectives”. In: IEEE transactions on pattern analysis and machine intelligence
35.8 (2013), pp. 1798–1828.
[13] Yoav Benjamini. “Discovering the false discovery rate”. In:JournaloftheRoyalStatistical
Society: series B (statistical methodology) 72.4 (2010), pp. 405–416.
[14] Yoav Benjamini and Yosef Hochberg. “Controlling the false discovery rate: a practical
and powerful approach to multiple testing”. In: Journal of the Royal statistical society:
series B (Methodological) 57.1 (1995), pp. 289–300.
[15] Yoav Benjamini, Abba M Krieger, and Daniel Yekutieli. “Adaptive linear step-up
procedures that control the false discovery rate”. In: Biometrika 93.3 (2006), pp. 491–507.
[16] Yoav Benjamini and Daniel Yekutieli. “The control of the false discovery rate in multiple
testing under dependency”. In: The Annals of Statistics 29.4 (2001), pp. 1165–1188.
[17] Andrew K Benson, Scott A Kelly, Ryan Legge, Fangrui Ma, Soo Jen Low, Jaehyoung Kim,
Min Zhang, Phaik Lyn Oh, Derrick Nehrenberg, Kunjie Hua, et al. “Individuality in gut
microbiota composition is a complex polygenic trait shaped by multiple environmental
and host genetic factors”. In: Proceedings of the National Academy of Sciences 107.44
(2010), pp. 18933–18938.
[18] Hans Bisgaard, Nan Li, Klaus Bonnelykke, Bo Lund Krogsgaard Chawes, Thomas Skov,
Georg Paludan-Müller, Jakob Stokholm, Birgitte Smith, and Karen Angeliki Krogfelt.
“Reduced diversity of the intestinal microbiota during infancy is associated with
increased risk of allergic disease at school age”. In: Journal of Allergy and Clinical
Immunology 128.3 (2011), pp. 646–652.
[19] Yuna Blum, Guillaume Le Mignon, Sandrine Lagarrigue, and David Causeur. “A factor
model to analyze heterogeneity in gene expression”. In: BMC Bioinformatics 11.1 (2010),
pp. 1–12.
115
[20] Annemarie Boleij, Elizabeth M Hechenbleikner, Andrew C Goodwin, Ruchi Badani,
Ellen M Stein, Mark G Lazarev, Brandon Ellis, Karen C Carroll, Emilia Albesiano,
Elizabeth C Wick, et al. “The Bacteroides fragilis toxin gene is prevalent in the colon
mucosa of colorectal cancer patients”. In: Clinical Infectious Diseases 60.2 (2015),
pp. 208–215.
[21] Sebastian Bonhoeer and Paul Sniegowski. “The importance of being erroneous”. In:
Nature 420.6914 (2002), pp. 367–369.
[22] Thomas J Borody, Lawrence J Brandt, and Sudarshan Paramsothy. “Therapeutic faecal
microbiota transplantation: current status and future developments”. In: Current opinion
in gastroenterology 30.1 (2014), p. 97.
[23] Leo Breiman. “Random forests”. In: Machine learning 45.1 (2001), pp. 5–32.
[24] Mya Breitbart and Forest Rohwer. “Here a virus, there a virus, everywhere the same
virus?” In: Trends in microbiology 13.6 (2005), pp. 278–284.
[25] Emmanuel Candès, Yingying Fan, Lucas Janson, and Jinchi Lv. “Panning for gold:
‘model-X’ knockos for high dimensional controlled variable selection”. In: Journal of
the Royal Statistical Society Series B 80.3 (2018), pp. 551–577.
[26] Mauro Castellarin, René L Warren, J Douglas Freeman, Lisa Dreolini,
Martin Krzywinski, Jaclyn Strauss, Rebecca Barnes, Peter Watson, Emma Allen-Vercoe,
Richard A Moore, et al. “Fusobacterium nucleatum infection is prevalent in human
colorectal carcinoma”. In: Genome research 22.2 (2012), pp. 299–306.
[27] Yanfei Chen, Feng Ji, Jing Guo, Ding Shi, Daiqiong Fang, and Lanjuan Li. “Dysbiosis of
small intestinal microbiota in liver cirrhosis and its association with etiology”. In:
Scientic reports 6.1 (2016), pp. 1–9.
[28] Chien-Ming Chi, Patrick Vossler, Yingying Fan, and Jinchi Lv. “Asymptotic Properties of
High-Dimensional Random Forests”. In: arXiv preprint arXiv:2004.13953 (2020).
[29] Rayan Chikhi and Guillaume Rizk. “Space-ecient and exact de Bruijn graph
representation based on a Bloom lter”. In: Algorithms for Molecular Biology 8.1 (2013),
pp. 1–9.
[30] Ilseung Cho and Martin J Blaser. “The human microbiome: at the interface of health and
disease”. In: Nature Reviews Genetics 13.4 (2012), pp. 260–270.
[31] Olabisi Oluwabukola Coker, Zhenwei Dai, Yongzhan Nie, Guijun Zhao, Lei Cao,
Geicho Nakatsu, William KK Wu, Sunny Hei Wong, Zigui Chen, Joseph JY Sung, et al.
“Mucosal microbiome dysbiosis in gastric carcinogenesis”. In: Gut 67.6 (2018),
pp. 1024–1032.
116
[32] Ruben J Colman and David T Rubin. “Fecal microbiota transplantation as therapy for
inammatory bowel disease: a systematic review and meta-analysis”. In: Journal of
Crohn’s and Colitis 8.12 (2014), pp. 1569–1581.
[33] UniProt Consortium. “UniProt: a hub for protein information”. In: Nucleic Acids Research
43.D1 (2015), pp. D204–D212.
[34] A Contreras, N Doan, C Chen, T Rusitanonta, MJ Flynn, and J Slots. “Importance of
Dialister pneumosintes in human periodontitis”. In: Oral microbiology and immunology
15.4 (2000), pp. 269–272.
[35] Zhenwei Dai, Olabisi Oluwabukola Coker, Geicho Nakatsu, William KK Wu,
Liuyang Zhao, Zigui Chen, Francis KL Chan, Karsten Kristiansen, Joseph JY Sung,
Sunny Hei Wong, et al. “Multi-cohort analysis of colorectal cancer metagenome
identied altered bacteria across populations and universal bacterial markers”. In:
Microbiome 6.1 (2018), p. 70.
[36] Spyros Darmanis, Steven A Sloan, Derek Croote, Marco Mignardi, Sophia Chernikova,
Peyman Samghababi, Ye Zhang, Norma Ne, Mark Kowarsky, Christine Caneda, et al.
“Single-cell RNA-seq analysis of inltrating neoplastic cells at the migrating front of
human glioblastoma”. In: Cell Reports 21.5 (2017), pp. 1399–1410.
[37] Les Dethlefsen, Margaret McFall-Ngai, and David A Relman. “An ecological and
evolutionary perspective on human–microbe mutualism and disease”. In: Nature
449.7164 (2007), pp. 811–818.
[38] Francis X Diebold, Glenn D Rudebusch, and S Boragan Aruoba. “The macroeconomy
and the yield curve: a dynamic latent factor approach”. In: Journal of Econometrics
131.1-2 (2006), pp. 309–338.
[39] HC Douglas and Shirley E Gunter. “The taxonomic position of Corynebacterium acnes”.
In: Journal of bacteriology 52.1 (1946), pp. 15–23.
[40] Julia L Drewes, James R White, Christine M Dejea, Payam Fathi, Thevambiga Iyadorai,
Jamuna Vadivelu, April C Roslani, Elizabeth C Wick, Emmanuel F Mongodin,
Mun Fai Loke, et al. “High-resolution bacterial 16S rRNA gene prole meta-analysis and
biolm status reveal common colorectal cancer consortia”. In: NPJ biolms and
microbiomes 3.1 (2017), pp. 1–12.
[41] Bradley Efron, Robert Tibshirani, John D Storey, and Virginia Tusher. “Empirical Bayes
analysis of a microarray experiment”. In: Journal of the American statistical association
96.456 (2001), pp. 1151–1160.
[42] Jianqing Fan and Yingying Fan. “High dimensional classication using features annealed
independence rules”. In: Annals of statistics 36.6 (2008), p. 2605.
117
[43] Jianqing Fan, Yingying Fan, and Jinchi Lv. “High dimensional covariance matrix
estimation using a factor model”. In: Journal of Econometrics 147.1 (2008), pp. 186–197.
[44] Jianqing Fan and Jinchi Lv. “A selective overview of variable selection in high
dimensional feature space”. In: Statistica Sinica 20.1 (2010), p. 101.
[45] Jianqing Fan and Jinchi Lv. “Sure independence screening”. In: Wiley StatsRef: Statistics
Reference Online (2018).
[46] Jianqing Fan and Jinchi Lv. “Sure independence screening for ultrahigh dimensional
feature space”. In: Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 70.5 (2008), pp. 849–911.
[47] Yingying Fan, Emre Demirkaya, Gaorong Li, and Jinchi Lv. “RANK: large-scale inference
with graphical nonlinear knockos”. In: Journal of the American Statistical Association
(2019).
[48] Yingying Fan, Emre Demirkaya, and Jinchi Lv. “Nonuniformity of p-values can occur
early in diverging dimensions”. In: The Journal of Machine Learning Research 20.1 (2019),
pp. 2849–2881.
[49] Yingying Fan, Jinchi Lv, Mahrad Sharifvaghe, and Yoshimasa Uematsu. “IPAD: stable
interpretable forecasting with knockos inference”. In: Journal of the American
Statistical Association 115.532 (2020), pp. 1822–1834.
[50] Ohad N Feldheim and Sasha Sodin. “A universality result for the smallest eigenvalues of
certain sample covariance matrices”. In: Geometric And Functional Analysis 20.1 (2010),
pp. 88–123.
[51] Qiang Feng, Suisha Liang, Huijue Jia, Andreas Stadlmayr, Longqing Tang, Zhou Lan,
Dongya Zhang, Huihua Xia, Xiaoying Xu, Zhuye Jie, et al. “Gut microbiome
development along the colorectal adenoma–carcinoma sequence”. In: Nature
communications 6.1 (2015), pp. 1–13.
[52] Eric Frichot, Sean D Schoville, Guillaume Bouchard, and Olivier François. “Testing for
associations between loci and environmental gradients using latent factor mixed
models”. In: Molecular Biology and Evolution 30.7 (2013), pp. 1687–1699.
[53] Chloé Friguet, Maela Kloareg, and David Causeur. “A factor model approach to multiple
testing under dependence”. In: Journal of the American Statistical Association 104.488
(2009), pp. 1406–1415.
[54] Adrian Fritz, Peter Hofmann, Stephan Majda, Eik Dahms, Johannes Dröge,
Jessika Fiedler, Till R Lesker, Peter Belmann, Matthew Z DeMaere, Aaron E Darling,
et al. “CAMISIM: simulating metagenomes and microbial communities”. In: Microbiome
7.1 (2019), pp. 1–12.
118
[55] Jed A Fuhrman. “Microbial community structure and its functional implications”. In:
Nature 459.7244 (2009), pp. 193–199.
[56] Lan Gao, Yingying Fan, Jinchi Lv, and Qi-Man Shao. “Asymptotic distributions of
high-dimensional distance correlation inference”. In: The Annals of Statistics 49.4 (2021),
pp. 1999–2020.
[57] Yilin Gao, Zifan Zhu, and Fengzhu Sun. “Increasing prediction performance of colorectal
cancer disease status using random forests classication based on metagenomic shotgun
sequencing data”. In: Synthetic and Systems Biotechnology 7.1 (2022), pp. 574–585.
[58] Christopher R Genovese, Kathryn Roeder, and Larry Wasserman. “False discovery
control with p-value weighting”. In: Biometrika 93.3 (2006), pp. 509–524.
[59] Jack A Gilbert, Martin J Blaser, J Gregory Caporaso, Janet K Jansson, Susan V Lynch,
and Rob Knight. “Current understanding of the human microbiome”. In:Naturemedicine
24.4 (2018), pp. 392–400.
[60] Jack A Gilbert and Christopher L Dupont. “Microbial metagenomics: beyond the
genome”. In: Annual review of marine science 3 (2011), pp. 347–371.
[61] Steven R Gill, Mihai Pop, Robert T DeBoy, Paul B Eckburg, Peter J Turnbaugh,
Buck S Samuel, Jerey I Gordon, David A Relman, Claire M Fraser-Liggett, and
Karen E Nelson. “Metagenomic analysis of the human distal gut microbiome”. In: science
312.5778 (2006), pp. 1355–1359.
[62] Carlos A Gomez, Daniel A Gerber, Eduardo Zambrano, Niaz Banaei, Stan Deresinski,
and Brian G Blackburn. “First case of infectious endocarditis caused by Parvimonas
micra”. In: Anaerobe 36 (2015), pp. 53–55.
[63] Isabelle Guyon et al. “A scaling law for the validation-set training-set size ratio”. In:
AT&T Bell Laboratories 1.11 (1997).
[64] Fakhri Haghi, Elshan Goli, Bahman Mirzaei, and Habib Zeighami. “The association
between fecal enterotoxigenic B. fragilis with colorectal cancer”. In: BMC Cancer 19.1
(2019), p. 879.
[65] Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf.
“Support vector machines”. In: IEEE Intelligent Systems and their applications 13.4 (1998),
pp. 18–28.
[66] Gerhard Hommel and T Homann. “Controlled uncertainty”. In: Multiple
Hypothesenprüfung/Multiple Hypotheses Testing. Springer, 1988, pp. 154–161.
[67] Dongming Huang and Lucas Janson. “Relaxing the assumptions of knockos by
conditioning”. In: The Annals of Statistics 48.5 (2020), pp. 3021–3042.
119
[68] Daniel H Huson, Sina Beier, Isabell Flade, Anna Górska, Mohamed El-Hadidi,
Suparna Mitra, Hans-Joachim Ruscheweyh, and Rewati Tappu. “MEGAN community
edition-interactive exploration and analysis of large-scale microbiome sequencing data”.
In: PLoS computational biology 12.6 (2016), e1004957.
[69] Nikolaos Ignatiadis, Bernd Klaus, Judith B Zaugg, and Wolfgang Huber. “Data-driven
hypothesis weighting increases detection power in genome-scale multiple testing”. In:
Nature Methods 13.7 (2016), pp. 577–580.
[70] Rodolphe Jenatton, Nicolas Le Roux, Antoine Bordes, and Guillaume Obozinski. “A
latent factor model for highly multi-relational data”. In: Advances in Neural Information
Processing Systems 25 (NIPS 2012). 2012, pp. 3176–3184.
[71] Ian Jollie. “Principal component analysis”. In: Encyclopedia of statistics in behavioral
science (2005).
[72] Minoru Kanehisa, Susumu Goto, Shuichi Kawashima, Yasushi Okuno, and
Masahiro Hattori. “The KEGG resource for deciphering the genome”. In: Nucleic acids
research 32.suppl_1 (2004), pp. D277–D280.
[73] Dongwan D Kang, Je Froula, Rob Egan, and Zhong Wang. “MetaBAT, an ecient tool
for accurately reconstructing single genomes from complex microbial communities”. In:
PeerJ 3 (2015), e1165.
[74] Fredrik H Karlsson, Valentina Tremaroli, Intawat Nookaew, Göran Bergström,
Carl Johan Behre, Björn Fagerberg, Jens Nielsen, and Fredrik Bäckhed. “Gut
metagenome in European women with normal, impaired and diabetic glucose control”.
In: Nature 498.7452 (2013), pp. 99–103.
[75] Richard Kellermayer, Dorottya Nagy-Szakal, R Alan Harris, Ruth Ann Luna,
Milena Pitashny, Deborah Schady, Sabina AV Mir, Monica E Lopez, Mark A Gilger,
John Belmont, et al. “Serial fecal microbiota transplantation alters mucosal gene
expression in pediatric ulcerative colitis”. In: The American journal of gastroenterology
110.4 (2015), p. 604.
[76] Daehwan Kim, Li Song, Florian P Breitwieser, and Steven L Salzberg. “Centrifuge: rapid
and sensitive classication of metagenomic sequences”. In:Genomeresearch 26.12 (2016),
pp. 1721–1729.
[77] Dan Knights, Elizabeth K Costello, and Rob Knight. “Supervised classication of human
microbiota”. In: FEMS microbiology reviews 35.2 (2011), pp. 343–359.
[78] Dan Knights, Laura Wegener Parfrey, Jesse Zaneveld, Catherine Lozupone, and
Rob Knight. “Human-associated microbial signatures: examining their predictive value”.
In: Cell host & microbe 10.4 (2011), pp. 292–296.
120
[79] Keegan Korthauer, Patrick K Kimes, Claire Duvallet, Alejandro Reyes,
Ayshwarya Subramanian, Mingxiang Teng, Chinmay Shukla, Eric J Alm, and
Stephanie C Hicks. “A practical guide to methods controlling false discoveries in
computational biology”. In: Genome Biology 20.1 (2019), pp. 1–21.
[80] Aleksandar D Kostic, Eunyoung Chun, Lauren Robertson, Jonathan N Glickman,
Carey Ann Gallini, Monia Michaud, Thomas E Clancy, Daniel C Chung, Paul Lochhead,
Georgina L Hold, et al. “Fusobacterium nucleatum potentiates intestinal tumorigenesis
and modulates the tumor-immune microenvironment”. In: Cell host & microbe 14.2
(2013), pp. 207–215.
[81] Aleksandar D Kostic, Ramnik J Xavier, and Dirk Gevers. “The microbiome in
inammatory bowel disease: current status and the future ahead”. In: Gastroenterology
146.6 (2014), pp. 1489–1499.
[82] Sachin Kunde, Angela Pham, Sarah Bonczyk, Teri Crumb, Meg Duba, Harold Conrad Jr,
Deborah Cloney, and Subra Kugathasan. “Safety, tolerability, and clinical response after
fecal transplantation in children and young adults with ulcerative colitis”. In: Journal of
pediatric gastroenterology and nutrition 56.6 (2013), pp. 597–601.
[83] Miron B Kursa, Witold R Rudnicki, et al. “Feature selection with the Boruta package”. In:
J Stat Softw 36.11 (2010), pp. 1–13.
[84] Jean-Christophe Lagier, Grégory Dubourg, Matthieu Million, Frédéric Cadoret,
Melhem Bilen, Florence Fenollar, Anthony Levasseur, Jean-Marc Rolain,
Pierre-Edouard Fournier, and Didier Raoult. “Culturing the human microbiota and
culturomics”. In: Nature Reviews Microbiology 16.9 (2018), pp. 540–550.
[85] Keara Lane, David Van Valen, Mialy M DeFelice, Derek N Macklin, Takamasa Kudo,
Ariel Jaimovich, Ambrose Carr, Tobias Meyer, Dana Pe’er, Stéphane C Boutet, et al.
“Measuring signaling and RNA-seq in the same cell links gene expression to dynamic
patterns of NF-B activation”. In: Cell Systems 4.4 (2017), pp. 458–469.
[86] Erich Leo Lehmann and Joseph P Romano. “Generalizations of the familywise error
rate”. In: Selected Works of EL Lehmann. Springer, 2012, pp. 719–735.
[87] Lihua Lei and William Fithian. “AdaPT: an interactive procedure for multiple testing
with side information”. In: Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 80.4 (2018), pp. 649–679.
[88] Arie Levine, Charlotte I de Bie, Dan Turner, Salvatore Cucchiara, Malgorzata Sladek,
M Stephen Murphy, Johanna C Escher, and
EUROKIDS Porto IBD Working Group of ESPGHAN. “Atypical disease phenotypes in
pediatric ulcerative colitis: 5-year analyses of the EUROKIDS Registry”. In:
Inammatory bowel diseases 19.2 (2013), pp. 370–377.
121
[89] Ang Li and Rina Foygel Barber. “Multiple testing with the structure-adaptive
Benjamini-Hochberg algorithm”. In: Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 81.1 (2019), pp. 45–74.
[90] Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam.
“MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics
assembly via succinct de Bruijn graph”. In: Bioinformatics 31.10 (2015), pp. 1674–1676.
[91] Heng Li. “Aligning sequence reads, clone sequences and assembly contigs with
BWA-MEM”. In: arXiv preprint arXiv:1303.3997 (2013).
[92] Heng Li and Richard Durbin. “Fast and accurate short read alignment with
Burrows–Wheeler transform”. In: bioinformatics 25.14 (2009), pp. 1754–1760.
[93] Thyra Löwenmark, Anna Löfgren-Burström, Carl Zingmark, Vincy Eklöf,
Michael Dahlberg, Sun Nyunt Wai, Pär Larsson, Ingrid Ljuslinder, Soa Edin, and
Richard Palmqvist. “Parvimonas micra as a putative non-invasive faecal biomarker for
colorectal cancer”. In: Scientic reports 10.1 (2020), pp. 1–10.
[94] Yang Young Lu, Ting Chen, Jed A Fuhrman, and Fengzhu Sun. “COCACOLA: binning
metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and
paired-end read LinkAge”. In: Bioinformatics 33.6 (2017), pp. 791–798.
[95] Yang Young Lu, Yingying Fan, Jinchi Lv, and William Staord Noble. “DeepPINK:
reproducible feature selection in deep neural networks”. In: arXiv preprint
arXiv:1809.01185 (2018).
[96] Tom Madden. “The BLAST sequence analysis tool”. In: The NCBI handbook (2003).
[97] Muneer Ahmad Malla, Anamika Dubey, Ashwani Kumar, Shweta Yadav, Abeer Hashem,
and Elsayed Fathi Abd_Allah. “Exploring the human microbiome: the potential future
role of next-generation sequencing in disease diagnosis and treatment”. In: Frontiers in
Immunology 9 (2019), p. 2868.
[98] Izumi Mashima, Citra F Theodorea, Boonyanit Thaweboon, Sroisiri Thaweboon,
Frank A Scannapieco, and Futoshi Nakazawa. “Exploring the salivary microbiome of
children stratied by the oral hygiene index”. In: PloS one 12.9 (2017), e0185274.
[99] Mindy A Maynard, Andrew J Evans, Tomoko Hosomi, Shuntaro Hara,
Michael AS Jewett, and Michael Ohh. “Human HIF-34 is a dominant-negative regulator
of HIF-1 and is down-regulated in renal cell carcinoma”. In: The FASEB Journal 19.11
(2005), pp. 1396–1406.
[100] Sarkis K Mazmanian, Cui Hua Liu, Arthur O Tzianabos, and Dennis L Kasper. “An
immunomodulatory molecule of symbiotic bacteria directs maturation of the host
immune system”. In: Cell 122.1 (2005), pp. 107–118.
122
[101] Warren S McCulloch and Walter Pitts. “A logical calculus of the ideas immanent in
nervous activity”. In: The bulletin of mathematical biophysics 5.4 (1943), pp. 115–133.
[102] Fernando Meyer, Peter Hofmann, Peter Belmann, Ruben Garrido-Oter, Adrian Fritz,
Alexander Sczyrba, and Alice C McHardy. “AMBER: assessment of metagenome
binners”. In: Gigascience 7.6 (2018), giy069.
[103] Paul Moayyedi, Michael G Surette, Peter T Kim, Josie Libertucci, Melanie Wolfe,
Catherine Onischi, David Armstrong, John K Marshall, Zain Kassam, Walter Reinisch,
et al. “Fecal microbiota transplantation induces remission in patients with active
ulcerative colitis in a randomized controlled trial”. In: Gastroenterology 149.1 (2015),
pp. 102–109.
[104] Natalie A Molodecky, Shian Soon, Doreen M Rabi, William A Ghali, Mollie Ferris,
Greg Cherno, Eric I Benchimol, Remo Panaccione, Subrata Ghosh,
Herman W Barkema, et al. “Increasing incidence and prevalence of the inammatory
bowel diseases with time, based on systematic review”. In: Gastroenterology 142.1 (2012),
pp. 46–54.
[105] WEC Moore and Lillian VH Moore. “The bacteria of periodontal diseases”. In:
Periodontology 2000 5.1 (1994), pp. 66–77.
[106] Geicho Nakatsu, Haokui Zhou, William Ka Kei Wu, Sunny Hei Wong,
Olabisi Oluwabukola Coker, Zhenwei Dai, Xiangchun Li, Chun-Ho Szeto,
Naoki Sugimura, Thomas Yuen-Tung Lam, et al. “Alterations in enteric virome are
associated with colorectal cancer and survival outcomes”. In: Gastroenterology 155.2
(2018), pp. 529–541.
[107] Jason M Norman, Scott A Handley, Megan T Baldridge, Lindsay Droit, Catherine Y Liu,
Brian C Keller, Amal Kambal, Cynthia L Monaco, Guoyan Zhao, Phillip Fleshner, et al.
“Disease-specic alterations in the enteric virome in inammatory bowel disease”. In:
Cell 160.3 (2015), pp. 447–460.
[108] Sergey Nurk, Dmitry Meleshko, Anton Korobeynikov, and Pavel A Pevzner.
“metaSPAdes: a new versatile metagenomic assembler”. In: Genome research 27.5 (2017),
pp. 824–834.
[109] Seiji Ohigashi, Kazuki Sudo, Daiki Kobayashi, Osamu Takahashi, Takuya Takahashi,
Takashi Asahara, Koji Nomoto, and Hisashi Onodera. “Changes of the intestinal
microbiota, short chain fatty acids, and fecal pH in patients with colorectal cancer”. In:
Digestive diseases and sciences 58.6 (2013), pp. 1717–1726.
[110] Ola Olén, Johan Askling, Michael C Sachs, Paolo Frumento, M Neovius,
Karin Ekström Smedby, A Ekbom, Petter Malmborg, and Jonas F Ludvigsson.
“Childhood onset inammatory bowel disease and risk of cancer: a Swedish nationwide
cohort study 1964-2014”. In: Bmj 358 (2017).
123
[111] Brian D Ondov, Todd J Treangen, Páll Melsted, Adam B Mallonee, Nicholas H Bergman,
Sergey Koren, and Adam M Phillippy. “Mash: fast genome and metagenome distance
estimation using MinHash”. In: Genome biology 17.1 (2016), pp. 1–14.
[112] Muhammad Aq Osman, Hui-min Neoh, Nurul-Syakima Ab Mutalib, Siok-Fong Chin,
Luqman Mazlan, Raja Aendi Raja Ali, Andee Dzulkarnaen Zakaria, Chai Soon Ngiu,
Mia Yang Ang, and Rahman Jamal. “Parvimonas micra, Peptostreptococcus stomatis,
Fusobacterium nucleatum and Akkermansia muciniphila as a four-bacteria biomarker
panel of colorectal cancer”. In: Scientic reports 11.1 (2021), pp. 1–12.
[113] Norman R Pace. “A molecular view of microbial diversity and the biosphere”. In: Science
276.5313 (1997), pp. 734–740.
[114] Sudarshan Paramsothy, Michael A Kamm, Nadeem O Kaakoush, Alissa J Walsh,
Johan van den Bogaerde, Douglas Samuel, Rupert WL Leong, Susan Connor, Watson Ng,
Ramesh Paramsothy, et al. “Multidonor intensive faecal microbiota transplantation for
active ulcerative colitis: a randomised placebo-controlled trial”. In: The Lancet 389.10075
(2017), pp. 1218–1228.
[115] Edoardo Pasolli, Lucas Schier, Paolo Manghi, Audrey Renson, Valerie Obenchain,
Duy Tin Truong, Francesco Beghini, Faizan Malik, Marcel Ramos, Jennifer B Dowd, et al.
“Accessible, curated metagenomic data through ExperimentHub”. In: Nature methods
14.11 (2017), pp. 1023–1024.
[116] Edoardo Pasolli, Duy Tin Truong, Faizan Malik, Levi Waldron, and Nicola Segata.
“Machine learning meta-analysis of large metagenomic datasets: tools and biological
insights”. In: PLoS computational biology 12.7 (2016), e1004977.
[117] Rob Patro, Stephen M Mount, and Carl Kingsford. “Sailsh enables alignment-free
isoform quantication from RNA-seq reads using lightweight algorithms”. In: Nature
biotechnology 32.5 (2014), pp. 462–464.
[118] Judea Pearl. “Probabilistic reasoning in intelligent systems. 1988”. In: San Mateo, CA:
Kaufmann 23 (1988), pp. 33–34.
[119] Alexandra Perry and Peter Lambert. “Propionibacterium acnes: infection beyond the
skin”. In: Expert review of anti-infective therapy 9.12 (2011), pp. 1149–1156.
[120] Jane Peterson, Susan Garges, Maria Giovanni, Pamela McInnes, Lu Wang,
Jeery A Schloss, Vivien Bonazzi, Jean E McEwen, Kris A Wetterstrand, Carolyn Deal,
et al. “The NIH human microbiome project”. In: Genome research 19.12 (2009),
pp. 2317–2323.
124
[121] Gregory D Poore, Evguenia Kopylova, Qiyun Zhu, Carolina Carpenter, Serena Fraraccio,
Stephen Wandro, Tomasz Kosciolek, Stefan Janssen, Jessica Metcalf, Se Jin Song, et al.
“Microbiome analyses of blood and tissues suggest cancer diagnostic approach”. In:
Nature 579.7800 (2020), pp. 567–574.
[122] Kim D Pruitt, Garth R Brown, Susan M Hiatt, Françoise Thibaud-Nissen,
Alexander Astashyn, Olga Ermolaeva, Catherine M Farrell, Jennifer Hart,
Melissa J Landrum, Kelly M McGarvey, et al. “RefSeq: an update on mammalian
reference sequences”. In: Nucleic acids research 42.D1 (2014), pp. D756–D763.
[123] Rachel V Purcell, Martina Visnovska, Patrick J Biggs, Sebastian Schmeier, and
Frank A Frizelle. “Distinct gut microbiome patterns associate with consensus molecular
subtypes of colorectal cancer”. In: Scientic reports 7.1 (2017), pp. 1–12.
[124] Junjie Qin, Yingrui Li, Zhiming Cai, Shenghui Li, Jianfeng Zhu, Fan Zhang, Suisha Liang,
Wenwei Zhang, Yuanlin Guan, Dongqian Shen, et al. “A metagenome-wide association
study of gut microbiota in type 2 diabetes”. In: Nature 490.7418 (2012), pp. 55–60.
[125] Nan Qin, Fengling Yang, Ang Li, Edi Prifti, Yanfei Chen, Li Shao, Jing Guo,
Emmanuelle Le Chatelier, Jian Yao, Lingjiao Wu, et al. “Alterations of the human gut
microbiome in liver cirrhosis”. In: Nature 513.7516 (2014), pp. 59–64.
[126] Jie Ren, Nathan A Ahlgren, Yang Young Lu, Jed A Fuhrman, and Fengzhu Sun.
“VirFinder: a novel k-mer based tool for identifying viral sequences from assembled
metagenomic data”. In: Microbiome 5.1 (2017), pp. 1–20.
[127] Alejandro Reyes, Laura V Blanton, Song Cao, Guoyan Zhao, Mark Manary, Indi Trehan,
Michelle I Smith, David Wang, Herbert W Virgin, Forest Rohwer, et al. “Gut DNA
viromes of Malawian twins discordant for severe acute malnutrition”. In: Proceedings of
the National Academy of Sciences 112.38 (2015), pp. 11941–11946.
[128] Noortje G Rossen, Susana Fuentes, Mirjam J van der Spek, Jan G Tijssen,
Jorn HA Hartman, Ann Duou, Mark Löwenberg, Gijs R van den Brink,
Elisabeth MH Mathus-Vliegen, Willem M de Vos, et al. “Findings from a randomized
controlled trial of fecal transplantation for patients with ulcerative colitis”. In:
Gastroenterology 149.1 (2015), pp. 110–118.
[129] Fred Sanger and Alan R Coulson. “A rapid method for determining sequences in DNA
by primed synthesis with DNA polymerase”. In: Journal of molecular biology 94.3 (1975),
pp. 441–448.
[130] R Balfour Sartor. “Microbial inuences in inammatory bowel diseases”. In:
Gastroenterology 134.2 (2008), pp. 577–594.
[131] Dwayne C Savage. “Microbial ecology of the gastrointestinal tract”. In: Annual review of
microbiology 31.1 (1977), pp. 107–133.
125
[132] Vered Schildkraut, George Alex, Donald JS Cameron, Winita Hardikar, Barry Lipschitz,
Mark R Oliver, Dianne M Simpson, and Anthony G Catto-Smith. “Sixty-year study of
incidence of childhood ulcerative colitis nds eleven-fold increase beginning in 1990s”.
In: Inammatory Bowel Diseases 19.1 (2013), pp. 1–6.
[133] Patrick D Schloss and Jo Handelsman. “Introducing DOTUR, a computer program for
dening operational taxonomic units and estimating species richness”. In: Applied and
environmental microbiology 71.3 (2005), pp. 1501–1506.
[134] John T Scott Jr. “Factor analysis and regression”. In: Econometrica: Journal of the
Econometric Society (1966), pp. 552–562.
[135] John T Scott Jr et al. “Factor analysis regression revisited”. In: Econometrica 37.4 (1969),
pp. 719–719.
[136] James G Scott, Ryan C Kelly, Matthew A Smith, Pengcheng Zhou, and Robert E Kass.
“False discovery rate regression: an application to neural synchrony detection in
primary visual cortex”. In: Journal of the American Statistical Association 110.510 (2015),
pp. 459–471.
[137] Alexander Sczyrba, Peter Hofmann, Peter Belmann, David Koslicki, Stefan Janssen,
Johannes Dröge, Ivan Gregor, Stephan Majda, Jessika Fiedler, Eik Dahms, et al. “Critical
assessment of metagenome interpretation—a benchmark of metagenomics software”. In:
Nature methods 14.11 (2017), pp. 1063–1071.
[138] Matteo Sesia, Chiara Sabatti, and Emmanuel J Candès. “Gene hunting with hidden
Markov model knockos”. In: Biometrika 106.1 (2019), pp. 1–18.
[139] Fergus Shanahan and Eamonn MM Quigley. “Manipulation of the microbiota for
treatment of IBS and IBD—challenges and controversies”. In: Gastroenterology 146.6
(2014), pp. 1554–1563.
[140] Yelong Shen and Ruoming Jin. “Learning personal+ social latent factor model for social
recommendation”. In: Proceedings of the 18th ACM SIGKDD international conference on
Knowledge discovery and data mining. 2012, pp. 1303–1311.
[141] Yanqiang Shi, Yiwei Dong, Wenhui Huang, Decong Zhu, Hua Mao, and Peizhu Su. “Fecal
microbiota transplantation for ulcerative colitis: a systematic review and meta-analysis”.
In: PloS one 11.6 (2016), e0157259.
[142] Iradj Sobhani, Julien Tap, Françoise Roudot-Thoraval, Jean P Roperch, Sophie Letulle,
Philippe Langella, Gerard Corthier, Jeanne Tran Van Nhieu, and Jean P Furet. “Microbial
dysbiosis in colorectal cancer (CRC) patients”. In: PloS one 6.1 (2011), e16393.
[143] Matthew Stephens. “False discovery rates: a new deal”. In: Biostatistics 18.2 (2017),
pp. 275–294.
126
[144] John D Storey. “A direct approach to false discovery rates”. In: Journal of the Royal
Statistical Society: Series B (Statistical Methodology) 64.3 (2002), pp. 479–498.
[145] John D Storey. “The positive false discovery rate: a Bayesian interpretation and the
q-value”. In: The Annals of Statistics 31.6 (2003), pp. 2013–2035.
[146] David L Suskind, Namita Singh, Heather Nielson, and Ghassan Wahbeh. “Fecal
microbial transplant via nasogastric tube for active pediatric ulcerative colitis”. In:
Journal of pediatric gastroenterology and nutrition 60.1 (2015), pp. 27–29.
[147] Gábor J Székely, Maria L Rizzo, and Nail K Bakirov. “Measuring and testing dependence
by correlation of distances”. In: The annals of statistics 35.6 (2007), pp. 2769–2794.
[148] Gerald W Tannock. “What immunologists should know about bacterial communities of
the human bowel”. In: Seminars in immunology. Vol. 19. 2. Elsevier. 2007, pp. 94–105.
[149] Roman L Tatusov, Natalie D Fedorova, John D Jackson, Aviva R Jacobs, Boris Kiryutin,
Eugene V Koonin, Dmitri M Krylov, Raja Mazumder, Sergei L Mekhedov,
Anastasia N Nikolskaya, et al. “The COG database: an updated version includes
eukaryotes”. In: BMC bioinformatics 4.1 (2003), pp. 1–14.
[150] Andrew Maltez Thomas, Paolo Manghi, Francesco Asnicar, Edoardo Pasolli,
Federica Armanini, Moreno Zolfo, Francesco Beghini, Serena Manara, Nicolai Karcher,
Chiara Pozzi, et al. “Metagenomic analysis of colorectal cancer datasets identies
cross-cohort microbial diagnostic signatures and a link with choline degradation”. In:
Nature medicine 25.4 (2019), pp. 667–678.
[151] Robert Tibshirani. “Regression shrinkage and selection via the lasso”. In: Journal of the
Royal Statistical Society: Series B (Methodological) 58.1 (1996), pp. 267–288.
[152] Cole Trapnell, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan,
Marijke J Van Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. “Transcript
assembly and quantication by RNA-Seq reveals unannotated transcripts and isoform
switching during cell dierentiation”. In: Nature biotechnology 28.5 (2010), pp. 511–515.
[153] Duy Tin Truong, Eric A Franzosa, Timothy L Tickle, Matthias Scholz, George Weingart,
Edoardo Pasolli, Adrian Tett, Curtis Huttenhower, and Nicola Segata. “MetaPhlAn2 for
enhanced metagenomic taxonomic proling”. In: Nature methods 12.10 (2015),
pp. 902–903.
[154] Peter J Turnbaugh, Ruth E Ley, Michael A Mahowald, Vincent Magrini, Elaine R Mardis,
and Jerey I Gordon. “An obesity-associated gut microbiome with increased capacity for
energy harvest”. In: nature 444.7122 (2006), pp. 1027–1031.
[155] Dan Turner and Anne M Griths. “Acute severe ulcerative colitis in children: a
systematic review”. In: Inammatory bowel diseases 17.1 (2011), pp. 440–449.
127
[156] Ajim Uddin and Dantong Yu. “Latent factor model for asset pricing”. In: Journal of
Behavioral and Experimental Finance 27 (2020), p. 100353.
[157] Yoshimasa Uematsu, Yingying Fan, Kun Chen, Jinchi Lv, and Wei Lin. “SOFAR:
large-scale association network learning”. In: IEEE transactions on information theory
65.8 (2019), pp. 4924–4939.
[158] Haruka Uemura, Kayoko Hayakawa, Kayo Shimada, Masayoshi Tojo, Maki Nagamatsu,
Tohru Miyoshi-Akiyama, Saeko Tamura, Kazuhisa Mesaki, Kei Yamamoto,
Yasuaki Yanagawa, et al. “Parvimonas micra as a causative organism of spondylodiscitis:
a report of two cases and a literature review”. In: International Journal of Infectious
Diseases 23 (2014), pp. 53–55.
[159] N Ulger Toprak, AYŞEGÜL Yagci, BM Gulluoglu, ML Akin, P Demirkalem, T Celenk, and
G Soyletir. “A possible role of Bacteroides fragilis enterotoxin in the aetiology of
colorectal cancer”. In: Clinical microbiology and infection 12.8 (2006), pp. 782–786.
[160] Marcel GA Van Der Heijden, Richard D Bardgett, and Nico M Van Straalen. “The unseen
majority: soil microbes as drivers of plant diversity and productivity in terrestrial
ecosystems”. In: Ecology letters 11.3 (2008), pp. 296–310.
[161] Els Van Nood, Anne Vrieze, Max Nieuwdorp, Susana Fuentes, Erwin G Zoetendal,
Willem M de Vos, Caroline E Visser, Ed J Kuijper, Joep FWM Bartelsman,
Jan GP Tijssen, et al. “Duodenal infusion of donor feces for recurrent Clostridium
dicile”. In: New England Journal of Medicine 368.5 (2013), pp. 407–415.
[162] Yvan Vandenplas, Geneviève Veereman, J van der Wer Ten Bosch, Annieta Goossens,
Denis Pierard, JN Samsom, and JC Escher. “Fecal microbial transplantation in
early-onset colitis: caution advised”. In: Journal of Pediatric Gastroenterology and
Nutrition 61.3 (2015), e12–e14.
[163] Katie S Viljoen, Amirtha Dakshinamurthy, Paul Goldberg, and Jonathan M Blackburn.
“Quantitative proling of colorectal cancer-associated bacteria reveals associations
between fusobacterium spp., enterotoxigenic Bacteroides fragilis (ETBF) and
clinicopathological features of colorectal cancer”. In: PloS one 10.3 (2015), e0119462.
[164] Jakob Wirbel, Paul Theodor Pyl, Ece Kartal, Konrad Zych, Alireza Kashani,
Alessio Milanese, Jonas S Fleck, Anita Y Voigt, Albert Palleja, Ruby Ponnudurai, et al.
“Meta-analysis of fecal metagenomes reveals global microbial signatures that are
specic for colorectal cancer”. In: Nature medicine 25.4 (2019), pp. 679–689.
[165] Derrick E Wood, Jennifer Lu, and Ben Langmead. “Improved metagenomic analysis with
Kraken 2”. In: Genome biology 20.1 (2019), pp. 1–13.
128
[166] Yu-Wei Wu, Blake A Simmons, and Steven W Singer. “MaxBin 2.0: an automated
binning algorithm to recover genomes from multiple metagenomic datasets”. In:
Bioinformatics 32.4 (2016), pp. 605–607.
[167] Li C Xia, Jacob A Cram, Ting Chen, Jed A Fuhrman, and Fengzhu Sun. “Accurate
genome relative abundance estimation based on shotgun metagenomic reads”. In: PloS
one 6.12 (2011), e27992.
[168] Xin Xing, Jun S Liu, and Wenxuan Zhong. “MetaGen: reference-free learning with
multiple metagenomic samples”. In: Genome biology 18.1 (2017), pp. 1–15.
[169] Jun Yu, Qiang Feng, Sunny Hei Wong, Dongya Zhang, Qiao yi Liang, Youwen Qin,
Longqing Tang, Hui Zhao, Jan Stenvang, Yanli Li, et al. “Metagenomic analysis of faecal
microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer”.
In: Gut 66.1 (2017), pp. 70–78.
[170] Georg Zeller, Julien Tap, Anita Y Voigt, Shinichi Sunagawa, Jens Roat Kultima,
Paul I Costea, Aurélien Amiot, Jürgen Böhm, Francesco Brunetti, Nina Habermann, et al.
“Potential of fecal microbiota for early-stage detection of colorectal cancer”. In:
Molecular systems biology 10.11 (2014), p. 766.
[171] Zemin Zheng, Jinchi Lv, and Wei Lin. “Nonsparse learning with latent variables”. In:
Operations Research 69.1 (2021), pp. 346–359.
[172] Zifan Zhu, Yingying Fan, Yinfei Kong, Jinchi Lv, and Fengzhu Sun. “DeepLINK: Deep
learning inference using knockos with applications to genomics”. In: Proceedings of the
National Academy of Sciences 118.36 (2021).
[173] Zifan Zhu, Jie Ren, Sonia Michail, and Fengzhu Sun. “MicroPro: using metagenomic
unmapped reads to provide insights into human microbiota and disease associations”.
In: Genome biology 20.1 (2019), pp. 1–13.
[174] Laurence Zitvogel, Yuting Ma, Didier Raoult, Guido Kroemer, and Thomas F Gajewski.
“The microbiome in cancer immunotherapy: Diagnostic tools and therapeutic
strategies”. In: Science 359.6382 (2018), pp. 1366–1370.
[175] Erwin G Zoetendal, Jeroen Raes, Bartholomeus Van Den Bogert,
Manimozhiyan Arumugam, Carien CGM Booijink, Freddy J Troost, Peer Bork,
Michiel Wels, Willem M De Vos, and Michiel Kleerebezem. “The human small intestinal
microbiota is driven by rapid uptake and conversion of simple carbohydrates”. In: The
ISME journal 6.7 (2012), pp. 1415–1426.
129
Abstract (if available)
Abstract
Trillions of microbes inhabit human body and play an important role in metabolism, reproduction, immune system activity, and development of disorders and diseases. Starting from the 19th century, human microbiome analysis mainly relies on culture based methods. However, our view of the microbial world was greatly limited since a large fraction of microbes were not cultivable. With the rapid development and reduced cost of the sequencing technology, metagenomic studies coupled with the 16S rRNA sequencing and the whole genome shotgun (WGS) sequencing markedly expand our understanding of the microbial community and bring the human microbiome analysis to a new era. Great technological revolutions come with great challenges. Many important scientific questions in the field of human microbiome await new ways of thinking and new types of analysis methodology. In this dissertation, we present our efforts to address several research problems in human microbiome anlayses using computational algorithms and statistical modelings.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Exploration of human microbiome through metagenomic analysis and computational algorithms
PDF
Enhancing phenotype prediction through integrative analysis of heterogeneous microbiome studies
PDF
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Predicting virus-host interactions using genomic data and applications in metagenomics
PDF
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Mapping genetic variants for nonsense-mediated mRNA decay regulation across human tissues
PDF
Application of machine learning methods in genomic data analysis
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Validating structural variations: from traditional algorithms to deep learning approaches
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Data-driven learning for dynamical systems in biology
PDF
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
PDF
Machine learning of DNA shape and spatial geometry
PDF
Statistical methods and analyses in the Multiethnic Cohort (MEC) human gut microbiome data
PDF
danbing-tk: a computational genetics framework for studying variable number tandem repeats
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
Asset Metadata
Creator
Zhu, Zifan
(author)
Core Title
Computational algorithms and statistical modelings in human microbiome analyses
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Degree Conferral Date
2022-05
Publication Date
03/01/2022
Defense Date
02/28/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,feature selection,machine learning,metagenomics,microbiome,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sun, Fengzhu (
committee chair
), Chen, Liang (
committee member
), Fan, Yingying (
committee member
), Michail, Sonia (
committee member
)
Creator Email
zifanzhu@alumni.usc.edu,zifanzhu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC110765150
Unique identifier
UC110765150
Legacy Identifier
etd-ZhuZifan-10411
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhu, Zifan
Type
texts
Source
20220301-usctheses-batch-914
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep learning
feature selection
machine learning
metagenomics
microbiome