Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Exploration of human microbiome through metagenomic analysis and computational algorithms
(USC Thesis Other)
Exploration of human microbiome through metagenomic analysis and computational algorithms
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EXPLORATION OF HUMAN MICROBIOME THROUGH METAGENOMIC ANALYSES AND
COMPUTATIONAL ALGORITHMS
by
Wenxuan Zuo
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
May 2024
Copyright 2024 Wenxuan Zuo
Dedication
I dedicate this dissertation to my beloved parents, Runqing Fu and Changbo Zuo, for their unwavering
support and encouragement throughout this challenging yet rewarding journey.
ii
Acknowledgements
First of all, I would like to extend my most sincere gratitude to my advisor, Dr. Fengzhu Sun, for his unwavering support, invaluable guidance and mentorship that have been instrumental throughout my Ph.D.
journey. His commitment to academic excellence, passion for research, and insightful feedback have significantly shaped this dissertation and enriched my academic experience. I truly appreciate the opportunity
to work under Dr. Sun’s mentorship, and I am deeply grateful for his inspiration and encouragement that
have been devoted to my academic and personal growth.
I would like to thank my Ph.D. qualification and dissertation committee members, Dr. Liang Chen,
Dr. Michael D. Edge, Dr. Jinchi Lv, and Dr. Sonia Michail for their invaluable feedback and constructive
suggestions on my research.
I would like to thank Dr. Yingying Fan and Dr. Jinchi Lv for their unreserved guidance on the
DeepLINK-T project. Their insightful suggestions and dedicated commitment to the project significantly
expands my understanding of statistical applications, enriching my research journey.
I would like to thank Dr. Sonia Michail for inspiring me to delve into the intricate relationship between
gut diseases and human microbiome. Her profound expertise in the field of gut diseases and invaluable
biological insights substantially elevated the quality of my research.
I would like to thank Dr. Shanfeng Zhu for introducing me to the amazing world of computational
biology and bioinformatics. I truly appreciate the opportunity to conduct research under his guidance
during my undergraduate studies.
iii
I would also like to thank all my collaborators, Dr. Xin Bai, Yuxuan Du, Dr. Yingying Fan, Dr. Jed A.
Fuhrman, Dr. Yihui Luan, Dr. Jinchi Lv, Dr. Sonia Michail, Beibei Wang, Dr. Yi-Chun Yeh and Dr. Zifan
Zhu for their tremendous support and commitment to joint publications.
I would like to thank professors at USC, Dr. Peter Calabrese, Dr. Mark Chaisson, Dr. Liang Chen, Dr.
Vsevolod Katritch, Dr. Adam Maclean, Dr. Andrew Smith and Dr. Rory Spence for teaching me invaluable
knowledge about computational biology.
I would like to thank all my labmates at Sun lab: Dr. Xin Bai, Yuxuan Du, Dallace Francis, Dr. Yilin
Gao, Jiawei Huang, Yue Huang, Dr. Kujin Tang, Dr. Tianqi Tang, Beibei Wang, Dr. Weili Wang, Yuqiu
Wang, Dr. Ziye Wang and Dr. Zifan Zhu. I would also like to thank other friends at USC: Dr. Brendon
Cooper, Bryan Dinh, Bida Gu, Xinyu Guo, Wei Jiang, Yibei Jiang, Jordy Lam, Junjian Liu, Yingtong Liu,
Dr. Tsung-Yu Lu, Meilu McDermott, Raktim Mitra, Dandan Peng, Dr. Jingwen Ren, Dr. Bo Sun, Vardges
Tserunyan, Yingfei Wang, Xiaojun Wu, Quentin Yang, Qingyang Yin, Yuxiang Zhan and many others.
Last but most importantly, I would like to express my deepest gratitude to my parents Runqing Fu and
Changbo Zuo for their unwavering support and boundless love. I am thankful for their teachings, guidance
and constant companionship at every step of this journey.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Identifying the phage signature in colorectal cancer through metagenomic analyses of
heterogeneous microbiome studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Feature selection for longitudinal data using deep learning and model-X knockoffs
framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2: Metagenomic analyses of multiple gut datasets revealed the association of phage
signatures in colorectal cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Cohort description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Quantification of viral abundance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Taxonomic annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.4 Viral functional profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.5 Viral diversity, multivariate analysis and meta-analysis . . . . . . . . . . . . . . . . 15
2.2.6 Principal coordinate analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.7 Differential abundance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.8 Quantification of bacterial abundance and virus-bacterium association . . . . . . . 16
2.2.9 Random forest classifier for within-study and cross-study prediction . . . . . . . . 16
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Case-control comparison showed higher viral diversity in CRC samples compared
to healthy controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Principal coordinates analysis among different studies . . . . . . . . . . . . . . . . 19
2.3.3 Differential abundance analysis revealed important viral taxon and metabolic
pathways associated with CRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.4 Interkingdom association between viral families and bacterial species . . . . . . . 24
v
2.3.5 Random forests classifiers accurately predict CRC status based on human gut
virome profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Chapter 3: DeepLINK-T: deep learning inference for time series data using knockoffs and LSTM . . 32
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Model setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 The model-X knockoffs framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.3 DeepLINK-T: a new deep learning inference method for time series data . . . . . . 38
3.3 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 The impacts of hyperparameters and model misspecification on DeepLINK-T . . . 47
3.3.2 Comparisons of DeepLINK-T and DeepLINK . . . . . . . . . . . . . . . . . . . . . . 49
3.3.3 The impacts of number of subjects on DeepLINK-T . . . . . . . . . . . . . . . . . . 52
3.4 Real data applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.1 Application to longitudinal gut microbiome data of early infants . . . . . . . . . . 53
3.4.2 Application to marine metagenomic time series data . . . . . . . . . . . . . . . . . 55
3.4.2.1 Identifying primary chlorophyll-a producer . . . . . . . . . . . . . . . . 55
3.4.2.2 Identifying taxa significantly associated with prokaryotic production . . 56
3.4.3 Application to dietary glycans treatment time series data . . . . . . . . . . . . . . . 57
3.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 4: Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1 Future work for gut virome analysis in CRC . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 Future work for DeepLINK-T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Chapter 5: Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1 Appendix for chapter1: Metagenomic analyses of multiple gut datasets revealed the
association of phage signatures in colorectal cancer . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Appendix for Chapter 2: DeepLINK-T: deep learning inference for time series data using
knockoffs and LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.1 The impacts of hyperparameters and model misspecification on DeepLINK-T . . . 78
5.2.2 Application to longitudinal gut microbiome data set of early infants . . . . . . . . 79
5.2.3 Application to marine metagenomic time series data set . . . . . . . . . . . . . . . 80
5.2.4 Application to dietary glycans treatment time series data set . . . . . . . . . . . . . 88
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
vi
List of Tables
2.1 Characteristics of metagenomic datasets used in this study. . . . . . . . . . . . . . . . . . . 13
3.1 Selection frequencies of top 20 selected genera for Bacteroides and Bifidobacterium over
200 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Selection frequencies of top 10 selected taxa for chl-a concentration at the genus, family,
and order levels over 200 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3 Selection frequencies of top 10 selected taxa for differentiating the FOS and PDX groups
at the genus, family, and order levels over 200 runs. . . . . . . . . . . . . . . . . . . . . . . 60
5.1 Selection frequencies of top 10 selected taxa for leucine incorporation at the genus, family,
and order levels over 200 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
vii
List of Figures
2.1 Analysis of viral species Shannon diversity within each dataset. (A) Boxplots of viral
species-level Shannon index for gut samples of CRC subjects and healthy controls
stratified by disease status in each dataset. BH adjusted p-values were calculated using the
two-tailed Wilcoxon rank-sum test. ns:p> 0.05, *p< 0.05, **p< 0.01, ***p< 0.001. (B)
Multivariate analysis of the adjusted impact of age, gender and BMI on Shannon diversity.
(C) Forest plot showing effect sizes from a meta-analysis on species-level diversity. RE
Model: Random effect model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Principal coordinates analysis of all samples based on Bray–Curtis distance. (A) PCoA
plot of gut samples of CRC subjects and healthy controls in each dataset. R2 values and
p-values were calculated by PERMANOVA. (B) Boxplots of the first principal coordinates
(PCo1) in each dataset. (C) Boxplots of the second principal coordinates (PCo2) in each
dataset. BH adjusted p-values were calculated using the two-tailed Wilcoxon rank-sum
test. ns:p> 0.05, *p< 0.05, **p< 0.01, ***p< 0.001, ****p< 0.0001. (D) Forest plot
showing effect sizes from a meta-analysis on PCo1. (E) Forest plot showing effect sizes
from a meta-analysis on PCo2. RE Model: Random effect model. . . . . . . . . . . . . . . . 21
2.3 Differential abundance analysis on taxonomic and functional viral profiles. (A) UpSet
plot showing the number of shared differentially abundant viral species determined by
species-level TMM normalized abundance and DESeq2. Only viral species differentiated
in at least 5 datasets were displayed. (B) UpSet plot showing the number of shared
differentially abundant viral pathways determined by HUMAnN3 pathway abundance
and DESeq2. (C) Heatmap showing the log transformed TMM normalized abundance of
viral species differentiated in all 7 datasets. (D) Heatmap showing the log transformed
HUMAnN3 pathway abundance of pathways differentiated in at least 3 datasets. . . . . . . 24
2.4 Correlations between viral families and bacterial species. (A) Random effect size of
Spearman’s correlation coefficients between the diversity and richness of bacteria and
viruses in healthy controls and CRC subjects. Correlations with BH adjusted p-values
< 0.05 are displayed. (B) Random effect size of Spearman’s correlation coefficients
between the abundance of all 24 viral families and that of 27 differentially abundant
bacterial species. Correlations with BH adjusted p-values < 0.05 are displayed. The size
and color of circles indicate the extent of correlation. . . . . . . . . . . . . . . . . . . . . . 26
viii
2.5 Prediction performances of random forest classifiers based on gut viral abundance. (A)
Within and cross study AUROC matrix obtained by using GPD genome-level abundance.
The diagonal refers to results of cross validation within each dataset. Off-diagonal values
refer to prediction results trained on the study of each row and tested on the study of
each column. (B) Within and cross study AUROC matrix obtained by using species-level
abundance. See Supplementary Figure 5.12(A) and (B) for genus-level and family-level
AUROC. (C) Within and cross study AUROC matrix obtained by using gene-family
abundance. See Supplementary Figure 5.12(C) for pathway AUROC. (D) LODO results
with the x axis indicating the study left out as the validation set and other studies
combined as the training set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 The structure of the LSTM autoencoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 The structure of the LSTM prediction network. . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 The structure of a LSTM cell. x is the input of the LSTM cell. f, i and o represent the
forget, input and output gates, respectively. c˜t denotes the input activation. h and c are
hidden state and cell state. Subscript t indicates the time step. . . . . . . . . . . . . . . . . 43
3.4 The impacts of bottleneck dimensionality and training epochs on DeepLINK-T using the
linear factor model. A and B are FDR and power under the setting of linear link function.
C and D are FDR and power under the setting of nonlinear link function. The number of
training epochs of both LSTM autoencoder and LSTM prediction network is specified on
the x-axis. The bottleneck dimensionality of the LSTM autoencoder is specified on the
y-axis. The pre-specified FDR level is q = 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 The impacts of bottleneck dimensionality and training epochs on DeepLINK-T using the
logistic factor model. A and B are FDR and power under the setting of linear link function.
C and D are FDR and power under the setting of nonlinear link function. The number of
training epochs of both LSTM autoencoder and LSTM prediction network is specified on
the x-axis. The bottleneck dimensionality of the LSTM autoencoder is specified on the
y-axis. The pre-specified FDR level is q = 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6 Comparisons of DeepLINK-T and DeepLINK based on simulated data with the linear
factor model. The number of subjects m is 1, the number of time points n is 1000, and the
number of true signals is 10. Each subtitle suggests the link function l, and the number of
features p. The solid lines and dashed lines stand for Power+ and FDR+, respectively. The
green lines and red lines represent the results of DeepLINK-T and DeepLINK, respectively. 49
3.7 Comparisons of DeepLINK-T and DeepLINK based on simulated data with the logistic
factor model. The number of subjects m is 1, the number of time points is 1000, and the
number of true signals is 10. Each subtitle suggests the link function l, and the number of
features p. The solid lines and dashed lines stand for Power+ and FDR+, respectively. The
green lines and red lines represent the results of DeepLINK-T and DeepLINK, respectively. 50
ix
3.8 The impacts of number of subjects m on DeepLINK-T. In this simulation, the number of
time points is 1000, the number of features is 500, and the number of true signals is 10.
Each subtitle suggests the link function l, and the factor model. The solid lines and dashed
lines stand for Power+ and FDR+, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1 The distribution of sequence depths within each dataset used in this study. . . . . . . . . . 66
5.2 The distribution of mapping rates for each dataset used in this study. . . . . . . . . . . . . 67
5.3 Boxplots of (A) genus-level and (B) family-level Shannon diversity within each dataset.
P-values were calculated using the two-tailed Wilcoxon rank-sum test. ns:p> 0.05,
*p< 0.05, **p< 0.01, ***p< 0.001, ****p< 0.0001. (C) Forest plot showing effect sizes
from a meta-analysis on genus-level diversity (I
2 = 44.10%, Q test p-value= 0.094).
(D) Forest plot showing effect sizes from a meta-analysis on family-level diversity
(I
2 = 66.51%, Q test p-value= 0.012). RE Model: Random effect model. . . . . . . . . . . 68
5.4 Analysis of viral species Heip evenness within each dataset. (A) Boxplots of viral specieslevel Heip evenness for gut samples of CRC subjects and healthy controls stratified by
disease status in each dataset. BH adjusted p-values were calculated using the two-tailed
Wilcoxon rank-sum test. ns:p> 0.05, *p< 0.05, **p< 0.01, ***p< 0.001. (B) Multivariate
analysis of the adjusted impact of age, gender and BMI on Heip evenness. (C) Forest plot
showing effect sizes from a meta-analysis on species-level evenness (I
2 = 0.00%, Q test
p-value= 0.89). RE Model: Random effect model. . . . . . . . . . . . . . . . . . . . . . . . 69
5.5 Analysis of viral species Chao1 richness within each dataset. (A) Boxplots of viral specieslevelChao1 richness for gut samples of CRC subjects and healthy controls stratified by
disease status in each dataset. BH adjusted p-values were calculated using the two-tailed
Wilcoxon rank-sum test. ns:p> 0.05, *p< 0.05, **p< 0.01, ***p< 0.001. (B) Multivariate
analysis of the adjusted impact of age, gender and BMI on Chao1 richness. (C) Forest plot
showing effect sizes from a meta-analysis on species-level richness (I
2 = 69.93%, Q test
p-value= 0.005). RE Model: Random effect model. . . . . . . . . . . . . . . . . . . . . . . . 70
5.6 PCoA plot of gut samples of CRC subjects and healthy controls for each separate dataset.
R2 values and p-values on subtitles were calculated by PERMANOVA to quantify the
separation of the samples into two groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.7 Boxplots showing the log transformed TMM normalized abundance of significant CRCassociated viral species within each dataset. P-values were calculated using the two-tailed
Wilcoxon rank-sum test. ns:p> 0.05, *p< 0.05, **p< 0.01, ***p< 0.001, ****p< 0.0001. . . 72
5.8 Boxplots showing the log transformed TMM normalized abundance of significant CRCassociated viral genera within each dataset. P-values were calculated using the two-tailed
Wilcoxon rank-sum test. ns:p> 0.05, *p< 0.05, **p< 0.01, ***p< 0.001, ****p< 0.0001. . . 73
x
5.9 Boxplots showing the log transformed TMM normalized abundance of significant CRCassociated viral families ((A) Drexlerviridae, (B) Inoviridae, (C) Herelleviridae)) and
families to which significant CRC-associated viral species belong ((D) Myoviridae, (E)
Podoviridae, (F) Siphoviridae)) within each dataset. P-values were calculated using the
two-tailed Wilcoxon rank-sum test. ns:p> 0.05, *p< 0.05, **p< 0.01, ***p< 0.001,
****p< 0.0001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.10 Boxplots showing the log transformed HUMAnN3 pathway abundance of significant
pathways within each dataset. P-values were calculated using the two-tailed Wilcoxon
rank-sum test. ns:p> 0.05, *p< 0.05, **p< 0.01, ***p< 0.001, ****p< 0.0001. . . . . . . . . 75
5.11 Alpha diversity of bacterial species. (A) Boxplots showing bacterial Shannon index.
(B) Boxplots showing bacterial richness. P-values were calculated using the two-tailed
Wilcoxon rank-sum test. ns:p> 0.05, *p< 0.05, **p< 0.01, ***p< 0.001, ****p< 0.0001.
(C) Forest plot showing effect sizes from a meta-analysis on species-level bacterial
diversity (meta-analysis I
2 = 73.31%, Q test p-value = 0.0029). (D) Forest plot showing
effect sizes from a meta-analysis on species-level bacterial richness (meta-analysis
I
2 = 81.82%, Q test p-value < 0.0001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.12 Prediction performances of random forest classifiers based on gut viral abundance. (A)
Within and cross study AUROC matrix obtained by using genus-level abundance. The
diagonal refers to results of cross validation within each dataset. Off-diagonal values
refer to prediction results trained on the study on each row and tested on the study on
each column. (B) Within and cross study AUROC matrix obtained by using family-level
abundance. (C) Within and cross study AUROC matrix obtained by using pathway
abundance. (D) LODO results with the x axis indicating the study left out as the validation
set and other studies combined as the training set. . . . . . . . . . . . . . . . . . . . . . . . 77
5.13 Prediction performances of random forest classifiers based on gut microbial abundance.
(A) Within and cross study AUROC matrix obtained by using bacterial species-level
abundance. The diagonal refers to results of cross validation within each dataset. Offdiagonal values refer to prediction results trained on the study on each row and tested
on the study on each column. (B) Within and cross study AUROC matrix obtained by
using both bacterial and viral species-level abundance profiles. (C) Within and cross study
AUROC matrix obtained by using both bacterial species-level and viral genome-level
abundance profiles. (D) LODO results with the x axis indicating the study left out as the
validation set and other studies combined as the training set. . . . . . . . . . . . . . . . . . 78
5.14 The impacts of bottleneck dimensionality and training epochs on DeepLINK-T using
the linear factor model. Latent confouders are added to make the regression model
misspecified. A and B are FDR and power under the setting of linear link function. C
and D are FDR and power under the setting of nonlinear link function. The number of
training epochs of both LSTM autoencoder and LSTM prediction network is specified on
the x-axis. The bottleneck dimensionality of the LSTM autoencoder is specified on the
y-axis. The pre-specified FDR level is q = 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . 79
xi
5.15 The impacts of bottleneck dimensionality and training epochs on DeepLINK-T using
the logistic factor model. Latent confouders are added to make the regression model
misspecified. A and B are FDR and power under the setting of linear link function. C
and D are FDR and power under the setting of nonlinear link function. The number of
training epochs of both LSTM autoencoder and LSTM prediction network is specified on
the x-axis. The bottleneck dimensionality of the LSTM autoencoder is specified on the
y-axis. The pre-specified FDR level is q = 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.16 The distribution of selection frequencies on the gut microbiome data set of early infants
over 200 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.17 The abundance variation of Bacteroides and its top 10 selected genera. . . . . . . . . . . . . 82
5.18 The abundance variation of Bifidobacterium and its top 10 selected genera. . . . . . . . . . 82
5.19 The distribution of selection frequencies for response chlorophyll-a concentration on the
SPOT data set over 200 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.20 The variation of chlorophyll-a concentration and the abundance of its top 3 selected taxa
at the genus, family, and order levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.21 The distribution of selection frequencies for response leucine incorporation on the SPOT
data set over 200 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.22 The variation of leucine incorporation and the abundance of its top 3 selected taxa at the
genus, family, and order levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.23 The distribution of selection frequencies on the dietary glycans treatment time series data
set over 200 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.24 The abundance variation of the top 10 selected genera between the FOS and PDX groups. . 90
5.25 The abundance variation of the top 10 selected families between the FOS and PDX groups. 90
5.26 The abundance variation of the top 10 selected orders between the FOS and PDX groups. . 91
xii
Abstract
Human microbes, including bacteria, eukaryotes, viruses, and many other microorganisms, inhabit diverse
regions throughout the human body. These microbes together with their genes constitute the human microbiome. And this complicated ecosystem is significant in maintaining human health and is associated
with various diseases. Recent developments in metagenomic analysis have provided unprecedented insight
into the human microbiome. However, these breakthroughs also bring up with novel difficulties, including
the identification of microbial signatures in specific diseases and the detection of disease-associated microbes. This dissertation addresses these challenges through a comprehensive exploration of metagenomic
analysis. In Chapter 2, we present a multifaceted investigation into the gut phageome in colorectal cancer,
uncovering distinct phage signatures and their potential in disease prediction. In Chapter 3, we introduce
DeepLINK-T to achieve FDR control in high-dimensional statistical inference for time-series data using the
integration of long short-term memory and model-X knockoffs framework. The efficacy of this algorithm
is demonstrated through rigorous simulation studies and real-world applications. The outcomes of this
research deepen our understanding of the human microbiome, offering valuable insights for future efforts
in studying the pathogenesis and developing therapeutics of microbiome-associated diseases.
xiii
Chapter 1
Introduction
Microorganisms, also referred to as microbes, are microscopic organisms that exist in large numbers and
diverse forms across every habitat on our planet [58, 11]. Though microorganisms can not be seen by naked
eyes, they play significant roles in shaping ecosystems, driving primary production, and influencing the
overall balance of life on our planet [6, 117]. Microorganisms encompass diverse communities such as
bacteria, archaea, fungi, eukaryotes, and viruses. Each of these communities have unique characteristics
and adaptations that allow them to flourish in extreme habitats such as deep ocean, waste water and
human gut [109, 37]. Specifically, microorganisms residing human tissues collectively make up human
microbiome, which comprises 10-100 trillion symbiotic microbial cells within an individual [139].
The exploration of the human microbiome has undergone a complex evolution driven by advancements
in biotechnology. The origins of microbiome study can be traced back to Antonie van Leeuwenhoek, whose
invention of the microscope unveiled the microbial world [116]. In the late 19th century, with the advent
of microbial colonization, scientists started to realize the potential associations between the microbiome
and human health through culturing diverse microbes in laboratories. However, with more than 98% of
bacteria could not be cultured in laboratory environments [145], there were still heavy darkness covering
human microbiome to be enlighted.
1
In the late 20th century, the advancement of molecular biology techniques, including DNA sequencing
and polymerase chain reaction (PCR), revolutionized the research of human microbiome. These techniques
provided tools to enable researchers to amplify and sequence specific genes, making it possible to study
the functions and essence of genomes, thus providing a more comprehensive view of the microbiome.
Afterwards, metagenomics began to thrive [78], allowing researchers to derive massive sequencing of
DNA from some environmental samples such as ocean water and stool samples [109, 37]. This technique
allowed researchers to study microbiome as a whole at certain ecosystems without any cultivation [143],
providing advanced insights of microbial interactions and their symbiotic networks within human body.
Next-generation sequencing (NGS), also known as high-throughput sequencing or second generation
sequencing [2], was then invented. This technique greatly facilitated the development of metagenomics
[82, 77, 91]. Unlike traditional Sanger sequencing, NGS enables the low-cost and efficient sequencing
of large amounts of DNA or RNA [20]. Representative platforms of NGS include Illumina [67], Pacific
Biosciences (PacBio) and Oxford Nanopore [112, 87]. These tools have enabled scientists discover novel
biomarkers of diseases and contributed to our comprehension of disease pathogenesis [2], further uncovering the vast complexity of different microbial communities.
Motivated by the technology revolution, Human Microbiome Project (HMP) [138, 133] was initially
launched in 2007 by the National Institutes of Health (NIH). The goal of the Human Microbiome Project
was to delineate the microbial communities residing the human body and understand how they are related
to the pathogenesis of certain microbial diseases. Massive microbial DNA and RNA have been collected
during the implementation of this project. These genetic materials were used for the functional analysis
and the construction of reference genome database [108], serving as a road map for future and deeper
human microbial research.
2
Whole Genome Shotgun (WGS) sequencing is a high-throughput sequencing approach used to decode
genomes of an organism by randomly breaking up the genomes into small DNA fragments and individually sequence them [80]. WGS was widely applied in the analysis of human microbiome. Over the past
decades, many microbiome studies have shown that human microbiome plays an important role in human
health by impacting key body systems such as metabolism system and immune system [55, 162]. Extensive
studies have demonstrated the associations between human microbiome and diverse diseases such as inflammatory bowel disease (IBD) [56, 95, 104, 105], Colorectal cancer (CRC) [129, 1, 156], metabolic diseases
[111, 151, 70] and autoimmune diseases [137, 41, 30]. Many downstream tools and analyses are available
for WGS data encompassing genome assembly [9, 81], taxonomic classification [92, 127], taxonomic profiling [18, 28] and functional pathway analysis [72, 73]. Most of these analyses aim to derive a better
human microbiome profiling and locate the groups of microbes associated with certain diseases. While
the majority of studies focus on profiling prokaryotes due to its high prevalence in human microbiome,
the importance of bacteriophage, with much less genetic materials but also strong impact on body systems,
is often neglected. Thus, achieving a more accurate phage profiling and conducting a comprehensive analysis of phage signatures are crucial for revealing their roles in human health. Additionally, the detection
of disease-associated microbes is vital in uncovering disease pathogens. Despite various differential abundance analysis methods, most of them are designed for panel data and ignored the temporal continuity of
diseases and the variation of microbes. Consequently, in this dissertation, we target to unveil the phage
signature in the human microbiome through a comprehensive analysis. We also strive to develop robust
methods for detecting phenotype-associated taxa in longitudinal data, contributing an indispensable piece
to the intricate puzzle of metagenomics.
3
1.1 Identifying the phage signature in colorectal cancer through metagenomic
analyses of heterogeneous microbiome studies
Gut disease, encompassing gastrointestinal disorders and colorectal cancer, is of increasing prevalence
globally. For example, Ulcerative Colitis (UC), a form of inflammatory bowel disease (IBD), is more common in developed countries with a prevalence reaching 0.3% and the prevalence of IBD increases as countries become more industrialized [123]. Colorectal cancer (CRC) has become the third most common cancer
worldwide causing at least 500,000 deaths annually [21]. Most gut diseases stem from a complicated interaction of genetic and environmental factors. Specifically, many of them are due to environmental and
lifestyle factors [50]. Potential risk factors include older age, lack of physical activity, diet rich in red meat,
smoking and alcohol use [147, 51]. Understanding the microbial signatures associated with these diseases
is vital to developing effective diagnostic and therapeutic strategies.
All the above risk factors may affect the gastrointestinal tract by inducing alterations in the microbial
structure within the human gut. The human gut microbiome, referring to the microbial communities inhabiting our gastrointestinal tract, can greatly influence human health through the immune and metabolic
systems [64]. Numerous studies have underscored the intricate associations between gut microbiome
and various human diseases, including inflammatory bowel disease (IBD) [100], metabolic diseases [111]
and autoimmune diseases [137]. The human gut microbiome, which could be influenced by some of the
mentioned risk factors, has also been recognized as one of the most important environmental factors contributing to the development of gut diseases. Indeed, several existing studies have already illuminated the
structural changes in the gut microbiome among individuals with CRC [134, 150].
4
Recent advancements in biological techniques have propelled the study of the human gut microbiome,
allowing for more in-depth analyses and insights. In several studies, the bacterial composition of gut microbial populations was identified by sequencing of the 16S rRNA gene followed by comparison to reference databases. This approach provides a snapshot of the taxonomic diversity within a certain ecosystem.
Moreover, metagenomic analysis by sequencing all microbial DNA (shotgun sequencing) in a complex
community has the additional advantage of measuring the genetic potential of the microbial population.
This comprehensive method enables the exploration of functional capabilities and metabolic attributes
encoded in microbial genomes. Both 16S rRNA gene sequencing and metagenomic analysis have been
widely used in the analysis of bacterial signatures associated with various gut diseases. Their combined
use provides a deeper understanding of the taxonomic and functional aspects of the gut microbiome in
health and disease.
While the majority of recent microbiome studies concentrate on prokaryotes as they constitute most
of the genetic materials in the gut microbial community, the significant role of gut virome is increasingly
recognized as significant. Gut viruses that infect bacteria referred to as bacteriophages outnumber the
bacteria by tenfold in the gut environment. These bacteriophages could greatly impact the microbial community by infecting their hosts. Despite their numerical abundance, the significance of gut virome was
substantially understudied due to the relatively short length of viral genetic materials in gut microbiomes.
Additionally, the shortage of well-curated viral reference genomes has also inhibited efforts to accurately
estimate the viral abundance profiling. To conquer these challenges, experimental techniques to enrich
virus-like particles (VLP) in microbiome studies have been developed [89]. Through the enrichment of
VLPs, it has been shown that several viral taxa were associated with some diseases such as ulcerative
colitis (UC) [164]. However, studies of VLPs in relation to other gut diseases remains limited, and more
research in this area is warranted to unveil the intricate interactions within the gut virome and its potential
implications for human health.
5
To overcome the challenges associated with the shortage of VLPs studies regarding CRC, an effective
solution is to utilize a more sophisticated viral reference database. Recently, with the increasing collection
of metagenome assembled genomes (MAG) and the advancement of computational virus identification
algorithms such as VirFinder [115] and/or VirSorter [119], curating a larger virus databases became possible. One of the well-known databases is Gut Phage Database (GPD) curated by Camarillo-Guerrero et
al. [25]. These large virome databases provided new resources for researchers, allowing them to identify,
characterize, and understand the impact of specific viral taxa on the gut microbiome, facilitating deeper
understandings of relationships between viruses and complex gut diseases.
In Chapter 2, we conducted comprehensive analyses concerning the relationship between CRC and
the gut virome. we conducted extensive metagenomic analyses using shotgun sequencing data from seven
independent CRC studies, including a total of 462 CRC subjects and 449 healthy controls. Our investigation aimed to assess alterations in the gut virome of CRC subjects and examine the potential heterogeneity
among distinct studies by obtaining the virus abundance profiles and mapping the reads against the Gut
Phage Database [25]. Differential abundance analysis was also performed to identify CRC-associated viral
species, bacterial species and metabolic pathways. Subsequently, phage-bacterium associations were identified to illustrate the symbiotic network in CRC. Finally, the performance of classification models based
on the viral abundance profiling surpassed the prediction performance of all previous viral studies. In
essence, our analyses contribute comprehensive insight into the association between gut diseases and the
distinct signature of gut microbiome.
6
1.2 Feature selection for longitudinal data using deep learning and modelX knockoffs framework
Feature selection has become one of the most popular topics within metagenomics as increasing number
of microbial taxa are identified by sophisticated reference database. For example, to predict the phenotype
of a certain disease, abundance of thousands of bacterial or viral species can be utilized in the machine
learning model. Despite the vast number of available features, only a small fraction of them may be informative or contributory to the modeled response with others independent of the response. The presence
of redundant or noisy information among features could pose challenges in terms of statistical accuracy
and algorithmic robustness [84, 90]. In the context of predicting disease phenotypes, the identification of
a subset of features that significantly contribute to the response becomes crucial. Successfully identifying
these true signals from the multitude of features not only enhances the robustness of machine learning
algorithms, but also plays an important role in advancing research on disease pathogens. This process is
essential for refining models, improving predictive accuracy, and gaining deeper insights into the complicated relationship between human microbiome and disease states.
In traditional machine learning, various methods have been devised to quantify the feature importance during the training process of the model. For instance, the random forest algorithm [23] employs
an accuracy variation-based importance score to rank the importance of input features. LASSO [135] detects true signals by regularizing other features to be zero weight. However, a common limitation of these
methods is their inability to effectively control false discoveries in the selection results, which potentially
mislead the downstream analysis of selected features [128]. To address the issue of false discovery rate
(FDR) control, several p-value based methods have been proposed. One of the most famous methods is
Benjamini–Hochberg (BH) procedure for FDR control [15]. Subsequent development includes BenjaminiYekutieli (BY) Procedure [17], Storey’s q-value [131] and adaptive Benjamini–Hochberg procedure [16].
7
Depite the effectiveness of these methods in FDR control, their performance heavily relies on assumptions
of independence and the availability of valid p-values. Unfortunately, these assumptions are challenging to satisfy in high-dimention and non-linear data settings. Therefore, the application of these methods may encounter limitations in scenarios characterized by complex data structures and dependencies
among features. In light of these challenges, innovative approaches that can accommodate the intricacies
of high-dimensional and non-linear datasets are demanded to ensure reliable and robust feature selection
in metagenomic analyses.
To address these difficulties, in 2015, Barber and Candès proposed knockoff filter [10] to achieve FDR
control, operating under the assumption of Gaussian linear models and eliminating the need for p-values.
Later in 2018, Candès et al. introduced model-X knockoffs framework [26] to extend FDR control to the scenario with arbitrary dimensionality and dependency structures within the data. The fundamental concept
behind the model-X knockoff is the creation of knockoff variables that mimic the dependency structure of
explanatory variables on the response and is independent of the response given the explanatory variables.
Consequently, these knockoff variables can be treated as references to measure the importance of each
explanatory variable. The knockoff threshold is then derived from the explanatory variables and their
knockoff copies for feature selection with FDR control. Recently, several methods have been developed
based on model-X knockoffs framework. Fan et al. [46] proposed IPAD, an approach based on model-X
knockoffs, to achieve FDR control for feature selection in high-dimensional linear models. Lu et al. [88]
introduced DeepPINK, integrating a fully connected neural network with model-X knockoffs to enhance
the interpretability of deep neural network. Zhu et al. [163] further developed DeepLINK with the combination of an autoencoder and model-X knockoffs, demonstrating applications in scenarios with both linear
and non-linear dependency structures.
However, these existing knockoffs framework-based method predominantly concentrate on independent observations, neglecting the widespread serial dependency observed in microbial communities across
8
different ecosystems, such as human gut [19, 141] and marine environments [154]. Integrating temporal
information in the feature selection process can potentially enhance the accuracy and interpretability of
outcomes by utilizing known serial dependencies, rather than relying solely on the correlations between
features and responses. For example, in metagenomics, incorporating temporal information acknowledges
the inherent relationships and interactions among microbial taxa over time. By considering the temporal
dynamics, we can better capture the evolution of microbial communities and their associations with specific environmental factors or health status, which enables us to identify the association between microbial
taxa and a particular condition as well as the variation of such associations, providing deeper insights into
the implicit biological mechanisms.
Consequently, in Chapter 3, we introduce an innovative method tailored for time series data by using
the long short-term memory (LSTM) and model-X knockoffs, named DeepLINK-T. This method is designed
for efficient feature selection while ensuring FDR control in the context of longitudinal data. DeepLINKT leverages the LSTM autoencoder to capture the joint and potentially non-linear dependency structure
of features on the response. Another LSTM prediction network is trained on a combination of original
variables and knockoff variables. This dual-network architecture enables the quantification of the feature importance, which is eventually used for FDR control and feature selection. In Chapter 3, we have
demonstrated the robustness of DeepLINK-T under various scenarios through comprehensive simulation
studies. Furthermore, real world applications regarding human gut metagenomics and marine metagenomics underscored the practical utility of DeepLINK-T in extracting meaningful insights from complex,
time-varying metagenomic data. The findings obtained from these applications are also demonstrated by
other studies, showcasing the potential of our algorithm to enhance our understanding of the temporal
variation of microbial signatures in different ecosystems.
9
Chapter 2
Metagenomic analyses of multiple gut datasets revealed the association
of phage signatures in colorectal cancer
2.1 Introduction
Colorectal cancer (CRC) is the third most common cancer worldwide causing at least 500,000 deaths annually [21]. Most colorectal cancers are caused by complex genetic and environmental factors. While only
a small proportion of cases are explained by genetic mutations [136], over 70% of them are due to environmental and lifestyle factors [50]. Potential CRC risk factors include older age, lack of physical activity,
diet rich in red meat, smoking and alcohol use [147, 51].
The human gut microbiome, the microbial communities inhabiting our gastrointestinal tract, can greatly
influence human health through the immune and metabolic systems [64]. Many studies have demonstrated
the complicated associations between gut microbiome and human diseases such as inflammatory bowel
disease (IBD) [100, 165], metabolic diseases [111] and autoimmune diseases [137]. The human gut microbiome, which could be altered by some of the mentioned risk factors, has also been considered as one of
the most important environmental factors in the development of CRC. Indeed, several studies have already
shown the structural alterations of gut microbiome among CRC patients [134, 150].
10
Most of the current microbiome studies concentrate on prokaryotes as they constitute most of the
genetic materials in the gut microbial community. On the other hand, the number of viruses that infect
bacteria referred to as bacteriophages or simply phages for short outnumber the bacteria by tenfold. These
bacteriophages impact the microbial community directly through their own genes or indirectly by infecting
their hosts. However, the importance of the gut virome was vastly understudied due to the relatively
low fraction of viral genetic materials in microbiomes, despite their larger numbers compared to that of
prokaryotes. The lack of well-curated viral reference genomes has also hampered efforts to accurately
study the virome. To overcome these issues, experimental techniques to enrich virus-like particles (VLP)
in microbiome studies have been developed [89]. Through the enrichment of VLPs, it has been shown that
several viral taxa were associated with some diseases such as ulcerative colitis (UC) [164], a subtype of
IBD. However, only a very limited number of VLP studies related to CRC have been conducted thus far.
Until recently, most studies used NCBI virus database as references for virome studies. However, the
fraction of viruses in NCBI only represented a tiny fraction of all the viruses. To overcome this issue,
metagenome assembled genomes (MAG) were constructed from a large number of metagenomes and computational virus identification algorithms such as VirFinder [115] and/or VirSorter [119]. This allowed the
identification of more viruses resulting in larger virus databases. Gregory et al. [57] constructed the Gut
Virome Database (GVD) from 2,697 public metagenomic samples, Paez-Espino et al. [103] built the IMG/VR
database based on stool samples from the Human Microbiome Project (HMP) and Camarillo-Guerrero et
al. [25] formed the Gut Phage Database (GPD) from 28,060 metagenomes. These large virome databases
provided new resources for investigating the relationship between viruses and complex diseases. However, to the best of our knowledge, no studies have been carried out to investigate the relationship between
human gut virome and CRC using such newly developed databases.
In this study, we analyzed metagenomic shotgun sequencing data from 7 CRC studies, including 462
CRC subjects and 449 healthy controls. We evaluated changes in the gut virome of CRC subjects and the
11
heterogeneity among different studies by obtaining the virus abundance profiles and mapping the reads
against the Gut Phage Database [25], the largest phage database available to date. Secondly, we performed
differential abundance analysis to identify CRC-associated viral species, bacterial species and metabolic
pathways. Next, phage-bacterium associations were identified to illustrate the symbiotic network in CRC.
In addition, we estimated the diagnostic ability of the viral profiling by the performance of classification
models trained on either within-study and cross-study settings. Finally, we measured the generalizability
and robustness of the classifiers by pooling these datasets together. Our results provide comprehensive
insight into the links between gut virome and colorectal cancer.
2.2 Materials and Methods
2.2.1 Cohort description
We collected 911 healthy control and CRC subjects from 7 publicly available datasets from 7 countries
and 3 continents (Table 2.1. All 7 studies used fecal shotgun sequencing to compare the gut microbiome of
CRC patients to that of healthy controls. Fecal samples from all participants of these studies were collected
before treatment, thus excluding the cancer treatment as a potential confounding factor. Characteristics
of these datasets are shown in Table 2.1. Sequencing depth distributions within these datasets are shown
in Supplementary Figure 5.1.
12
Table 2.1: Characteristics of metagenomic datasets used in this study.
Study No. of controls No. of CRC Country Reference
Zeller 93 91 France/Germany [156]
Yu 54 74 China [155]
Feng 63 46 Austria [48]
Vogtmann 52 52 USA [144]
Thomas 52 61 Italy [134]
Yachida 40 40 Japan [152]
Yang 95 98 China [153]
Total 449 462
2.2.2 Quantification of viral abundance
Centrifuge v1.0.4 [74] with default parameters was used to map reads from each sample against the Gut
Phage Database (GPD) [25], since Centrifuge yielded more accurate estimation of relative abundance at
species and genus rank [93]. GPD is a recently published database containing 142,809 non-redundant
gut phage genomes. On average, 31.84% reads of each sample were mapped to GPD. Although a small
proportion (< 10%) of viral reads could still be mapped to NCBI bacterial reference genomes, it does not
substantially affect the statistical results. Such reads are potentially caused by the prophage in the bactieral
reference genomes or the non-viral genome in GPD (the allowed false positive rate of GPD was 0.25%).
The distribution of viral mapping rates whithin each dataset is shown in Supplementary Figure 5.2. The
number of unique mappings given by Centrifuge was further normalized with trimmed mean of M values
(TMM) [118] using the edgeR package [118] to obtain the TMM normalized abundance profiling. Although
other viruses such as eukaryotic viruses and endogenous retroviruses are also important components of
13
gut virome, their mapping rate was low (< 0.02%). Therefore, we focused our analyses on gut phages in
this study.
2.2.3 Taxonomic annotation
A protein-level comparison was used for the species-level annotation. First, open reading frames (ORFs) in
viral genomes from GPD were predicted using Prodigal meta (v2.6.3) [65]. The predicted ORFs in the viral
genomes were searched against the RefSeq protein database (downloaded in December 2021, containing
577,484 proteins) using DIAMOND blastp (v2.0.13) [24] with e-value less than 10−5
. Each ORF was assigned
to the protein with the highest bit score. Each viral genome was assigned taxonomy based on the majority
of taxa within that genome using a voting system for virus taxonomic assignment at different taxon levels
[164, 60, 94]. Viral genomes with less than two ORFs were considered unclassified viral species [137].
In summary, 134,871 viral genomes in GPD were annotated with species or higher-level taxon, respectively. Camarillo-Guerrero et al. [25] also used HMMER [44] to query each protein sequence within the
viral genome against the ViPhOG database for taxonomic annotation. Only 16,636 viral genomes in GPD
were annotated with family-level taxon. Out of these 16,636 viral genomes, we assigned family-level taxon
to 15,603 genomes with 70.7%(11,033) having the same predicted taxon as obtained by Camarillo-Guerrero
et al. [25].
2.2.4 Viral functional profiling
In order to study the viral gene expression in CRC patients, we used viral reads (reads that were mapped
to GPD) to obtain the gene family and pathway profiling. HUMAnN3 [14] along with its ChocoPhlAn
pangenome database and UniRef90 EC filtered database was used to predict the Pfam protein domains and
Gene Ontology terms, with the reported abundance value shown as RPKs (reads per kilobases). In total,
462 pathways and 497,967 (±132,909) were identified.
14
2.2.5 Viral diversity, multivariate analysis and meta-analysis
Shannon index [130], Heip evenness [62] and Chao1 richness [29] were used to measure viral diversity
for each dataset at species, genus and family levels, respectively. To study the association between disease
status and alpha diversity as well as the impact of age, gender and body mass index (BMI in kg/m2
) on
this association [134], we performed a naive linear model (Y ∼ Disease) and an age-, gender- and BMIadjusted linear model (Y ∼ Disease+Age+Gender+BMI), where Y is the log-transformed alpha diversity.
Coefficients and p-values of the disease variable were obtained to perform the comparisons. Meta-analysis
was implemented using the R package metafor [142]. Standardized mean difference of alpha diversity was
calculated for each taxonomic level to obtain the random effect model. Heterogeneity among studies was
quantified by I
2
(percentage of total heterogeneity on total variability) and the p-value was obtained by
Cochran’s Q test [36].
2.2.6 Principal coordinate analysis
The dissimilarity between CRC cases and healthy controls in all 7 datasets were measured by principal
coordinate analysis (PCoA) [43] based on Bray-Curtis distance [22]. PCoA was performed on either combined datasets or each separate dataset. Permutational multivariate analysis of variance (PERMANOVA)
[4] was used to quantify the heterogeneity among different datasets and the separation between healthy
controls and CRC subjects.
2.2.7 Differential abundance analysis
To identify viral species, genus, families, as well as the viral functions that are differentially abundant in
the CRC group, we used DESeq2 [86] to perform differential analysis based on their abundance profiling.
Taxa with low variance (less than half of the median) or not found in ≥ 90% samples were removed [85].
15
The exact test in DESeq2 was used to calculate the p-value for each variable. Multiple hypothesis tests
were adjusted using the Benjamini-Hochberg false discovery rate (BH-FDR) procedure [15]. Associations
with FDR less than 0.05 were considered significant.
2.2.8 Quantification of bacterial abundance and virus-bacterium association
Centrifuge v1.0.4 [74] with default parameters was used to map non-viral reads from each sample against
the UHGG database [3] to obtain the bacterial abundance profile. The TMM normalized abundance profiles
were obtained in the same way as the viral abundance profile.
We chose the 27 bacterial species that are differentially abundant between CRC cases and healthy
controls in all 7 datasets for further analysis. Spearman’s correlation coefficients based on the bacterial
species abundance and viral family abundance were calculated. Fisher’s z-transformation [49] and metaanalysis were used to derive a random effect model on Spearman’s correlation coefficients. The p-values
were adjusted by the BH procedure. Only associations with adjusted p-value < 0.05 were considered
significant.
2.2.9 Random forest classifier for within-study and cross-study prediction
We used six types of microbiome quantitative profiles: GPD genome-level, taxonomic family-level, genuslevel, species-level TMM normalized abundance estimated by Centrifuge [74] and gene-family and pathway abundance (in RPKs) estimated by HUMAnN3 [14], to predict CRC status using random forests.
Since the random forests algorithm [23] has been proven to have better performance than other machine learning models, especially on microbial abundance data [106, 53], all experiments were carried out
using the random forests classifier implemented by the python package scikit-learn v1.0.2 [107]. We set
the number of estimators as 1000 [134, 53] and all other parameters as their default values [110] in all
16
prediction experiments. The area under the ROC curve (AUROC) was used as a criterion measuring the
performance of every prediction model.
To accurately measure the performance, generalizability and robustness of predictions models, we
performed within-dataset, cross-dataset and leave-one-dataset-out (LODO) predictions [134]. The withindataset analysis was performed by repeating 10-fold cross validation 20 times [134]. The average AUROC
based on 200 runs was calculated as the final measurement.
The cross-dataset analysis was performed by pairwise datasets prediction. For each pair of all 7
datasets, one was used as the training set, and the other one as the validation set. This step was also
repeated 20 times to reduce randomness within the algorithm.
The LODO analysis consisted of the validation set as one of the 7 datasets and the training set as the
pooled samples from the other 6 datasets. This approach substantially increased the size of the training
set, which could potentially improve the performance and generalization of the random forests classifier.
It also helped reduce profiling differences between different data batches.
2.3 Results
2.3.1 Case-control comparison showed higher viral diversity in CRC samples compared
to healthy controls
By taxonomic annotation, GPD genomes were assigned to 2,056 viral species, 673 viral genera, 24 viral
families and 7 viral orders, respectively. To study the gut viral structural alteration within each dataset,
we performed case-control comparison for species-level, genus-level and family-level alpha diversity (measured by Shannon index [130], Heip evenness [62] and Chao1 richness [29]) for each dataset. Figure 2.1(A)
and Supplementary Figures 5.3-5.4 showed that the virome alpha diversity and evenness are higher in CRC
subjects than healthy controls at species, genus and family levels, which are consistent with the findings
17
in Nakatsu et al. [98]. The increments are significant in the Feng, Yachida and Yang datasets (adjusted pvalues of two-tailed Wilcoxon rank-sum test are less than 0.01). This positive association between species
viral diversity and CRC can be further observed in the multivariate analysis (Figure 2.1(B)), since all coefficients given by linear models are positive. Potential confounding factors such as age, gender and BMI do
not meaningfully impact the contribution of disease status to the alpha diversity. It should be noted that
although the viral alpha diversity in CRC cases is not statistically different from that in healthy controls,
they are in the same direction as the other three datasets. Results of meta-analysis (Figure 2.1) also showed
this significant positive association (µ = 0.26, p-value< 0.0001) with no heterogeneity observed in alpha
diversity (I
2 = 0.0%, Q test p-value = 0.49). Similarly, the evenness also exhibited a positive association
with CRC (µ = 0.28, p-value< 0.0001) without significant heterogeneity (I
2 = 0.0%, Q test p-value
= 0.89, Supplementary Figure 5.4). Whereas, the Chao1 richness (Supplementary Figure 5.5) indicated
either negative associations or positive associations with CRC in different studies. However, the random
effect model obtained by the meta-analysis suggested a positive effect size (µ = 0.12, p-value= 0.318) of
the richness, although this association was not significant. Altogether, these findings highlighted associations between dysbiosis of gut virome and CRC.
18
RE Model
−0.5 0 0.5 1
Standardized Mean Difference
Yang
Yachida
Thomas
Vogtmann
Feng
Yu
Zeller
0.27 [−0.01, 0.56]
0.39 [−0.05, 0.83]
0.17 [−0.20, 0.54]
0.05 [−0.33, 0.43]
0.58 [ 0.19, 0.97]
0.37 [ 0.02, 0.73]
0.13 [−0.15, 0.42]
0.26 [ 0.13, 0.40]
(B) (C)
Figure 2.1: Analysis of viral species Shannon diversity within each dataset. (A) Boxplots of viral specieslevel Shannon index for gut samples of CRC subjects and healthy controls stratified by disease status in each
dataset. BH adjusted p-values were calculated using the two-tailed Wilcoxon rank-sum test. ns:p> 0.05,
*p< 0.05, **p< 0.01, ***p< 0.001. (B) Multivariate analysis of the adjusted impact of age, gender and BMI
on Shannon diversity. (C) Forest plot showing effect sizes from a meta-analysis on species-level diversity.
RE Model: Random effect model.
2.3.2 Principal coordinates analysis among different studies
By using the principal coordinates analysis based on Bray-Curtis distance, we assessed both the heterogeneity among different datasets from various studies and the dissimilarity in the gut viral communities
between healthy controls and CRC subjects. Figure 2.2(A) revealed that the heterogeneity among 7 datasets
19
had significant effect (PERMANOVA R2=0.164, p-value=0.001) on the gut viral composition, which is consistent with the results from Thomas et al. [134] and Wirbel et al. [150].
Samples from Feng dataset (light green), Zeller dataset (red) and Yang dataset (fuchsia) tend to cluster to the left, middle and right among all subjects, respectively. Boxplots (Figure 2.2(B)) also show that
their first principal coordinates (explained 28.3% of the variance) are significantly greater in the CRC group
compared to the control group, suggesting the existence of within-study clusters. The significance of these
separations is validated by the PEMANOVA between the viral dissimilarity and disease status (R2=0.011,
p-value=0.001). Meta-analysis also indicates substantial heterogeneity on the second principal coordinates
(Figure 2.2(E), µ = 0.13, p-value = 0.29, I
2 = 71.88%, Q test p-value = 0.0027). In addition, Supplementary Figure 5.6 further demonstrates significant separations between healthy controls and CRC subjects
within each dataset. All PERMANOVA p-values are less than 0.05 except the p-value of the Vogtmann
dataset.
20
RE Model
−0.5 0 0.5 1 1.5
Standardized Mean Difference
Yang
Yachida
Thomas
Vogtmann
Feng
Yu
Zeller
0.44 [ 0.15, 0.72]
0.12 [−0.32, 0.56]
0.04 [−0.33, 0.41]
0.29 [−0.10, 0.68]
0.71 [ 0.32, 1.10]
0.45 [ 0.09, 0.80]
0.37 [ 0.08, 0.66]
0.36 [ 0.22, 0.50] RE Model
−1 −0.5 0 0.5 1 1.5
Standardized Mean Difference
Yang
Yachida
Thomas
Vogtmann
Feng
Yu
Zeller
0.06 [−0.22, 0.35]
0.60 [ 0.16, 1.05]
0.21 [−0.16, 0.59]
−0.12 [−0.50, 0.27]
−0.26 [−0.64, 0.12]
0.62 [ 0.26, 0.98]
−0.11 [−0.40, 0.18]
0.13 [−0.12, 0.39]
(D) (E)
Figure 2.2: Principal coordinates analysis of all samples based on Bray–Curtis distance. (A) PCoA plot of
gut samples of CRC subjects and healthy controls in each dataset. R2 values and p-values were calculated
by PERMANOVA. (B) Boxplots of the first principal coordinates (PCo1) in each dataset. (C) Boxplots of
the second principal coordinates (PCo2) in each dataset. BH adjusted p-values were calculated using the
two-tailed Wilcoxon rank-sum test. ns:p> 0.05, *p< 0.05, **p< 0.01, ***p< 0.001, ****p< 0.0001. (D)
Forest plot showing effect sizes from a meta-analysis on PCo1. (E) Forest plot showing effect sizes from a
meta-analysis on PCo2. RE Model: Random effect model.
2.3.3 Differential abundance analysis revealed important viral taxon and metabolic
pathways associated with CRC
We next used DESeq2 to perform differential abundance analysis on species-level, genus-level, family-level
and pathway abundance. In general, most taxa identified in this analysis were from the Caudovirales order.
Most of them were enriched in CRC cohorts compared to healthy controls. We found 11 CRC-enriched viral
species (p-value < 10−5
, Figure 2.3(A), (C) and Supplementary Figure 5.7) from 3 phage families that were
significantly enriched in CRC cohorts in all 7 datasets, including Erwinia phage phiEt88, Klebsiella virus
21
ST16OXA48phi5-4, Vibrio phage martha 12B12, Mannheimia phage vB_MhM_3927AP2, Salmonella phage
118970_sal3 from Myoviridae, Salmonella virus Epsilon15 from Podoviridae and Pseudomonas virus B3, Escherichia phage HK639, Enterobacteria phage phi80, Enterobacteria phage ES18, Cronobacter phage phiES15
from Siphoviridae. Among these species, Klebsiella virus [27] Enterobacteria phage phi80 and Salmonella
phage [52] were found to increase in the CRC group. Erwinia phage and Vibrio phage were also reported
to contribute to CRC progression [98, 99]. All viral genera that were significant in at least 6 datasets
were found to be increased in the CRC group (Supplementary Figure 5.8). In regard to family-level taxon,
besides Myoviridae, Podoviridae and Siphoviridae that were obtained from the species-level analysis, we
additionally identified Drexlerviridae, Inoviridae and Herelleviridae that were significantly associated with
CRC (Supplementary Figure 5.9). While all other 5 families were found to be more abundant in the CRC
group, Herelleviridae, a recently established phage family in the Caudovirales order [12, 13], was observed
to be significantly depleted in the CRC group for most datasets (Supplementary Figure 5.9(C)). Phages in
the Herelleviridae family typically infect members of the Firmicutes phylum [13] and serve as a potential
treatment to the infection of intestinal epithelium-like environment [101].
Viral functional signatures were described by the differential abundance analysis of KEGG pathways.
Even though most gut virome functions remain to be uncurated with only a small portion of viral reads that
can be functionally characterized, we detected 7 metabolic pathways (p-value < 10−5
) that were notably
associated with CRC. Similar to taxonomic taxa, most pathways were enriched in the CRC groups in most
datasets (Supplementary Figure 5.10), including stearate biosynthesis II, oleate biosynthesis IV, fatty acid
elongation – saturated, palmitoleate biosynthesis I and (5Z)-dodecenoate biosynthesis I. These pathways
were substantially associated with fatty acid biosynthesis such as stearic acid, oleic acid and palmitoleic
acid, all of which were demonstrated to modulate the metabolic profiles and increase the risk of CRC
[125, 31, 75]. On the contrary, we discovered two pathways that were significantly decreased in the CRC
groups. One is L-methionine biosynthesis iii, which is an intracellular regulator that functions to inhibit
22
the proliferation of colorectal cancer cells [96]. The other depleted pathway is pyruvate fermentation to
acetate and lactate ii. This pathway ferments fiber into acetate, which may play an important role in the
turnover of the colonic epithelium to maintain the normal homeostasis [45]. Therefore, the inactivity of
these two pathways may serve as future targeted therapies of CRC.
Taken together, the differential abundance analysis based on the taxonomic and functional profiling
indicated that the structural alteration of gut viruses, mainly bacteriophages, was substantial in CRC cohorts. Although most bacteriophages were enriched in CRC groups and consequently caused more active
expression of fatty acid biosynthesis, some were observed to decrease in the CRC groups potentially inactivating the protective inhibition process of immune regulation. These results can further expand our
understanding of the potential contribution of the gut virome in CRC.
23
Feng
Zeller
Yang
Thomas
Yu
Yachida
Vogtmann
500 0
572
537
478
349
304
266
244
(A)
0
5
10
15
20
Intersection size
22
11
9 9
7
6
4 4
3 3
2 2 2
Yang
Feng
Zeller
Thomas
Yu
Vogtmann
Yachida
25 0
43
33
26
11
0
0
0
(B)
0
5
10
15
20
Intersection size
23
18
12
8
5 5
3 2 2 2 1
Control
Mannheimia phage vB_MhM_3927AP2
Salmonella phage 118970_sal3
Escherichia phage HK639
Vibrio phage martha 12B12
Enterobacteria phage phi80
Pseudomonas virus B3
Cronobacter phage phiES15
Enterobacteria phage ES18
Klebsiella virus ST16OXA48phi5-4
Salmonella virus Epsilon15
Erwinia phage phiEt88
(C) CRC
L-methionine biosynthesis III
pyruvate fermentation to acetate and lactate II
fatty acid elongation -- saturated
oleate biosynthesis IV
(5Z)-dodecenoate biosynthesis I
palmitoleate biosynthesis I
stearate biosynthesis II
(D)
1
0
1
2
3
4
log(abundance)
Figure 2.3: Differential abundance analysis on taxonomic and functional viral profiles. (A) UpSet plot
showing the number of shared differentially abundant viral species determined by species-level TMM normalized abundance and DESeq2. Only viral species differentiated in at least 5 datasets were displayed.
(B) UpSet plot showing the number of shared differentially abundant viral pathways determined by HUMAnN3 pathway abundance and DESeq2. (C) Heatmap showing the log transformed TMM normalized
abundance of viral species differentiated in all 7 datasets. (D) Heatmap showing the log transformed HUMAnN3 pathway abundance of pathways differentiated in at least 3 datasets.
2.3.4 Interkingdom association between viral families and bacterial species
Since most viruses in human gut are bacteriophages that either lyse their hosts or alter their functions,
we then characterized the relationship between bacteriophages and their hosts by assessing the correlation between their abundance and alpha diversity [100, 164]. To identify differentially abundant bacterial
24
species, we performed differential analysis on each dataset and found 27 bacterial species significant in all
7 datasets. Although the bacterial richness (Supplementary Figure 5.11(B) and (D)) did not show consistent
differences among datasets (meta-analysis I
2 = 81.82%, Q test p-value < 0.0001), we did find that the
bacterial alpha diversity substantially decreased in CRC in most datasets (Supplementary Figure 5.11(A)
and (C), meta-analysis I
2 = 73.31%, Q test p-value = 0.0029), which may have been the result of the
expansion of the viral community. Figure 2.4(A) showed a positive bacterium-virus correlation in terms
of diversity and richness in both control and CRC groups. While the direction of the correlation between
diversity and richness within a kingdom remains the same in the control and CRC groups, the positive
interkingdom association was weaker in the CRC group, especially for the association between viral alpha
diversity and bacterial richness.
The decrements of the association between bacteria and viruses in CRC could be further quantified by
the correlation between viral families and bacterial species. Figure 2.4(B) showed similar patterns of virusbacterium associations in both control and CRC groups. Positive correlations included Bicaudaviridae
and Methanobrevibacter smithii, Tectiviridae and Clostridium_M clostridioforme, and negative correlations
with Bicaudaviridae and Clostridium_M clostridioforme, Myoviridae and Gemella morbillorum. In the CRC
group, however, the positive correlations between Podoviridae and several bacterial species as well as the
negative correlations between Myoviridae and most bacterial species were markedly decreased. Altogether,
these results revealed a complex alteration of virus-bacterium relationship in CRC. The reduction in these
correlations implies a shrinkage of symbiotic network and highlights the importance of virus-bacterium
equilibrium in the maintenance of intestinal stability.
25
Viral richness
Viral diversity
Bacterial richness
Viral diversity
Bacterial richness
Bacterial diversity
Viral richness
Viral diversity
Bacterial richness
Viral diversity
Bacterial richness
Bacterial diversity
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Bicaudaviridae
Plasmaviridae
Phycodnaviridae
Lavidaviridae
Autolykiviridae
Tectiviridae
Ackermannviridae
Autographiviridae
Chaseviridae
Demerecviridae
Drexlerviridae
Guelinviridae
Herelleviridae
Myoviridae
Podoviridae
Rountreeviridae
Salasmaviridae
Schitoviridae
Siphoviridae
Zobellviridae
Inoviridae
Plectroviridae
Microviridae
Rudiviridae
Gemella morbillorum
Dialister_A pneumosintes
Prevotella nigrescens
Parvimonas micra
Peptostreptococcus anaerobius
Succinivibrio sp000431835
Alloprevotella tannerae
Fusobacterium animalis
Fusobacterium vincentii
Clostridium_M clostridioforme
Kallipyga massiliensis
Selenomonas_C bovis
Porphyromonas somerae
Peptoniphilus_A harei_A
Peptostreptococcus sp000758885
Anaerococcus obesiensis
Prevotella intermedia
Solobacterium moorei
Fusobacterium polymorphum
Pseudescherichia sp002298805
Porphyromonas uenonis
Fusobacterium_C necrophorum
Enterococcus_B hirae
Methanobrevibacter_A smithii_A
Fusobacterium nucleatum
Fusobacterium sp000235465
Lactobacillus_H mucosae −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Bicaudaviridae
Plasmaviridae
Phycodnaviridae
Lavidaviridae
Autolykiviridae
Tectiviridae
Ackermannviridae
Autographiviridae
Chaseviridae
Demerecviridae
Drexlerviridae
Guelinviridae
Herelleviridae
Myoviridae
Podoviridae
Rountreeviridae
Salasmaviridae
Schitoviridae
Siphoviridae
Zobellviridae
Inoviridae
Plectroviridae
Microviridae
Rudiviridae
Gemella morbillorum
Dialister_A pneumosintes
Prevotella nigrescens
Parvimonas micra
Peptostreptococcus anaerobius
Succinivibrio sp000431835
Alloprevotella tannerae
Fusobacterium animalis
Fusobacterium vincentii
Clostridium_M clostridioforme
Kallipyga massiliensis
Selenomonas_C bovis
Porphyromonas somerae
Peptoniphilus_A harei_A
Peptostreptococcus sp000758885
Anaerococcus obesiensis
Prevotella intermedia
Solobacterium moorei
Fusobacterium polymorphum
Pseudescherichia sp002298805
Porphyromonas uenonis
Fusobacterium_C necrophorum
Enterococcus_B hirae
Methanobrevibacter_A smithii_A
Fusobacterium nucleatum
Fusobacterium sp000235465
Lactobacillus_H mucosae
(B)
(A)
Control CRC
Control CRC Viral richness Viral diversity Bacterial richness
Viral diversity
Bacterial richness
Bacterial diversity
Viral richness
Viral diversity
Bacterial richness
Viral diversity
Bacterial richness
Bacterial diversity
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Viral richness Viral diversity Bacterial richness
Viral diversity
Bacterial richness
Bacterial diversity
Random effect size
Figure 2.4: Correlations between viral families and bacterial species. (A) Random effect size of Spearman’s
correlation coefficients between the diversity and richness of bacteria and viruses in healthy controls and
CRC subjects. Correlations with BH adjusted p-values < 0.05 are displayed. (B) Random effect size of
Spearman’s correlation coefficients between the abundance of all 24 viral families and that of 27 differentially abundant bacterial species. Correlations with BH adjusted p-values < 0.05 are displayed. The size
and color of circles indicate the extent of correlation.
2.3.5 Random forests classifiers accurately predict CRC status based on human gut
virome profiles
We next built random forest models using either gut viral taxonomic profiling or viral functional profiling to distinguish between healthy controls and CRC subjects. Despite the ethnic difference and the
26
heterogeneity of sequencing techniques, classifiers achieved high AUROC with gut viral profiles in both
within-study and cross-study predictions. Figure 2.5 showed that AUROC scores of within-study cross validation range between 0.65 and 0.92 on GPD genome-level abundance. Classification models had weaker
performance in the Vogtmann and Thomas datasets. Compared with the AUROC scores of within-study
prediction, the AUROC scores of cross-study slightly dropped. The highest decrease came from the models trained on the Feng and Yang datasets, which indicates weaker generalization of these two datasets.
The overfitting within the Feng dataset was also observed in a whole gut microbiome study by Wirbel
et al [150]. The performances of random forest classifiers using other taxonomic levels such as species,
genus and family abundance (Figure 2.5(B), (C) and Supplementary Figure 5.12) were lower than that based
on genome-level abundance, suggesting the loss of viral signature when tracing upward the taxonomic
tree. However, gene-family abundance with more functional units (more than 400,000) did not necessarily enhance the overall performance of the random forest model, reflecting the redundant nature of the
information provided by viral functional profiles.
Increasing the size of training set generally improves the prediction ability of machine learning models. Therefore, we further estimated the diagnostic ability of the random forest classifier by the leaveone-dataset-out validation (LODO) [134, 150]. The LODO results of GPD genome-level models again outperformed all other models trained on other type of abundance (Figure 2.5(D) and Supplementary Figure
5.12(D)). The LODO results of GPD genome-level models had a stable AUROC range from 0.75 to 0.85,
with the Vogtmann and the Thomas datasets regarded as outliers. The lower AUROC in the Vogtmann
dataset suggested that the long-time (> 25 years) freezing before sequencing altered the viral community
structure in fecal samples [144, 150]. In addition, the weak prediction result of the Thomas dataset was
potentially due to the relatively shallow sequencing depth compared to other datasets (Supplementary
Figure 5.1).
27
Bacterial signatures of the human gut have been shown to be predictive of CRC status [134, 150, 53].
To study whether the viral signatures can further enhance the prediction performance of this disease, we
combined both bacterial and viral abundance profiles together and re-run the random forest model. The
results are shown in Supplementary Figure 5.13, which does not show a better performance when this
combination is used. The bacterial abundance profile itself shows a high AUROC (> 0.8). The combination of bacterial and viral species abundance profiles did not increase the AUC scores compared to the
bacterial abundance profile alone. Moreover, adding the viral genome abundance can even reduce the prediction performance of the random forest model using bacterial abundance profile. This is due to the much
higher dimension of viral genome abundance, which mitigates the effect of bacterial signatures. One potential explanation for this observation is that the human gut virome does not independently contribute to
CRC development, but rather interacts with the prokaryotes to impact CRC, resulting in high correlation
between the gut viral abundance profile and bacterial abundance profile.
On the whole, the LODO analysis revealed the random forest models trained on these heterogeneous
datasets have solid generalization and robustness to make accurate predictions on other metagenomic CRC
studies. The prediction ability (AUROC > 0.80) achieved based on gut viral signatures was competitive
with that of whole gut microbial signatures (AUROC > 0.83) [134, 150, 53]. Although it can not further
enhance the performance of the bacterial signatures, the prediction reuslts still show the important role of
viruses in the homeostasis of gut microbiota.
28
Zeller
Yu
Feng
Vogtmann
Thomas
Yachida
Yang
Genome-level test set
Zeller
Yu
Feng
Vogtmann
Thomas
Yachida
Yang
Genome-level training set
0.829 0.753 0.734 0.678 0.596 0.679 0.846
0.745 0.780 0.675 0.632 0.573 0.702 0.834
0.613 0.566 0.927 0.574 0.623 0.700 0.601
0.727 0.737 0.676 0.659 0.628 0.573 0.693
0.615 0.620 0.675 0.607 0.701 0.610 0.714
0.692 0.700 0.727 0.543 0.640 0.688 0.669
0.771 0.752 0.647 0.568 0.601 0.578 0.921
(A)
Zeller
Yu
Feng
Vogtmann
Thomas
Yachida
Yang
Species-level test set
Species-level training set
0.785 0.657 0.692 0.600 0.608 0.620 0.771
0.681 0.720 0.663 0.618 0.599 0.689 0.761
0.666 0.558 0.803 0.573 0.650 0.665 0.532
0.610 0.616 0.603 0.605 0.690 0.557 0.647
0.592 0.533 0.671 0.624 0.720 0.585 0.643
0.645 0.642 0.689 0.571 0.607 0.697 0.650
0.746 0.698 0.633 0.547 0.591 0.618 0.874
(B)
Zeller
Yu
Feng
Vogtmann
Thomas
Yachida
Yang
Gene-family test set
Gene-family training set
0.802 0.678 0.727 0.587 0.563 0.589 0.776
0.615 0.757 0.543 0.539 0.568 0.658 0.713
0.640 0.534 0.775 0.547 0.564 0.618 0.512
0.565 0.554 0.580 0.619 0.617 0.540 0.585
0.525 0.636 0.510 0.609 0.609 0.634 0.640
0.637 0.731 0.587 0.551 0.629 0.694 0.684
0.697 0.670 0.662 0.563 0.549 0.604 0.845
(C)
Zeller Yu Feng Vogtmann Thomas Yachida Yang
0.0
0.2
0.4
0.6
0.8
AUROC
(D)
Genome level Species level Gene family
0.5
0.6
0.7
0.8
0.9
1.0
Figure 2.5: Prediction performances of random forest classifiers based on gut viral abundance. (A) Within
and cross study AUROC matrix obtained by using GPD genome-level abundance. The diagonal refers to
results of cross validation within each dataset. Off-diagonal values refer to prediction results trained on
the study of each row and tested on the study of each column. (B) Within and cross study AUROC matrix
obtained by using species-level abundance. See Supplementary Figure 5.12(A) and (B) for genus-level and
family-level AUROC. (C) Within and cross study AUROC matrix obtained by using gene-family abundance.
See Supplementary Figure 5.12(C) for pathway AUROC. (D) LODO results with the x axis indicating the
study left out as the validation set and other studies combined as the training set.
2.4 Discussion
Analysis of the composition of the gut microbiome provides new insight in the understanding of the etiology and pathophysiology of many gastrointestinal diseases. The development of colorectal cancer is
complex and involves genetic and environment factors such as the gut microbiome [50]. Despite the fact
that many studies have demonstrated specific microbial signatures in CRC, much remains to be explored
29
in the structure of the gut virome. To our knowledge, this study is the most comprehensive analysis of the
association between gut virome and CRC using the largest collection of datasets to date. Although there is
technical heterogeneity among different datasets, we found some consistent patterns and prediction abilities among these datasets, including the viral diversity, CRC-associated viral species, metabolic pathways
as well as robust and accurate diagnostic models.
The alpha diversity of gut viruses was found to be much higher in the CRC cohorts at the species, genus
and family levels. Combined with results of previous studies [38, 32], we demonstrate that the dysbiosis of
the intestinal microbiota is highly associated with CRC, perhaps the enrichment of viral species results in
more lytic infections in the host, thus significantly depleting the gut bacterial organisms and prompting
the development of CRC. In addition to the alpha diversity, the principal coordinate analyses with BrayCurtis distance and the PERMANOVA test further unraveled the separation between healthy controls and
CRC subjects.
The presence of 11 viral species and 10 viral genera was commonly associated with CRC in the majority of the 7 datasets employed in this study. At the family level, Myoviridae, Podoviridae, Siphoviridae,
Drexlerviridae from Caudovirales order and Inoviridae from Tubulavirales order were increased in the CRC
groups. Among these enriched families, Myoviridae, Podoviridae and Siphoviridae were frequently reported
to be associated with CRC [59, 122] and other human diseases such as IBD [35] and autoimmune diseases
[137], which substantially validate our results. We also discovered that the viral family Herelleviridae significantly was significantly depleted in CRC groups. As a relatively new viral family [13], Herelleviridae
contains phage species that may have therapeutic potential for gastrointestinal infections [101]. Phages
in the Herelleviridae family typically infect members of the Firmicutes phylum [13] and have been shown
to display Ig-like domains on their virions that contribute to the integrity and heath of intestinal barrier
function serving as a potential treatment targeting the intestinal epithelium [101]. Therefore, the depletion of this family may lead to intestinal epithelial dysregulation permissive to the development of tumors.
30
Moreover, several metabolic pathways were identified in subjects with CRC in this study. Five metabolic
pathways related to fatty acid biosynthesis were found to be more active in CRC. These pathways have
been shown to increase risk of CRC in prior studies [125, 31, 75]. Furthermore, two other pathways namely
L-methionine biosynthesis iii and pyruvate fermentation to acetate and lactate ii were inactive in CRC.
pathways Both have been evinced to be firmly linked to inhibiting the proliferation of tumors [96, 45].
Such pathways may serve as potential therapy targets in the future.
Our correlation analysis further reveals the interkingdom association in CRC. Although the alpha diversity demonstrates a positive correlation between viral families and bacterial species in both healthy
controls and CRC, the reciprocity between them considerably weakened in CRC, especially in the network
between Podoviridae, Myoviridae and most displayed bacterial species. These relationships are important
in the virus-bacterium interaction and their effect on the intestinal health.
Finally, the diagnostic models we built based on the viral abundance and random forest algorithm
outperformed all other prior studies of the gut virome [98, 52]. Despite the high performance of distinguishing CRC being achieved with the whole gut microbiome [150, 134], our virome-based classifiers had
competitive results in both within-study and cross-study validations. Remarkably, the LODO experiment
showed that the diagnostic models are quite robust, which suggests that the combination of heterogeneous datasets can substantially improve the sensitivity and accuracy for detecting CRC cases in other
independent datasets.
In conclusion, we performed a comprehensive gut virome case-control study, revealing the significant
contribution of the gut virome in CRC. The detected gut virobiota, which links the virome and bacteriome
as combined diagnostic models, unveil a new perspective of the gut virome in the pathogenesis of CRC.
31
Chapter 3
DeepLINK-T: deep learning inference for time series data using
knockoffs and LSTM
3.1 Introduction
In the past decades, feature selection has become one of predominant techniques in various fields including
bioinformatics [120, 146, 86, 83], economics [157, 161] and healthcare [114, 97, 113]. The primary objective
of feature selection is to identify and choose pertinent features (covariates) that significantly contribute
to specific response variables of interest from a pool of candidates, often characterized by a plethora of
noisy variables. Feature selection serves various purposes, including model interpretation, dimensionality
reduction, and efficiency enhancement [71]. These applications are integral to a spectrum of machine
learning tasks including classification, regression, and clustering.
Given the widespread popularity and broad application of feature selection, numerous methods have
been developed to approach this complex problem. For example, the random forests (RF) approach [23]
utilizes information gain for feature selection, but it may be susceptible to a high false discovery rate (FDR)
in the setting of high-dimension inference; see, e.g., [34] for some recent theoretical developments on highdimensional random forests. To surmount this difficulty, seminal works based on p-values [15, 17, 66, 124]
have been introduced to evaluate the feature importance. However, the efficacy of p-value based methods
32
is restricted to simpler model settings where obtaining p-values is straightforward. With the evolution
of deep neural networks, models have become increasingly intricate, which makes it luxury to calculate
p-values, thereby posing a formidable challenge for feature selection in such contexts.
Various approaches have been suggested to achieve FDR control [17, 132, 54, 66]. However, most of
these methods depend on p-values, thus not adaptable to the deep neural network setting. Moreover, they
mainly focused on independent data setting and are not able to capture the serial dependency in time series
data.
In contrast, Barber and Candès [10] introduced a novel approach known as knockoff filter, designed for
FDR control in Gaussian linear models without relying on p-values. Subsequently, the model-X knockoffs
framework [26] was proposed to generalize knockoff filter to general high-dimensional nonlinear models
with arbitrary dependency structure of the predictors. The framework essentially acts as a wrapper, seamlessly integrating with any feature selection methods that yield feature importance measures satisfying
specific conditions. Recently, Lu et al. [88] proposed DeepPINK to combine model-X knockoffs and deep
neural networks for features characterized by joint Gaussian distributions, and Zhu et al. [163] introduced
DeepLINK to generalize the distribution of input features. More recent developments on the extensions of
model-X knockoffs inference include [46, 8, 33]. In particular, our current work was motivated by the idea
of utilizing the latent factor structure in the recent work of IPAD knockoffs in [46].
However, these existing knockoffs framework based method predominantly focus on independent observations, failing to account for the prevalent serial dependence observed in data across diverse fields,
such as ecology, medical care and finance. Incorporating temporal information into the feature selection
process has the potential to produce more accurate and interpretable results by leveraging known serial
dependencies rather than solely relying on the correlations between features and responses. Consequently,
33
we introduce a novel method for deep learning inference using knockoffs specifically tailored for time series data, named DeepLINK-T. This method is designed to efficiently select features while ensuring FDR
control in the context of serial data.
3.2 Method
3.2.1 Model setting
Consider the problem of high-dimensional supervised learning, which involves analyzing longitudinal time
series observations on m subjects. The time series data for each subject is represented as (Xi
, yi), i =
1, . . . , m, where Xi = (xi1, . . . , xin)
T ∈ R
n×p
is the feature matrix containing n time points and
p features for the ith subject, and the response vector over the n time points is represented by yi =
(yi1, . . . , yin)
T ∈ R
n×1
. To simplify the technical presentation, let us assume that (Xi
, yi)’s are independent and identically distributed (i.i.d.) across i ∈ {1, . . . , m}. The full set of features is represented
as {1, 2, . . . , p}. We assume that given each row of Xi
, the corresponding conditional distribution of response yik, k = 1, . . . , n, depends only on a common small subset of features, and our objective is to
identify such Markov blanket, which is the smallest subset S0 of {1, . . . , p} that makes yik independent
of all other features conditional on those in S0, while controlling the feature selection error rate under a
predefined level.
Precisely, we aim to select important features in S0 while keeping the false discovery rate (FDR) [15]
controlled, where the FDR is defined as
FDR = E[FDP] with FDP =
|Sˆ ∩ S
c
0
|
max{|Sˆ|, 1}
, (3.1)
34
where | · | represents the cardinality of a set, FDP is the false discovery proportion, and Sˆ stands for the
set of features selected by a learning procedure. A slightly modified version of FDR (mFDR) is defined as
mFDR = E
"
|Sˆ ∩ S
c
0
|
|Sˆ| + q−1
#
. (3.2)
Here, q ∈ (0, 1) is the targeted FDR level. It is evident that FDR is more conservative than mFDR since
the mFDR is always under control if the FDR is under control. Additionally, we utilize another important
metric, the power, to measure the ability of the learning procedure in discovering the true features. Power
is defined as the expectation of the true discovery proportion (TDP)
Power = E[TDP] with TDP =
|Sˆ ∩ S0|
|S0|
. (3.3)
As mentioned in the Introduction, the p-value based FDR approaches are not directly applicable to
deep learning models due to the lack of valid p-value calculation methods caused by the complicated
model structure. Existing knockoffs methods have focused mainly on independent data setting. The recent
development of DeepLINK [163] has addressed deep learning inference without serial dependency. In this
paper, we propose to combine the model-X knockoffs framework [26] and the long short-term memory
(LSTM) network [63] to achieve the FDR control in terms of feature selection scenario for longitudinal
time series applications.
To make our idea concrete, let us assume that response yit for the ith subject at the tth time point
depends on p-dimensional feature vector xit through the following regression model
yit = l(xit) + εit, i = 1, · · · , m, t = 1, · · · , n, (3.4)
35
where l stands for some unknown true regression function that can be either linear or nonlinear, and
εit’s are i.i.d. model errors. To accommodate the potentially strong feature dependency, we focus on the
scenario when the high-dimensional feature vector xit, i = 1, . . . , m, at time point t depends on some
low-dimensional latent factor vector fit ∈ R
r
in a potentially nonlinear manner. Here, we assume that
r ≪ p with p the dimensionality of feature vector xit. For clear presentation, let us first consider the case
of one time series (i.e., m = 1) and use xt and ft to denote xit and fit, respectively. Hence, we consider a
factor structure for feature vector xt described by equation
xt = g(ft) + ϵt
, t = 1, . . . , n. (3.5)
Here, g represents some unknown link function whose coordinates can take either linear or nonlinear
functional forms, and ϵt
is the error vector for the factor model. Considering the potential time dependency
among the latent factors, we further assume the following dependency structure for the latent factors
f1 = f
raw
1
,
ft = w0f
raw
t−1 + w1f
raw
t
, t = 2, · · · , n,
(3.6)
where f
raw
t
’s are vectors of the raw latent factors, and w0, w1 are the weights for the time dependency
satisfying w0 + w1 = 1. Here, we assume that f
raw
t
’s are drawn from a certain distribution and details are
specified in the simulation designs. To simplify the notation, let X = (x1, . . . , xn)
T be the n × p design
matrix for a given subject, C = (g(f1), . . . , g(fn))T
the n × p factor matrix, and E = (ϵ1, . . . , ϵn)
T
the
error matrix for the factor model. Then the factor model in Equation (3.5) becomes
X = C + E. (3.7)
36
Hence, our objective is to develop a new deep learning feature selection method that keeps FDR controlled at the target level q under the time series model setting outlined in Equations (3.4)–(3.7).
3.2.2 The model-X knockoffs framework
The idea of knockoffs inference was initially proposed by Barber and Candès [10] for the case of lowdimensional Gaussian linear models. Then the model-X knockoffs framework [26] generalizes this approach to general high-dimensional nonlinear models and achieves theoretically guaranteed FDR control
in arbitrary dimensions and for arbitrary dependence structure of response on the features. Concisely, the
model-X knockoffs framework constructs the knockoff variables which imitate the dependence structure
of the original variables, but are conditionally independent of the response given the original features.
These knockoff variables function as control variables, aiding in the assessment of the original variables’
importance. This is achieved through the construction of what are referred to as knockoff statistics – a concept that will be explained subsequently – which serve as measures of the original variable’s importance.
The knockoff statistics are used for achieving FDR control in feature selection.
Consider covariate vector x = (x1, . . . , xp)
T
and response y. The model-X knockoff variables for
variables in x are random variables in x˜ = (˜x1, . . . , x˜p)
T
that satisfy the following two properties:
1) For any subset S ⊂ {1, 2, . . . , p},(x, x˜)swap(S)
d= (x, x˜), where swap(S) represents the operator
of swapping the original variable xj and its knockoff counterpart x˜j for each j ∈ S, and d= stands
for equal in distribution.
2) The knockoff variables x˜ is independent of response y given the original variables x.
Let X = (x1, . . . , xn)
T be the observed n × p design matrix and X˜ = (x˜1, . . . , x˜n)
T
the constructed
knockoffs data matrix. The model-X knockoffs framework constructs knockoff statistics W1, . . . , Wp to
37
evaluate the importance of the original variables. For each j ∈ {1, 2, . . . , p}, Wj = wj ([X, X˜ ], y) is a
function of the augmented data matrix [X, X˜ ] and response y satisfying the sign-flip property
wj ([X, X˜ ]swap(S)
, y) =
wj ([X, X˜ ], y), j /∈ S,
−wj ([X, X˜ ], y), j ∈ S,
(3.8)
where S ⊂ {1, 2, . . . , p} is any given subset. Intuitively, ideal knockoff statistics quantify the significance of original variables, with high positive values indicating their importance. On the other hand, for
unimportant features not in S0, the corresponding Wj ’s are expected to have small magnitude and be
symmetrically distributed around zero.
Finally, the model-X knockoffs framework selects important features as Sˆ = {j : Wj ≥ t} with t
being either the knockoff or knockoff+ threshold [26] defined respectively as
T = min
t > 0 :
|{j : Wj ≤ −t}|
max{|{j : Wj ≥ t}|, 1}
≤ q
, (3.9)
T+ = min
t > 0 :
1 + |{j : Wj ≤ −t}|
max{|{j : Wj ≥ t}|, 1}
≤ q
. (3.10)
Here, no features will be selected if T = ∞ or T+ = ∞. It has been proved in [26] that under
some regularity conditions, the model-X knockoffs framework controls the FDR in finite samples with the
knockoff+ threshold, and controls the mFDR with the knockoff threshold.
3.2.3 DeepLINK-T: a new deep learning inference method for time series data
The preceding sections have highlighted the crucial elements of the model-X knockoffs framework: the
construction of knockoff variables and the construction of knockoff statistics. In this subsection, we introduce DeepLINK-T, a deep learning-based statistical inference framework using knockoffs and LSTM for
longitudinal time series data, to achieve the construction tasks and feature selection with controlled FDR.
38
Briefly, the DeepLINK-T framework contains two components: 1) an LSTM autoencoder network for generating knockoff variables and 2) another LSTM prediction network for the knockoff statistic construction
and feature selection with controlled FDR.
Our motivation comes from the recent works of IPAD [46] and DeepLINK [163] on knockoffs inference. Equation (3.7) clearly shows that the knockoff variables can be directly constructed if we have the
realization C and the distribution fθ of the error matrix E without violating the two properties of knockoff
variables, since we can generate an independent copy of the error matrix from fθ. However, in practice,
it is challenging to obtain the realization C, and we need to estimate C via a particular procedure. It is
crucial to note that even though C is a random matrix, the construction of valid knockoff variables requires us to know the specific realization C rather than the distribution of realization C. This is because
if Cˆ is independently drawn from the known distribution, the first property in the definition of model-X
knockoff variables could be violated due to the independence of Cˆ and C.
To overcome this challenge, we propose to use an LSTM autoencoder [121] for generating the knockoff
variables. The structure of the LSTM autoencoder is shown in Figure 3.1. Take one subject (time series) as
the example. We train an LSTM autoencoder with design matrix X as both the input and the target of the
neural network. Let Cˆ be the output matrix of the LSTM autoencoder for reconstruction. The estimated
knockoff variables Xˆ can be constructed by
X˜ = Cˆ + E˜ . (3.11)
39
LSTM LSTM . . . LSTM
LSTM LSTM . . . LSTM
Encoded features
Input original variables
Output knockoff variables
LSTM encoder
LSTM decoder
Figure 3.1: The structure of the LSTM autoencoder.
Here, E˜ is the regenerated error matrix with entries independently drawn from the estimated marginal
distribution fθˆ of ϵit. For instance, if the entries of the error matrix follow the normal distribution with
mean 0 and variance σ
2
, we can estimate θ = σ
2
as
ˆθ =
1
np
Xn
t=1
X
p
k=1
eˆ
2
tk, (3.12)
where eˆtk’s are the entries of the residual matrix Eˆ = X − Cˆ .
In the existence of multiple subjects (i.e., m > 1), each subject is treated as an input batch for the
LSTM autoencoder. Stateless LSTM layers are applied to ensure that the initial state of the current batch
will not be related to the last state of the previous batch. The variance estimator ˆθi
is calculated separately
for each batch i = 1, · · · , m, and then the error matrix E˜
i for each batch is independently drawn from its
estimated marginal distribution fθˆ
i
.
As discussed in the previous section, ideal knockoff statistics are able to accurately measure the importance of original features. Additionally, the relationship between response variable yit and features xit
40
in Equation (3.4) could be either linear or nonlinear. To achieve these objectives, we propose to utilize the
weights from the learned LSTM prediction network to construct the knockoff statistics. The structure of
the LSTM prediction network is shown in Figure 3.2.
Specifically, after generating knockoff variable matrix X˜ from the LSTM autoencoder, we form the
augmented data matrix [X, X˜ ]. To integrate the information from the original variables X and the knockoff
variables X˜ , we adopt the idea of DeepPINK [88] to form an LSTM prediction network (Figure 3.2). In
Figure 3.2, each of the original features and their knockoff counterparts were paired together using a
plugin pairwise-coupling layer to ensure fair competition in their importance for predicting the response
variable, and this layer is named as the filter layer for the ease of reference. The output from the filter layer
is then fed to a dense layer followed by an LSTM layer with the output being response vector yi
.
The LSTM prediction network (Figure 3.2) can be easily trained with the original variables and their
knockoff copies by backpropagation through time (BPTT) [149]. Similar to the training process of the
LSTM autoencoder, in the case of multiple subjects (i.e., m > 1), each subject will be treated as a batch and
BPTT will be repeatedly applied to every single batch. Since a stateless LSTM layer is applied, the initial
state of a new batch will not be affected by the last state of the previous batch. After training this LSTM
prediction network, denote by z = (z1, z2, . . . , zp)
T
and z˜ = (˜z1, z˜2, . . . , z˜p)
T
the filter weights for the
original features and their knockoff counterparts in the filter layer, respectively. Let W0 ∈ R
p×k be the
weight matrix for the dense layer, where k is the number of neurons in the dense layer. Given the output
xt
in state t from the dense layer, the LSTM layer (Figure 3.3) first determines what old information will
be discarded by the forget gate
ft = σ(Vfxt + Ufht−1 + bf ), (3.13)
where ht−1 is the hidden state from the LSTM cell at the state t − 1, σ(·) denotes the sigmoid function, bf
is the bias parameter, and Vf and Uf are network weight matrices of appropriate sizes. An input gate and
41
�# �$ �!"# �!
LSTM Layer
�
Original variables Knockoff variables
Input layer
Filter layer
Dense layer
Output layer
. . .
. . .
Figure 3.2: The structure of the LSTM prediction network.
a cell state decide the updated values it and propose a new candidate values c˜t
, respectively. These values
together determine the new cell state ct as
it = σ(Vixt + Uiht−1 + bi),
c˜t = tanh(Vcxt + Ucht−1 + bc),
ct = ft ⊙ ct−1 + it ⊙ c˜t
,
(3.14)
42
ṓ ṓ WDQK ṓ
WDQK
�!"# �!
ℎ!"# ℎ!
�!
ℎ!
�! �! � �! ̃
!
Figure 3.3: The structure of a LSTM cell. x is the input of the LSTM cell. f, i and o represent the forget,
input and output gates, respectively. c˜t denotes the input activation. h and c are hidden state and cell state.
Subscript t indicates the time step.
where tanh(·) denotes hyperbolic tangent function, ⊙ is element-wise product, bi and bc are the biases,
and Vi
, Ui
, Vc, Uc are network weights of appropriate sizes. Finally, the output ot and the hidden state ht
at current state t can be obtained by
ot = σ(Voxt + Uoht−1 + bo),
ht = ot ⊙ tanh(ct).
(3.15)
Here, similarly, b0 is the bias and V0 and U0 are the weight matrices.
In Equations (3.13)–(3.15) above, if we denote u as the number of LSTM units in the LSTM layer, then
the weigh matrices V·
’s are of u × k, the weight matrices U·
’s are of size u × u, and biases b· are all vectors
of length u, where the · in the subscript can be f, i, c or o.
43
To measure the relative importance of original features and their knockoff counterparts, we track their
network weights propagate along the network through filter layer, dense layer, LSTM layer, and output
layer (for predicting y). Due to the complicated structure of the LSTM layer, each of the original and
knockoff features make multi-path contributions to the network by going through various gates including
the forget gate, input gate, cell state, and output gate. Since their contributions along these different
paths (each corresponds to a gate) are likely different, we measure the relative importance of original and
knockoff features when going through each of the paths separately as
v
(f) = (V0 ⊙ Γ)VfV1,
v
(i) = (V0 ⊙ Γ)ViV1,
v
(c) = (V0 ⊙ Γ)VcV1,
v
(o) = (V0 ⊙ Γ)VoV1,
(3.16)
respectively corresponding to forget gate, input gate, cell state, and output gate. Here, V1 ∈ R
h
is the
weight vector for the output layer following the LSTM layer and Γ = [γ, γ, . . . , γ]
T ∈ R
p×k with γ ∈ R
k
being the weight vector for the batch normalization [68] between the dense layer and the LSTM layer. If
the batch normalization is not used, we simply remove Γ in the above equation. We then augment the contributions of the original feature j along the four different paths into a vector (zjv
(f)
j
, zjv
(i)
j
, zjv
(c)
j
, zjv
(o)
j
),
and define the variable importance measure for feature j as
Zj = ||(zjv
(f)
j
, zjv
(i)
j
, zjv
(c)
j
, zjv
(o)
j
)||2,
(3.17)
where ||·||2 denotes the L2-norm. Correspondingly, the variable importance measure for the jth knockoff
variable is defined as
Z˜
j = ||(˜zjv
(f)
j
, z˜jv
(i)
j
, z˜jv
(c)
j
, z˜jv
(o)
j
)||2,
44
Finally, the knockoff statistics are defined as
Wj = Z
2
j − Z˜2
j
, j = 1, · · · , p. (3.18)
With the use of our new knockoff statistics introduced in Equation (3.18), the feature selection procedure described in Equations (3.9) and (3.10) can be directly applied. It is important to note that the filter
weights zj and z˜j need to be initialized identically when training the neural networks to ensure good
feature selection performance.
Here, we have introduced the way of constructing knockoff statistics based on the LSTM network
structure depicted in Figure 3.2. Our definition of knockoff statistics can be easily adapted to other neural
network architectures, including those with more dense layers and multiple stacked LSTM layers.
3.3 Simulation studies
To investigate the finite-sample capability of DeepLINK-T for FDR control in feature selection, we design
various simulation models including the linear or nonlinear factor models, and different forms of the link
function between the response and features for a comprehensive analysis. In all simulation designs, we
set m = 1 and the vector of raw latent factors as f
raw
t = (f
raw
t1
, fraw
t2
, fraw
t3
)
T
. Each f
raw
t
is drawn
independently from the multivariate normal distribution N(0, Σ), where Σ has the AR(1) structure with
(i, j)th entry being 0.9
|i−j|
. Then the factor vector ft = (ft1, ft2, ft3)
T
is obtained based on f
raw
t and
Equation (3.6) with w0 = 0.3 and w1 = 0.7. Two factor models, including the linear factor model (Equation
(3.19)) and logistic factor model (Equation (3.20)), are used in the simulation studies to generate covariates
xt = Λft + ϵt
, (3.19)
45
xtk =
ck
1 + exp([1,f
T
t
]λk)
+ ϵtk, k = 1, · · · , p. (3.20)
Here, ck, λk, ϵtk, and entries of Λ ∈ R
p×3
are all independently sampled from the standard normal distribution N(0, 1).
The response vector y = (y1, . . . , yn)
T
is simulated from Equation (3.4), and both the link function
(Equation (3.21)) and nonlinear link function (Equation (3.22)) are considered in our simulation study:
l(xt) = xtβ, (3.21)
l(xt) = sin(xtβ) exp(xtβ), (3.22)
where β = (β1, . . . , βp)
T
is the coefficient vector with each βk, 1 ≤ k ≤ p being either A, −A, or 0. Here,
A is some positive value standing for the signal amplitude. We randomly pick s components from β and
set them to be A or −A with the same probability; the features corresponding to these s components are
important features and expected to be selected out by a good feature selection method. The remaining
p − s components of β are then set to be 0 indicating the corresponding features are unimportant. With
the linear link function, it is evident that a large A indicates a strong signal strength of the corresponding
feature. However, for nonlinear link functions, it is possible that A no longer has a monotonic increasing
relationship with signal strength. Indeed, there does not exist a commonly accepted measure for the signal
strength in nonlinear model settings; See [163] for more detailed discussions on this.
Furthermore, to verify the capability of DeepLINK-T to tackle potential model misspecification in realworld applications, we tested DeepLINK-T on the response relationship with latent confounders using the
following model:
yt = l(xt) + 1
3
(ft1 + ft2 + ft3) + εit, t = 1, · · · , n, (3.23)
46
linear link function nonlinear link function
A B C D
Figure 3.4: The impacts of bottleneck dimensionality and training epochs on DeepLINK-T using the linear
factor model. A and B are FDR and power under the setting of linear link function. C and D are FDR
and power under the setting of nonlinear link function. The number of training epochs of both LSTM
autoencoder and LSTM prediction network is specified on the x-axis. The bottleneck dimensionality of
the LSTM autoencoder is specified on the y-axis. The pre-specified FDR level is q = 0.2.
Note that although ft can be viewed as a function of both xt and the idiosyncratic error ϵt
, the dependence
of yt on xt does not admit the functional form defined in (3.4). In this sense, (3.4) is a misspecified model
when data are generated from (3.23).
3.3.1 The impacts of hyperparameters and model misspecification on DeepLINK-T
We fixed the number of subjects m = 1, the number of time points n = 1000, the number of features
p = 500, the signal amplitude A = 10, and the pre-specified FDR level q = 0.2 to investigate the effects of
hyperparameters including bottleneck dimensionality in the LSTM autoencoder and the training epochs
for both networks (i.e., LSTM autoencoder and LSTM prediction network) in DeepLINK-T on the knockoffs
inference results. Figures 3.4 and 3.5 demonstrate the performance of DeepLINK-T in the linear factor
model and logistic factor model, respectively. Sub-figures A and B present the FDR+ and Power+ for the
linear link function setting when varying the values for bottleneck dimension and training epochs. Subfigures C and D present the FDR+ and Power+ in the setting of nonlinear link function. These results reveal
lower power when the number of training epochs is less than 300, indicating the potential underfitting
47
linear link function nonlinear link function
A B C D
Figure 3.5: The impacts of bottleneck dimensionality and training epochs on DeepLINK-T using the logistic factor model. A and B are FDR and power under the setting of linear link function. C and D are
FDR and power under the setting of nonlinear link function. The number of training epochs of both LSTM
autoencoder and LSTM prediction network is specified on the x-axis. The bottleneck dimensionality of
the LSTM autoencoder is specified on the y-axis. The pre-specified FDR level is q = 0.2.
status of the networks. Another observation is that DeepLINK-T is robust to the choice of bottleneck
dimension. The FDR stays under control and the power is stable as the bottleneck dimension varies from
1 to 64 when the number of training epochs exceeds 500. The underestimation (bottleneck dimension < 3)
and overestimation (bottleneck dimension > 3) of the number of latent factors do not substantially impact
the FDR and power, justifying the robustness of DeepLINK-T under potential model misspecification.
Next, we introduced latent confounders as described in Equation (3.23) into data generating models,
aiming to assess the ability of DeepLINK-T in addressing model misspecification more comprehensively.
Supplementary Figures 5.14 and 5.15 present the FDR+ and power+ metrics based on the linear factor
model and logistic factor model, respectively. The results reveal that DeepLINK-T, when trained with
fewer than 500 epochs, exhibits reduced power with inflated FDR, particularly in the case of logistic factor
model. With larger training epochs exceeding 1000, the performance of DeepLINK-T becomes comparable
to simulations without latent confounders. Additionally, the number of bottleneck dimensions does not
meaningfully influence DeepLINK-T’s performance. All together, these results suggest that DeepLINK-T
demonstrates robustness under the scenario of model misspecification.
48
0 5 10 15 20
Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
Metric
A linear link function, p=500
Model
DeepLINK-T
DeepLINK
Metric
Power+
FDR+
0 5 10 15 20
Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
Metric
B linear link function, p=1500
Model
DeepLINK-T
DeepLINK
Metric
Power+
FDR+
0 5 10 15 20
Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
Metric
C nonlinear link function, p=50
Model
DeepLINK-T
DeepLINK
Metric
Power+
FDR+
0 5 10 15 20
Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
Metric
D nonlinear link function, p=500
Model
DeepLINK-T
DeepLINK
Metric
Power+
FDR+
Figure 3.6: Comparisons of DeepLINK-T and DeepLINK based on simulated data with the linear factor
model. The number of subjects m is 1, the number of time points n is 1000, and the number of true
signals is 10. Each subtitle suggests the link function l, and the number of features p. The solid lines and
dashed lines stand for Power+ and FDR+, respectively. The green lines and red lines represent the results
of DeepLINK-T and DeepLINK, respectively.
3.3.2 Comparisons of DeepLINK-T and DeepLINK
To further investigate the ability of DeepLINK-T to handle longitudinal data, we applied both DeepLINK-T
and DeepLINK [163] to various simulated data sets for a comprehensive comparison. Since DeepLINK-T is
robust to different values of bottleneck dimensionality and 1000 training epochs are enough for the learned
model to converge, we fixed the bottleneck dimensionality to be 15. The number of training epochs was
set to 1000 for both DeepLINK-T and DeepLINK in this subsection. The pre-specified FDR level was again
set to 0.2. We tested both algorithms on the data simulated with the combination of different link functions
49
0 5 10 15 20
Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
Metric
A linear link function, p=500
Model
DeepLINK-T
DeepLINK
Metric
Power+
FDR+
0 5 10 15 20
Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
Metric
B linear link function, p=1500
Model
DeepLINK-T
DeepLINK
Metric
Power+
FDR+
0 5 10 15 20
Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
Metric
C nonlinear link function, p=50
Model
DeepLINK-T
DeepLINK
Metric
Power+
FDR+
0 5 10 15 20
Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
Metric
D nonlinear link function, p=500
Model
DeepLINK-T
DeepLINK
Metric
Power+
FDR+
Figure 3.7: Comparisons of DeepLINK-T and DeepLINK based on simulated data with the logistic factor
model. The number of subjects m is 1, the number of time points is 1000, and the number of true signals is
10. Each subtitle suggests the link function l, and the number of features p. The solid lines and dashed lines
stand for Power+ and FDR+, respectively. The green lines and red lines represent the results of DeepLINKT and DeepLINK, respectively.
and different factor models listed in the simulation designs. For each simulation setting, the two algorithms
were tested on 50 independently simulated data sets. Then the average FDP and TDP were calculated as
the empirical FDR and power, respectively. The comparison results are shown in Figures 3.6 and 3.7.
Figure 3.6 shows the comparison based on the linear factor model. Both DeepLINK and DeepLINK-T
effectively control the FDR under the target level 0.2. Yet, DeepLINK-T successfully maintains the FDR
at a lower level while achieving a higher power compared to DeepLINK in all but the setting in panel B,
where the power of DeepLINK-T is only slightly lower than DeepLINK when signal amplitude is high.
50
0 5 10 15 20 25 30 35 40
Number of subjects
0.0
0.2
0.4
0.6
0.8
1.0
Metric
A linear link function, linear factor model
Power+
FDR+
0 5 10 15 20 25 30 35 40
Number of subjects
0.0
0.2
0.4
0.6
0.8
1.0
Metric
B nonlinear link function, linear factor model
Power+
FDR+
0 5 10 15 20 25 30 35 40
Number of subjects
0.0
0.2
0.4
0.6
0.8
1.0
Metric
C linear link function, logistic factor model
Power+
FDR+
0 5 10 15 20 25 30 35 40
Number of subjects
0.0
0.2
0.4
0.6
0.8
1.0
Metric
D nonlinear link function, logistic factor model
Power+
FDR+
Figure 3.8: The impacts of number of subjects m on DeepLINK-T. In this simulation, the number of time
points is 1000, the number of features is 500, and the number of true signals is 10. Each subtitle suggests
the link function l, and the factor model. The solid lines and dashed lines stand for Power+ and FDR+,
respectively.
The simulation results based on the logistic factor model (Figure 3.7) provide additional evidence of
DeepLINK-T’s superior capability of FDR control in feature selection. DeepLINK can no longer control
the FDR in the setting of the nonlinear factor model, while DeepLINK-T can still control the FDR under
the target level and simultaneously outperform DeepLINK in terms of power. An interesting observation
is that the power curve is inverted-U shaped as the signal amplitude A increases when the nonlinear link
function l is applied, which suggests that the signal amplitude may not be an appropriate measure for the
signal strength in certain nonlinear models. We remark that similar phenomenon on the power curve was
observed in [163]. Overall, these simulation results highlight the power of DeepLINK-T in dealing with
51
longitudinal data. The simulation settings involving different factor models and link functions demonstrate
the ability of DeepLINK-T in coping with complex models, indicating its great potential in real applications.
3.3.3 The impacts of number of subjects on DeepLINK-T
Another advanced feature of DeepLINK-T lies in its ability to handle multiple subjects (time series), which
is prevalent in real-world applications. An illustrative example is the gut microbiome metagenomic time
series, where each time series may independently correspond to a different individual. Hence, in this
simulation case, we investigate the influence of the number of subjects on the performance of DeepLINK-T.
We maintained a fixed number of time points n = 1000, number of features p = 500, signal amplitude A =
10, and pre-specified FDR level q = 0.2. Multiple subjects were independently and identically simulated
using the procedure described above. Figure 3.8 displays the performance variation as the number of
subjects changes across different factor models and link functions. The results consistently demonstrate
that FDR+ decreases as the number of subjects increases. Moreover, in simulations involving the nonlinear
link function and the linear factor model, power+ increases with the growing number of subjects. In
summary, these findings underscore that the generalization of variance estimation outlined in Equation
(3.12) from a single subject to multiple subjects is reasonable. This further substantiates the prospect of
DeepLINK-T in handling complicated real-world datasets.
3.4 Real data applications
We now apply DeepLINK-T to three real-world metagenomic time series data sets to further validate its
practical performance. All metagenomic data was processed by a centered log ratio (CLR) transformation.
For all three applications, the numbers of training epochs and the bottleneck dimensionality were set
to 1000 and 15, respectively. The pre-specified FDR level was chosen as 0.2. In order to alleviate the
52
randomness during the weight initialization and the training process of the neural networks, we repeatedly
implemented DeepLINK-T on each data set for 200 times.
3.4.1 Application to longitudinal gut microbiome data of early infants
First, we applied DeepLINK-T to microbial longitudinal samples from early infants [19, 141]. Our focus
was on identifying important bacterial genera associated with the abundance levels of Bacteroides and
Bifidobacterium, which are predominant genera in the stool of early life from the facultative aerobic Enterobacteriaceae family. These two genera were found to be gradually displaced by a diverse mixture of
Clostridiales in 2 years [19]. Since we used the abundance of one genus among all the genera as the response, a direct application of the CLR transformation to all genera could cause collinearity between the
explanatory and response variables. Instead, a modified CLR transformation was adopted to address the
issue. For the explanatory genera, we directly applied CLR
CLR(xi) = ln xi −
1
D
X
D
j=1
ln xj , (3.24)
where D is the number of explanatory genera and xi denotes the composition part of genus i. Then we
transformed the count of response genus y using
yˆ = ln y −
1
D
X
D
j=1
ln xj . (3.25)
Considering that the data is rather sparse, we filtered out samples with over 50% missing time points.
At the feature domain, genera that are absent in over 90% samples were also filtered out. Consequently,
the processed data set is comprised of 36 subjects, each sequenced monthly over a 24-month period. Fifty
common genera were identified among all samples. Hence, the data array is of size m×n×p = 36×24×50.
53
Genus selected for Bacteroides Frequency Genus selected for Bifidobacterium Frequency
f__Porphyromonadaceae; g__Parabacteroides 200 f__Clostridiaceae; g__SMB53 196
f__Lachnospiraceae; g__Epulopiscium 198 f__; g__ 186
f__Veillonellaceae; g__Megasphaera 192 f__Ruminococcaceae; g__ 185
f__Veillonellaceae; g__Phascolarctobacterium 182 f__Micrococcaceae; g__Rothia 185
f__Fusobacteriaceae; g__Fusobacterium 182 f__Lactobacillaceae; g__Lactobacillus 174
f__Micrococcaceae; g__Rothia 170 f__Ruminococcaceae; g__Anaerotruncus 163
f__Erysipelotrichaceae; g__ 168 f__Erysipelotrichaceae; g__[Eubacterium] 155
f__Prevotellaceae; g__Prevotella 162 f__Lachnospiraceae; g__[Ruminococcus] 153
f__Lachnospiraceae; g__Coprococcus 161 f__Erysipelotrichaceae; g__ 152
f__Clostridiaceae; g__SMB53 151 f__Clostridiaceae; g__Clostridium 147
f__Bifidobacteriaceae; g__Bifidobacterium 146 f__Turicibacteraceae; g__Turicibacter 146
f__Ruminococcaceae; g__Ruminococcus 137 f__Clostridiaceae; g__ 144
f__Actinomycetaceae; g__Actinomyces 129 f__Coriobacteriaceae; g__Eggerthella 144
f__Ruminococcaceae; g__Oscillospira 124 f__Peptostreptococcaceae; g__[Clostridium] 131
f__Peptostreptococcaceae; g__[Clostridium] 120 f__Veillonellaceae; g__Veillonella 130
f__Erysipelotrichaceae; g__Coprobacillus 120 f__Erysipelotrichaceae; g__Coprobacillus 120
f__Pasteurellaceae; g__Haemophilus 106 f__Moraxellaceae; g__Acinetobacter 115
f__Coriobacteriaceae; g__ 105 f__Staphylococcaceae; g__Staphylococcus 109
f__Clostridiaceae; g__Clostridium 98 f__Fusobacteriaceae; g__Fusobacterium 108
f__Staphylococcaceae; g__Staphylococcus 98 f__Bacteroidaceae; g__Bacteroides 102
Table 3.1: Selection frequencies of top 20 selected genera for Bacteroides and Bifidobacterium over 200 runs.
Supplementary Figure 5.16 illustrates the histogram of selection frequencies by DeepLINK-T over 200
runs. The top 20 microbial genera, along with their selected frequencies, are detailed in Table 3.1. The
abundance variations of Bacteroides and Bifidobacterium with their top 10 selection genera are shown
in Supplementary Figures 5.17 and 5.18, respectively. These findings showcase the consistent ability of
DeepLINK-T to reliably identify the most relevant genera associated with Bacteroides and Bifidobacterium
across the 200 runs.
Specifically, Parabacteroides, identified as the top significant genus associated with Bacteroides in the
gut of early infants (Table 3.1 and Supplementary Figure 5.17), constitutes one of the major groups of
bacteria that inhabit the human gastrointestinal tract, substantially influencing the microbial diversity
and composition [19] there. For Bifidobacterium, Rothia (Table 3.1 and Supplementary Figure 5.18) was
found to be positively correlated with Bifidobacterium [47]. Both Rothia and Bifidobacterium are integral
components of breast milk, playing crucial roles in maintaining the immune homeostasis of infant gut [47,
5].
54
3.4.2 Application to marine metagenomic time series data
3.4.2.1 Identifying primary chlorophyll-a producer
Next, we applied DeepLINK-T to analyze the San Pedro Ocean time series (SPOT) data set [39, 154]. The
ocean’s surface harbors microscopic organisms known as phytoplankton, which are primary producers
utilizing pigments like chlorophyll-a (chl-a) for photosynthesis. The quantification of chlorophyll in the
surface water serves as a reliable indicator of primary production levels in the ocean ’s upper layer [126].
Consequently, our objective is to explore the effect of the abundance of chloroplast on the chlorophyll-a
concentration.
The SPOT data set consists of both 16S rRNA data and 18S rRNA data collected from San Pedro Channel,
situated off the coast of Los Angeles, spanning the period from August 2000 to July 2018. Samples were
taken from different depths including 5m, 150m, 500m, and etc. Specifically, we used samples from 5m to
ensure the maximum number of time points and capture the most abundant microbiome. After aligning
the common time point between 16S and the metadata, we identified 161 time points in total. The analysis
was conducted at the order, family, and genus levels. A taxon was filtered out if it is missing in over
90% of time points to ensure the robustness. Finally, the data arrays (matrices) are of sizes 1 × 161 × 88,
1 × 161 × 70, and 1 × 161 × 56 at the genus, family, and order levels, respectively. The CLR normalization
was applied to compute the relative abundance of the explanatory variables.
DeepLINK-T again demonstrated its efficacy in discovering taxa closely associated with the concentration of chl-a. The histogram of selection frequencies by DeepLINK-T on the SPOT data set is depicted
in Supplementary Figure 5.19. Supplementary Figure 5.20 further delineates the relationship between
chl-a concentration and the abundance of its top three selected taxa. Notably, peak chl-a concentration
aligns with the peak abundance of Heterosigma, Pyramimonas, and Corethrales. Conversely, the peak chl-a
concentration coincides with the troughs in the abundance of Chlorarachniaceae family with its parent
Chlorarachniophyceae order.
55
Table 3.2 presents the top selected taxa along with their corresponding selection frequencies. Among
the top selected genera, Jackson et al. [69] observed that the deep chlorophyll maximum (DCM) is primarily influenced by a species from the Rhizosolenia genus. Additionally, Zhao et al. [160] identified a
significant correlation between the concentration of chl-a and the top two selected genera Pyramimonas
and Heterosigma.
3.4.2.2 Identifying taxa significantly associated with prokaryotic production
Similar to chlorophyll-a concentration, the leucine incorporation rate serves as a crucial metric for measuring prokaryotic production [76]. Employing a similar approach, we investigated the relationship between
prokaryotic abundance levels and the leucine incorporation rate. Following data filtration and CLR normalization, the prokaryotic abundance data matrices have dimensions of 1 × 161 × 270, 1 × 161 × 173,
and 1 × 161 × 110 at the genus, family, and order levels, respectively.
Supplementary Figures 5.21 and 5.22 display the selection frequencies and the relationship between
leucine incorporation rate and the abundance levels of its top three selected prokaryotes. Notably, the peak
leucine incorporation rate is in line with the peak abundance of Tenacibaculum, Planktomarina, Flavobacteriaceae, Halieaceae and Cellvibrionales. Conversely, the peak leucine incorporation rate coincides with the
troughs in the abundance of Defluviicoccales order. Table 5.1 presents the top selected taxa with their selection frequencies. Among the selection taxa, Rhodobacteraceae family and orders from Alphaproteobacteria
class, previously identified by Lamy et al. [79] and Wemheuer et al. [148], were confirmed to significantly
contribute to the leucine incorporation rate during the peak of an iron-fertilized phytoplankton bloom in
the ocean. Moreover, the Polaribacter genus was also found as a predominant contributor to both leucine
incorporation and prokaryotic abundance in regions characterized by high-nutrient, low-chlorophyll conditions in the upper 100 meters of the ocean [102].
56
3.4.3 Application to dietary glycans treatment time series data
In our final exploration, we applied DeepLINK-T to a time series data set involving dietary glycans treatment [40], which encompasses shotgun data collected from the stool samples of 61 healthy volunteers at
dense temporal resolution (about 4 times per week over 10 weeks). The data set was used to study two
different glycans (fructooligosaccharides (FOS) and polydextrose (PDX)). To discern the important bacterial groups associated with the two treatments, the data set was aggregated at various taxonomic levels,
including genus, family, and order. The CLR was used to transform the count data into relative abundance.
After filtering out samples with over 50% missing data and taxa absent in over 90% subjects, 25 subjects
for the FOS group and 27 subjects for the PDX group were retained. Each sample accounted for 40 time
points, revealing a data set comprising 241 genera, 77 families, and 38 orders, respectively. Therefore, the
input data arrays are of sizes 52×40×241, 52×40×77, and 52×40×38 at the three levels, respectively.
The distribution of selection frequencies, depicted in Supplementary Figure 5.23, underscores the robustness of DeepLINK-T to stably identify taxa importantly associated with the two treatments across
different taxonomic levels. The pertinent taxa identified by DeepLINK-T are shown in Table 3.3. Consistent with the finding from Creswell et al. [40], Ruminiclostridum and Odoribacter are significantly correlated with the PDX treatment, often resulting in an increase in a more diverse set of bacterial genera.
Furthermore, FOS treatment is associated with a reduction in the proportion of the Rothia genus [7]. Supplementary Figures 5.24-5.26 additionally elucidate that DeepLINK-T effectively identifies taxa such as
Odoribacter, Angelakisella, and Ruminiclostridium exhibiting decreased abundant in both groups. Concurrently, Campylobacteraceae displays an increase in abundance in the FOS group and a decrease in the PDX
group throughout the treatment period.
57
3.5 Discussions
In this paper, we have introduced a new method, DeepLINK-T, designed for stable feature selection in
deep learning applications with longitudinal time series data. DeepLINK-T leverages the LSTM autoencoder based on the latent factor assumption to generate time series knockoff variables under the potentially
strong dependency structure. Then another LSTM prediction network is trained on the combination of the
original variables and knockoff variables to construct the knockoff statistics for knockoffs inference. Our
method extends the application of the original model-X knockoffs framework to recurrent neural networks (RNNs) and temporal data. Through comprehensive simulation studies, we have demonstrated that
DeepLINK-T attains high power and successfully controls the FDR at a desired level in terms of feature
selection. Notably, DeepLINK-T outperforms its predecessor, DeepLINK, by exploiting the serial dependency information. Moreover, our empirical analyses highlight the utility and effectiveness of DeepLINK-T
in practice.
Future advancements in this methodology could explore the integration of the Transformer [140] instead of the LSTM in both the autoencoder and the prediction networks for generating the knockoff variables and constructing the knockoff statistics. The attention mechanism in the Transformer could be
helpful to learn long-term serial dependency within the longitudinal data. Additionally, Transformer is
faster than LSTM in both training and inference phases due to its enhanced parallelizability. Furthermore,
the model-X knockoffs framework can be conceptualized as a general inference machine, where knockoff
statistics play a crucial role. In our current work, we have utilized the method proposed by [26] to calculate these statistics. Future endeavors could explore innovative approaches to constructing new knockoff
statistics that can yield higher power with effective FDR control. Finally, it is also interesting to integrate
the two underlying deep learning networks into one unified deep learning model for end-to-end training
and inference with flexibility, interpretability, and stability. This consolidation would enhance the tool’s
user-friendliness and overall efficiency.
58
Genus Frequency
f__Chattonellaceae;g__Heterosigma 51
f__Pyramimonadaceae;g__Pyramimonas 51
f__Rhizosoleniaceae;g__Rhizosolenia 46
f__Chlorarachniaceae;g__Partenskyella 34
f__Corethraceae;g__Corethron 25
f__Prasino-clade-9_XX;g__Prasino-clade-9_XXX 19
f__Sarcinochrysidaceae;g__Sarcinochrysidaceae_X 19
f__Prasino-clade-7_X-B;g__Prasino-clade-7_X-B2 18
f__Thalassiosiraceae;g__Thalassiosira 16
f__Chrysochromulinaceae;g__Chrysochromulina 14
Family Frequency
o__Chlorarachniophyceae;f__Chlorarachniaceae 44
o__Corethrales;f__Corethraceae 41
o__Chattonellales;f__Chattonellaceae 38
o__Prasino-clade-9;f__Prasino-clade-9_XX 22
o__Coscinodiscophyceae_X;f__Coscinodiscophyceae_XXX 21
o__Prasino-clade-7;f__Prasino-clade-7_X-B 17
o__Bolidomonadales;f__Bolidomonadaceae 14
o__Rhizosoleniales;f__Rhizosoleniaceae 14
o__Prymnesiophyceae_XX;f__Prymnesiophyceae_XXXX 11
o__Eutreptiales;f__Eutreptiaceae 10
Order Frequency
c__Filosa-Chlorarachnea;o__Chlorarachniophyceae 48
c__Prasinophyceae;o__Prasino-clade-9 34
c__Bacillariophyta;o__Corethrales 29
c__Raphidophyceae;o__Chattonellales 29
c__Bacillariophyta;o__Coscinodiscophyceae_X 27
c__Prasinophyceae;o__Prasino-clade-7 26
c__Bolidophyceae;o__Bolidomonadales 20
c__Bacillariophyta;o__Rhizosoleniales 15
c__Pavlovophyceae;o__Pavlovales 10
c__Prasinophyceae;o__Pyramimonadales 9
Table 3.2: Selection frequencies of top 10 selected taxa for chl-a concentration at the genus, family, and
order levels over 200 runs.
59
Genus Frequency
f__Oscillospiraceae;g__Oscillibacter 172
f__Micrococcaceae;g__Rothia 159
f__Campylobacteraceae;g__Campylobacter 133
f__Staphylococcaceae;g__Staphylococcus 99
f__Spirochaetaceae;g__Treponema 54
f__Odoribacteraceae;g__Odoribacter 31
f__Ruminococcaceae;g__Angelakisella 30
f__;g__Intestinimonas 28
f__Carnobacteriaceae;g__Granulicatella 21
f__Ruminococcaceae;g__Ruminiclostridium 18
Family Frequency
o__Clostridiales;f__Oscillospiraceae 160
o__Campylobacterales;f__Campylobacteraceae 124
o__Bacillales;f__Staphylococcaceae 91
o__Spirochaetales;f__Spirochaetaceae 79
o__Micrococcales;f__Micrococcaceae 76
o__Bacteroidales;f__Odoribacteraceae 63
o__Synergistales;f__Synergistaceae 58
o__Brachyspirales;f__Brachyspiraceae 58
o__Mycoplasmatales;f__Mycoplasmataceae 42
o__Enterobacterales;f__Enterobacteriaceae 28
Order Frequency
c__Epsilonproteobacteria;o__Campylobacterales 154
c__Spirochaetia;o__Brachyspirales 116
c__Synergistia;o__Synergistales 98
c__Chlamydiia;o__Chlamydiales 93
c__Mollicutes;o__Mycoplasmatales 89
c__Gammaproteobacteria;o__Aeromonadales 59
c__Gammaproteobacteria;o__Oceanospirillales 57
c__Actinobacteria;o__Streptomycetales 39
c__Gammaproteobacteria;o__Vibrionales 33
c__Actinobacteria;o__Propionibacteriales 31
Table 3.3: Selection frequencies of top 10 selected taxa for differentiating the FOS and PDX groups at the
genus, family, and order levels over 200 runs.
60
Chapter 4
Conclusions and future work
In this dissertation, we have conducted comprehensive statistical analyses and introduced a novel computational method to address important research challenges in the metagenomic analysis of the human
microbiome.
Initially, we analyzed a comprehensive set of CRC patients, utilizing a sophisticated reference phage
database. Our findings reveal substantial heterogeneity in gut virome patterns between healthy controls
and individuals with CRC. The differentially abundant viral species and genus identified by us were consistently observed in other research, affirming the robustness of our results. Our correlation analysis further
indicates the potential symbiotic network between bacteria and phages and its effect on the intestinal
health. Additionally, the diagnostic model based on the phage abundance showcases the potential of virome in the CRC prediction. Altogether, our work uncovers significant contributions of the gut virome to
CRC and offers a fresh perspective of the gut virome in the pathogenesis of this disease.
Furthermore, we proposed a novel algorithm, named DeepLINK-T, based on the model-X knockoffs
framework. It achieves FDR control for high-dimensional statistical inference in longitudinal data. Our
method incorporates a dual-network structure, featuring an LSTM autoencoder for knockoff variable construction and an LSTM prediction network for quantifying feature importance. We have demonstrated the
effectiveness of DeepLINK-T under different scenarios through comprehensive simulation studies. The
61
real-world applications underscore the practice utility of our algorithm in various microbial communities.
We hope that DeepLINK-T will serve as a valuable tool for researchers delving into the pathogenesis of
various diseases, contributing to unveiling more mysteries surrounding microorganisms.
These endeavors have significantly broadened our comprehension of the intricate ways in which the
human microbiome influences disease states, shedding lights on the direction of potential future exploration and applications, which will be discussed in the subsequent sections.
4.1 Future work for gut virome analysis in CRC
In Chapter 2, we presented comprehensive metagenomic analyses of the gut virome in CRC. Though these
analyses reveal important viral signatures in the human gut, there are still several avenues that can be
explored to further enhance our understanding of the gut virome.
First of all, all seven datasets we collected for the analyses are based on whole genome shotgun data.
However, some existing studies have demonstrated the association of gut microbiome and CRC using 16S
rRNA genes [156, 42]. In comparison to shotgun sequencing, 16S rRNA sequencing is less expensive and
focuses more on marker genes. By integrating both 16S data and shotgun data in the study of the gut
virome, researchers may gain a deeper insight into the composition and dynamics of the viral community.
This combined approach might provide a more comprehensive understanding of the interactions between
bacterial and viral components in the gut microbiome, revealing the essence of their roles in CRC and
potentially uncovering novel associations.
Additionally, the diagnostic model discussed in Chapter 2 can be further enhanced. On the one hand,
when training the model based on the combination of several datasets, we treated each sample from different studies as coming from the same dataset. However, significant heterogeneity exists among different
datasets, which could be caused by distinct sequencing depth, different sequencing techniques and diverse dietary habits. These factors potentially introduce noise to the diagnostic model and weaken the
62
performance of the model. To overcome this challenge, several sophisticated methods and tools including
ensemble learning [159] and ComBat normalization [158] can be explored to remove batch effect and reduce bias across different datasets. It is worth trying these methods for further enhancing the power and
robustness of the diagnostic model.
Moreover, we only used the abundance of phages to train the diagnostic model. Though the existing
features provide much information about gut microbiome, additional classification criteria can be obtained
from the abundance levels of other microorganisms, such as prokaryotes and eukaryotes. Both prokaryotes
and eukaryotes are crucial component of the gut microbiome and play pivotal roles in the pathogenesis of
CRC. By incorporating viral, prokaryotic and eukaryotic abundances, the diagnostic model may achieve
better accuracy and assists researchers in better understanding the associations between different microbial
kingdoms and CRC. This additional information could provide a more comprehensive view of the multikingdom interactions within the gut microbiome and potentially unveil novel patterns or relationships
relevant to CRC.
4.2 Future work for DeepLINK-T
In Chapter 3, we introduced DeepLINK-T, a cutting-edge algorithm designed to achieve FDR control in
high-dimensional statistical inference for time-series data using LSTM and model-X knockoffs framework.
Despite its effectiveness and robustness in both simulation studies and real-world applications, several
future directions could be explored.
Firstly, with the dynamic evolution of network structure in deep learning, several advanced network
structures have been proposed to model serial data in recent years. One of the most famous structures is
Transformer [140], which utilizes the attention mechanism to effectively compute a representation of the
serial data. It has been demonstrated that Transformer performs better than LSTM in terms of capturing
long-term serial dependency and computational parallelizability. Moreover, the weights of query, key and
63
value inside the attention mechanism are natural substitution for quantifying importance of features in
model-X knockoffs framework. Therefore, a compelling avenue for further exploration is replacing the existing LSTM networks in DeepLINK-T with Transformer-based counterparts. This refinement aligns with
the forefront of deep learning advancements and potentially enhance the performance of the algorithm,
particularly in scenarios involving longer serial data sequences.
Furthermore, the current implementation of DeepLINK-T incorporates a dual-network structure, consisting of both an LSTM autoencoder and an LSTM prediction network. While this configuration results
in robust performance, its complexity could pose challenges for new users. Moreover, due to the dualnetwork structure, the construction of knockoff variables and knockoff statistics are separated, making it
difficult to explain the association between these two important factors in model-X knockoffs framework.
To address these challenges, future efforts could explore integrating these two neural networks into one
deep learning model for end-to-end training and inference. This modification not only simplifies the usability of the tool, making it more accessible for a broader user group, but also augments the interpretability
of the entire framework.
Additionally, the computation of knockoff statistics is a nuanced aspect of the methodology. In our
implementation, we adopted the squared difference metric as proposed in the original work of model-X
knockoffs [26] to calculate the knockoff statistics. The intuition is that higher knockoff statistics indicate
greater importance of the corresponding features. While this metric has been demonstrated to be effective,
several alternatives have been proposed for such purpose. For example, He et al. [61] introduced a method
defined the knockoff statistics as the difference between the importance score and the median of all importance scores. This approach has been applied to detect common and rare risk variants in the analysis
of whole genome sequencing [61]. Therefore, an avenue for further exploration lies in the comparison of
different approaches for constructing knockoff statistics. By evaluating alternative metrics, we can systematically measure their performance and potential advantages over the existing squared difference metric.
64
This comparative analysis could shed light on areas for refinement or enhancement in the interpretability
and the performance of DeepLINK-T framework.
65
Chapter 5
Appendix
5.1 Appendix for chapter1: Metagenomic analyses of multiple gut datasets
revealed the association of phage signatures in colorectal cancer
0 20 40 60 80
Number of reads (×10
6
)
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Proportion(%)
(A)Zeller
Disease
Control
CRC
10 20 30 40
Number of reads (×10
6
)
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Proportion(%)
(B)Yu
Disease
Control
CRC
20 25 30 35
Number of reads (×10
6
)
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Proportion(%)
(C)Feng
Disease
Control
CRC
0 20 40 60
Number of reads (×10
6
)
0.000
0.025
0.050
0.075
0.100
0.125
0.150
0.175
0.200
Proportion(%)
(D)Vogtmann
Disease
Control
CRC
20 40 60 80
Number of reads (×10
6
)
0.000
0.025
0.050
0.075
0.100
0.125
0.150
0.175
Proportion(%)
(E)Thomas
Disease
Control
CRC
10 20 30 40 50
Number of reads (×10
6
)
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Proportion(%)
(F)Yachida
Disease
Control
CRC
15.0 17.5 20.0 22.5 25.0
Number of reads (×10
6
)
0.00
0.02
0.04
0.06
0.08
0.10
Proportion(%)
(G)Yang
Disease
Control
CRC
Figure 5.1: The distribution of sequence depths within each dataset used in this study.
66
0.15 0.20 0.25 0.30 0.35
Mapping rate
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Proportion(%)
(A)Zeller
Disease
Control
CRC
0.2 0.3 0.4
Mapping rate
0.00
0.02
0.04
0.06
0.08
Proportion(%)
(B)Yu
Disease
Control
CRC
0.20 0.25 0.30 0.35 0.40
Mapping rate
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Proportion(%)
(C)Feng
Disease
Control
CRC
0.20 0.25 0.30 0.35
Mapping rate
0.00
0.02
0.04
0.06
0.08
0.10
Proportion(%)
(D)Vogtmann
Disease
Control
CRC
0.1 0.2 0.3 0.4
Mapping rate
0.000
0.025
0.050
0.075
0.100
0.125
0.150
0.175
Proportion(%)
(E)Thomas
Disease
Control
CRC
0.25 0.30 0.35 0.40
Mapping rate
0.00
0.02
0.04
0.06
0.08
0.10
Proportion(%)
(F)Yachida
Disease
Control
CRC
0.25 0.30 0.35 0.40 0.45
Mapping rate
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Proportion(%)
(G)Yang
Disease
Control
CRC
Figure 5.2: The distribution of mapping rates for each dataset used in this study.
67
RE Model
−1 −0.5 0 0.5 1 1.5
Standardized Mean Difference
Yang
Yachida
Thomas
Vogtmann
Feng
Yu
Zeller
0.21 [−0.07, 0.49]
−0.14 [−0.58, 0.30]
0.06 [−0.31, 0.43]
0.10 [−0.28, 0.49]
0.95 [ 0.55, 1.35]
0.29 [−0.06, 0.65]
0.23 [−0.06, 0.52]
RE Model 0.24 [ 0.01, 0.47]
−0.5 0 0.5 1 1.5
Standardized Mean Difference
Yang
Yachida
Thomas
Vogtmann
Feng
Yu
Zeller
0.42 [ 0.14, 0.71]
0.10 [−0.34, 0.53]
−0.06 [−0.43, 0.31]
0.12 [−0.27, 0.50]
0.70 [ 0.31, 1.10]
0.42 [ 0.07, 0.78]
0.22 [−0.07, 0.51]
0.28 [ 0.10, 0.46]
(C) (D)
Figure 5.3: Boxplots of (A) genus-level and (B) family-level Shannon diversity within each dataset. Pvalues were calculated using the two-tailed Wilcoxon rank-sum test. ns:p> 0.05, *p< 0.05, **p< 0.01,
***p< 0.001, ****p< 0.0001. (C) Forest plot showing effect sizes from a meta-analysis on genus-level
diversity (I
2 = 44.10%, Q test p-value= 0.094). (D) Forest plot showing effect sizes from a meta-analysis
on family-level diversity (I
2 = 66.51%, Q test p-value= 0.012). RE Model: Random effect model.
68
(B) (C)
RE Model
−0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Standardized Mean Difference
Yachida
Yang
Yu
Thomas
Vogtmann
Feng
Zeller
0.23 [−0.21, 0.66]
0.34 [ 0.05, 0.62]
0.45 [ 0.10, 0.81]
0.17 [−0.20, 0.54]
0.21 [−0.18, 0.59]
0.38 [−0.01, 0.76]
0.18 [−0.11, 0.47]
0.28 [ 0.15, 0.41]
(A)
Figure 5.4: Analysis of viral species Heip evenness within each dataset. (A) Boxplots of viral species-level
Heip evenness for gut samples of CRC subjects and healthy controls stratified by disease status in each
dataset. BH adjusted p-values were calculated using the two-tailed Wilcoxon rank-sum test. ns:p> 0.05,
*p< 0.05, **p< 0.01, ***p< 0.001. (B) Multivariate analysis of the adjusted impact of age, gender and
BMI on Heip evenness. (C) Forest plot showing effect sizes from a meta-analysis on species-level evenness
(I
2 = 0.00%, Q test p-value= 0.89). RE Model: Random effect model.
69
(B) (C)
(A)
RE Model
−1 −0.5 0 0.5 1 1.5
Standardized Mean Difference
Yachida
Yang
Yu
Thomas
Vogtmann
Feng
Zeller
0.54 [ 0.09, 0.98]
0.15 [−0.13, 0.43]
−0.12 [−0.48, 0.23]
0.02 [−0.35, 0.39]
−0.24 [−0.63, 0.14]
0.68 [ 0.29, 1.07]
−0.07 [−0.36, 0.22]
0.12 [−0.12, 0.37]
Figure 5.5: Analysis of viral species Chao1 richness within each dataset. (A) Boxplots of viral specieslevelChao1 richness for gut samples of CRC subjects and healthy controls stratified by disease status in each
dataset. BH adjusted p-values were calculated using the two-tailed Wilcoxon rank-sum test. ns:p> 0.05,
*p< 0.05, **p< 0.01, ***p< 0.001. (B) Multivariate analysis of the adjusted impact of age, gender and
BMI on Chao1 richness. (C) Forest plot showing effect sizes from a meta-analysis on species-level richness
(I
2 = 69.93%, Q test p-value= 0.005). RE Model: Random effect model.
70
0.25 0.00 0.25 0.50
PCo1(37.3%)
0.4
0.3
0.2
0.1
0.0
0.1
0.2
PCo2(12.5%)
(A)Zeller(R
2=0.014,p=0.026)
Control
CRC
0.2 0.0 0.2 0.4
PCo1(30.6%)
0.0
0.2
0.4
0.6
PCo2(16.1%)
(B)Yu(R
2=0.032,p=0.003)
Control
CRC
0.00 0.25 0.50
PCo1(25.6%)
0.2
0.1
0.0
0.1
0.2
0.3
0.4
PCo2(18.3%)
(C)Feng(R
2=0.041,p=0.001)
Control
CRC
0.4 0.2 0.0 0.2
PCo1(28.0%)
0.3
0.2
0.1
0.0
0.1
0.2
0.3
PCo2(17.0%)
(D)Vogtmann(R
2=0.012,p=0.225)
Control
CRC
0.25 0.00 0.25
PCo1(25.0%)
0.4
0.2
0.0
0.2
PCo2(22.7%)
(E)Thomas(R
2=0.010,p=0.314)
Control
CRC
0.2 0.0 0.2
PCo1(25.8%)
0.3
0.2
0.1
0.0
0.1
0.2
0.3
PCo2(19.4%)
(F)Yachida(R
2=0.028,p=0.034)
Control
CRC
0.25 0.00 0.25
PCo1(28.3%)
0.2
0.1
0.0
0.1
0.2
0.3
PCo2(21.6%)
(G)Yang(R
2=0.025,p=0.001)
Control
CRC
Figure 5.6: PCoA plot of gut samples of CRC subjects and healthy controls for each separate dataset. R2
values and p-values on subtitles were calculated by PERMANOVA to quantify the separation of the samples
into two groups.
71
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
ns
*
**
ns
*
ns
ns
(A)Erwinia phage phiet88
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
ns
**
***
ns
ns
ns
*
(B)Klebsiella virus st16oxa48phi5-4
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
ns
ns
** * ns ns
ns
(C)Vibrio phage martha 12b12
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
ns ns
*
ns ns ns
ns
(D)Salmonella virus epsilon15
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
log(abundance)
*
ns
** ns ns ns
**
(E)Mannheimia phage vb_mhm_3927ap2
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
ns
*
**
ns
**
ns
ns
(F)Pseudomonas virus b3
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
log(abundance)
*
ns
* ns ns ns
**
(G)Salmonella phage 118970_sal3
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
ns
ns
*** * ns ns
ns
(H)Escherichia phage hk639
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
ns
ns
*** ns ns
ns
ns
(I)Enterobacteria phage phi80
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
*
ns
***
ns
*
ns
ns
(J)Enterobacteria phage es18
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
ns
*
*** ns
ns
ns
**
(K)Cronobacter phage phies15
Control
CRC
Figure 5.7: Boxplots showing the log transformed TMM normalized abundance of significant CRCassociated viral species within each dataset. P-values were calculated using the two-tailed Wilcoxon ranksum test. ns:p> 0.05, *p< 0.05, **p< 0.01, ***p< 0.001, ****p< 0.0001.
72
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
ns
ns
** ns * ns
ns
(A)Reginaelenavirus
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
ns
ns
***
ns
*
ns
*
(B)Seongnamvirus
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
*
ns
* ** ns
ns
ns
(C)Tequatrovirus
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
**
ns
*** ** ns
ns
ns
(D)Nonagvirus
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
*
ns
***
ns
ns ns
*
(E)Senquatrovirus
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
2.5
3.0
3.5
4.0
4.5
5.0
log(abundance)
ns
ns
* * * ns
*
(F)Punavirus
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
**
ns
***
ns
ns
ns
ns
(G)Irrigatiovirus
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
ns *
**
ns
ns
ns
ns
(H)Cuauhtlivirus
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
ns
**
*
ns
**
ns
***
(I)Aguilavirus
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
ns *
**
ns
*
ns
ns
(J)Beetrevirus
Control
CRC
Figure 5.8: Boxplots showing the log transformed TMM normalized abundance of significant CRCassociated viral genera within each dataset. P-values were calculated using the two-tailed Wilcoxon ranksum test. ns:p> 0.05, *p< 0.05, **p< 0.01, ***p< 0.001, ****p< 0.0001.
73
Yang
Vogtmann
Yu
Yachida
1
2
3
4
5
log(abundance)
ns
**
ns
ns
(A)Drexlerviridae
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1
2
3
4
5
log(abundance)
ns ns
*** ns
*
ns
ns
(B)Inoviridae
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
3.0
3.5
4.0
4.5
5.0
log(abundance)
ns
****
*
ns
****
ns
**
(C)Herelleviridae
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
5.4
5.6
5.8
6.0
6.2
log(abundance)
ns
ns
ns
* *
ns
ns
(D)Myoviridae
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
4.5
5.0
5.5
6.0
log(abundance)
ns
* ns
ns ***
ns
ns
(E)Podoviridae
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
5.5
5.6
5.7
5.8
5.9
6.0
6.1
6.2
log(abundance)
ns
ns
ns
*
**
ns
*
(F)Siphoviridae
Control
CRC
Figure 5.9: Boxplots showing the log transformed TMM normalized abundance of significant CRCassociated viral families ((A) Drexlerviridae, (B) Inoviridae, (C) Herelleviridae)) and families to which
significant CRC-associated viral species belong ((D) Myoviridae, (E) Podoviridae, (F) Siphoviridae)) within
each dataset. P-values were calculated using the two-tailed Wilcoxon rank-sum test. ns:p> 0.05, *p< 0.05,
**p< 0.01, ***p< 0.001, ****p< 0.0001.
74
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
log(abundance)
ns
ns
ns
ns
ns
ns
ns
(A)Stearate biosynthesis ii
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
log(abundance)
ns ns
ns
ns
ns
ns
*
(B)Oleate biosynthesis iv
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
log(abundance)
ns ns
ns
ns
ns
ns
*
(C)Fatty acid elongation -- saturated
Control
CRC
Zeller
Vogtmann
Thomas
Feng
Yu
Yang
Yachida
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
log(abundance)
ns ns
ns
ns
ns
ns
*
(D)Palmitoleate biosynthesis i
Control
CRC
Zeller
Vogtmann
Thomas
Yu
Feng
Yang
Yachida
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
log(abundance)
ns ns
ns
ns
ns
ns
*
(E)(5z)-dodecenoate biosynthesis i
Control
CRC
Yu
Vogtmann
Feng
Yachida
Yang
Thomas
Zeller
1.5
2.0
2.5
3.0
3.5
4.0
4.5
log(abundance)
ns
**
*
***
*
ns
****
(F)L-methionine biosynthesis iii
Control
CRC
Yang
Yachida
Feng
Vogtmann
Yu
Thomas
Zeller
1.5
2.0
2.5
3.0
3.5
4.0
4.5
log(abundance)
ns
****
**
**
*
ns
ns
(G)Pyruvate fermentation to acetate and lactate ii
Control
CRC
Figure 5.10: Boxplots showing the log transformed HUMAnN3 pathway abundance of significant pathways
within each dataset. P-values were calculated using the two-tailed Wilcoxon rank-sum test. ns:p> 0.05,
*p< 0.05, **p< 0.01, ***p< 0.001, ****p< 0.0001.
75
RE Model
−1 −0.5 0 0.5 1 1.5
Standardized Mean Difference
Yang
Yachida
Thomas
Vogtmann
Feng
Yu
Zeller
−0.15 [−0.44, 0.13]
0.69 [ 0.24, 1.14]
−0.25 [−0.62, 0.12]
−0.29 [−0.67, 0.10]
0.23 [−0.15, 0.61]
−0.25 [−0.60, 0.10]
−0.36 [−0.65, −0.07]
−0.07 [−0.33, 0.18] RE Model
−1 −0.5 0 0.5 1 1.5
Standardized Mean Difference
Yang
Yachida
Thomas
Vogtmann
Feng
Yu
Zeller
−0.41 [−0.69, −0.12]
0.44 [−0.00, 0.88]
0.09 [−0.28, 0.46]
−0.25 [−0.64, 0.13]
0.83 [ 0.43, 1.22]
−0.00 [−0.35, 0.35]
−0.14 [−0.43, 0.15]
0.06 [−0.25, 0.38]
(B)
(A)
(C) (D)
Figure 5.11: Alpha diversity of bacterial species. (A) Boxplots showing bacterial Shannon index. (B) Boxplots showing bacterial richness. P-values were calculated using the two-tailed Wilcoxon rank-sum test.
ns:p> 0.05, *p< 0.05, **p< 0.01, ***p< 0.001, ****p< 0.0001. (C) Forest plot showing effect sizes from a
meta-analysis on species-level bacterial diversity (meta-analysis I
2 = 73.31%, Q test p-value = 0.0029).
(D) Forest plot showing effect sizes from a meta-analysis on species-level bacterial richness (meta-analysis
I
2 = 81.82%, Q test p-value < 0.0001)
. RE Model: Random effect model.
76
Zeller
Yu
Feng
Vogtmann
Thomas
Yachida
Yang
Genus-level test set
Zeller
Yu
Feng
Vogtmann
Thomas
Yachida
Yang
Genus-level training set
0.755 0.621 0.691 0.601 0.601 0.621 0.798
0.650 0.704 0.636 0.542 0.598 0.675 0.725
0.670 0.583 0.773 0.567 0.648 0.647 0.573
0.591 0.586 0.612 0.614 0.682 0.565 0.647
0.596 0.547 0.670 0.614 0.707 0.573 0.643
0.622 0.623 0.640 0.550 0.583 0.641 0.647
0.733 0.667 0.647 0.535 0.567 0.606 0.861
(A)
Zeller
Yu
Feng
Vogtmann
Thomas
Yachida
Yang
Family-level test set
Family-level training set
0.636 0.590 0.636 0.554 0.551 0.600 0.655
0.623 0.588 0.590 0.534 0.568 0.565 0.628
0.663 0.510 0.711 0.601 0.575 0.655 0.550
0.559 0.501 0.597 0.622 0.631 0.541 0.669
0.561 0.540 0.552 0.627 0.551 0.508 0.622
0.622 0.568 0.734 0.517 0.513 0.560 0.634
0.655 0.665 0.742 0.640 0.597 0.598 0.683
(B)
Zeller
Yu
Feng
Vogtmann
Thomas
Yachida
Yang
Pathway test set
Pathway training set
0.788 0.602 0.707 0.562 0.593 0.551 0.619
0.585 0.721 0.509 0.504 0.582 0.628 0.555
0.715 0.537 0.638 0.501 0.552 0.537 0.550
0.511 0.528 0.564 0.579 0.554 0.534 0.603
0.584 0.622 0.576 0.556 0.604 0.586 0.580
0.560 0.677 0.516 0.551 0.636 0.673 0.648
0.549 0.564 0.632 0.639 0.544 0.589 0.790
(C)
Zeller Yu Feng Vogtmann Thomas Yachida Yang
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
AUROC
(D)
Genus level Family level Pathway
0.5
0.6
0.7
0.8
0.9
1.0
Figure 5.12: Prediction performances of random forest classifiers based on gut viral abundance. (A) Within
and cross study AUROC matrix obtained by using genus-level abundance. The diagonal refers to results
of cross validation within each dataset. Off-diagonal values refer to prediction results trained on the study
on each row and tested on the study on each column. (B) Within and cross study AUROC matrix obtained
by using family-level abundance. (C) Within and cross study AUROC matrix obtained by using pathway
abundance. (D) LODO results with the x axis indicating the study left out as the validation set and other
studies combined as the training set.
77
Zeller
Yu
Feng
Vogtmann
Thomas
Yachida
Yang
Bacterial species test set
Zeller
Yu
Feng
Vogtmann
Thomas
Yachida
Yang
Bacterial species training set
0.891 0.839 0.859 0.700 0.620 0.801 0.947
0.805 0.897 0.837 0.749 0.717 0.826 0.909
0.737 0.767 0.946 0.697 0.740 0.795 0.782
0.726 0.831 0.781 0.698 0.747 0.738 0.860
0.687 0.795 0.805 0.691 0.759 0.701 0.845
0.738 0.797 0.830 0.705 0.663 0.809 0.817
0.849 0.873 0.877 0.755 0.700 0.740 0.978
(A)
Zeller
Yu
Feng
Vogtmann
Thomas
Yachida
Yang
Bacterial species and viral species test set
Bacterial species and viral species training set0.884 0.788 0.826 0.635 0.659 0.717 0.947
0.754 0.839 0.759 0.626 0.625 0.765 0.871
0.687 0.706 0.908 0.641 0.668 0.696 0.754
0.607 0.693 0.710 0.647 0.642 0.628 0.718
0.620 0.569 0.702 0.603 0.718 0.633 0.718
0.691 0.777 0.671 0.654 0.662 0.749 0.746
0.813 0.813 0.850 0.661 0.702 0.784 0.958
(B)
Zeller
Yu
Feng
Vogtmann
Thomas
Yachida
Yang
Bacterial species and viral genome test set
Bacterial species and viral genome training set0.852 0.763 0.759 0.631 0.605 0.664 0.906
0.668 0.790 0.605 0.594 0.545 0.628 0.740
0.615 0.584 0.917 0.579 0.595 0.659 0.503
0.643 0.720 0.668 0.666 0.588 0.605 0.680
0.616 0.628 0.625 0.610 0.657 0.594 0.714
0.695 0.705 0.616 0.565 0.639 0.677 0.667
0.748 0.765 0.726 0.596 0.625 0.649 0.928
(C)
Zeller Feng Vogtmann Thomas Yu Yang Yachida 0.0
0.2
0.4
0.6
0.8
AUROC
(D)
Bacterial species Bacterial species and viral species Bacterial species and viral genome
0.5
0.6
0.7
0.8
0.9
1.0
Figure 5.13: Prediction performances of random forest classifiers based on gut microbial abundance. (A)
Within and cross study AUROC matrix obtained by using bacterial species-level abundance. The diagonal refers to results of cross validation within each dataset. Off-diagonal values refer to prediction results
trained on the study on each row and tested on the study on each column. (B) Within and cross study AUROC matrix obtained by using both bacterial and viral species-level abundance profiles. (C) Within and
cross study AUROC matrix obtained by using both bacterial species-level and viral genome-level abundance profiles. (D) LODO results with the x axis indicating the study left out as the validation set and
other studies combined as the training set.
5.2 Appendix for Chapter 2: DeepLINK-T: deep learning inference for
time series data using knockoffs and LSTM
5.2.1 The impacts of hyperparameters and model misspecification on DeepLINK-T
To evaluate the ability of DeepLINK-T to deal with model misspecification, we introduced latent confounders into the linear or nonlinear regression model utilized in our simulation studies. Figures 5.14 and
78
linear link function nonlinear link function
A B C D
Figure 5.14: The impacts of bottleneck dimensionality and training epochs on DeepLINK-T using the linear
factor model. Latent confouders are added to make the regression model misspecified. A and B are FDR
and power under the setting of linear link function. C and D are FDR and power under the setting of
nonlinear link function. The number of training epochs of both LSTM autoencoder and LSTM prediction
network is specified on the x-axis. The bottleneck dimensionality of the LSTM autoencoder is specified on
the y-axis. The pre-specified FDR level is q = 0.2.
5.15 show the results using linear and logistic factor model, respectively. Subfigures A and B present FDR+
and power+ with the linear link function, while subfigures C and D show FDR+ and power+ using the
nonlinear link function. In each subfigure, the x-axis denotes the number of training epochs for both networks in DeepLINK-T and the y-axis indicates the number of bottleneck layer dimensions in the LSTM
autoencoder. Each cell in the heatmap reflects the mean value of FDR+ or power+ based on 50 runs of
DeepLINK-T. After introducing the latent confounders, an increased number of epochs is required for the
model to converge during the training process. However, after training the networks for 1000 epochs,
DeepLINK-T demonstrates the ability to achieve high power while effectively controlling the FDR, even in
the presence of model misspecification.
5.2.2 Application to longitudinal gut microbiome data set of early infants
To evaluate the performance of DeepLINK-T on real-world application, we first applied it to a longitudinal
gut microbiome data set of early infants. We aimed to identify important bacterial genera associated with
the abundance levels of Bacteroides and Bifidobacterium. For each response genus, DeepLINK-T was fit on
79
linear link function nonlinear link function
A B C D
Figure 5.15: The impacts of bottleneck dimensionality and training epochs on DeepLINK-T using the logistic factor model. Latent confouders are added to make the regression model misspecified. A and B are
FDR and power under the setting of linear link function. C and D are FDR and power under the setting of
nonlinear link function. The number of training epochs of both LSTM autoencoder and LSTM prediction
network is specified on the x-axis. The bottleneck dimensionality of the LSTM autoencoder is specified on
the y-axis. The pre-specified FDR level is q = 0.2.
the data set for 200 times. The histograms of selection frequencies for associated genera are shown in Figure
5.16, which showcase that DeepLINK-T is able to consistently select significant bacterial genera associated
with the response. The abundance variation of the top 10 selection genera along with the response is
shown in Figures 5.17 and 5.18. Most selected bacterial genera share similar or opposite variation patterns
with their response genus.
5.2.3 Application to marine metagenomic time series data set
Next, we applied DeepLINK-T to a marine metagenomic time series data set from the San Pedro Ocean,
where exploring the association between the abundance of chloroplast and chlorophyll-a concentration is
our primary interest. Similar to the first application, Figure 5.19 shows the selection frequencies of important taxa associated with the chlorophyll-a concentration at the genus, family and order levels. Although
the top selection frequencies are not comparable to those in the first application, Figure 5.20 delineates
that the variation in the abundance of top-selected taxa did align with the variation in chlorophyll-a concentration.
80
Figure 5.16: The distribution of selection frequencies on the gut microbiome data set of early infants over
200 runs.
We also leveraged the same dataset to investigate the relationship between the abundance of prokaryotes and leucine incorporation rate, a metric used to quantify the prokaryotic production. Similarly, Figure
5.21 illustrates the selection frequencies of taxa associated with the leucine incorporation rate at the genus,
family and order levels. Notably, the top selection frequencies surpass those observed for chlorophyll-a
concentration. Further insights are provided in Figure 5.20, where the variation trend in the abundance of
most top-selected taxa closely aligned with fluctuations in the leucine incorporation rate.
81
0 5 10 15 20 25
Month
1
0
1
2
3
4
5
value
Genus
g__Parabacteroides
g__Bacteroides
0 5 10 15 20 25
Month
2
1
0
1
2
3
4
5
value
Genus
g__Epulopiscium
g__Bacteroides
0 5 10 15 20 25
Month
2
1
0
1
2
3
4
5
value
Genus
g__Megasphaera
g__Bacteroides
0 5 10 15 20 25
Month
2
1
0
1
2
3
4
5
value
Genus
g__Phascolarctobacterium
g__Bacteroides
0 5 10 15 20 25
Month
2
1
0
1
2
3
4
5
value
Genus
g__Fusobacterium
g__Bacteroides
0 5 10 15 20 25
Month
2
1
0
1
2
3
4
5
value
Genus
g__Rothia
g__Bacteroides
0 5 10 15 20 25
Month
2
1
0
1
2
3
4
5
value
Genus
g__
g__Bacteroides
0 5 10 15 20 25
Month
2
1
0
1
2
3
4
5
value
Genus
g__Prevotella
g__Bacteroides
0 5 10 15 20 25
Month
1
0
1
2
3
4
5
value
Genus
g__Coprococcus
g__Bacteroides
0 5 10 15 20 25
Month
2
1
0
1
2
3
4
5
value
Genus
g__SMB53
g__Bacteroides
Figure 5.17: The abundance variation of Bacteroides and its top 10 selected genera.
0 5 10 15 20 25
Month
2
1
0
1
2
3
4
5
6
value
Genus
g__SMB53
g__Bifidobacterium
0 5 10 15 20 25
Month
1
0
1
2
3
4
5
6
value
Genus
g__
g__Bifidobacterium
0 5 10 15 20 25
Month
2
1
0
1
2
3
4
5
6
value
Genus
g__
g__Bifidobacterium
0 5 10 15 20 25
Month
2
0
2
4
6
value
Genus
g__Rothia
g__Bifidobacterium
0 5 10 15 20 25
Month
2
0
2
4
6
value
Genus
g__Lactobacillus
g__Bifidobacterium
0 5 10 15 20 25
Month
2
1
0
1
2
3
4
5
6
value
Genus
g__Anaerotruncus
g__Bifidobacterium
0 5 10 15 20 25
Month
1
0
1
2
3
4
5
6
value
Genus
g__[Eubacterium]
g__Bifidobacterium
0 5 10 15 20 25
Month
0
1
2
3
4
5
6
value
Genus
g__[Ruminococcus]
g__Bifidobacterium
0 5 10 15 20 25
Month
2
1
0
1
2
3
4
5
6
value
Genus
g__
g__Bifidobacterium
0 5 10 15 20 25
Month
0
1
2
3
4
5
6
value
Genus
g__Clostridium
g__Bifidobacterium
Figure 5.18: The abundance variation of Bifidobacterium and its top 10 selected genera.
82
0
10
20
30
40
50
Frequency
0
5
10
15
20
25
Count of genus
A) Genus level
0
10
20
30
40
Frequency
0
5
10
15
20
25
30
Count of family
B) Family level
0
10
20
30
40
50
Frequency
0
2
4
6
8
10
12
14
16
Count of order
C) Order level
Figure 5.19: The distribution of selection frequencies for response chlorophyll-a concentration on the SPOT
data set over 200 runs.
83
4
2
0
2
4
6
8
f__Chattonellaceae;g__Heterosigma
Chlorophyll-a concentration
4
2
0
2
4
6
8
f__Pyramimonadaceae;g__Pyramimonas
Chlorophyll-a concentration
4
2
0
2
4
6
8
f__Rhizosoleniaceae;g__Rhizosolenia
Chlorophyll-a concentration
4
2
0
2
4
6
8
a
b
u
n
d
a
n
c
e
or c
o
n
c
e
ntratio
n (m
g/m3
)
o__Chlorarachniophyceae;f__Chlorarachniaceae
Chlorophyll-a concentration
6
4
2
0
2
4
6
8
o__Corethrales;f__Corethraceae
Chlorophyll-a concentration
4
2
0
2
4
6
8
o__Chattonellales;f__Chattonellaceae
Chlorophyll-a concentration
4
2
0
2
4
6
8
c__Filosa-Chlorarachnea;o__Chlorarachniophyceae
Chlorophyll-a concentration
Time
4
2
0
2
4
6
8
c__Prasinophyceae;o__Prasino-clade-9
Chlorophyll-a concentration
6
4
2
0
2
4
6
8
c__Bacillariophyta;o__Corethrales
Chlorophyll-a concentration
Figure 5.20: The variation of chlorophyll-a concentration and the abundance of its top 3 selected taxa at
the genus, family, and order levels.
84
Genus Frequency
f__Flavobacteriaceae;g__Tenacibaculum 74
f__uncultured;g__uncultured 41
f__Rhodobacteraceae;g__Planktomarina 36
f__Flavobacteriaceae;g__Polaribacter 24
f__Rhodobacteraceae;g__Planktotalea 22
f__Rhodobacteraceae;g__Amylibacter 17
f__Halieaceae;g__Luminiphilus 16
f__Clade_I;g__Clade_Ib 13
f__Flavobacteriaceae;g__Aquibacter 12
f__Flavobacteriaceae;g__Psychroserpens 11
Family Frequency
o__Defluviicoccales;f__uncultured 93
o__Flavobacteriales;f__Flavobacteriaceae 55
o__Cellvibrionales;f__Halieaceae 54
o__Flavobacteriales;f__Cryomorphaceae 44
o__Caulobacterales;f__Hyphomonadaceae 37
o__Rhodobacterales;f__Rhodobacteraceae 34
o__SAR11_clade;f__Clade_III 29
o__Balneolales;f__Balneolaceae 26
o__SAR11_clade;f__Clade_IV 23
o__Cellvibrionales;f__Porticoccaceae 20
Order Frequency
c__Alphaproteobacteria;o__Defluviicoccales 89
c__Alphaproteobacteria;o__Caulobacterales 51
c__Gammaproteobacteria;o__Cellvibrionales 47
c__Alphaproteobacteria;o__Rhodobacterales 46
c__Rhodothermia;o__Balneolales 43
c__Bacteroidia;o__Flavobacteriales 30
c__Gammaproteobacteria;o__Thiomicrospirales 27
c__BD7-11;o__BD7-11 25
c__Oligoflexia;o__Oligoflexales 25
c__Marinimicrobia_(SAR406_clade);o__Marinimicrobia_(SAR406_clade) 25
Table 5.1: Selection frequencies of top 10 selected taxa for leucine incorporation at the genus, family, and
order levels over 200 runs.
85
0
10
20
30
40
50
60
70
Frequency
0
10
20
30
40
50
60
Count of genus
A) Genus level
0
20
40
60
80
Frequency
0
10
20
30
40
50
Count of family
B) Family level
0
20
40
60
80
Frequency
0
5
10
15
20
25
30
35
40
Count of order
C) Order level
Figure 5.21: The distribution of selection frequencies for response leucine incorporation on the SPOT data
set over 200 runs.
86
2
0
2
4
6
f__Flavobacteriaceae;g__Tenacibaculum
Leucine incorporation
2
0
2
4
6
f__uncultured;g__uncultured
Leucine incorporation
2
0
2
4
6
f__Rhodobacteraceae;g__Planktomarina
Leucine incorporation
2
0
2
4
6
CLR(Count) or leucine incorporation
o__Defluviicoccales;f__uncultured
Leucine incorporation
0
2
4
6
8
o__Flavobacteriales;f__Flavobacteriaceae
Leucine incorporation
0
1
2
3
4
5
6
7
o__Cellvibrionales;f__Halieaceae
Leucine incorporation
2
0
2
4
6
c__Alphaproteobacteria;o__Defluviicoccales
Leucine incorporation
Time
2
0
2
4
6
c__Alphaproteobacteria;o__Caulobacterales
Leucine incorporation
0
2
4
6
8 c__Gammaproteobacteria;o__Cellvibrionales
Leucine incorporation
Figure 5.22: The variation of leucine incorporation and the abundance of its top 3 selected taxa at the
genus, family, and order levels.
87
5.2.4 Application to dietary glycans treatment time series data set
Finally, DeepLINK-T was applied to detect important bacterial taxa associated with two different dietary
glycans treatments in a time series human gut metagenomic data set. Again, the application was implemented at the genus, family, and order levels. Figure 5.23 shows the selection frequencies of important
taxa associated with fructooligosaccharides (FOS) and polydextrose (PDX) treatments at the three taxa
levels. The results further demonstrate that DeepLINK-T can reliably select top features associated with
the two treatments. The variation for the abundance level of top selected taxa categorized by FOS and
PDX is shown in Figures 5.24-5.26, which additionally display the important bacterial taxa targeted by the
two different treatments.
88
0
25
50
75
100
125
150
175
Frequency
0
20
40
60
80
100
Count of genus
A) Genus level
0
20
40
60
80
100
120
140
160
Frequency
0
10
20
30
40
50
Count of family
B) Family level
0
20
40
60
80
100
120
140
160
Frequency
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
Count of order
C) Order level
Figure 5.23: The distribution of selection frequencies on the dietary glycans treatment time series data set
over 200 runs.
89
0 10 20 30 40
Time
0
1
2
3
4
5
value
A) f__Oscillospiraceae;g__Oscillibacter
Treatment
FOS
PDX
0 10 20 30 40
Time
0.5
0.0
0.5
1.0
1.5
value
B) f__Micrococcaceae;g__Rothia
Treatment
FOS
PDX
0 10 20 30 40
Time
0.0
0.5
1.0
1.5
2.0
2.5
3.0
value
C) f__Campylobacteraceae;g__Campylobacter
Treatment
FOS
PDX
0 10 20 30 40
Time
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
1.25
value
D) f__Staphylococcaceae;g__Staphylococcus
Treatment
FOS
PDX
0 10 20 30 40
Time
0.4
0.2
0.0
0.2
0.4
0.6
0.8
value
E) f__Spirochaetaceae;g__Treponema
Treatment
FOS
PDX
0 10 20 30 40
Time
0
1
2
3
4
5
value
F) f__Odoribacteraceae;g__Odoribacter
Treatment
FOS
PDX
0 10 20 30 40
Time
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
value
G) f__Ruminococcaceae;g__Angelakisella
Treatment
FOS
PDX
0 10 20 30 40
Time
0
1
2
3
4
value
H) f__;g__Intestinimonas
Treatment
FOS
PDX
0 10 20 30 40
Time
0.50
0.25
0.00
0.25
0.50
0.75
1.00
1.25
1.50
value
I) f__Carnobacteriaceae;g__Granulicatella
Treatment
FOS
PDX
0 10 20 30 40
Time
0
1
2
3
4
5
6
value
J) f__Ruminococcaceae;g__Ruminiclostridium
Treatment
FOS
PDX
Figure 5.24: The abundance variation of the top 10 selected genera between the FOS and PDX groups.
0 10 20 30 40
Time
0
1
2
3
4
5
6
value
A) o__Clostridiales;f__Oscillospiraceae
Treatment
FOS
PDX
0 10 20 30 40
Time
0.0
0.5
1.0
1.5
2.0
2.5
3.0
value
B) o__Campylobacterales;f__Campylobacteraceae
Treatment
FOS
PDX
0 10 20 30 40
Time
1.0
0.5
0.0
0.5
1.0
1.5
value
C) o__Bacillales;f__Staphylococcaceae
Treatment
FOS
PDX
0 10 20 30 40
Time
1.0
0.5
0.0
0.5
1.0
value
D) o__Spirochaetales;f__Spirochaetaceae
Treatment
FOS
PDX
0 10 20 30 40
Time
0.5
0.0
0.5
1.0
1.5
2.0
2.5
value
E) o__Micrococcales;f__Micrococcaceae
Treatment
FOS
PDX
0 10 20 30 40
Time
1
0
1
2
3
4
5
value
F) o__Bacteroidales;f__Odoribacteraceae
Treatment
FOS
PDX
0 10 20 30 40
Time
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
value
G) o__Synergistales;f__Synergistaceae
Treatment
FOS
PDX
0 10 20 30 40
Time
1.6
1.4
1.2
1.0
0.8
0.6
0.4
value
H) o__Brachyspirales;f__Brachyspiraceae
Treatment
FOS
PDX
0 10 20 30 40
Time
1.6
1.4
1.2
1.0
0.8
0.6
value
I) o__Mycoplasmatales;f__Mycoplasmataceae
Treatment
FOS
PDX
0 10 20 30 40
Time
0
1
2
3
4
5
6
7
value
J) o__Enterobacterales;f__Enterobacteriaceae
Treatment
FOS
PDX
Figure 5.25: The abundance variation of the top 10 selected families between the FOS and PDX groups.
90
0 10 20 30 40
Time
0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
value
A) c__Epsilonproteobacteria;o__Campylobacterales
Treatment
FOS
PDX
0 10 20 30 40
Time
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
value
B) c__Spirochaetia;o__Brachyspirales
Treatment
FOS
PDX
0 10 20 30 40
Time
2.00
1.75
1.50
1.25
1.00
0.75
0.50
0.25
value
C) c__Synergistia;o__Synergistales
Treatment
FOS
PDX
0 10 20 30 40
Time
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
value
D) c__Chlamydiia;o__Chlamydiales
Treatment
FOS
PDX
0 10 20 30 40
Time
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
value
E) c__Mollicutes;o__Mycoplasmatales
Treatment
FOS
PDX
0 10 20 30 40
Time
2.0
1.5
1.0
0.5
0.0
0.5
1.0
value
F) c__Gammaproteobacteria;o__Aeromonadales
Treatment
FOS
PDX
0 10 20 30 40
Time
1.75
1.50
1.25
1.00
0.75
0.50
0.25
value
G) c__Gammaproteobacteria;o__Oceanospirillales
Treatment
FOS
PDX
0 10 20 30 40
Time
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
value
H) c__Actinobacteria;o__Streptomycetales
Treatment
FOS
PDX
0 10 20 30 40
Time
1.5
1.0
0.5
0.0
0.5
value
I) c__Gammaproteobacteria;o__Vibrionales
Treatment
FOS
PDX
0 10 20 30 40
Time
1.0
0.5
0.0
0.5
1.0
1.5
2.0
value
J) c__Actinobacteria;o__Propionibacteriales
Treatment
FOS
PDX
Figure 5.26: The abundance variation of the top 10 selected orders between the FOS and PDX groups.
91
Bibliography
[1] Jiyoung Ahn, Rashmi Sinha, Zhiheng Pei, Christine Dominianni, Jing Wu, Jianxin Shi,
James J Goedert, Richard B Hayes, and Liying Yang. “Human gut microbiome and risk for
colorectal cancer”. In: Journal of the National Cancer Institute 105.24 (2013), pp. 1907–1911.
[2] Olaitan Akintunde, Trichina Tucker, and Valerie J Carabetta. “The evolution of next-generation
sequencing technologies”. In: arXiv preprint arXiv:2305.08724 (2023).
[3] Alexandre Almeida, Stephen Nayfach, Miguel Boland, Francesco Strozzi, Martin Beracochea,
Zhou Jason Shi, Katherine S Pollard, Ekaterina Sakharova, Donovan H Parks, Philip Hugenholtz,
et al. “A unified catalog of 204,938 reference genomes from the human gut microbiome”. In:
Nature Biotechnology 39.1 (2021), pp. 105–114.
[4] Marti J Anderson. “A new method for non-parametric multivariate analysis of variance”. In:
Austral Ecology 26.1 (2001), pp. 32–46.
[5] Marie-Claire Arrieta, Leah T Stiemsma, Pedro A Dimitriu, Lisa Thorson, Shannon Russell,
Sophie Yurist-Doutsch, Boris Kuzeljevic, Matthew J Gold, Heidi M Britton, Diana L Lefebvre, et al.
“Early infancy microbial and metabolic alterations affect risk of childhood asthma”. In: Science
Translational Medicine 7.307 (2015), 307ra152.
[6] Kevin R Arrigo. “Marine microorganisms and global nutrient cycles”. In: Nature 437.7057 (2005),
pp. 349–355.
[7] Ignasi Azagra-Boronat, Malén Massot-Cladera, Karen Knipping, Belinda van ‘t Land,
Sebastian Tims, Bernd Stahl, Jan Knol, Johan Garssen, Àngels Franch, Margarida Castell, et al.
“Oligosaccharides modulate rotavirus-associated dysbiosis and TLR gene expression in neonatal
rats”. In: Cells 8.8 (2019), p. 876.
[8] Xin Bai, Jie Ren, Yingying Fan, and Fengzhu Sun. “KIMI: Knockoff Inference for Motif
Identification from molecular sequences with controlled false discovery rate”. In: Bioinformatics
37.6 (2021), pp. 759–766.
[9] Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail Dvorkin,
Alexander S Kulikov, Valery M Lesin, Sergey I Nikolenko, Son Pham, Andrey D Prjibelski, et al.
“SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing”. In:
Journal of computational biology 19.5 (2012), pp. 455–477.
92
[10] Rina Foygel Barber and Emmanuel J Candès. “Controlling the false discovery rate via knockoffs”.
In: The Annals of Statistics 43 (2015), pp. 2055–2085.
[11] Larry L Barton and Diana E Northup. Microbial ecology. John Wiley & Sons, 2011.
[12] Jakub Barylski, François Enault, Bas E Dutilh, Margo BP Schuller, Robert A Edwards,
Annika Gillis, Jochen Klumpp, Petar Knezevic, Mart Krupovic, Jens H Kuhn, et al. “Taxonomy
proposal: To create one (1) new family, Herelleviridae, in the order Caudovirales”. In: ICTV
Online: International Committee on Taxonomy of Viruses (ICTV) (2018).
[13] Jakub Barylski, Andrew M Kropinski, Nabil-Fareed Alikhan, Evelien M Adriaenssens, and
ICTV Report Consortium. “ICTV virus taxonomy profile: herelleviridae”. In: The Journal of
general virology 101.4 (2020), p. 362.
[14] Francesco Beghini, Lauren J McIver, Aitor Blanco-Míguez, Leonard Dubois, Francesco Asnicar,
Sagun Maharjan, Ana Mailyan, Paolo Manghi, Matthias Scholz, Andrew Maltez Thomas, et al.
“Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities
with bioBakery 3”. In: Elife 10 (2021), e65088.
[15] Yoav Benjamini and Yosef Hochberg. “Controlling the false discovery rate: a practical and
powerful approach to multiple testing”. In: Journal of the Royal statistical society: series B
(Methodological) 57.1 (1995), pp. 289–300.
[16] Yoav Benjamini, Abba M Krieger, and Daniel Yekutieli. “Adaptive linear step-up procedures that
control the false discovery rate”. In: Biometrika 93.3 (2006), pp. 491–507.
[17] Yoav Benjamini and Daniel Yekutieli. “The control of the false discovery rate in multiple testing
under dependency”. In: Annals of statistics (2001), pp. 1165–1188.
[18] Aitor Blanco-Míguez, Francesco Beghini, Fabio Cumbo, Lauren J McIver, Kelsey N Thompson,
Moreno Zolfo, Paolo Manghi, Leonard Dubois, Kun D Huang, Andrew Maltez Thomas, et al.
“Extending and improving metagenomic taxonomic profiling with uncharacterized species using
MetaPhlAn 4”. In: Nature Biotechnology (2023), pp. 1–12.
[19] Nicholas A Bokulich, Jennifer Chung, Thomas Battaglia, Nora Henderson, Melanie Jay, Huilin Li,
Arnon D. Lieber, Fen Wu, Guillermo I Perez-Perez, Yu Chen, et al. “Antibiotics, birth mode, and
diet shape microbiome maturation during early life”. In: Science Translational Medicine 8.343
(2016), 343ra82.
[20] John R ten Bosch and Wayne W Grody. “Keeping up with the next generation: massively parallel
sequencing in clinical diagnostics”. In: The Journal of Molecular Diagnostics 10.6 (2008),
pp. 484–492.
[21] Freddie Bray, Jacques Ferlay, Isabelle Soerjomataram, Rebecca L Siegel, Lindsey A Torre, and
Ahmedin Jemal. “Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality
worldwide for 36 cancers in 185 countries”. In: CA: a cancer journal for clinicians 68.6 (2018),
pp. 394–424.
93
[22] J Roger Bray and John T Curtis. “An ordination of the upland forest communities of southern
Wisconsin”. In: Ecological monographs 27.4 (1957), pp. 325–349.
[23] Leo Breiman. “Random forests”. In: Machine learning 45 (2001), pp. 5–32.
[24] Benjamin Buchfink, Klaus Reuter, and Hajk-Georg Drost. “Sensitive protein alignments at
tree-of-life scale using DIAMOND”. In: Nature Methods 18.4 (2021), pp. 366–368.
[25] Luis F Camarillo-Guerrero, Alexandre Almeida, Guillermo Rangel-Pineros, Robert D Finn, and
Trevor D Lawley. “Massive expansion of human gut bacteriophage diversity”. In: Cell 184.4
(2021), pp. 1098–1109.
[26] Emmanuel Candes, Yingying Fan, Lucas Janson, and Jinchi Lv. “Panning for
gold:‘model-X’knockoffs for high dimensional controlled variable selection”. In: Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 80.3 (2018), pp. 551–577.
[27] Adrian Canizalez-Roman, Juan E Reina-Reyes, Uriel A Angulo-Zamudio,
Eloy E Geminiano-Martínez, Antonio F Flores-Carrillo, Rolando R García-Matus,
Norma M Valencia-Mijares, Nidia Leon-Sicairos, Jorge Velazquez-Roman,
Francisco A Martínez-Villa, et al. “Prevalence of Cyclomodulin-Positive E. coli and Klebsiella spp.
Strains in Mexican Patients with Colon Diseases and Antimicrobial Resistance”. In: Pathogens
11.1 (2022), p. 14.
[28] J Gregory Caporaso, Justin Kuczynski, Jesse Stombaugh, Kyle Bittinger, Frederic D Bushman,
Elizabeth K Costello, Noah Fierer, Antonio Gonzalez Peña, Julia K Goodrich, Jeffrey I Gordon,
et al. “QIIME allows analysis of high-throughput community sequencing data”. In: Nature
Methods 7.5 (2010), pp. 335–336.
[29] Anne Chao. “Nonparametric estimation of the number of classes in a population”. In:
Scandinavian Journal of statistics (1984), pp. 265–270.
[30] Beidi Chen, Luxi Sun, and Xuan Zhang. “Integration of microbiome and epigenome to decipher
the pathogenesis of autoimmune diseases”. In: Journal of autoimmunity 83 (2017), pp. 31–42.
[31] Ling Chen, Jie Ren, Longhe Yang, Yanting Li, Jin Fu, Yuhang Li, Yifeng Tian, Funan Qiu,
Zuguo Liu, and Yan Qiu. “Stearoyl-CoA desaturase-1 mediated cell apoptosis in colorectal cancer
by promoting ceramide synthesis”. In: Scientific reports 6.1 (2016), pp. 1–11.
[32] Yiwen Cheng, Zongxin Ling, and Lanjuan Li. “The intestinal microbiota and colorectal cancer”.
In: Frontiers in immunology (2020), p. 3100.
[33] Chien-Ming Chi, Yingying Fan, Jinchi Lv, et al. “High-Dimensional Knockoffs Inference for Time
Series Data”. In: arXiv preprint arXiv:2112.09851 (2021).
[34] Chien-Ming Chi, Patrick Vossler, Yingying Fan, and Jinchi Lv. “Asymptotic properties of
high-dimensional random forests”. In: The Annals of Statistics 50.6 (2022), pp. 3415–3438.
94
[35] Adam G Clooney, Thomas DS Sutton, Andrey N Shkoporov, Ross K Holohan, Karen M Daly,
Orla O’Regan, Feargal J Ryan, Lorraine A Draper, Scott E Plevy, R Paul Ross, et al. “Whole-virome
analysis sheds light on viral dark matter in inflammatory bowel disease”. In: Cell host & microbe
26.6 (2019), pp. 764–778.
[36] William G Cochran. “The comparison of percentages in matched samples”. In: Biometrika 37.3/4
(1950), pp. 256–266.
[37] Luciene M Coelho, Helen C Rezende, Luciana M Coelho, PA De Sousa, DF Melo, and NM Coelho.
“Bioremediation of polluted waters using microorganisms”. In: Advances in bioremediation of
wastewater and polluted soil 10 (2015), p. 60770.
[38] Olabisi Oluwabukola Coker, Geicho Nakatsu, Rudin Zhenwei Dai, William Ka Kei Wu,
Sunny Hei Wong, Siew Chien Ng, Francis Ka Leung Chan, Joseph Jao Yiu Sung, and Jun Yu.
“Enteric fungal microbiota dysbiosis and ecological alterations in colorectal cancer”. In: Gut 68.4
(2019), pp. 654–662.
[39] Jacob A Cram, Cheryl-Emiliane T Chow, Rohan Sachdeva, David M Needham, Alma E Parada,
Joshua A Steele, and Jed A Fuhrman. “Seasonal and interannual variability of the marine
bacterioplankton community throughout the water column over ten years”. In: The ISME Journal
9.3 (2015), pp. 563–580.
[40] Richard Creswell, Jie Tan, Jonathan W Leff, Brandon Brooks, Michael A Mahowald,
Ruth Thieroff-Ekerdt, and Georg K Gerber. “High-resolution temporal profiling of the human gut
microbiome reveals consistent and cascading alterations in response to dietary glycans”. In:
Genome Medicine 12 (2020), pp. 1–16.
[41] F De Luca and Y Shoenfeld. “The microbiome in autoimmune diseases”. In: Clinical &
Experimental Immunology 195.1 (2019), pp. 74–85.
[42] Xingming Deng, Zhuofei Li, Guan Li, Bei Li, Xinhan Jin, and Guoqing Lyu. “Comparison of
microbiota in patients treated by surgery or chemotherapy by 16S rRNA sequencing reveals
potential biomarkers for colorectal cancer therapy”. In: Frontiers in Microbiology 9 (2018), p. 1607.
[43] Stéphane Dray, Pierre Legendre, and Pedro R Peres-Neto. “Spatial modelling: a comprehensive
framework for principal coordinate analysis of neighbour matrices (PCNM)”. In: ecological
modelling 196.3-4 (2006), pp. 483–493.
[44] Sean R. Eddy. “Profile hidden Markov models.” In: Bioinformatics (Oxford, England) 14.9 (1998),
pp. 755–763.
[45] Majid Eslami, Sina Sadrifar, Mohsen Karbalaei, Masoud Keikha, Nazarii M Kobyliak, and
Bahman Yousefi. “Importance of the microbiota inhibitory mechanism on the Warburg effect in
colorectal cancer cells”. In: Journal of Gastrointestinal Cancer 51.3 (2020), pp. 738–747.
[46] Yingying Fan, Jinchi Lv, Mahrad Sharifvaghefi, and Yoshimasa Uematsu. “IPAD: stable
interpretable forecasting with knockoffs inference”. In: Journal of the American Statistical
Association 115.532 (2020), pp. 1822–1834.
95
[47] Kelsey Fehr, Shirin Moossavi, Hind Sbihi, Rozlyn CT Boutin, Lars Bode, Bianca Robertson,
Chloe Yonemitsu, Catherine J Field, Allan B Becker, Piushkumar J Mandhane, et al. “Breastmilk
feeding practices are associated with the co-occurrence of bacteria in mothers’ milk and the
infant gut: the CHILD cohort study”. In: Cell Host & Microbe 28.2 (2020), pp. 285–297.
[48] Qiang Feng, Suisha Liang, Huijue Jia, Andreas Stadlmayr, Longqing Tang, Zhou Lan,
Dongya Zhang, Huihua Xia, Xiaoying Xu, Zhuye Jie, et al. “Gut microbiome development along
the colorectal adenoma–carcinoma sequence”. In: Nature Communications 6.1 (2015), pp. 1–13.
[49] Ronald A Fisher. “Frequency distribution of the values of the correlation coefficient in samples
from an indefinitely large population”. In: Biometrika 10.4 (1915), pp. 507–521.
[50] Christoph Frank, Jan Sundquist, Hongyao Yu, Akseli Hemminki, and Kari Hemminki.
“Concordant and discordant familial cancer: familial risks, proportions and population impact”.
In: International journal of cancer 140.7 (2017), pp. 1510–1516.
[51] World Cancer Research Fund and American Institute for Cancer Research. Food, nutrition,
physical activity, and the prevention of cancer: a global perspective. Vol. 1. Amer Inst for Cancer
Research, 2007.
[52] Renyuan Gao, Yefei Zhu, Cheng Kong, Kai Xia, Hao Li, Yin Zhu, Xiaohui Zhang, Yongqiang Liu,
Hui Zhong, Rong Yang, et al. “Alterations, Interactions, and Diagnostic Potential of Gut Bacteria
and Viruses in Colorectal Cancer”. In: Frontiers in cellular and infection microbiology 11 (2021).
[53] Yilin Gao, Zifan Zhu, and Fengzhu Sun. “Increasing prediction performance of colorectal cancer
disease status using random forests classification based on metagenomic shotgun sequencing
data”. In: Synthetic and Systems Biotechnology 7.1 (2022), pp. 574–585.
[54] Christopher R Genovese, Kathryn Roeder, and Larry Wasserman. “False discovery control with
p-value weighting”. In: Biometrika 93.3 (2006), pp. 509–524.
[55] Christopher L Gentile and Tiffany L Weir. “The gut microbiota at the intersection of diet and
human health”. In: Science 362.6416 (2018), pp. 776–780.
[56] Dirk Gevers, Subra Kugathasan, Lee A Denson, Yoshiki Vázquez-Baeza, Will Van Treuren,
Boyu Ren, Emma Schwager, Dan Knights, Se Jin Song, Moran Yassour, et al. “The treatment-naive
microbiome in new-onset Crohn’s disease”. In: Cell host & microbe 15.3 (2014), pp. 382–392.
[57] Ann C Gregory, Olivier Zablocki, Ahmed A Zayed, Allison Howell, Benjamin Bolduc, and
Matthew B Sullivan. “The gut virome database reveals age-dependent patterns of virome
diversity in the human gut”. In: Cell host & microbe 28.5 (2020), pp. 724–740.
[58] Ankit Gupta, Rasna Gupta, and Ram Lakhan Singh. “Microbes and environment”. In: Principles
and applications of environmental biotechnology for a sustainable future (2017), pp. 43–84.
[59] Geoffrey D Hannigan, Melissa B Duhaime, Mack T Ruffin IV, Charlie C Koumpouras, and
Patrick D Schloss. “Diagnostic potential and interactive dynamics of the colorectal cancer
virome”. In: MBio 9.6 (2018), e02248–18.
96
[60] Geoffrey D Hannigan, Jacquelyn S Meisel, Amanda S Tyldsley, Qi Zheng, Brendan P Hodkinson,
Adam J SanMiguel, Samuel Minot, Frederic D Bushman, and Elizabeth A Grice. “The human skin
double-stranded DNA virome: topographical and temporal diversity, genetic enrichment, and
dynamic associations with the host microbiome”. In: MBio 6.5 (2015), e01578–15.
[61] Zihuai He, Linxi Liu, Chen Wang, Yann Le Guen, Justin Lee, Stephanie Gogarten, Fred Lu,
Stephen Montgomery, Hua Tang, Edwin K Silverman, et al. “Identification of putative causal loci
in whole-genome sequencing data via knockoff statistics”. In: Nature Communications 12.1 (2021),
p. 3152.
[62] Carlo Heip. “A new index measuring evenness”. In: Journal of the Marine Biological Association of
the United Kingdom 54.3 (1974), pp. 555–557.
[63] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural computation 9.8
(1997), pp. 1735–1780.
[64] Elaine Holmes, Jia V Li, Julian R Marchesi, and Jeremy K Nicholson. “Gut microbiota composition
and activity in relation to host metabolic phenotype and disease risk”. In: Cell metabolism 16.5
(2012), pp. 559–564.
[65] Doug Hyatt, Gwo-Liang Chen, Philip F LoCascio, Miriam L Land, Frank W Larimer, and
Loren J Hauser. “Prodigal: prokaryotic gene recognition and translation initiation site
identification”. In: BMC bioinformatics 11.1 (2010), pp. 1–11.
[66] Nikolaos Ignatiadis, Bernd Klaus, Judith B Zaugg, and Wolfgang Huber. “Data-driven hypothesis
weighting increases detection power in genome-scale multiple testing”. In: Nature Methods 13.7
(2016), pp. 577–580.
[67] I Illumina. “An introduction to next-generation sequencing technology”. In: Illumina, Inc (2015).
[68] Sergey Ioffe and Christian Szegedy. “Batch normalization: accelerating deep network training by
reducing internal covariate shift”. In: International conference on machine learning. 2015,
pp. 448–456.
[69] Leland J Jackson, John G Stockner, and Paul J Harrison. “Contribution of Rhizosolenia eriensis
and Cyclotella spp. to the deep chlorophyll maximum of Sproat Lake, British Columbia, Canada”.
In: Canadian Journal of Fisheries and Aquatic Sciences 47.1 (1990), pp. 128–135.
[70] Elizabeth L Johnson, Stacey L Heaver, William A Walters, and Ruth E Ley. “Microbiome and
metabolic disease: revisiting the bacterial phylum Bacteroidetes”. In: Journal of Molecular
Medicine 95 (2017), pp. 1–8.
[71] Alan Jović, Karla Brkić, and Nikola Bogunović. “A review of feature selection methods with
applications”. In: 2015 38th international convention on information and communication technology,
electronics and microelectronics (MIPRO). 2015, pp. 1200–1205.
[72] Minoru Kanehisa and Susumu Goto. “KEGG: kyoto encyclopedia of genes and genomes”. In:
Nucleic acids research 28.1 (2000), pp. 27–30.
97
[73] Purvesh Khatri, Marina Sirota, and Atul J Butte. “Ten years of pathway analysis: current
approaches and outstanding challenges”. In: PLoS computational biology 8.2 (2012), e1002375.
[74] Daehwan Kim, Li Song, Florian P Breitwieser, and Steven L Salzberg. “Centrifuge: rapid and
sensitive classification of metagenomic sequences”. In: Genome research 26.12 (2016),
pp. 1721–1729.
[75] Woo-Young Kim. “Therapeutic targeting of lipid synthesis metabolism for selective elimination of
cancer stem cells”. In: Archives of pharmacal research 42.1 (2019), pp. 25–39.
[76] David Kirchman, ELIZABETH K’nees, and Robert Hodson. “Leucine incorporation and its
potential as a measure of protein synthesis by bacteria in natural aquatic systems”. In: Applied
and environmental microbiology 49.3 (1985), pp. 599–607.
[77] Jerzy K Kulski. “Next-generation sequencing—an overview of the history, tools, and “Omic”
applications”. In: Next generation sequencing-advances, applications and challenges 10 (2016),
p. 61964.
[78] Satish Kumar, Kishore Kumar Krishnani, Bharat Bhushan, and Manoj Pandit Brahmane.
“Metagenomics: retrospect and prospects in high throughput age”. In: Biotechnology research
international 2015 (2015).
[79] D Lamy, I Obernosterer, M Laghdass, LF Artigas, E Breton, JD Grattepanche, E Lecuyer,
N Degros, P Lebaron, and U Christaki. “Temporal changes of major bacterial groups and bacterial
heterotrophic activity during a Phaeocystis globosa bloom in the eastern English Channel”. In:
Aquatic microbial ecology 58.1 (2009), pp. 95–107.
[80] Miriam Land, Loren Hauser, Se-Ran Jun, Intawat Nookaew, Michael R Leuze, Tae-Hyuk Ahn,
Tatiana Karpinets, Ole Lund, Guruprased Kora, Trudy Wassenaar, et al. “Insights from 20 years of
bacterial genome sequencing”. In: Functional & integrative genomics 15 (2015), pp. 141–161.
[81] Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. “MEGAHIT: an
ultra-fast single-node solution for large and complex metagenomics assembly via succinct de
Bruijn graph”. In: Bioinformatics 31.10 (2015), pp. 1674–1676.
[82] Na Li, Qingqing Cai, Qing Miao, Zeshi Song, Yuan Fang, and Bijie Hu. “High-throughput
metagenomics for identification of pathogens in the clinical settings”. In: Small methods 5.1
(2021), p. 2000792.
[83] Huang Lin and Shyamal Das Peddada. “Analysis of compositions of microbiomes with bias
correction”. In: Nature Communications 11.1 (2020), pp. 1–11.
[84] Wanjun Liu, Yuan Ke, Jingyuan Liu, and Runze Li. “Model-free feature screening and FDR control
with knockoff features”. In: Journal of the American Statistical Association 117.537 (2022),
pp. 428–443.
98
[85] Jason Lloyd-Price, Cesar Arze, Ashwin N Ananthakrishnan, Melanie Schirmer,
Julian Avila-Pacheco, Tiffany W Poon, Elizabeth Andrews, Nadim J Ajami, Kevin S Bonham,
Colin J Brislawn, et al. “Multi-omics of the gut microbial ecosystem in inflammatory bowel
diseases”. In: Nature 569.7758 (2019), pp. 655–662.
[86] Michael I Love, Wolfgang Huber, and Simon Anders. “Moderated estimation of fold change and
dispersion for RNA-seq data with DESeq2”. In: Genome biology 15.12 (2014), pp. 1–21.
[87] Hengyun Lu, Francesca Giordano, and Zemin Ning. “Oxford Nanopore MinION sequencing and
genome assembly”. In: Genomics, proteomics & bioinformatics 14.5 (2016), pp. 265–279.
[88] Yang Lu, Yingying Fan, Jinchi Lv, and William Stafford Noble. “DeepPINK: reproducible feature
selection in deep neural networks”. In: Advances in neural information processing systems 31
(2018).
[89] Christine Ludwig and Ralf Wagner. “Virus-like particles—universal molecular toolboxes”. In:
Current opinion in biotechnology 18.6 (2007), pp. 537–545.
[90] Jinchi Lv and Jun S Liu. “Model selection principles in misspecified models”. In: Journal of the
Royal Statistical Society Series B: Statistical Methodology 76.1 (2014), pp. 141–167.
[91] Elaine R Mardis. “Next-generation DNA sequencing methods”. In: Annu. Rev. Genomics Hum.
Genet. 9 (2008), pp. 387–402.
[92] Peter Menzel, Kim Lee Ng, and Anders Krogh. “Fast and sensitive taxonomic classification for
metagenomics with Kaiju”. In: Nature Communications 7.1 (2016), p. 11257.
[93] Fernando Meyer, Adrian Fritz, Zhi-Luo Deng, David Koslicki, Till Robin Lesker, Alexey Gurevich,
Gary Robertson, Mohammed Alser, Dmitry Antipov, Francesco Beghini, et al. “Critical
Assessment of Metagenome Interpretation: the second round of challenges”. In: Nature Methods
(2022), pp. 1–12.
[94] Samuel Minot, Alexandra Bryson, Christel Chehoud, Gary D Wu, James D Lewis, and
Frederic D Bushman. “Rapid evolution of the human gut virome”. In: Proceedings of the National
Academy of Sciences 110.30 (2013), pp. 12450–12455.
[95] Hengameh Chloé Mirsepasi-Lauridsen, Katleen Vrankx, Jørgen Engberg, Alice Friis-Møller,
Jørn Brynskov, Inge Nordgaard-Lassen, Andreas Munk Petersen, and Karen Angeliki Krogfelt.
“Disease-specific enteric microbiome dysbiosis in inflammatory bowel disease”. In: Frontiers in
medicine 5 (2018), p. 304.
[96] Katalin Módis, Ciro Coletta, Antonia Asimakopoulou, Bartosz Szczesny, Celia Chao,
Andreas Papapetropoulos, Mark R Hellmich, and Csaba Szabo. “Effect of S-adenosyl-L-methionine
(SAM), an allosteric activator of cystathionine-β-synthase (CBS) on colorectal cancer cell
proliferation and bioenergetics in vitro”. In: Nitric oxide 41 (2014), pp. 146–156.
99
[97] Senthil Murugan Nagarajan, V Muthukumaran, R Murugesan, Rose Bindu Joseph, and
Meram Munirathanam. “Feature selection model for healthcare analysis and classification using
classifier ensemble technique”. In: International Journal of System Assurance Engineering and
Management (2021), pp. 1–12.
[98] Geicho Nakatsu, Haokui Zhou, William Ka Kei Wu, Sunny Hei Wong,
Olabisi Oluwabukola Coker, Zhenwei Dai, Xiangchun Li, Chun-Ho Szeto, Naoki Sugimura,
Thomas Yuen-Tung Lam, et al. “Alterations in enteric virome are associated with colorectal
cancer and survival outcomes”. In: Gastroenterology 155.2 (2018), pp. 529–541.
[99] Charmaine Ng, Haojun Li, William KK Wu, Sunny H Wong, and Jun Yu. “Genomics and
metagenomics of colorectal cancer”. In: Journal of Gastrointestinal Oncology 10.6 (2019), p. 1164.
[100] Jason M Norman, Scott A Handley, Megan T Baldridge, Lindsay Droit, Catherine Y Liu,
Brian C Keller, Amal Kambal, Cynthia L Monaco, Guoyan Zhao, Phillip Fleshner, et al.
“Disease-specific alterations in the enteric virome in inflammatory bowel disease”. In: Cell 160.3
(2015), pp. 447–460.
[101] María A Núñez-Sánchez, Joan Colom, Lauren Walsh, Colin Buttimer, Andrei Sorin Bolocan,
Rory Pang, Cormac GM Gahan, and Colin Hill. “Characterizing phage-host interactions in a
simplified human intestinal barrier model”. In: Microorganisms 8.9 (2020), p. 1374.
[102] Ingrid Obernosterer, Philippe Catala, Philippe Lebaron, and Nyree J West. “Distinct bacterial
groups contribute to carbon cycling during a naturally iron fertilized phytoplankton bloom in the
Southern Ocean”. In: Limnology and oceanography 56.6 (2011), pp. 2391–2401.
[103] David Paez-Espino, Simon Roux, I-Min A Chen, Krishna Palaniappan, Anna Ratner, Ken Chu,
Marcel Huntemann, T B K Reddy, Joan Carles Pons, Mercè Llabrés, et al. “IMG/VR v. 2.0: an
integrated data management and analysis system for cultivated and environmental viral
genomes”. In: Nucleic acids research 47.D1 (2019), pp. D678–D686.
[104] Eliseo Papa, Michael Docktor, Christopher Smillie, Sarah Weber, Sarah P Preheim, Dirk Gevers,
Georgia Giannoukos, Dawn Ciulla, Diana Tabbaa, Jay Ingram, et al. “Non-invasive mapping of
the gastrointestinal microbiota identifies children with inflammatory bowel disease”. In: PLoS one
7.6 (2012).
[105] Victoria Pascal, Marta Pozuelo, Natalia Borruel, Francesc Casellas, David Campos, Alba Santiago,
Xavier Martinez, Encarna Varela, Guillaume Sarrabayrouse, Kathleen Machiels, et al. “A microbial
signature for Crohn’s disease”. In: Gut 66.5 (2017), pp. 813–822.
[106] Edoardo Pasolli, Duy Tin Truong, Faizan Malik, Levi Waldron, and Nicola Segata. “Machine
learning meta-analysis of large metagenomic datasets: tools and biological insights”. In: PLoS
computational biology 12.7 (2016), e1004977.
[107] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
M. Perrot, and E. Duchesnay. “Scikit-learn: Machine Learning in Python”. In: Journal of Machine
Learning Research 12 (2011), pp. 2825–2830.
100
[108] Jane Peterson, Susan Garges, Maria Giovanni, Pamela McInnes, Lu Wang, Jeffery A Schloss,
Vivien Bonazzi, Jean E McEwen, Kris A Wetterstrand, Carolyn Deal, et al. “The NIH human
microbiome project”. In: Genome research 19.12 (2009), pp. 2317–2323.
[109] Elena V Pikuta, Richard B Hoover, and Jane Tang. “Microbial extremophiles at the limits of life”.
In: Critical reviews in microbiology 33.3 (2007), pp. 183–209.
[110] Philipp Probst, Marvin N Wright, and Anne-Laure Boulesteix. “Hyperparameters and tuning
strategies for random forest”. In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery 9.3 (2019), e1301.
[111] Junjie Qin, Yingrui Li, Zhiming Cai, Shenghui Li, Jianfeng Zhu, Fan Zhang, Suisha Liang,
Wenwei Zhang, Yuanlin Guan, Dongqian Shen, et al. “A metagenome-wide association study of
gut microbiota in type 2 diabetes”. In: Nature 490.7418 (2012), pp. 55–60.
[112] Michael A Quail, Miriam Smith, Paul Coupland, Thomas D Otto, Simon R Harris,
Thomas R Connor, Anna Bertoni, Harold P Swerdlow, and Yong Gu. “A tale of three next
generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina
MiSeq sequencers”. In: BMC genomics 13.1 (2012), pp. 1–13.
[113] Omesaad Rado, Najat Ali, Habiba Muhammad Sani, Ahmad Idris, and Daniel Neagu.
“Performance analysis of feature selection methods for classification of healthcare datasets”. In:
Intelligent Computing: Proceedings of the 2019 Computing Conference, Volume 1. 2019, pp. 929–938.
[114] Beatriz Remeseiro and Veronica Bolon-Canedo. “A review of feature selection methods in medical
applications”. In: Computers in Biology and Medicine 112 (2019), p. 103375.
[115] Jie Ren, Nathan A Ahlgren, Yang Young Lu, Jed A Fuhrman, and Fengzhu Sun. “VirFinder: a novel
k-mer based tool for identifying viral sequences from assembled metagenomic data”. In:
Microbiome 5 (2017), p. 69.
[116] Lesley Robertson, Jantien Backer, Claud Biemans, Joop van Doorn, Klaas Krab, Willem Reijnders,
Henk Smit, and Peter Willemsen. Antoni van Leeuwenhoek: Master of the Minuscule. Brill, 2016.
[117] Courtney J Robinson, Brendan JM Bohannan, and Vincent B Young. “From structure to function:
the ecology of host-associated microbial communities”. In: Microbiology and Molecular Biology
Reviews 74.3 (2010), pp. 453–476.
[118] Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. “edgeR: a Bioconductor package for
differential expression analysis of digital gene expression data”. In: Bioinformatics 26.1 (2010),
pp. 139–140.
[119] Simon Roux, Francois Enault, Bonnie L Hurwitz, and Matthew B Sullivan. “VirSorter: mining
viral signal from microbial genomic data”. In: PeerJ 3 (2015), e985.
[120] Yvan Saeys, Inaki Inza, and Pedro Larranaga. “A review of feature selection techniques in
bioinformatics”. In: Bioinformatics 23.19 (2007), pp. 2507–2517.
101
[121] Alaa Sagheer and Mostafa Kotb. “Unsupervised pre-training of a deep LSTM-based stacked
autoencoder for multivariate time series forecasting problems”. In: Scientific Reports 9.1 (2019),
p. 19038.
[122] Lidia Sánchez-Alcoholado, Bruno Ramos-Molina, Ana Otero, Aurora Laborda-Illanes,
Rafael Ordóñez, José Antonio Medina, Jaime Gómez-Millán, and María Isabel Queipo-Ortuño.
“The role of the gut microbiome in colorectal cancer development and therapy response”. In:
Cancers 12.6 (2020), p. 1406.
[123] Melanie Schirmer, Ashley Garner, Hera Vlamakis, and Ramnik J Xavier. “Microbial genes and
pathways in inflammatory bowel disease”. In: Nature reviews microbiology 17.8 (2019),
pp. 497–511.
[124] James G Scott, Ryan C Kelly, Matthew A Smith, Pengcheng Zhou, and Robert E Kass. “False
discovery rate regression: an application to neural synchrony detection in primary visual cortex”.
In: Journal of the American Statistical Association 110.510 (2015), pp. 459–471.
[125] Simona Serini, Roberta Cassano, Paola Antonia Corsetto, Angela Maria Rizzo, Gabriella Calviello,
and Sonia Trombino. “Omega-3 PUFA loaded in resveratrol-based solid lipid nanoparticles:
physicochemical properties and antineoplastic activities in human colorectal cancer cells in
vitro”. In: International journal of molecular sciences 19.2 (2018), p. 586.
[126] Daniel M Sigman and Mathis P Hain. “The biological productivity of the ocean”. In: Nature
Education Knowledge 3.10 (2012), p. 21.
[127] H Ye Simon, Katherine J Siddle, Daniel J Park, and Pardis C Sabeti. “Benchmarking metagenomics
tools for taxonomic classification”. In: Cell 178.4 (2019), pp. 779–794.
[128] Charlotte Soneson, Michael I Love, and Mark D Robinson. “Differential analyses for RNA-seq:
transcript-level estimates improve gene-level inferences”. In: F1000Research 4 (2015).
[129] Mingyang Song, Andrew T Chan, and Jun Sun. “Influence of the gut microbiome, diet, and
environment on risk of colorectal cancer”. In: Gastroenterology 158.2 (2020), pp. 322–340.
[130] Ian F Spellerberg and Peter J Fedor. “A tribute to Claude Shannon (1916–2001) and a plea for
more rigorous use of species richness, species diversity and the ‘Shannon–Wiener’Index”. In:
Global ecology and biogeography 12.3 (2003), pp. 177–179.
[131] John D Storey. “A direct approach to false discovery rates”. In: Journal of the Royal Statistical
Society Series B: Statistical Methodology 64.3 (2002), pp. 479–498.
[132] John D Storey, Jonathan E Taylor, and David Siegmund. “Strong control, conservative point
estimation and simultaneous conservative consistency of false discovery rates: a unified
approach”. In: Journal of the Royal Statistical Society Series B 66.1 (2004), pp. 187–205.
[133] “The integrative human microbiome project”. In: Nature 569.7758 (2019), pp. 641–648.
102
[134] Andrew Maltez Thomas, Paolo Manghi, Francesco Asnicar, Edoardo Pasolli, Federica Armanini,
Moreno Zolfo, Francesco Beghini, Serena Manara, Nicolai Karcher, Chiara Pozzi, et al.
“Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic
signatures and a link with choline degradation”. In: Nature medicine 25.4 (2019), pp. 667–678.
[135] Robert Tibshirani. “Regression shrinkage and selection via the lasso”. In: Journal of the Royal
Statistical Society Series B: Statistical Methodology 58.1 (1996), pp. 267–288.
[136] M Toma, L Beluşică, M Stavarachi, P Apostol, S Spandole, I Radu, and D Cimponeriu. “Rating the
environmental and genetic risk factors for colorectal cancer”. In: Journal of medicine and life
5.Spec Issue (2012), p. 152.
[137] Yoshihiko Tomofuji, Toshihiro Kishikawa, Yuichi Maeda, Kotaro Ogawa, Takuro Nii,
Tatsusada Okuno, Eri Oguro-Igashira, Makoto Kinoshita, Kenichi Yamamoto, Kyuto Sonehara,
et al. “Whole gut virome analysis of 476 Japanese revealed a link between phage and autoimmune
disease”. In: Annals of the Rheumatic Diseases 81.2 (2022), pp. 278–288.
[138] Peter J Turnbaugh, Ruth E Ley, Micah Hamady, Claire M Fraser-Liggett, Rob Knight, and
Jeffrey I Gordon. “The human microbiome project”. In: Nature 449.7164 (2007), pp. 804–810.
[139] Luke K Ursell, Jessica L Metcalf, Laura Wegener Parfrey, and Rob Knight. “Defining the human
microbiome”. In: Nutrition reviews 70.suppl_1 (2012), S38–S44.
[140] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in Neural
Information Processing Systems 30 (2017).
[141] Britta Velten, Jana M Braunger, Ricard Argelaguet, Damien Arnol, Jakob Wirbel,
Danila Bredikhin, Georg Zeller, and Oliver Stegle. “Identifying temporal and spatial patterns of
variation from multimodal data using MEFISTO”. In: Nature Methods 19.2 (2022), pp. 179–186.
[142] Wolfgang Viechtbauer. “Conducting meta-analyses in R with the metafor package”. In: Journal of
statistical software 36.3 (2010), pp. 1–48.
[143] José M Vieites, María-Eugenia Guazzaroni, Ana Beloqui, Peter N Golyshin, and Manuel Ferrer.
“Metagenomics approaches in systems microbiology”. In: FEMS microbiology reviews 33.1 (2008),
pp. 236–255.
[144] Emily Vogtmann, Xing Hua, Georg Zeller, Shinichi Sunagawa, Anita Y Voigt, Rajna Hercog,
James J Goedert, Jianxin Shi, Peer Bork, and Rashmi Sinha. “Colorectal cancer and the human gut
microbiome: reproducibility with whole-genome shotgun sequencing”. In: PloS one 11.5 (2016),
e0155362.
[145] William Wade. “Unculturable bacteria—the uncharacterized organisms that cause oral infections”.
In: Journal of the Royal Society of Medicine 95.2 (2002), pp. 81–83.
[146] Lipo Wang, Yaoli Wang, and Qing Chang. “Feature selection methods for big data bioinformatics:
A survey from the search perspective”. In: Methods 111 (2016), pp. 21–31.
103
[147] Alastair JM Watson and Paul D Collins. “Colon cancer: a civilization disorder”. In: Digestive
diseases 29.2 (2011), pp. 222–228.
[148] Bernd Wemheuer, Franziska Wemheuer, Jacqueline Hollensteiner, Frauke-Dorothee Meyer,
Sonja Voget, and Rolf Daniel. “The green impact: bacterioplankton response toward a
phytoplankton spring bloom in the southern North Sea assessed by comparative metagenomic
and metatranscriptomic approaches”. In: Frontiers in Microbiology 6 (2015), p. 805.
[149] Paul J Werbos. “Backpropagation through time: what it does and how to do it”. In: Proceedings of
the IEEE 78.10 (1990), pp. 1550–1560.
[150] Jakob Wirbel, Paul Theodor Pyl, Ece Kartal, Konrad Zych, Alireza Kashani, Alessio Milanese,
Jonas S Fleck, Anita Y Voigt, Albert Palleja, Ruby Ponnudurai, et al. “Meta-analysis of fecal
metagenomes reveals global microbial signatures that are specific for colorectal cancer”. In:
Nature medicine 25.4 (2019), pp. 679–689.
[151] Jiayu Wu, Kai Wang, Xuemei Wang, Yanli Pang, and Changtao Jiang. “The role of the gut
microbiome and its metabolites in metabolic diseases”. In: Protein & cell 12.5 (2021), pp. 360–373.
[152] Shinichi Yachida, Sayaka Mizutani, Hirotsugu Shiroma, Satoshi Shiba, Takeshi Nakajima,
Taku Sakamoto, Hikaru Watanabe, Keigo Masuda, Yuichiro Nishimoto, Masaru Kubo, et al.
“Metagenomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut
microbiota in colorectal cancer”. In: Nature medicine 25.6 (2019), pp. 968–976.
[153] Jian Yang, Dongfang Li, Zhenyu Yang, Wenkui Dai, Xin Feng, Yanhong Liu, Yiqi Jiang, Pingang Li,
Yinhu Li, Bo Tang, et al. “Establishing high-accuracy biomarkers for colorectal cancer by
comparing fecal microbiomes in patients with healthy families”. In: Gut Microbes 11.4 (2020),
pp. 918–929.
[154] Yi-Chun Yeh and Jed A Fuhrman. “Contrasting diversity patterns of prokaryotes and protists over
time and depth at the San-Pedro Ocean Time series”. In: ISME Communications 2.1 (2022), p. 36.
[155] Jun Yu, Qiang Feng, Sunny Hei Wong, Dongya Zhang, Qiao yi Liang, Youwen Qin,
Longqing Tang, Hui Zhao, Jan Stenvang, Yanli Li, et al. “Metagenomic analysis of faecal
microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer”. In: Gut
66.1 (2017), pp. 70–78.
[156] Georg Zeller, Julien Tap, Anita Y Voigt, Shinichi Sunagawa, Jens Roat Kultima, Paul I Costea,
Aurélien Amiot, Jürgen Böhm, Francesco Brunetti, Nina Habermann, et al. “Potential of fecal
microbiota for early-stage detection of colorectal cancer”. In: Molecular systems biology 10.11
(2014), p. 766.
[157] Xiangzhou Zhang, Yong Hu, Kang Xie, Shouyang Wang, EWT Ngai, and Mei Liu. “A causal
feature selection algorithm for stock prediction modeling”. In: Neurocomputing 142 (2014),
pp. 48–59.
[158] Yuqing Zhang, Giovanni Parmigiani, and W Evan Johnson. “ComBat-seq: batch effect adjustment
for RNA-seq count data”. In: NAR genomics and bioinformatics 2.3 (2020), lqaa078.
104
[159] Yuqing Zhang, Prasad Patil, W Evan Johnson, and Giovanni Parmigiani. “Robustifying genomic
classifiers to batch effects via ensemble learning”. In: Bioinformatics 37.11 (2021), pp. 1521–1527.
[160] Dongzhi Zhao, Xiaogang Xing, Yuguang Liu, Jianhong Yang, and Lin Wang. “The relation of
chlorophyll-a concentration with the reflectance peak near 700 nm in algae-dominated waters
and sensitivity of fluorescence algorithms for detecting algal bloom”. In: International Journal of
Remote Sensing 31.1 (2010), pp. 39–48.
[161] Liang Zhao, Zhikui Chen, Yueming Hu, Geyong Min, and Zhaohua Jiang. “Distributed feature
selection for efficient economic big data analysis”. In: IEEE Transactions on Big Data 4.2 (2016),
pp. 164–176.
[162] Danping Zheng, Timur Liwinski, and Eran Elinav. “Interaction between microbiota and immunity
in health and disease”. In: Cell research 30.6 (2020), pp. 492–506.
[163] Zifan Zhu, Yingying Fan, Yinfei Kong, Jinchi Lv, and Fengzhu Sun. “DeepLINK: Deep learning
inference using knockoffs with applications to genomics”. In: Proceedings of the National Academy
of Sciences 118.36 (2021), e2104683118.
[164] Tao Zuo, Xiao-Juan Lu, Yu Zhang, Chun Pan Cheung, Siu Lam, Fen Zhang, Whitney Tang,
Jessica YL Ching, Risheng Zhao, Paul KS Chan, et al. “Gut mucosal virome alterations in
ulcerative colitis”. In: Gut 68.7 (2019), pp. 1169–1179.
[165] Wenxuan Zuo, Beibei Wang, Xin Bai, Yihui Luan, Yingying Fan, Sonia Michail, and Fengzhu Sun.
“16S rRNA and metagenomic shotgun sequencing data revealed consistent patterns of gut
microbiome signature in pediatric ulcerative colitis”. In: Scientific reports 12.1 (2022), pp. 1–13.
105
Abstract (if available)
Abstract
Human microbes, including bacteria, eukaryotes, viruses, and many other microorganisms, inhabit diverse regions throughout the human body. These microbes together with their genes constitute the human microbiome. And this complicated ecosystem is significant in maintaining human health and is associated with various diseases. Recent developments in metagenomic analysis have provided unprecedented insight into the human microbiome. However, these breakthroughs also bring up with novel difficulties, including the identification of microbial signatures in specific diseases and the detection of disease-associated microbes. This dissertation addresses these challenges through a comprehensive exploration of metagenomic analysis. In Chapter 2, we present a multifaceted investigation into the gut phageome in colorectal cancer, uncovering distinct phage signatures and their potential in disease prediction. In Chapter 3, we introduce DeepLINK-T to achieve FDR control in high-dimensional statistical inference for time-series data using the integration of long short-term memory and model-X knockoffs framework. The efficacy of this algorithm is demonstrated through rigorous simulation studies and real-world applications. The outcomes of this research deepen our understanding of the human microbiome, offering valuable insights for future efforts in studying the pathogenesis and developing therapeutics of microbiome-associated diseases.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Computational algorithms and statistical modelings in human microbiome analyses
PDF
Enhancing phenotype prediction through integrative analysis of heterogeneous microbiome studies
PDF
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Validating structural variations: from traditional algorithms to deep learning approaches
PDF
Mapping genetic variants for nonsense-mediated mRNA decay regulation across human tissues
PDF
Exploring the genetic basis of complex traits
PDF
Comparative transcriptomics: connecting the genome to evolution
PDF
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
Computational tools for large-scale analysis of brain function and structure
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Characterizing brain aging with neuroimaging, health, and genetic data
PDF
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
PDF
Data-driven learning for dynamical systems in biology
PDF
Predicting virus-host interactions using genomic data and applications in metagenomics
Asset Metadata
Creator
Zuo, Wenxuan
(author)
Core Title
Exploration of human microbiome through metagenomic analysis and computational algorithms
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Degree Conferral Date
2024-05
Publication Date
03/29/2024
Defense Date
03/21/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,feature selection,human microbiome,metagenomics,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sun, Fengzhu (
committee chair
), Chen, Liang (
committee member
), Edge, Michael D. (
committee member
), Lv, Jinchi (
committee member
), Michail, Sonia (
committee member
)
Creator Email
wzuo@usc.edu,zuowx1996@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113862174
Unique identifier
UC113862174
Identifier
etd-ZuoWenxuan-12735.pdf (filename)
Legacy Identifier
etd-ZuoWenxuan-12735
Document Type
Dissertation
Format
theses (aat)
Rights
Zuo, Wenxuan
Internet Media Type
application/pdf
Type
texts
Source
20240401-usctheses-batch-1133
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep learning
feature selection
human microbiome
metagenomics