Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Enhancing phenotype prediction through integrative analysis of heterogeneous microbiome studies
(USC Thesis Other)
Enhancing phenotype prediction through integrative analysis of heterogeneous microbiome studies
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Enhancing Phenotype Prediction through Integrative Analysis of Heterogeneous Microbiome
Studies
by
Yilin Gao
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
December 2023
Copyright 2023 Yilin Gao
Dedication
To my beloved parents, Zhiyuan Gao and Xiaohong Lin, for granting me invaluable educational opportunities and showering me with endless support.
In loving memory of my grandfather, Xun Lin, the cherished beacon of my life, who nurtured and cared
for me throughout my childhood.
ii
Acknowledgements
First and foremost, I would like to express my deepest gratitude to my advisor, Dr. Fengzhu Sun, for
his unwavering support, guidance, and mentorship throughout the course of my doctoral studies. Your
expertise and insights have been invaluable, and your encouragement has been a driving force behind my
academic journey.
I am also immensely grateful to my qualification and committee members, Dr. Liang Chen, Dr. Michael
Edge, Dr. Sonia Michail, and Dr. Jinchi Lv for your invaluable feedback, constructive criticisms, and encouragement. Their diverse perspectives have enriched my research and broadened my academic horizons.
I would like to thank Dr. Alex Tsoi from the Department of Computational Medicine and Bioinformatics and Dr. Matthew Patrick from the Department of Dermatology at the University of Michigan, for
introducing me to the magnificent world of bioinformatics and computational biology. I am really grateful
for the opportunity to do research under their guidance when I was still an undergraduate student. I would
not have achieved what I have today without their tremendous help and support.
I would like to thank my collaborators, Dr. Zifan Zhu from my lab, and Kexin Wang from the Department of Applied Mathematics and Statistics at Johns Hopkins University, for their great support in joint
publications. I would also like to thank Beibei Wang from my lab for the stimulating discussions for my
projects.
Special thanks to my dearest friend, Dr. Lijing Wang, for your unwavering friendship and invaluable
words of encouragement throughout this journey has been a beacon of support.
iii
Last but most importantly, I reserve my most heartfelt gratitude for my family. To my parents, Zhiyuan
Gao and Xiaohong Lin, and my grandparents, Xun Lin and Hanling Xu, thank you for your endless love,
sacrifices, and belief in my dreams. To my husband, Pei Zhang, for your love, patience, and steadfast
support in every step of this journey.
Other authors and contributors to the thesis
Dr. Fengzhu Sun, Dr. Zifan Zhu and Kexin Wang.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvi
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Research problems in colorectal cancer status prediction using microbiome abundance
profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Research problems in phenotype predictions integrating multiple omics studies . . . . . . 6
1.2.1 Disease status predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Quantitative phenotype predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter 2: Increasing prediction performance of colorectal cancer disease status using
random forests classification based on metagenomic shotgun sequencing data 12
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Workflow of the CRC-microbiome prediction analysis . . . . . . . . . . . . . . . . 14
2.2.2 Colorectal cancer metagenomic datasets . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Generating known and novel microbial relative abundance profiles by MicroPro . . 16
2.2.4 Generating known microbial relative abundance profiles by other metagenomic
taxonomic profiling tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.5 Random forests CRC predictive models . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.6 LASSO and SVM CRC predictive models . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.7 Leave-one-sample-out (LOSO) model stacking method . . . . . . . . . . . . . . . . 19
2.2.8 Prediction performance measured by the area under the precision-recall curve
(AUPRC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Known microbial relative abundance profiles extracted from Centrifuge yields
higher prediction performance than other metagenomic taxonomic profiling tools 20
2.3.2 Random forests classifier outperforms LASSO and SVM in terms of AUC scores . . 22
2.3.3 Increasing the number of decision trees in a random forests model significantly
improves the prediction performance . . . . . . . . . . . . . . . . . . . . . . . . . . 23
v
2.3.4 Including novel microbial organisms slightly improves the performance of CRC
prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.5 Removing cross-dataset batch effect by ComBat before training did not improve
the performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 3: Batch Normalization Followed by Merging is Powerful for Phenotype
Prediction Integrating Multiple Heterogeneous Studies . . . . . . . . . . . . . . 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Outline of workflow for integrating multiple heterogeneous metagenomic datasets 40
3.2.2 Simulation Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2.1 Scenario 1: Divergent training and test populations with varied genomic
feature distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2.2 Scenario 2: Different batch effects on training data with consistent
underlying population genomic feature distribution . . . . . . . . . . . . 46
3.2.2.3 Scenario 3: Varying degrees of overlap in disease-associated OTUs
between training and test datasets . . . . . . . . . . . . . . . . . . . . . . 47
3.2.3 Naive and ComBat Normalization Stage . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.4 Classification Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.4.1 Machine learning classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.4.2 Ensemble weighted learning methods . . . . . . . . . . . . . . . . . . . . 50
3.2.4.3 Rank aggregation methods . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.5 Applications on real CRC metagenomic datasets . . . . . . . . . . . . . . . . . . . 55
3.2.6 Applications on TB gene expression datasets . . . . . . . . . . . . . . . . . . . . . 56
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.1 ComBat normalization is essential for heterogeneous populations . . . . . . . . . . 57
3.3.2 ComBat normalization effectively corrects batch effects within the same population 59
3.3.3 Prediction accuracy can be markedly decreased as the number of overlapping
disease associated OTUs decreases . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.4 Applications to metagenomic datasets related to colorectal cancer . . . . . . . . . . 61
3.3.5 Applications to gene expression datasets related to tuberculosis . . . . . . . . . . . 62
3.3.6 Consistency across LASSO, SVM, and XGBoost classifiers mirroring RF Classifier
in simulation studies and real data applications . . . . . . . . . . . . . . . . . . . . 64
3.3.7 Rank aggregation is an alternative approach for integrating heterogeneous studies 65
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Chapter 4: Mitigating Heterogeneity Effects in Omics-based Quantitative Phenotype
Prediction: A Comprehensive Workflow for Integrating Multiple Studies
with Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.1 A comprehensive workflow for predicting quantitative phenotype from simulated
heterogeneous metagenomic datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.2 Three simulation scenarios of heterogeneity . . . . . . . . . . . . . . . . . . . . . . 82
4.2.2.1 Scenario 1: Different background distributions of genomic features in
training populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
vi
4.2.2.2 Scenario 2: Different batch effects in studies with the same background
distribution of genomic features in a population . . . . . . . . . . . . . . 86
4.2.2.3 Scenario 3: Different phenotype-associated feature models in different
studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2.3 Simulation of linear and non-linear relationship of quantitative phenotypes . . . . 87
4.2.4 Naive, ComBat normalization, and ConQuR normalization . . . . . . . . . . . . . . 88
4.2.5 Integration methods for combining multiple training models . . . . . . . . . . . . . 90
4.2.6 Applications on body mass index (BMI) prediction from Type-2 diabetes (T2D)
metagenomic datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2.7 Applications on survival prediction from ovarian cancer gene expression datasets . 92
4.2.8 Machine learning models used in phenotype predictions . . . . . . . . . . . . . . . 94
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.3.1 Batch normalization boosts prediction performance for homogeneous populations
with batch effects but falls short for diverse datasets and different phenotypeassociated feature models in quantitative phenotype predictions . . . . . . . . . . . 96
4.3.2 BMI predictions on T2D metagenomic datasets confirms the importance of batch
normalization combined with integration methods when training and test data
are similar and with sizeable samples . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3.3 Survival prediction of ovarian cancer gene expression data can be boosted by
batch normalization with substantial sample size . . . . . . . . . . . . . . . . . . . 101
4.3.4 ComBat and ConQuR normalization methods show similar performance in
removing effects by heterogeneity on machine learning model prediction
performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.3.5 Merging based method does not show any advantage than weighting based
integration methods in the context of quantitative phenotype predictions . . . . . 104
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Chapter 5: Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.1.1 Increasing prediction performance of CRC disease status . . . . . . . . . . . . . . . 108
5.1.2 Phenotype prediction by integrating multiple heterogeneous studies . . . . . . . . 110
5.2 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A Supplementary Materials for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.1 Prediction performance comparison of model stacking method and traditional
feature stacking method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.2 Leave-one-dataset-out predictive analysis . . . . . . . . . . . . . . . . . . . . . . . 129
A.3 Comparison of prediction accuracy based on microbial species abundance profiles
using the UHGG database and NCBI RefSeq Database . . . . . . . . . . . . . . . . 133
B Supplementary Materials for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
vii
List of Tables
2.1 Six metagenomic datasets related to colorectal cancer. . . . . . . . . . . . . . . . . . . . . . 15
3.1 Overview of six colorectal cancer-related metagenomic datasets. . . . . . . . . . . . . . . . 44
3.2 Overview of six tuberculosis-related gene expression datasets. . . . . . . . . . . . . . . . . 57
4.1 Three colorectal cancer (CRC) metagenomic datasets used for simulations. . . . . . . . . . 84
4.2 Integration methods used for combining multiple training models. . . . . . . . . . . . . . . 90
4.3 Four metagenomic datasets related to T2D. . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4 Three gene expression datasets related to ovarian cancer. . . . . . . . . . . . . . . . . . . . 94
viii
List of Figures
2.1 Workflow of the CRC-microbiome prediction analysis. MicroPro is applied to the
input metagenomic dataset to characterize known and novel abundance tables. Two
random forests classifiers are trained using known and novel abundances, respectively.
LOSO model stacking is used to incorporate predictive probabilities from both random
forests classifiers and applied to the independent test metagenomic dataset to derive the
prediction performance in terms of the AUC score. LOSO: leave-one-sample-out; AUC:
area under the operating characteristic curve. . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Barplots of AUC scores for predicting disease status (case/control) using
random forests with 1000 decision trees on known species abundance profiles
characterized by Centrifuge (orange), MetaPhlAn2 (blue), and Bracken (grey), in
three experimental designs: within-dataset (subfigure A), cross-dataset (subfigure
B), and leave-one-dataset-out (LODO) (subfigure C). AUC scores of within-dataset
and cross-dataset analyses are averages from 30 independent repetitions, while AUC
scores of LODO analysis are averages from 10 independent repetitions. MetaPhlAn2 AUC
scores are directly taken from study by Thomas et al.[40]. . . . . . . . . . . . . . . . . . . . 30
2.3 Barplots of AUC scores for predicting CRC status (case/control) using known
microbial species abundance profiles trained by RF (blue), LASSO (orange) and
SVM (grey) in three experimental designs: within-dataset (subfigure A), crossdataset (subfigure B), and leave-one-dataset-out (LODO) (subfigure C). AUC scores
of within-dataset and cross-dataset analyses are averages from 30 independent repetitions,
while AUC scores of LODO analysis are averages from 10 independent repetitions.
Random forests classifiers use 1000 decision trees. The regularization coefficients in
LASSO and SVM are the ones that maximize AUCs of 10-fold cross validation. The
microbial species abundance profiles were log10-transformed (a small constant, 1e−10,
was added to 0 abundance) before model training. . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Scatter plots of AUC scores by three different classifiers: RF, LASSO and SVM in
three experimental settings. X-axis shows the RF AUC scores, and y-axis shows AUC
scores obtained by LASSO (subfigure A) and SVM (subfigure B). The microbial species
abundance profiles were log10-transformed (a small constant, 1e−10, was added to 0
abundance) before model training. Spearman correlation for RF vs. LASSO (subfigure A)
is 0.79, and for RF vs. SVM (subfigure B) is 0.75. . . . . . . . . . . . . . . . . . . . . . . . . 32
ix
2.5 The AUC scores generally increase by larger number of decision trees in RF in
cross-dataset analysis. The Hannigan dataset is used as the training set. Each line refers
to the testing set used in the analysis. All AUC scores are averages from 30 independent
repetitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6 Barplots of random forests AUC scores using known microbial abundances
(orange), LOSO model stacking (blue), and results from Thomas et al.[40] in three
experimental designs: within-dataset (subfigure A), cross-dataset (subfigure B)
and leave-one-dataset-out (LODO) (subfigure C). AUC scores of within-dataset and
cross-dataset analyses are averages from 30 independent repetitions, while AUC scores of
LODO analysis are averages from 10 independent repetitions. Random forests models in
‘known’ and ‘LOSO model stacking’ use 5000 decision trees while those in ‘Thomas et al.
paper’ use 1000 decision trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7 Barplots of random forests AUPRC scores using known microbial abundances
(orange) and LOSO model stacking (blue) in two experimental designs: withindataset (subfigure A) and cross-dataset (subfigure B). All AUPRC scores are averages
from 30 independent repetitions. Random forests models in ‘known’ and ‘LOSO model
stacking’ use 5000 decision trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.8 Barplots of cross-dataset AUC scores for predicting disease status (case/control)
using known species abundance profiles trained by random forests directly (blue)
and removing batch effect by ComBat before training random forests models
(orange). AUC scores are averages from 30 independent repetitions. Random forests
classifiers use 1000 decision trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1 Workflow for integrating multiple simulated heterogeneous metagenomic
datasets. A: Simulation stage of three different heterogeneity scenarios. Scenario 1:
Divergent training and test populations with varied genomic feature distribution. Scenario
2: Different batch effects on training data with consistent underlying population genomic
feature distribution. Scenario 3: Varying degrees of overlap in disease-associated OTUs
between training and test datasets. The output of this step includes two simulated
training datasets and one test dataset. B: Naive and ComBat normalization settings. In the
Naive setting, the output datasets from Step A are directly used without any additional
normalization. In the ComBat normalization setting, the two training datasets from Step
A undergo normalization using the ComBat method to address potential batch effects by
using the test dataset as a reference. C: Classification stage of applying different ensemble
weighted learning and rank aggregation methods. The two training datasets from Step B
are used to train two machine learning classifiers. These classifiers generate two lists of
prediction probabilities when applied to the test dataset. For ensemble weighted learning,
the two lists of probabilities are directly integrated using specific weights (w1, w2)
determined by the integration method, resulting in a final list of prediction probabilities.
For rank aggregation, the two lists of probabilities are ranked, and the resulting lists of
ranks are integrated using the respective rank aggregation methods. Different weights
(a1, a2) determined by the method are used in this integration process, producing one
final list of ranks. Note that the integration methods utilize distinct weights (w1, w2, a1,
a2) based on their specific approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
x
3.2 Distributions of genomic features across multiple colorectal cancer and tuberculosis studies. A: Principal Coordinate Analysis (PCoA) of Bray–Curtis distances,
calculated from six colorectal cancer metagenomic count datasets. B: Principal Component
Analysis (PCA) performed on six tuberculosis gene expression datasets. The PCA was
carried out on the logarithm of fragments per kilobase of transcript per million mapped
reads (log FPKM). In this figure, the Zak, Leong, and Walter datasets overlap with each
other and are far away from the other three datasets. C: PCA performed on the Zak, Leong,
and Walter tuberculosis gene expression datasets. The ellipses represent a 95% confidence
level under the assumption of a multivariate t-distribution. A round dot represents a case
sample, while a triangle dot signifies a control sample. . . . . . . . . . . . . . . . . . . . . 72
3.3 ComBat normalization markedly enhances cross-study prediction when the
training and test data have divergent feature distributions. The figures display the
AUCs of RF predictions employing various integration methods with three distinct disease
effect factors. Columns represent different values of α. Method names without a "ComBat"
suffix refer to those implemented in the naive setting, while those with a "ComBat" suffix
were conducted in the ComBat normalization setting. All the experiments were conducted
100 times, and the AUC scores presented in the figure represent the averages from these
100 trials. Abbreviations: Ensem.learn.-Ensemble Learning; Rank aggr.-Rank aggregation;
norm.-normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4 ComBat normalization improves cross-study prediction in the presence of batch
effects. The figures show the AUC scores of RF prediction using different integration
methods with varying severity levels of batch effects. A: AUC score comparisons with
different severity levels of additive batch effects on the mean of OTU abundances, with
no multiplicative batch effect on the variance. B: AUC score comparisons with different
severity levels of multiplicative batch effects on the variance of OTU abundances, with
no additive batch effect on the mean. The disease effect factor was set to 1.025 for both
scenarios. The integration methods without a suffix of "ComBat" were applied in the naive
setting, while those with a suffix of "ComBat" were applied after ComBat normalization.
The experiments were repeated 100 times, and the shown AUC scores are the averages
across the 100 trials. Abbreviations are the same as in Figure 4.4. . . . . . . . . . . . . . . . 74
3.5 ComBat normalization markedly increases cross-study prediction when the
training and test data have varying degrees of overlap in disease-associated OTUs.
The figures show the AUCs of RF prediction using different integration methods with
various number of overlapping disease associated OTUs. The disease effect factor was set
to 1.075. Columns represent different numbers of overlapping disease associated OTUs in
the training and test data, the larger the number, the more similar the two disease models
are. When the number achieves 10, the two models are the same in the training and test
data. All the experiments were repeated for 100 times and the AUC scores shown on the
figure are the averages from the 100 trials. Abbreviations are the same as in Figure 4.4. . . 75
xi
3.6 Realistic applications of merging and integration methods on multiple CRC
metagenomic datasets and TB gene expression datasets using RF classifiers
with average AUC socres from six individual Leave-one-dataset-out(LODO)
experiments. A: Leave-one-dataset-out average AUC score comparisons among different
methods in colorectal cancer metagenomic datasets. B: Leave-one-dataset-out average
AUC score comparisons among different methods in tuberculosis gene expression datasets.
The results by different methods are grouped into six groups. "Single learner": Each of
the five training datasets were trained independently with RF classifier and predicted
on the test dataset, then the average AUC score was taken among the five predictions.
"Merged": Merging method with pooling all five training datasets into one training
data. The "Single learner" and "Merged" experiments were conducted under both naive
and ComBat normalization settings. "Ensemble learning": The five training predictors
were integrated by ensemble weighted learning methods under naive setting. "Ensemble
learning (normalized)": The five training predictors were integrated by ensemble weighted
learning methods under ComBat normalization setting. "Rank aggregation": The five
training predictors were integrated by rank aggregation methods under naive setting.
"Rank aggregation (normalized)": The five training predictors were integrated by rank
aggregation methods under ComBat normalization setting. The red dots and associated
values on the figure are the mean AUC scores for each method, the vertical bars are the
median AUC scores for each method, while the black dots represent the outliers. Same
method under different settings are represented in the same color of boxplots. All the
experiments were repeated 30 times for each test dataset, and the results presented in the
figure were based on the average AUC scores of the total 180 replications for the six test
datasets for metagenomic and gene expression studies, respectively. . . . . . . . . . . . . . 76
4.1 Workflow for quantitative phenotype prediction by integrating simulated
heterogeneous metagenomic datasets. A: Simulation stage of three different
heterogeneity scenarios. Scenario 1: Different background distributions of genomic
features in training populations. Scenario 2: Different batch effects in studies with the
same background distribution of genomic features in a population. Scenario 3: Different
phenotype-associated feature models in training data and test data. The output from
this step consists of two simulated training datasets and one test dataset. B: Naive and
normalization stage. The output from A are unchanged in naive setting, while in the
normalization stage, the two simulated training datasets are normalized. C: Integration
stage. The output from previous stage are used to train machine learning models, followed
by applying different integration methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2 Metagenomic feature distributions from multiple T2D studies. Principal Component Analysis (PCA) is conducted on four T2D metagenomic log relative abundance
datasets, including T2D samples, IGT samples, and healthy control samples. The analysis
is performed on the intersection of metagenomic features from the four datasets, which
are later used for training machine learning models. The ellipses in the PCA plot represent
the 95% confidence level assuming a multivariate t-distribution. . . . . . . . . . . . . . . . 92
xii
4.3 Genomic feature distributions from multiple ovarian cancer studies. A: Principal
component analysis (PCA) of three ovarian cancer gene expression datasets. The analysis
is performed on the intersection of gene features from the three datasets, which are later
used for training machine learning models. The ellipses in the PCA plot represent the
95% confidence level assuming a multivariate t-distribution. B: Gene expression level
comparisons for each individual in the three ovarian cancer datasets. The gene expression
levels of all genes for each individual are summarized using a boxplot. This comparison
is performed to investigate the expression profile similarity and heterogeneity among
individuals across datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4 Cross-study quantitative phenotype predictions comparisons by different
methods under naive and batch normalization settings when the training and
test populations have different feature distributions. The figures display the median
RMSE values of RF regression predictions using various integration methods with different
simulated quantitative phenotypes. The columns represent different values of α. The
methods without a suffix are the ones carried out in the naive setting, while the methods
with a suffix of "ComBat" or "ConQuR" were carried out in the ComBat or ConQuR
normalization setting, respectively. The experiments were repeated 100 times, and the
figures show the median RMSE values from the 100 trials. . . . . . . . . . . . . . . . . . . . 98
4.5 Batch normalization increases cross-study quantitative phenotype prediction
when the studies have batch effects. The figures display the median RMSE values of
RF regression predictions using different integration methods with various batch severity
levels. Columns represent different batch effect factors in mean and variance of microbial
abundance. "m": batch effect factor adjusting the mean. "v": batch effect factor adjusting
the variance. The methods without a suffix are the ones carried out in the naive setting,
while the methods with a suffix of "ComBat" or "ConQuR" were carried out in the ComBat
or ConQuR normalization setting, respectively. The experiments were repeated 100 times,
and the figures show the median RMSE values from the 100 trials. . . . . . . . . . . . . . . 99
4.6 Batch normalization does not improve cross-study prediction when the training
and test data have different quantitative-phenotype-associated feature models.
The figures display the median RMSE values of RF regression predictions using different
integration methods with with various number of overlapping phenotype-associated
OTUs. Columns represent different numbers of overlapping disease associated OTUs in
the training and test data, the larger the number, the more similar the two disease models
are. When the number achieves 10, the two models are the same in the training and test
data. The methods without a suffix are the ones carried out in the naive setting, while
the methods with a suffix of "ComBat" or "ConQuR" were carried out in the ComBat or
ConQuR normalization setting, respectively. The experiments were repeated 100 times,
and the figures show the median RMSE values from the 100 trials. . . . . . . . . . . . . . . 100
xiii
4.7 Realistic applications of merging and integration methods on BMI predictions
using four T2D metagenomic datasets. The real data analysis is conducted using
a Leave-One-Dataset-Out setting. Each sub-figure shows the prediction results on the
corresponding test dataset, while the other three datasets are used as training data.
The bars in each sub-figure represent different integration methods, with each color
representing a different method group, and the three same-colored bars representing the
three different settings: Naive, ComBat normalization, and ConQuR normalization. The
median RMSE values are shown for each test dataset, and all experiments are repeated
30 times for randomness in RF regression models and different validation sets used in
"val_rmse" method. Asterisks show the significance level of the RMSE comparisons
between normalization setting and naive setting by Mann–Whitney U tests. . . . . . . . . 102
4.8 Realistic applications of merging and integration methods on survival predictions
using three ovarian cancer gene expression datasets. Each sub-figure shows the
prediction results on the corresponding test dataset, while the other two datasets are used
as training data. The bars in each sub-figure represent different integration methods,
with each color representing a different method group, and the two same-colored bars
representing the two different settings: Naive and ComBat normalization. The median
C-index values are shown for each test dataset, and all experiments are repeated 10 times
for different validation sets used in "val_cindex" method. . . . . . . . . . . . . . . . . . . . 103
S1 Barplots of AUC scores using feature stacking (orange) and LOSO model stacking
(blue) in two experimental designs: within-dataset (subfigure A) and cross-dataset
(subfigure B). All AUC scores are averages from 30 independent repetitions. ‘feature
stacking’ refers to the random forests model trained on abundances from both known
and novel organisms directly. ‘LOSO model stacking’ refers to incorporating predictive
probabilities from known and novel organisms to predict by LOSO model stacking method.
5000 decision trees are used all random forests models. . . . . . . . . . . . . . . . . . . . . 130
S2 Barplots of AUPRC scores using feature stacking (orange) and LOSO model
stacking (blue) in two experimental designs: within-dataset (subfigure A) and
cross-dataset (subfigure B). All AUPRC scores are averages from 30 independent
repetitions. ‘feature stacking’ refers to the random forests model trained on abundances
from both known and novel organisms directly. ‘LOSO model stacking’ refers to
incorporating predictive probabilities from known and novel organisms to predict by
LOSO model stacking method. 5000 decision trees are used all random forests models. . . . 131
S3 Barplots of AUC (subfigure A) and AUPRC (subfigure B) scores using known
microbial abundances (orange) and LOSO model stacking (blue) in Leaveone-dataset-out(LODO) analysis. All AUC and AUPRC scores are averages from
10 independent repetitions. ‘known’ refers to the random forests model using only
abundances of known organisms and ‘LOSO model stacking’ refers to incorporating
predictive probabilities from known and novel organisms to predict by LOSO model
stacking method. Random forests models use 5000 decision trees. . . . . . . . . . . . . . . 132
xiv
S4 Barplots of average mapping rates against NCBI RefSeq (orange) and UHGG
(blue) databases for each dataset. Average mapping rate is the mean mapping rate
of all samples in the same dataset. Sample mapping rate is calculated as the number of
mapped reads divided by total number of reads in the sample. . . . . . . . . . . . . . . . . 134
S5 Barplots of AUC scores for predicting disease status (case/control) trained by
random forests on known species abundance profiles classified by NCBI (orange)
and UHGG (blue) reference databases in three experimental designs: withindataset (subfigure A), cross-dataset (subfigure B), and leave-one-dataset-out
(LODO) (subfigure C). AUC scores of within-dataset and cross-dataset analyses are
averages from 30 independent repetitions, while AUC scores of LODO analysis are
averages from 10 independent repetitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
S6 Realistic applications of merging and integration methods on multiple colorectal
cancer metagenomic datasets using RF classifiers from six individual Leave-onedataset-out(LODO) experiments. The results by different methods are grouped into six
groups. "Single learner": Each of the five training datasets were trained independently
with RF classifier and predicted on the test dataset, then the average AUC score was taken
among the five predictions. "Merged": Merging method with pooling all five training
datasets into one training data. The "Single learner" and "Merged" experiments were
conducted under both naive and ComBat normalization settings. "Ensemble learning":
The five training predictors were integrated by ensemble weighted learning methods
under naive setting. "Ensemble learning (normalized)": The five training predictors were
integrated by ensemble weighted learning methods under ComBat normalization setting.
"Rank aggregation": The five training predictors were integrated by rank aggregation
methods under naive setting. "Rank aggregation (normalized)": The five training
predictors were integrated by rank aggregation methods under ComBat normalization
setting. The red dots and associated values on the figure are the mean AUC scores for
each method, the vertical bars are the median AUC scores for each method, while the
black dots represent the outliers. Same method under different settings are represented
in the same color of boxplots. All the experiments were repeated 30 times for each test
dataset, and the results presented in the figure were based on the average AUC scores of
the total 180 replications for the six test datasets. . . . . . . . . . . . . . . . . . . . . . . . . 136
xv
S7 Realistic applications of merging and integration methods on multiple tuberculosis genomic datasets using RF classifiers from six individual Leave-one-datasetout(LODO) experiments. The results by different methods are grouped into six groups.
"Single learner": Each of the five training datasets were trained independently with RF
classifier and predicted on the test dataset, then the average AUC score was taken among
the five predictions. "Merged": Merging method with pooling all five training datasets
into one training data. The "Single learner" and "Merged" experiments were conducted
under both naive and ComBat normalization settings. "Ensemble learning": The five
training predictors were integrated by ensemble weighted learning methods under naive
setting. "Ensemble learning (normalized)": The five training predictors were integrated
by ensemble weighted learning methods under ComBat normalization setting. "Rank
aggregation": The five training predictors were integrated by rank aggregation methods
under naive setting. "Rank aggregation (normalized)": The five training predictors were
integrated by rank aggregation methods under ComBat normalization setting. The red
dots and associated values on the figure are the mean AUC scores for each method, the
vertical bars are the median AUC scores for each method, while the black dots represent
the outliers. Same method under different settings are represented in the same color of
boxplots. All the experiments were repeated 30 times for each test dataset, and the results
presented in the figure were based on the average AUC scores of the total 180 replications
for the six test datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
S8 ComBat normalization increases cross-study prediction when the training and
test data have different feature distribution with LASSO classifiers. The figures
show the AUCs predictions of LASSO using different integration methods with three
different disease effect factors. Columns represents different values of α. All the method
names without a suffix of "ComBat" are the methods carried out in the naive setting,
while the names with a suffix of "ComBat" were carried out in the ComBat normalization
setting. All the experiments were repeated for 100 times and the AUC scores shown on
the figure are the averages from the 100 trials. However, the differences between the
results using normalization versus no-normalization are not as dramatic as other machine
learning classifiers indicating robustness of LASSO with respect to population differences. 138
S9 ComBat normalization showed similar low cross-study prediction accuracy
compared to no normalization when the training and test data have different
feature distribution with SVM classifiers. The figures show the AUCs predictions
of SVM with polynomial kernel using different integration methods with three different
disease effect factors. Columns represents different values of α. All the method names
without a suffix of "ComBat" are the methods carried out in the naive setting, while the
names with a suffix of "ComBat" were carried out in the ComBat normalization setting.
All the experiments were repeated for 100 times and the AUC scores shown on the figure
are the averages from the 100 trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
xvi
S10 ComBat normalization markedly increases cross-study prediction when the
training and test data have different feature distribution with XGBoost classifiers.
The figures show the AUCs predictions of XGBoost using different integration methods
with three different disease effect factors. Columns represents different values of α.
All the method names without a suffix of "ComBat" are the methods carried out in the
naive setting, while the names with a suffix of "ComBat" were carried out in the ComBat
normalization setting. All the experiments were repeated for 100 times and the AUC
scores shown on the figure are the averages from the 100 trials. . . . . . . . . . . . . . . . 140
S11 ComBat normalization markedly increases cross-study prediction when the
studies have batch effects when using LASSO classifiers. The figures show the AUCs
predictions of LASSO using different integration methods with various batch severity
levels. A: AUC score comparisons with different severity levels of additive batch effects
on the mean of OTU abundances, with no multiplicative batch effect on the variance. B:
AUC score comparisons with different severity levels of multiplicative batch effects on
the variance of OTU abundances, with no additive batch effect on the mean. The disease
effect factor was set to 1.025 for both situations. All the method names without a suffix
of "ComBat" are the methods done in naive setting, while the names with a suffix of
"ComBat" were done in ComBat normalization setting. All the experiments were repeated
for 100 times and the AUC scores shown on the figure are the averages from the 100 trials. 141
S12 ComBat normalization slightly increases cross-study prediction accuracy when
the studies have batch effects when using SVM classifiers. The figures show the
AUCs predictions of SVM with polynomial kernel using different integration methods
with various batch severity levels. A: AUC score comparisons with different severity
levels of additive batch effects on the mean of OTU abundances, with no multiplicative
batch effect on the variance. B: AUC score comparisons with different severity levels of
multiplicative batch effects on the variance of OTU abundances, with no additive batch
effect on the mean. The disease effect factor was set to 1.025 for both situations. All the
method names without a suffix of "ComBat" are the methods done in naive setting, while
the names with a suffix of "ComBat" were done in ComBat normalization setting. All the
experiments were repeated for 100 times and the AUC scores shown on the figure are the
averages from the 100 trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
S13 ComBat normalization markedly increases cross-study prediction when the
studies have batch effects when using XGBoost classifiers. The figures show the
AUCs predictions of XGBoost using different integration methods with various batch
severity levels. A: AUC score comparisons with different severity levels of additive
batch effects on the mean of OTU abundances, with no multiplicative batch effect on the
variance. B: AUC score comparisons with different severity levels of multiplicative batch
effects on the variance of OTU abundances, with no additive batch effect on the mean. The
disease effect factor was set to 1.025 for both situations. All the method names without a
suffix of "ComBat" are the methods done in naive setting, while the names with a suffix of
"ComBat" were done in ComBat normalization setting. All the experiments were repeated
for 100 times and the AUC scores shown on the figure are the averages from the 100 trials. 143
xvii
S14 ComBat normalization markedly increases cross-study prediction when the
training and test data have different disease models when using LASSO classifiers.
The figures show the AUCs predictions of LASSO using different integration methods
with various number of overlapping disease associated OTUs. The disease effect factor
was set to 1.075. Columns represent different numbers of overlapping disease associated
OTUs in the training and test data, the larger the number, the more similar the two disease
models are. When the number achieves 10, the two models are the same in the training
and test data. All the experiments were repeated for 100 times and the AUC scores shown
on the figure are the averages from the 100 trials. . . . . . . . . . . . . . . . . . . . . . . . 144
S15 ComBat normalization markedly increases cross-study prediction when the
training and test data have different disease models when using SVM classifiers.
The figures show the AUCs predictions of SVM with polynomial kernel using different
integration methods with various number of overlapping disease associated OTUs. The
disease effect factor was set to 1.075. Columns represent different numbers of overlapping
disease associated OTUs in the training and test data, the larger the number, the more
similar the two disease models are. When the number achieves 10, the two models are the
same in the training and test data. All the experiments were repeated for 100 times and
the AUC scores shown on the figure are the averages from the 100 trials. . . . . . . . . . . 145
S16 ComBat normalization markedly increases cross-study prediction when the
training and test data have different disease models when using XGBoost
classifiers. The figures show the AUCs predictions of XGBoost using different integration
methods with various number of overlapping disease associated OTUs. The disease effect
factor was set to 1.075. Columns represent different numbers of overlapping disease
associated OTUs in the training and test data, the larger the number, the more similar the
two disease models are. When the number achieves 10, the two models are the same in
the training and test data. All the experiments were repeated for 100 times and the AUC
scores shown on the figure are the averages from the 100 trials. . . . . . . . . . . . . . . . 146
xviii
S17 Realistic applications of merging and integration methods on multiple CRC
metagenomic datasets and TB gene expression datasets using LASSO classifiers
with average AUC socres from six individual Leave-one-dataset-out(LODO)
experiments. A: Leave-one-dataset-out average AUC score comparisons among different
methods in colorectal cancer metagenomic datasets. B: Leave-one-dataset-out average
AUC score comparisons among different methods in tuberculosis gene expression datasets.
The results by different methods are grouped into six groups. "Single learner": Each of
the five training datasets were trained independently with RF classifier and predicted
on the test dataset, then the average AUC score was taken among the five predictions.
"Merged": Merging method with pooling all five training datasets into one training
data. The "Single learner" and "Merged" experiments were conducted under both naive
and ComBat normalization settings. "Ensemble learning": The five training predictors
were integrated by ensemble weighted learning methods under naive setting. "Ensemble
learning (normalized)": The five training predictors were integrated by ensemble weighted
learning methods under ComBat normalization setting. "Rank aggregation": The five
training predictors were integrated by rank aggregation methods under naive setting.
"Rank aggregation (normalized)": The five training predictors were integrated by rank
aggregation methods under ComBat normalization setting. The red dots and associated
values on the figure are the mean AUC scores for each method, the vertical bars are the
median AUC scores for each method, while the black dots represent the outliers. Same
method under different settings are represented in the same color of boxplots. All the
experiments were repeated 30 times for each test dataset, and the results presented in the
figure were based on the average AUC scores of the total 180 replications for the six test
datasets for metagenomic and gene expression studies, respectively. . . . . . . . . . . . . . 147
xix
S18 Realistic applications of merging and integration methods on multiple CRC
metagenomic datasets and TB gene expression datasets using SVM classifiers
with average AUC socres from six individual Leave-one-dataset-out(LODO)
experiments. A: Leave-one-dataset-out average AUC score comparisons among different
methods in colorectal cancer metagenomic datasets. B: Leave-one-dataset-out average
AUC score comparisons among different methods in tuberculosis gene expression datasets.
The results by different methods are grouped into six groups. "Single learner": Each of
the five training datasets were trained independently with RF classifier and predicted
on the test dataset, then the average AUC score was taken among the five predictions.
"Merged": Merging method with pooling all five training datasets into one training
data. The "Single learner" and "Merged" experiments were conducted under both naive
and ComBat normalization settings. "Ensemble learning": The five training predictors
were integrated by ensemble weighted learning methods under naive setting. "Ensemble
learning (normalized)": The five training predictors were integrated by ensemble weighted
learning methods under ComBat normalization setting. "Rank aggregation": The five
training predictors were integrated by rank aggregation methods under naive setting.
"Rank aggregation (normalized)": The five training predictors were integrated by rank
aggregation methods under ComBat normalization setting. The red dots and associated
values on the figure are the mean AUC scores for each method, the vertical bars are the
median AUC scores for each method, while the black dots represent the outliers. Same
method under different settings are represented in the same color of boxplots. All the
experiments were repeated 30 times for each test dataset, and the results presented in the
figure were based on the average AUC scores of the total 180 replications for the six test
datasets for metagenomic and gene expression studies, respectively. . . . . . . . . . . . . . 148
xx
S19 Realistic applications of merging and integration methods on multiple CRC
metagenomic datasets and TB gene expression datasets using XGBoost classifiers
with average AUC socres from six individual Leave-one-dataset-out(LODO)
experiments. A: Leave-one-dataset-out average AUC score comparisons among different
methods in colorectal cancer metagenomic datasets. B: Leave-one-dataset-out average
AUC score comparisons among different methods in tuberculosis gene expression datasets.
The results by different methods are grouped into six groups. "Single learner": Each of
the five training datasets were trained independently with RF classifier and predicted
on the test dataset, then the average AUC score was taken among the five predictions.
"Merged": Merging method with pooling all five training datasets into one training
data. The "Single learner" and "Merged" experiments were conducted under both naive
and ComBat normalization settings. "Ensemble learning": The five training predictors
were integrated by ensemble weighted learning methods under naive setting. "Ensemble
learning (normalized)": The five training predictors were integrated by ensemble weighted
learning methods under ComBat normalization setting. "Rank aggregation": The five
training predictors were integrated by rank aggregation methods under naive setting.
"Rank aggregation (normalized)": The five training predictors were integrated by rank
aggregation methods under ComBat normalization setting. The red dots and associated
values on the figure are the mean AUC scores for each method, the vertical bars are the
median AUC scores for each method, while the black dots represent the outliers. Same
method under different settings are represented in the same color of boxplots. All the
experiments were repeated 30 times for each test dataset, and the results presented in the
figure were based on the average AUC scores of the total 180 replications for the six test
datasets for metagenomic and gene expression studies, respectively. . . . . . . . . . . . . . 149
S20 Realistic applications of merging and integration methods on multiple colorectal
cancer metagenomic datasets using LASSO classifiers from six individual
Leave-one-dataset-out(LODO) experiments. The results by different methods are
grouped into six groups. "Single learner": Each of the five training datasets were trained
independently with RF classifier and predicted on the test dataset, then the average AUC
score was taken among the five predictions. "Merged": Merging method with pooling
all five training datasets into one training data. The "Single learner" and "Merged"
experiments were conducted under both naive and ComBat normalization settings.
"Ensemble learning": The five training predictors were integrated by ensemble weighted
learning methods under naive setting. "Ensemble learning (normalized)": The five training
predictors were integrated by ensemble weighted learning methods under ComBat
normalization setting. "Rank aggregation": The five training predictors were integrated
by rank aggregation methods under naive setting. "Rank aggregation (normalized)": The
five training predictors were integrated by rank aggregation methods under ComBat
normalization setting. The red dots and associated values on the figure are the mean AUC
scores for each method, the vertical bars are the median AUC scores for each method,
while the black dots represent the outliers. Same method under different settings are
represented in the same color of boxplots. All the experiments were repeated 30 times for
each test dataset, and the results presented in the figure were based on the average AUC
scores of the total 180 replications for the six test datasets. . . . . . . . . . . . . . . . . . . 150
xxi
S21 Realistic applications of merging and integration methods on multiple tuberculosis genomic datasets using LASSO classifiers from six individual Leave-onedataset-out(LODO) experiments. The results by different methods are grouped into six
groups. "Single learner": Each of the five training datasets were trained independently
with RF classifier and predicted on the test dataset, then the average AUC score was taken
among the five predictions. "Merged": Merging method with pooling all five training
datasets into one training data. The "Single learner" and "Merged" experiments were
conducted under both naive and ComBat normalization settings. "Ensemble learning":
The five training predictors were integrated by ensemble weighted learning methods
under naive setting. "Ensemble learning (normalized)": The five training predictors were
integrated by ensemble weighted learning methods under ComBat normalization setting.
"Rank aggregation": The five training predictors were integrated by rank aggregation
methods under naive setting. "Rank aggregation (normalized)": The five training predictors were integrated by rank aggregation methods under ComBat normalization setting.
The red dots and associated values on the figure are the mean AUC scores for each
method, the vertical bars are the median AUC scores for each method, while the black
dots represent the outliers. Same method under different settings are represented in the
same color of boxplots. All the experiments were repeated 30 times for each test dataset,
and the results presented in the figure were based on the average AUC scores of the total
180 replications for the six test datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
S22 Realistic applications of merging and integration methods on multiple colorectal
cancer metagenomic datasets using SVM classifiers from six individual Leave-onedataset-out(LODO) experiments. The results by different methods are grouped into six
groups. "Single learner": Each of the five training datasets were trained independently
with RF classifier and predicted on the test dataset, then the average AUC score was taken
among the five predictions. "Merged": Merging method with pooling all five training
datasets into one training data. The "Single learner" and "Merged" experiments were
conducted under both naive and ComBat normalization settings. "Ensemble learning":
The five training predictors were integrated by ensemble weighted learning methods
under naive setting. "Ensemble learning (normalized)": The five training predictors were
integrated by ensemble weighted learning methods under ComBat normalization setting.
"Rank aggregation": The five training predictors were integrated by rank aggregation
methods under naive setting. "Rank aggregation (normalized)": The five training
predictors were integrated by rank aggregation methods under ComBat normalization
setting. The red dots and associated values on the figure are the mean AUC scores for
each method, the vertical bars are the median AUC scores for each method, while the
black dots represent the outliers. Same method under different settings are represented
in the same color of boxplots. All the experiments were repeated 30 times for each test
dataset, and the results presented in the figure were based on the average AUC scores of
the total 180 replications for the six test datasets. . . . . . . . . . . . . . . . . . . . . . . . . 152
xxii
S23 Realistic applications of merging and integration methods on multiple tuberculosis genomic datasets using SVM classifiers from six individual Leave-one-datasetout(LODO) experiments. The results by different methods are grouped into six groups.
"Single learner": Each of the five training datasets were trained independently with RF
classifier and predicted on the test dataset, then the average AUC score was taken among
the five predictions. "Merged": Merging method with pooling all five training datasets
into one training data. The "Single learner" and "Merged" experiments were conducted
under both naive and ComBat normalization settings. "Ensemble learning": The five
training predictors were integrated by ensemble weighted learning methods under naive
setting. "Ensemble learning (normalized)": The five training predictors were integrated
by ensemble weighted learning methods under ComBat normalization setting. "Rank
aggregation": The five training predictors were integrated by rank aggregation methods
under naive setting. "Rank aggregation (normalized)": The five training predictors were
integrated by rank aggregation methods under ComBat normalization setting. The red
dots and associated values on the figure are the mean AUC scores for each method, the
vertical bars are the median AUC scores for each method, while the black dots represent
the outliers. Same method under different settings are represented in the same color of
boxplots. All the experiments were repeated 30 times for each test dataset, and the results
presented in the figure were based on the average AUC scores of the total 180 replications
for the six test datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
S24 Realistic applications of merging and integration methods on multiple colorectal
cancer metagenomic datasets using XGBoost classifiers from six individual
Leave-one-dataset-out(LODO) experiments. The results by different methods are
grouped into six groups. "Single learner": Each of the five training datasets were trained
independently with RF classifier and predicted on the test dataset, then the average AUC
score was taken among the five predictions. "Merged": Merging method with pooling
all five training datasets into one training data. The "Single learner" and "Merged"
experiments were conducted under both naive and ComBat normalization settings.
"Ensemble learning": The five training predictors were integrated by ensemble weighted
learning methods under naive setting. "Ensemble learning (normalized)": The five training
predictors were integrated by ensemble weighted learning methods under ComBat
normalization setting. "Rank aggregation": The five training predictors were integrated
by rank aggregation methods under naive setting. "Rank aggregation (normalized)": The
five training predictors were integrated by rank aggregation methods under ComBat
normalization setting. The red dots and associated values on the figure are the mean AUC
scores for each method, the vertical bars are the median AUC scores for each method,
while the black dots represent the outliers. Same method under different settings are
represented in the same color of boxplots. All the experiments were repeated 30 times for
each test dataset, and the results presented in the figure were based on the average AUC
scores of the total 180 replications for the six test datasets. . . . . . . . . . . . . . . . . . . 154
xxiii
S25 Realistic applications of merging and integration methods on multiple tuberculosis genomic datasets using XGBoost classifiers from six individual
Leave-one-dataset-out(LODO) experiments. The results by different methods are
grouped into six groups. "Single learner": Each of the five training datasets were trained
independently with RF classifier and predicted on the test dataset, then the average AUC
score was taken among the five predictions. "Merged": Merging method with pooling
all five training datasets into one training data. The "Single learner" and "Merged"
experiments were conducted under both naive and ComBat normalization settings.
"Ensemble learning": The five training predictors were integrated by ensemble weighted
learning methods under naive setting. "Ensemble learning (normalized)": The five training
predictors were integrated by ensemble weighted learning methods under ComBat
normalization setting. "Rank aggregation": The five training predictors were integrated
by rank aggregation methods under naive setting. "Rank aggregation (normalized)": The
five training predictors were integrated by rank aggregation methods under ComBat
normalization setting. The red dots and associated values on the figure are the mean AUC
scores for each method, the vertical bars are the median AUC scores for each method,
while the black dots represent the outliers. Same method under different settings are
represented in the same color of boxplots. All the experiments were repeated 30 times for
each test dataset, and the results presented in the figure were based on the average AUC
scores of the total 180 replications for the six test datasets. . . . . . . . . . . . . . . . . . . 155
S26 AUC results of Scenario 1: Divergent training and test populations with varied
genomic feature distribution with constant parameter C = 105 during Dirichlet
simulations. The figures display the AUCs of RF predictions employing various
integration methods with three distinct disease effect factors. Columns represent different
values of α. Method names without a "ComBat" suffix refer to those implemented in
the naive setting, while those with a "ComBat" suffix were conducted in the ComBat
normalization setting. All the experiments were conducted 100 times, and the AUC
scores presented in the figure represent the averages from these 100 trials. Abbreviations:
Ensem.learn.-Ensemble Learning; Rank aggr.-Rank aggregation; norm.-normalization. . . . 156
S27 AUC results of Scenario 2: Different batch effects on training data with consistent
underlying population genomic feature distribution with constant parameter
C = 105 during Dirichlet simulations. The figures show the AUC scores of RF
prediction using different integration methods with varying severity levels of batch
effects. A: AUC score comparisons with different severity levels of additive batch effects
on the mean of OTU abundances, with no multiplicative batch effect on the variance. B:
AUC score comparisons with different severity levels of multiplicative batch effects on
the variance of OTU abundances, with no additive batch effect on the mean. The disease
effect factor was set to 1.025 for both scenarios. The integration methods without a suffix
of "ComBat" were applied in the naive setting, while those with a suffix of "ComBat" were
applied after ComBat normalization. The experiments were repeated 100 times, and the
shown AUC scores are the averages across the 100 trials. Abbreviations are the same as in
Figure S26. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
xxiv
S28 AUC results of Scenario 3: Different phenotype-associated feature models in
different studies with constant parameter C = 105 during Dirichlet simulations.
The figures show the AUCs of RF prediction using different integration methods with
various number of overlapping disease associated OTUs. The disease effect factor was set
to 1.075. Columns represent different numbers of overlapping disease associated OTUs in
the training and test data, the larger the number, the more similar the two disease models
are. When the number achieves 10, the two models are the same in the training and test
data. All the experiments were repeated for 100 times and the AUC scores shown on the
figure are the averages from the 100 trials. Abbreviations are the same as in Figure S26. . . 158
xxv
Abstract
The human microbiome, an intricate ecosystem of microorganisms within and on us, significantly influences our health and disease states. The emergence of next-generation sequencing (NGS) technologies
has broadened our understanding, providing a comprehensive perspective on the complex interplay between microbes and humans. This understanding is further enriched by the evolution of bioinformatics
tools, deepening our grasp on the functional roles and relationships within the microbiome. However,
challenges, especially in predicting sample phenotypes across varied studies due to dataset heterogeneity,
remain.
In the first part of this dissertation, we embarked on a colorectal cancer (CRC) - microbiome predictive
analysis using extensive metagenomic datasets. Our exploration spanned various microbial taxonomic
profiling tools to extract relative abundances of recognized microbial genomes. We also integrated mapping and assembly techniques to capture both known and novel genomes’ relative abundance profiles.
Utilizing the random forests (RF) classification algorithm, we aimed to predict colorectal cancer status
based on these microbial profiles. Our findings indicated superior prediction performance using profiles
estimated by Centrifuge compared to those by MetaPhlAn2 and Bracken. Additionally, we introduced
a novel method integrating the relative abundance profiles of both known and novel microbial entities,
enhancing the CRC prediction from metagenomes.
In the second part, our focus shifted to disease case/control phenotype prediction by integrating multiple heterogeneous studies. We implemented a comprehensive workflow to simulate diverse heterogeneity
xxvi
types and assess various integration methods, emphasizing batch normalization using ComBat. Applying
our methods to six CRC metagenomic studies and six tuberculosis (TB) gene expression studies, we demonstrated the critical role of batch normalization in enhancing phenotype prediction performance by machine
learning classifiers across multiple heterogeneous datasets. Alongside normalization, both merging strategy and ensemble weighted learning methods emerged as potent tools to bolster prediction performance.
We also posited rank aggregation methods as a robust alternative, showcasing resilience akin to ensemble
weighted learning approaches.
In the final part, we extended our investigation from the second part to quantitative phenotype prediction. We evaluated a range of normalization and integration strategies to counteract heterogeneity effects
on cross-study quantitative phenotype predictions. Through simulations and real-world data applications,
we established that combining normalization with integration effectively neutralizes heterogeneity stemming from batch effects in similar populations, thereby refining prediction accuracy.
Our research underscores the pivotal role of addressing study heterogeneity to ensure robust predictions. These insights pave the way for future endeavors in microbiome-based disease prediction and
treatment.
xxvii
Chapter 1
Introduction
The human body, a marvel of biological complexity, is not just a singular entity but a vast ecosystem
teeming with trillions of microorganisms. This intricate community, known as the “human microbiome",
comprises bacteria, viruses, fungi, archaea, and eukaryotes that reside both within and on our bodies [1,
2]. These microorganisms, found in diverse niches like the gut, skin, mouth, and respiratory tract, play a
fundamental role in shaping our health and disease states [1, 3, 4].
Historically, the relationship between humans and microbes was primarily viewed through the lens
of pathogenicity. However, with the development of advanced sequencing technologies and large-scale
projects like the Human Microbiome Project [5], our understanding has been revolutionized. We now
recognize that while certain microbes can cause diseases, a significant proportion are either benign or
beneficial, actively participating in vital physiological processes such as digestion, metabolism, immune
modulation, and even behavioral influences [1, 3].
The microbiome’s composition and diversity are dynamic, influenced by a myriad of factors like genetics, diet, age, environment, and antibiotic usage [6, 7]. Disruptions in this microbial equilibrium, termed
“dysbiosis", have been implicated in a range of health issues. These span from gastrointestinal disorders
like inflammatory bowel disease to systemic conditions like obesity, diabetes, and even certain cancers [1,
1
2, 8]. For instance, specific bacterial species in the gut have been shown to either elevate or reduce the risk
of certain cancers, potentially influencing tumor initiation and progression [8].
The potential therapeutic implications of understanding the human microbiome are vast. Modulating
the microbiome’s composition, either by introducing beneficial microbes or altering existing populations,
offers promising strategies for disease prevention and treatment [2, 8]. As research continues to delve
deeper into the myriad ways the microbiome influences our health, it becomes evident that this microbial
community is not just a passive resident but an active participant in our well-being.
The exploration of the human microbiome has been profoundly influenced by the technological advancements of recent decades, particularly in the realm of sequencing technology. Historically, our grasp
of the microbiome was hindered by the constraints of 19th-century culture-based methods [9], which could
only discern a subset of microbial entities. This left researchers with a fragmented view, able to delve into
specific microbes but lacking a holistic understanding of the entire microbiome and the intricate dynamics
between microbes and their human hosts.
The mid-20th century marked a transformative period with the advent of sequencing technologies. The
pioneering Sanger sequencing [10], emblematic of the first generation of sequencing techniques, played
a pivotal role in the iconic Human Genome Project [5]. However, its inherent limitations in throughput,
sensitivity, and scalability soon became apparent. The subsequent emergence of next-generation sequencing (NGS) technologies provided a much-needed solution to these challenges, offering a cost-effective,
high-throughput approach that democratized microbiome research. These advancements in NGS have not
only facilitated a broader detection of microbial entities but have also deepened our understanding of their
functional roles and intricate relationships within the host.
The advent of NGS technologies has not only enabled the generation of vast amounts of sequencing
data but has also necessitated the development of a plethora of bioinformatics tools to decipher this data.
These tools are designed to extract meaningful insights from the raw sequencing outputs, transforming
2
them into actionable biological knowledge. For instance, Centrifuge [11] is a versatile tool designed for
rapid and accurate classification of DNA sequences from microbial samples, leveraging a compressed index of a large reference genome database. MetaPhlAn2 [12], on the other hand, specializes in profiling the
composition of microbial communities from metagenomic shotgun sequencing data using unique cladespecific marker genes. Kraken2 [13] is another powerful tool that assigns taxonomic labels to short DNA
reads with high precision by checking them against a reference database, and its companion tool Bracken
[14] estimates the abundance of species in a sample, providing a more detailed view of its microbial composition. The emergence of such tools underscores the importance of bioinformatics in harnessing the full
potential of NGS technologies and offers researchers a suite of options for diverse analytical needs.
With the advent of NGS technology, coupled with a plethora of bioinformatics tools, predicting sample
phenotypes has emerged as a prominent downstream scientific task. Yet, this task is not without its challenges. A notable issue arises during cross-study phenotype predictions, especially when training a machine learning model on one dataset and applying it to another. The heterogeneity between these datasets,
whether stemming from population differences or batch effects due to varying sequencing platforms, can
significantly impact the reproducibility of the machine learning model, thereby hindering effective crossstudy phenotype predictions. Another challenge lies in the optimal utilization of newly identified microbiomes in phenotype predictions. A large proportion of metagenomic shotgun reads remain unmapped,
primarily due to the absence of a database of all genomes [15, 16, 17]. However, meta-analyses of clinical
gut microbiomes have revealed associations between novel microbial organisms and a myriad of diseases,
suggesting that incorporating these novel organisms could enhance prediction accuracy [18]. In this dissertation, we delve into our efforts to pinpoint and address these challenges in human microbiome-phenotype
prediction analysis, harnessing the power of computational and statistical methodologies.
3
1.1 Research problems in colorectal cancer status prediction using microbiome
abundance profiles
Predicting disease status based on human microbiome analysis has become an increasingly popular area of
research, given the profound influence of the microbiome on human health. The intricate composition of
the gut microbiome has been associated with a myriad of diseases, and understanding these associations
can provide valuable insights into disease etiology and progression [19]. With the surge in microbiomerelated studies, there’s a growing emphasis on leveraging machine learning techniques to predict diseases
based on microbiome profiles. Previous studies found strong associations between the abundance levels
of some microbial organisms with various diseases, such as rheumatoid arthritis [20], diabetes [21, 22],
inflammatory bowel disease [23, 24], and colorectal cancer [25]. These findings provide critical information
towards understanding the potential roles of the human microbiome in disease developments.
Chapter 2 delves into our research on predicting colorectal cancer (CRC) phenotypes through microbiome analysis. Colorectal cancer (CRC) is a leading cause of cancer-related morbidity and mortality
globally. The intricate relationship between the human gut microbiome and CRC has garnered significant attention in recent years. Dysbiosis, or imbalances in the gut microbiota, has been implicated in the
pathogenesis of CRC, with certain microbial species and their metabolites playing pivotal roles in disease
onset and progression [26, 27]. For instance, alterations in microbial taxa and their associated metabolites,
such as butyrate-related colonic metabolites, have been observed in CRC patients, suggesting a potential mechanistic link between the microbiome and CRC carcinogenesis [26]. Furthermore, studies have
shown significant differences in microbial assemblages between tumor and adjacent normal tissues, as
well as between distal and proximal CRC segments [28]. The gut microbiome’s influence extends beyond
just the presence or absence of specific microbes; their metabolites, many of which are products of sulfur
metabolism, have been associated with CRC, indicating a complex interplay between microbial metabolism
4
and cancer pathogenesis [27]. Recent advancements in the field have also emphasized the potential of the
gut microbiome as a diagnostic and therapeutic target, with the hope of paving the way for more personalized medicine approaches for CRC patients [29].
A pivotal aspect of CRC-microbiome studies is predicting a sample’s disease status from its microbiome
profile. A microbiome profile is typically represented by a microbial abundance matrix, where each row
denotes a sample and each column an operational taxonomic unit (OTU). OTUs, foundational in microbial
ecology, categorize and analyze microbial sequences based on similarity. They are commonly defined
by clustering sequences with a set similarity level, often a 97% threshold for 16S rRNA gene sequences
[30]. This methodology enables the grouping and analysis of microbial entities without specific taxonomic
assignments. With next-generation sequencing (NGS) technologies’ rise, numerous metagenomic profiling
tools have been developed for microbiome studies. Chapter 2 evaluates various metagenomic profiling
tools in generating microbiome OTU abundances and their implications for CRC status prediction.
OTU abundance is typically estimated using bioinformatics tools, either through reference-based methods like Centrifuge [11], or de-novo assembly methods like MEGAHIT [31]. De-novo assembly holds an advantage over reference-based methods as it can identify OTU abundances not present in reference genome
databases, enhancing the potential for downstream prediction. Chapter 2 also investigates the benefits of
incorporating these newly identified microbiome OTUs using a tool named MicroPro [32] in CRC status
classification.
Supervised machine learning models are standard for sample classification. In our CRC disease status
prediction, the goal is to categorize a sample as either healthy or CRC-afflicted. Conventional supervised
models, including random forests (RF) [33], support vector machine (SVM) [34], and LASSO [35], are
frequently employed for such classification tasks. In Chapter 2, we juxtapose the performance of RF,
LASSO, and SVM concerning CRC disease status prediction accuracy.
5
1.2 Research problems in phenotype predictions integrating multiple
omics studies
The prediction of phenotypes from genotypes is a important area of research, with one of the central
challenges being the enhancement of prediction generalizability. A prevalent strategy to address this is
the integration of multiple studies, aiming to bolster the machine learning model’s generalizability.
The integration of multiple studies in phenotype predictions is of paramount importance in the realm
of biomedical research. As biological systems are inherently complex, understanding the intricate interplay
between various factors necessitates a comprehensive approach that draws from diverse datasets. By
integrating multiple studies, researchers can harness a broader spectrum of data, enhancing the robustness
and generalizability of their findings [36]. This is particularly crucial in the context of complex traits
and diseases, where individual studies might capture only a fraction of the underlying variability and
complexity [37].
Moreover, the integration of multiple studies can help overcome the limitations associated with individual datasets, such as sample size constraints, population-specific biases, and methodological differences
[38]. For example, in genetics, individual genome-wide association studies (GWASs) have pinpointed a
plethora of genetic markers linked to diverse phenotypes. However, these often account for only a fraction of the heritability. By pooling results from several GWASs, there’s an increased capacity to identify
nuanced genetic effects, offering a more holistic perspective on the genetic framework of intricate traits
[39]. Similarly, in metagenomics, numerous studies have demonstrated enhanced CRC status prediction
performance when machine learning models are trained on datasets amalgamated from multiple studies
and then applied to a separate study. This performance boost is in contrast to models trained solely on a
single study and then applied to another [40, 41].
6
However, the fusion of multiple studies in phenotype predictions, while promising, is fraught with
challenges. A primary obstacle is the variability across studies, stemming from differences in study design,
population demographics, experimental protocols, and data acquisition methods [38]. This heterogeneity
can introduce biases and confounders, making it challenging to draw consistent and reliable conclusions
from integrated datasets.
Another challenge is the high dimensionality and sparsity of omics data, notably within the microbiome field, which can lead to issues like overfitting, especially when integrating data from multiple omics
layers [36]. The curse of dimensionality can make it difficult to identify true associations amidst the noise,
necessitating the application of robust statistical and computational methods to extract meaningful insights.
Furthermore, the integration of multiple studies requires careful consideration of data normalization
and transformation methods to ensure that data from different sources are comparable. This is particularly
crucial when integrating data from different omics platforms, as each platform might have its own set of
biases and technical variations.
Lastly, while integrating multiple studies can increase the sample size and statistical power, it also
amplifies the risk of batch effects, which can obscure true biological signals [42]. Addressing these batch
effects requires sophisticated computational methods and a deep understanding of the underlying biology.
In conclusion, while the integration of multiple studies holds immense potential for advancing our
understanding of complex phenotypes, it also presents challenges that require careful consideration and
methodological precision.
To tackle the challenges associated with integrating multiple omics studies for phenotype prediction,
we delineated several approaches in Chapter 3 and Chapter 4. Initially, we employed the batch normalization method to counteract the heterogeneity across different studies. Specifically designed to adjust data,
this method minimizes disparities between batches, ensuring that the amalgamated dataset reflects the
7
genuine biology rather than mere technical artifacts [43]. In our research, we extended the application of
batch normalization not just to address batch effects but also to mitigate other sources of heterogeneity,
thereby broadening its potential utility. Beyond batch normalization, we also ventured into various integration methods. These methods amalgamate phenotype prediction results from different studies using
advanced statistical and computational techniques. Furthermore, we delved into the potential benefits of
concurrently applying both batch normalization and integration methods to enhance phenotype prediction
accuracy.
Numerous integration methods have been proposed to combine multiple studies for phenotype predictions. One prevalent approach is the merging method, which directly consolidates datasets from various
studies into a singular dataset. While straightforward, this method requires careful preprocessing and normalization to ensure that the merged data is consistent and comparable across studies [44]. Owing to the
augmented sample size and diversity in the study population, the merging method has often outperformed
predictions based solely on individual studies [40, 41, 45].
Another widely adopted integration technique is ensemble weighted learning. Ensemble methods train
a multitude of models and subsequently aggregate their predictions. When integrating multiple studies,
ensemble weighted learning proves especially potent. Each study can be perceived as a distinct data source,
allowing for individual models to be trained on each dataset. The ensuing predictions from these models
can be integrated using weighted aggregation, with the weights being ascertained based on the reliability
or pertinence of each study [46, 47].
Beyond these conventional integration approaches, we introduced rank aggregation methods tailored
for phenotype predictions. Rank aggregation amalgamates ranked lists from various studies to formulate a
consensus ranking. This is especially beneficial when the primary outcome is a ranking rather than a scalar
value. For example, in gene function prediction, rank aggregation can merge ranked gene lists associated
with a specific function from diverse studies [48]. However, when it comes to phenotype predictions, the
8
desired outcome is typically a predicted probability of disease status or a continuous phenotype value
generated by a trained machine learning model. Our proposition hinges on the assumption that while the
predicted probability for test data samples might be imprecise, the relative order could still be informative.
In such scenarios, aggregating ranks, rather than predicted probabilities, might be more judicious. To
our knowledge, no existing studies have delved into rank aggregation methods specifically for omics data
phenotype prediction.
1.2.1 Disease status predictions
As detailed in Chapter 2, our prior research delved into CRC status prediction using microbiome abundance profiles. We demonstrated that the prediction performance concerning CRC disease status could
be enhanced through various methods. This enhancement could be achieved either by employing distinct
microbial taxonomic profiling tools or by incorporating novel microbiomes into the predictions. During
this study, we observed varying prediction accuracies when training machine learning models on specific studies and subsequently applying them to others in a cross-cohort analysis. We recognized that this
discrepancy stemmed from the inherent heterogeneity across different studies.
In Chapter 3, we presented our exploration into the effects of study heterogeneity on downstream
disease status prediction accuracy. We investigated three difference sources of heterogeneity: different
populations, different sequencing generating batches and different disease models. For this endeavor, we
undertook both simulation studies and applied real-world data analyses on CRC metagenomic datasets
as well as tuberculosis (TB) gene expression datasets. Our objective was to discern the optimal strategies
for integrating various studies of same data types under diverse heterogeneities. We gauged the efficacy
of several ensemble weighted learning algorithms and introduced the potential of rank aggregation algorithms in phenotype predictions. Additionally, we experimented with batch normalization techniques on
the data to explore its viability in managing multiple studies marked by heterogeneities.
9
1.2.2 Quantitative phenotype predictions
Predicting quantitative phenotypes, such as continuous measures like body mass index (BMI) or blood
sugar concentrations, remains a challenging frontier in microbiome research. One of the primary reasons for this challenge is the inherent variability in the microbiome across individuals, coupled with the
constraints of current microbial quantification techniques. For instance, 16S rRNA sequencing, a widely
used method, has been criticized for its limitations in both precision and repeatability [49, 50]. Despite
these challenges, understanding the relationship between the microbiome and quantitative phenotypes
is pivotal for comprehensive insights into human health. A notable study from Japan by Nishijima et al.
[51] observed distinct variations in gut microbiota compositions across varying BMI categories. Similarly,
Zeevi et al. [52] highlighted the intricate role of the gut microbiome in modulating post-meal blood sugar
levels, emphasizing the microbiome’s influence on metabolic responses. These studies underscore the importance of delving deeper into the associations between the microbiome and quantitative health metrics,
paving the way for a more holistic understanding of the symbiotic relationship between humans and their
resident microbes.
Building upon our exploration of disease status predictions, we delved deeper into the realm of integrating multiple omics studies for quantitative phenotype predictions. Drawing parallels with the methodologies employed in Chapter 2, we meticulously examined the same three sources of heterogeneity and
assessed their implications for quantitative phenotypes. To achieve this, we embarked on a comprehensive simulation study, generating four distinct types of quantitative phenotypes in relation to simulated
microbial abundance profiles. These encompassed linear, inverse, logarithmic, and quadratic associations.
Subsequently, we applied our methodologies to real-world data, aiming to predict BMI using microbial
abundance profiles. This endeavor provided invaluable insights into the intricate relationships between the
microbiome and host phenotypic traits. Furthermore, we expanded our scope to encompass cross-study
survival predictions, utilizing gene expression data from ovarian cancer studies. To our knowledge, our
10
investigation stands as a pioneering effort in the integration of heterogeneous studies specifically targeting
quantitative phenotype predictions. To the best of our knowledge, our research stands as the pioneering
effort centered on integrating heterogeneous studies in the context of quantitative phenotype predictions.
11
Chapter 2
Increasing prediction performance of colorectal cancer disease status
using random forests classification based on metagenomic shotgun
sequencing data
2.1 Introduction
Human microbial community is an aggregate of bacteria, archaea, viruses, plasmids, eukaryotes, etc. that
reside on or within human body sites. According to the Human Microbiome Project, these microorganisms
are ten times more than the number of human cells, but only take up 1-3 percent of human body mass due
to their tiny sizes [5]. Human microbiome has been demonstrated to play important roles in metabolic
functions, immune system processes and other physiological activities [20]. Previous studies found strong
associations between the abundance levels of some microbial organisms with various diseases, such as
rheumatoid arthritis [20], diabetes [21, 22], inflammatory bowel disease [23, 24], and colorectal cancer
[25]. These findings provide critical information towards understanding the potential roles of the human
microbiome in disease developments.
Colorectal cancer (CRC) is the third most common cancer worldwide. In USA alone, close to 137,000
people are diagnosed with CRC and 50,000 people die from it annually [53]. Several studies have shown
that hereditary, family medical history [54, 55] of diseases like inflammatory bowel disease [56], diabetes
12
[57], and behavior factors including alcohol consumption [58], smoking [59] and obesity [60] are associated
with CRC development. In addition to these risk factors, other studies have established associations of the
human gut microbiome with CRC [25, 40, 41, 61, 62, 63, 64]. Compositional alterations of bacteria genera
like Fusobacterium [65] and species like Escherichia coli [66] and Bacteroides fragilis [67] were shown to
be associated with the development of CRC. In addition to independent investigations, several cross-study
analyses were conducted and many reproducible microbiome biomarkers were found through comparisons
of various CRC datasets collected by different research groups [40, 41].
An important scientific task in CRC-microbiome association studies is prediction, namely predicting a
sample’s disease status based on its microbiome profile. A general workflow is to first separate the data set
into training set and testing set, then build a machine learning predictive model based on input features
using the training set, and finally evaluate the model performance on the testing set. In the context of CRCmicrobiome association study, the input features are usually microbial species abundances obtained from
sequence alignments against a microbial reference database and the outcome to be predicted is disease
or normal. In previous investigations, several machine learning models were successfully applied to the
CRC-microbiome association study, including LASSO [25, 41], random forests [40, 62, 63], neural network
[68], etc.
However, these studies were mostly focused on revealing the link between CRC and known microbial
organisms having reference genomes in the database such as NCBI RefSeq [69]. The role of unknown
(novel) microbial organisms is mostly neglected in these studies due to lack of reference genomes. Studies
have shown that about 40–50% of metagenomic shotgun reads cannot be mapped to known genomes, and
thus current reference-based analyses usually do not take these unmapped reads into consideration [15, 16].
Even though this number has been greatly improved by large-scale metagenomic assembly efforts in recent
years with the possibility of the mapping rate to be over 80% [17], a sizable fraction of reads still cannot
be mapped to these databases. Meta-analysis of clinical gut microbiome have shown the associations of
13
novel microbial organisms with numerous diseases indicating the potential of improving the prediction
accuracy by including novel microbial organisms [18].
Zhu et al. [32] developed a metagenomic predictive pipeline, MicroPro, that used information from
both the known and novel microbial organisms. After the characterization of known microbial organisms
by sequence alignment to known genomes, MicroPro assembled the unmapped reads pooled from all the
samples into contigs. Then, it clustered all the assembled contigs into bins so that each bin was considered
as a novel organism. In this way, the bins could serve as the reference database for the novel organisms and
contribute in further predictive analysis. However, the weight assignment of known and novel organisms
in the classification model remains a challenge, since they may not contribute equally to the prediction for
different datasets.
To deal with the challenges mentioned above, we developed a new colorectal cancer predictive pipeline
that incorporates both known and novel microbial organisms by a leave-one-sample-out (LOSO) model
stacking method. We demonstrated that our pipeline was able to achieve significantly higher prediction
performance when compared with an existing study [40].
2.2 Materials and Methods
2.2.1 Workflow of the CRC-microbiome prediction analysis
To predict the CRC disease status, we built a workflow as shown in Figure 2.1. First, we ran MicroPro
[32] on each of the input datasets and generated abundance tables for both known and novel microbial
organisms. We then renormalized known and novel microbial species abundance tables so that the relative
abundance levels of each sample summed up to 1 for both known and novel species. After the renormalization, we separated the dataset into training and testing sets, trained two random forests models on the
training set using the relative abundance levels of known and novel species, respectively, and applied the
14
trained models to the testing set to derive predictive probabilities. We then used a leave-one-sample-out
(LOSO) model stacking method to incorporate both predictive probabilities from known and novel models
to obtain the overall AUC score. We applied the workflow to six publicly available metagenomic CRC
datasets [25, 40, 61, 62, 63, 64] from different populations as described in Table 2.1.
Figure 2.1: Workflow of the CRC-microbiome prediction analysis. MicroPro is applied to the input
metagenomic dataset to characterize known and novel abundance tables. Two random forests classifiers
are trained using known and novel abundances, respectively. LOSO model stacking is used to incorporate
predictive probabilities from both random forests classifiers and applied to the independent test metagenomic dataset to derive the prediction performance in terms of the AUC score. LOSO: leave-one-sampleout; AUC: area under the operating characteristic curve.
Table 2.1: Six metagenomic datasets related to colorectal cancer.
Dataset Country No. of cases No. of controls Reference
Zeller France 91 93 [25]
Thomas Italy 61 52 [40]
Yu China 74 54 [61]
Hannigan USA/Canada 27 28 [62]
Feng Austria 46 63 [63]
Vogtmann USA/Canada 52 52 [64]
2.2.2 Colorectal cancer metagenomic datasets
We analyzed six publicly available and geographically diverse colorectal cancer metagenomic datasets with
download links from their original papers [25, 40, 61, 62, 63, 64]. We excluded the samples from patients
with adenoma so that only samples from patients diagnosed with CRC and healthy controls were used.
The numbers of cases and controls for each dataset are shown in Table 2.1. In total, there are 351 CRC
15
samples and 342 healthy controls. The percentage of CRC samples in each dataset is close to 0.5 according
to Table 1, which indicates the datasets used in our study are balanced.
2.2.3 Generating known and novel microbial relative abundance profiles by MicroPro
We used MicroPro [32] to generate known and novel microbial abundance profiles for the six CRC datasets.
There are three main steps in the MicroPro pipeline: (1) characterization of the known microbial species
abundance through sequence alignments against reference genomes, (2) extraction of novel microbial
abundances by an assembly-binning-based algorithm, and (3) machine learning predictive analysis.
In our study, we need to generate microbial abundances for both within-dataset and cross-dataset
analysis. For the within-dataset analysis, we ran the MicroPro pipeline directly and generated both known
and novel abundance tables. For cross-datasets, the known microbial abundances were generated in the
same way as the within-dataset analysis. To derive abundances of novel organisms for the testing dataset
in the cross-dataset setting, we treated the novel metagenomic bins of the training dataset as the reference
database, mapped the unmapped reads of the testing dataset back to this reference, and calculated the novel
abundances based on the mapping results. We then renormalized known and novel species abundance table
so that the relative abundance levels of each sample summed up to 1 for both known and novel species.
2.2.4 Generating known microbial relative abundance profiles by other metagenomic
taxonomic profiling tools
In this study, we compared the performance of three different metagenomic taxonomic profiling tools
in terms of prediction performance of colorectal cancer based on the generated known microbial profiles. The known microbial abundance profiles generated by MicroPro described in Generating known and
novel microbial relative abundance profiles by MicroPro were considered as known abundance generated
by Centrifuge [11] since MicroPro directly uses Centrifuge for characterizing known abundance profiles.
16
MetaPhlAn2 results were directly taken from Thomas et al. [40]. For generating known abundance profiles
from Bracken [14], we ran Bracken pipeline on all FASTQ sequence files for the six datasets. The number
of microbial species identified by Bracken were two to three times more than the number generated by
MicroPro. So in order to decrease the running time, we filtered out microbial species that had nonzero
abundances in less than 10% of all samples for each dataset.
2.2.5 Random forests CRC predictive models
In our study, we built random forests classification models to test the predictive power of the generated
known and novel microbial abundances on the CRC disease status. All the random forests analyses were
carried out with the ‘caret’ package in R [70]. We used two parameter settings for different purposes
of analysis. For comparison with the study conducted by Thomas et al. [40] as well as comparing the
performance of three different metagenomic taxonomic profiling tools, we used 1000 decision trees and
the ‘mtry’ parameter was tuned by a 10 fold cross-validation. We then performed the predictive analysis
three settings: within-dataset, cross-dataset and leave-one-dataset-out (LODO). For the rest of the analysis,
we used 5000 decision trees and again the ‘mtry’ parameter was tuned by a 10 fold cross-validation, then
performed the predictive analysis for two settings: within-dataset and cross-dataset. We didn’t conduct
analysis for LODO settings with 5000 decision trees due to the extremely long computing time.
For the within-dataset setting, we randomly split the samples in each dataset into 80% training set and
20% testing set. Then we trained a random forests model on the training data and applied it to the testing
data to derive the predictive probabilities for each testing sample. The process was repeated for 30 times
to obtain the average AUCs (area under the receiver operating characteristic curve).
For the setting of cross-dataset, one of the six datasets was treated as the training set and another in the
remaining five was treated as the testing set in turn. A random forests model was trained on the training
set and then applied to the testing set for prediction. When applying the trained model to the testing set,
17
if there was a missing feature in the testing matrix, we added this feature with 0 abundance in all samples,
so that all features in the training data were presented in the testing data as well.
For LODO setting, one of the six datasets was selected as the testing set while the other five datasets
were treated as the training set together. All the microbial species in the five training sets were used as
features in the random forests model, and we added a 0 abundance column to the testing set if any feature
in the training sets is missing in the testing set.
In terms of model evaluation, for random forests model using only known abundances, AUC was used
as the performance measurement. An AUC score of 1 indicated a perfect prediction and 0.5 represented
a random guess. For the random forests model using both known and novel microbial abundances, we
first used a LOSO model stacking method to obtain weighted predictive probabilities and then derived the
final AUC scores. The details of the LOSO model stacking method are described in Leave-one-sample-out
(LOSO) model stacking method.
2.2.6 LASSO and SVM CRC predictive models
Microbial species abundance profiles were first log10-transformed and a small constant (1e−10) was added
to 0 abundance to avoid the indefinite value of log10(0) before training LASSO and SVM. In terms of hyperparameters, we chose the regularization parameter from the range of 0 to 0.5 with step 0.001 that maximized
AUC by 10-fold cross validation in the training set in LASSO. For SVM, we used the Gaussian radial basis
kernel and chose the regularization coefficient C that maximized AUC by 10-fold cross validation from
default values of 0.25, 0.5 and 1. We used the ‘caret’ package [70] in R to implement both LASSO and SVM
algorithms. The training process was repeated for 30 times for within-dataset and cross-dataset settings,
and 10 times for LODO setting. The AUC scores reported were the average for the 30 or 10 repetitions.
18
2.2.7 Leave-one-sample-out (LOSO) model stacking method
In machine learning, ensemble methods that integrate predictions from multiple models are commonly
used to boost the prediction performance. According to previous studies [46, 71], ensemble methods are
often more accurate than the component methods that the ensemble methods contain. Among various
ensemble methods, model stacking is an efficient method that uses the predictions generated by other
machine learning algorithms as the inputs of a second layer algorithm, which then combines the input
predictions to form a new set of predictions. If we choose weighted linear combinations of predictors as the
second layer algorithm, then we only need to determine the weights of each input predictor and combine
them to form an optimal predictor. Consider two studies S1 and S2. The disease statuses of all individuals
in study S1 and part of the individuals in study S2 are known. We have several predictors, p1, p2, · · · , pK,
based on study S1 and we are interested in linear combinations of these predictors to maximize the AUC
for the integrated predictor in study S2. We can use the information of the individuals with known disease
statuses in study S2 to decide the weight for each predictor. Since we want to maximize the AUC of the
integrated predictor, we should give higher weights to predictors with high AUCs and lower weights to
predictors with low AUCs according to the prediction results for individuals with known disease statuses.
Therefore, if the AUC for the k-th predictor is AUCk, the weight for the k-th predictor is proportional to
max(AUCk − 0.5, 0) since AUC for random guess is 0.5.
Based on this idea, we developed a leave-one-sample-out (LOSO) model stacking method. The high
level idea of LOSO model stacking is to weight predictive probabilities based on known and novel microbial
abundances by their respective prediction accuracy without bringing any inference by the testing data.
For each sample i in the testing set, we obtained two predictive probabilities p
known
i
and p
novel
i
by
applying the trained random forests model using only known or novel microbial abundances to the testing set, respectively. We then excluded i from the testing set and computed AUC scores AUCknown
i
and
AUCnovel
i
by comparing the remaining predictive probabilities to their true disease statuses. The LOSO
19
model stacking method weighted p
known
i
and p
novel
i
by AUCknown
i
and AUCnovel
i
subtracting the background AUC score of 0.5. The detailed formula of LOSO model stacking is shown in equation (1).
pi =
max(AUCknown
i
−0.5,0)×p
known
i
+max(AUCnovel
i
−0.5,0)×p
novel
i
max(AUCknown
i
−0.5,0)+max(AUCnovel
i
−0.5,0) (2.1)
2.2.8 Prediction performance measured by the area under the precision-recall curve
(AUPRC)
When comparing the prediction performance of using only known species incorporating both known and
novel species, we used an additional measure AUPRC (area under the precision-recall curve). To calculate
AUPRC scores by the R package PRROC [72], we took the predictive probabilities from random forests
classifier with 5000 decision trees using known microbial species and novel microbial species, respectively.
In order to calculate the LOSO model stacking results, we first followed the same procedure as described in
Leave-one-sample-out (LOSO) model stacking method to generate the LOSO predictive probabilities using
equation (1), and then used these probabilities to calculate AUPRC scores.
2.3 Results
2.3.1 Known microbial relative abundance profiles extracted from Centrifuge yields
higher prediction performance than other metagenomic taxonomic profiling tools
Thomas et al. [40] conducted a CRC-microbiome predictive analysis on the same set of six public CRC
datasets described in Six metagenomic datasets related to colorectal cancer.. In their study, they used
MetaPhlAn2 [12] to generate known microbial abundances, and then trained a random forests model using these abundances to predict the CRC disease status. In MicroPro, however, Centrifuge [11] was used
for known microbial abundance extraction. Both MetaPhlAn2 and Centrifuge have been commonly used
in metagenomic data analyses for metagenomic taxonomic profiling. Centrifuge is an alignment-based
20
method that counts k-mer frequency of raw reads and compares them with a compressed composite reference genomes. MetaPhlAn2 relies on the identification of clade-specific marker genes in raw reads to
estimate the abundance of each taxa. Centrifuge uses Full-text index in Minute space (FM-index) to compress the reference genomes and removes redundancies to save the storage space. Consequently, it allows
users to use NCBI RefSeq, which contains over 36.5 million sequences with a total of 109 billion base
pairs [11], as the reference database. On the other hand, MetaPhlAn2 profiles species-level composition of
microbial communities by using a reference genome database that contains over one million unique cladespecific marker genes identified from around 17,000 reference genomes [12]. Besides these two profiling
tools, according to a study conducted by Ye et al. [73], Bracken [14], an add-on tool based on Kraken [13]
with more accurate relative abundance quantification, is the top performance profiling tool compared with
other tools including MetaPhlAn2 and Centrifuge, in terms of a more accurate abundance at species level
as well as time complexity. Both Kraken and Centrifuge are k-mer matching algorithms, while Kraken relies on a probabilistic hash table for k-mers, Centrifuge uses an FM-index and within-species compression.
Also, Kraken assigns each read to exactly one taxonomy, while Centrifuge can provide multiple taxonomic
assignments per read. As an add-on tool for Kraken, Bracken further improves the classification by reassigning the unclassified reads from Kraken results based on a probabilistic estimate of the true abundance
profile. Tamames et al. [74] showed that different metagenomic analysis methods usually generate different taxonomic annotation results, which can impact the follow-up predictive analysis. Therefore, we
investigated the difference of Centrifuge, MetaPhlAn2 and Bracken in terms of the prediction performance
of colorectal cancer based on the generated known microbial abundance profiles.
We compared the performance of these three profiling tools in within-dataset, cross-dataset, and leaveone-dataset-out (LODO) analyses. The detailed steps for conducting these analyses can be found in Random forests CRC predictive models. We first used the same parameters in RF classification as in Thomas
et al. [40] with 1000 decision trees. The AUC scores for the three tools in within-dataset, cross-dataset
21
and LODO settings are shown in Figure 2.2. Among all the comparisons, AUCs based on Centrifuge are
higher than that based on MetaPhlAn2 for 25/42 cases, the same in 1/42 case and lower in 16/42 cases.
AUCs based on Centrifuge are higher than that based on Bracken for 31/42 cases, the same in 3/42 cases
and lower in 8/42 cases. Overall, Centrifuge has the best AUC scores among the three tools in the three
settings.
We particularly focus on the comparison between Centrifuge and MetaPhlAn2. For within-dataset,
Centrifuge yields higher AUCs than MetaPhlAn2 in 4 out of 6 cases. For cross-dataset, Centrifuge outperforms MetaPhlAn2 in 19 out of 30 cases, performs similarly in 1 case and underperforms in 10 cases. For
LODO, Centrifuge outperforms MetaPhlAn2 in 2 out of 6 cases. A two-sided paired Wilcoxon signed-rank
test [75] on these 42 comparisons gives a p-value of 0.040. The average AUCs across the 42 comparisons
using Centrifuge and MetaPhlAn2 are 0.735 and 0.710, respectively. These results demonstrate that the
known microbial abundances extracted by Centrifuge has better predictive power of colorectal cancer disease status compared with that extracted by MetaPhlAn2. Similar results hold for the comparison between
Centrifuge and Bracken. Therefore, we used Centrifuge in the follow-up analysis.
2.3.2 Random forests classifier outperforms LASSO and SVM in terms of AUC scores
In order to show the differences in AUC scores among different machine learning methods, we trained two
additional machine learning classifiers: support vector machine (SVM) [34] and least absolute shrinkage
and selection operator (LASSO) [35], under the three experimental settings. The details about training
these two classifiers can be found in LASSO and SVM CRC predictive models. The results are shown in
Figure 2.3. Among all the three experimental settings, RF outperforms SVM and LASSO for 25 out of 42
cases. The average AUC scores for the 42 experiments by RF, LASSO and SVM are 0.735, 0.679 and 0.716,
respectively. The two-sided paired Wilcoxon signed-rank test [75] gives a p-value of 2.626e−6 between
the 42 comparisons of RF and LASSO indicating significant better performance of RF over LASSO. The
22
corresponding p-value for the comparison of RF and SVM is 0.13 indicating that RF and SVM perform
similarly. These results show that the linear model in LASSO cannot fully capture the relationship between
microbial abundance and the probability of having CRC.
We further explored the trend of AUC scores of different experiments by three different classifiers. As
shown in Figure 2.4, both LASSO and SVM show very similar trends of AUC scores to RF among all 42
experiments (Spearman correlation of 0.79 and 0.75, respectively). This indicates the marked differences
in AUC scores by training and testing on different datasets are not classifier-specific. In addition, we
demonstrate that when dealing with classification tasks with high feature dimensions, RF can yield higher
prediction accuracy than SVM and LASSO. Therefore, in the remaining part of the paper, we will only
concentrate on the results based on RF.
2.3.3 Increasing the number of decision trees in a random forests model significantly
improves the prediction performance
Choice of the number of decision trees in a random forests model affects its prediction performance. When
using a small number of decision trees for a dataset with a huge number of features, the random forests
model usually has relatively poor prediction performance. On the other hand, increasing the number of
decision trees can potentially improve the prediction performance, but would need more computational
resources. Friedman et al. [76] cautioned the potential risk of overfitting when increasing the number of
decision trees in a random forests model. On the other hand, Oshiro et al. [77] showed that the mean and
median AUCs tend to converge when increasing the number of decision trees in a random forests model.
In this section, we investigated whether increasing the number of decision trees could boost the prediction
performance.
We used the cross-dataset random forests setting for the experiment. We took the Hannigan dataset as
the training data, while the other five datasets as the testing data in turn. We investigated the prediction
23
performance for the number of 500, and 1000 to 10,000 trees by step of 1000 while keeping all other parameters unchanged. Each experiment was repeated 30 times and the mean AUCs were reported in Figure 2.5.
The figure shows that the mean AUCs have an increasing trend for all datasets when the number of decision trees is increased from 500 to 3000, and then stabilize after 3000. Considering the computational cost
of a large number of decision trees, choosing the number of decision trees between 3000 to 6000 would be
reasonable. In our following analysis we used 5000 decision tress in all random forests models.
We compared the prediction performance using random forests model with 5000 decision trees trained
on abundance profiles extracted from Centrifuge in our analysis with that in Thomas et al. [40] where they
trained random forests with 1000 decision trees on abundance profiles extracted from MetaPhlAn2. Both
models used only known microbial abundances. Our model outperforms that of Thomas et al. [40] in 5 out
of 6 cases in within-dataset analysis (Figure 2.6 A), 20 out of 30 cases in cross-dataset analysis (Figure 2.6
B), and 2 out of 5 cases in LODO analysis (Figure 2.6 C). A paired Wilcoxon signed-rank of two models’
AUC results gives a p-value of 0.004. We also compared the results of using 1000 trees with 5000 trees
based on the microbial abundances calculated using Centrifuge. While the mean AUC of 5000-tree model
improves slightly over 1000 trees (0.741 vs. 0.735), the Wilcoxon signed-rank test p-value of 0.0008 showed
these results are significantly different. This indicates that increasing the number of decision trees from
1000 to 5000 significantly improves the prediction performance.
2.3.4 Including novel microbial organisms slightly improves the performance of CRC
prediction
We next investigated whether including novel microbial organisms in the model could further improve
the prediction performance. We compared random forests models with and without abundance information of the novel microbial organisms in within-dataset and cross-dataset settings. For all the analyses,
24
novel microbial abundances was incorporated using the leave-one-set-out (LOSO) model stacking method
described in Leave-one-sample-out (LOSO) model stacking method.
In the setting of within-dataset (Figure 2.6 A), the AUCs are increased for 3/6 datasets when incorporating novel microbial organisms by the LOSO model stacking, while the AUCs of the other 3 datasets do
not change. For the cross-dataset setting (Figure 2.6 B), the AUCs of 15/30 are increased, 4/30 are the same,
and 11/30 are decreased when novel microbial organisms are incorporated using LOSO. For both settings,
LOSO AUCs are significantly higher than those reported in Thomas et al. [40] with a Wilcoxon signedrank test p-value of 0.0015. Compared to the p-value of 0.002 in the previous subsection, incorporating
the information from novel microbial organisms decreases the p-value slightly in CRC prediction. The
average AUCs across the 36 comparisons integrating both known and novel microbial organisms, with
only the known microbial organisms and that reported in Thomas et al. [40] are 0.735, 0.731, and 0.695,
respectively.
We also measured the performance of the predictive models using the area under the precision and
recall curves (AUPRC). The precision and recall curve is able to capture more information then AUC when
the input data used in the predictive model is highly imbalanced, especially when disease prevalence is
very low among the study population. In our study, as shown in Table 2.1, most of the CRC datasets we
used were balanced. However, we still investigated the AUPRC results to see if it was consistent with
the previous AUC results. Similar to the comparison of AUC results, we compared the AUPRC results
using only known microbial species with LOSO model stacking results in within-dataset and cross-dataset
settings, and the predictive probabilities were taken from random forests models with 5000 decision trees.
As shown in Figure 2.7, the LOSO model stacking AUPRC scores were higher than known species AUPRC
scores in 14/36 cases, the same in 8 cases and lower in 14 cases. We did not see significant differences in
their performance. The AUPRC results didn’t show as much improvements as the AUC results, probably
because the datasets we used were well-balanced enough.
25
2.3.5 Removing cross-dataset batch effect by ComBat before training did not improve
the performance
In cross-dataset analysis, heterogeneity in different datasets is a big challenge for machine learning prediction analyses. More specifically, the batch effects of different datasets may affect the predictions and
result in low AUC scores. We investigated if normalization across the datasets can improve prediction
performance in cross-dataset analyses. Therefore, we first removed the cross-dataset batch effects from
the abundance profiles by the ComBat [43] function in R’s ‘sva’ package [78]. We then trained RF models
to see if the AUC scores can be improved and the results are shown in Figure 2.8. It can be seen that AUC
scores were only improved in 6/30 cases after removing batch effect using ComBat. The average scores
for with and without ComBat normalization were 0.686 and 0.710, respectively, with a p-value of 0.003
by the two-sided paired Wilcoxon signed-rank test. These results demonstrate that ComBat is not helpful
in increasing the AUC score for our setting. In the future we may consider other methods to deal with
heterogeneity, or investigate other underlying reasons that may cause low AUC scores when dealing with
different datasets.
2.4 Discussion
In this study, we performed comprehensive CRC-microbiome predictive analyses on six publicly available
metagenomic datasets. We showed that the microbial relative abundance profiles extracted from Centrifuge gives higher prediction AUCs compared to other metagenomic taxonomic profiling tools based on
random forests classification results. We also showed that increasing the number of decision trees to 5000
from 1000 as used in Thomas et al. [40] and including novel microbial abundances can further slightly
improve the prediction performance compared with the results from Thomas et al. [40].
26
The random forests algorithm is a widely used machine learning algorithm for classification tasks in
metagenomic analyses. It is based on ensemble learning and bagging algorithms, and can reduce overfitting
compared to the decision tree algorithm. In our study, we showed that instead of sticking to a relative
small number of decision trees, increasing the number of decision trees in case of a large-scale input
data benefited the prediction. In addition, We also trained two other machine learning classifiers, LASSO
and SVM, to compare the prediction performance in the three experimental settings. Random forests
outperforms LASSO and performs similarly as SVM in all three settings (Figure 2.3). LASSO and SVM
present similar trends as random forests in terms of AUC scores (Figure 2.4). This further indicates the
differences in AUC scores by training and testing on different datasets are not classifier-specific.
Our study also showed that different metagenomic taxonomic profiling tools demonstrated different
predictive abilities of the disease status. We compared the AUC results generated by Centrifuge [11]
to the results by MetaPhlAn2 [12] used in Thomas et al. [40], as well as Bracken [14], and found that
Centrifuge has better metagenomic taxonomic profiling ability than other tools in terms of the CRC disease status prediction. Centrifuge computes abundances by sequence alignments against all the reference
genomes including those with low abundance. This could be a possible reason for Centrifuge to outperform
MetaPhlAn2 in terms of disease status prediction.
It is not surprising that the cross-dataset analyses have much lower prediction accuracies than withindataset analyses, considering that the heterogeneity always gives variability of outcomes when dealing
with cross-study analysis. In particular, the Hannigan dataset always has a poor prediction performance
whether it is treated as a training or a testing dataset. Thomas et al. [40] showed that the Hannigan dataset
differs markedly from other datasets when performing principal coordinate analysis with the Bray–Curtis
distance using MetaPhlAn2 microbial species abundances. The heterogeneity between different datasets
results in non-biological experimental variations, which is commonly known as ‘batch effect’, and this can
lead to difficulty in cross-dataset predictions. Many methods have been developed for data normalization
27
and batch effect reduction. For example, the ComBat function in R’s ‘sva’ package is an algorithm to
remove batch effects using either parametric or non-parametric empirical Bayes frameworks [43]. In our
study, we also explored the potential of removing batch effects before training machine learning models.
However, the cross-dataset AUC results did not improve (Figure 2.8). How to normalize the metagenomic
data across different datasets to improve prediction accuracy is a topic for future studies.
We also explored different ways to combine known and novel microbial abundances in constructing
disease predictive models. A straightforward method is to pool the known and novel features together and
train a single random forests model using the combined abundance table. This method is often referred
to as feature stacking. We compared the LOSO model stacking with the feature stacking in terms of AUC
scores and the results are shown in supplementary figure S1. For within-dataset results (Figure S1 A), 5
out of 6 datasets have better mean AUC scores when using the LOSO model stacking method. For crossdataset results (Figure S1 B), the LOSO model stacking method is even more outstanding; it outperforms
feature stacking in 24 out of 30 cases. For leave-one-dataset-out LOSO model (Figure S3 A), the AUC for
3 out of 6 datasets increased when incorporating the novel microbial abundances, while the AUC for one
dataset did not change. Consistent with the AUC results, the AUPRC results showed similar trend: LOSO
model stacking outperforms feature stacking in 6/6 cases in within-dataset settings (Figure S2 A), 22/30
cases in cross-dataset settings (Figure S2 B), and 3/6 cases in LODO settings (Figure S3 B). All these results
indicate that the LOSO model stacking method outperforms the feature stacking in terms of the prediction
performance. Also, the LOSO model stacking method is computationally more efficient than the feature
stacking method since it does not need to train a random forests model using the combined abundance
profiles with much larger feature dimensions. For future analyses, researchers can apply the LOSO model
stacking method to other machine learning prediction tasks.
NCBI RefSeq database is one of the most widely used databases for microbial research. However, it is
vastly incomplete and usually only about half of reads in human gut can be mapped to the NCBI RefSeq
28
database leaving a large fraction of the reads unused in most studies. To overcome this issue, several groups
constructed more complete human gut microbial genome databases using cross assembly and binning, for
example, the Unified Human Gastrointestinal Genome (UHGG) collection [17]. The fraction of reads that
can be mapped to UHGG from the human gut was markedly increased to over 80% [17]. However, it is
not known whether the microbial abundance profiles using UHGG can increase the prediction accuracy of
complex diseases such as CRC. In order to answer this question, we mapped the reads to UHGG, derived
the microbial abundance profiles for each sample, and predicted the disease status using RF. The results
are provided in supplementary material. The mapping rates for the datasets were markedly increased from
around around 50% to close to 90% as shown in Figure S4, consistent with previous studies [17]. Despite the
marked increase in mapping rates, the prediction accuracy for CRC measured by AUC based on UHGG did
not improve as shown in Figure S5. One potential explanation is that the microbial organisms associated
with CRC belong to or are strongly associated with the NCBI known microbial species. Therefore, the
inclusion of more microbial genomes in UHGG compared to the NCBI RefSeq database did not increase
the prediction accuracy of CRC. This result is consistent with that in Including novel microbial organisms
slightly improves the performance of CRC prediction that including novel microbial organisms did not
markedly increase the prediction accuracy. The usefulness of UHGG for the prediction of other disease
status such as inflammatory diseases, diabetes, obesity, etc. compared to NCBI RefSeq needs to be further
studied.
29
Figure 2.2: Barplots of AUC scores for predicting disease status (case/control) using random
forests with 1000 decision trees on known species abundance profiles characterized by Centrifuge (orange), MetaPhlAn2 (blue), and Bracken (grey), in three experimental designs: withindataset (subfigure A), cross-dataset (subfigure B), and leave-one-dataset-out (LODO) (subfigure
C). AUC scores of within-dataset and cross-dataset analyses are averages from 30 independent repetitions, while AUC scores of LODO analysis are averages from 10 independent repetitions. MetaPhlAn2
AUC scores are directly taken from study by Thomas et al.[40].
30
Figure 2.3: Barplots of AUC scores for predicting CRC status (case/control) using known microbial species abundance profiles trained by RF (blue), LASSO (orange) and SVM (grey) in three
experimental designs: within-dataset (subfigure A), cross-dataset (subfigure B), and leave-onedataset-out (LODO) (subfigure C). AUC scores of within-dataset and cross-dataset analyses are averages
from 30 independent repetitions, while AUC scores of LODO analysis are averages from 10 independent
repetitions. Random forests classifiers use 1000 decision trees. The regularization coefficients in LASSO
and SVM are the ones that maximize AUCs of 10-fold cross validation. The microbial species abundance
profiles were log10-transformed (a small constant, 1e−10, was added to 0 abundance) before model training. 31
Figure 2.4: Scatter plots of AUC scores by three different classifiers: RF, LASSO and SVM in
three experimental settings. X-axis shows the RF AUC scores, and y-axis shows AUC scores obtained
by LASSO (subfigure A) and SVM (subfigure B). The microbial species abundance profiles were log10-
transformed (a small constant, 1e−10, was added to 0 abundance) before model training. Spearman correlation for RF vs. LASSO (subfigure A) is 0.79, and for RF vs. SVM (subfigure B) is 0.75.
32
Figure 2.5: The AUC scores generally increase by larger number of decision trees in RF in crossdataset analysis. The Hannigan dataset is used as the training set. Each line refers to the testing set used
in the analysis. All AUC scores are averages from 30 independent repetitions.
33
Figure 2.6: Barplots of random forests AUC scores using known microbial abundances (orange),
LOSO model stacking (blue), and results from Thomas et al.[40] in three experimental designs:
within-dataset (subfigure A), cross-dataset (subfigure B) and leave-one-dataset-out (LODO) (subfigure C). AUC scores of within-dataset and cross-dataset analyses are averages from 30 independent
repetitions, while AUC scores of LODO analysis are averages from 10 independent repetitions. Random
forests models in ‘known’ and ‘LOSO model stacking’ use 5000 decision trees while those in ‘Thomas et
al. paper’ use 1000 decision trees.
34
Figure 2.7: Barplots of random forests AUPRC scores using known microbial abundances (orange)
and LOSO model stacking (blue) in two experimental designs: within-dataset (subfigure A) and
cross-dataset (subfigure B). All AUPRC scores are averages from 30 independent repetitions. Random
forests models in ‘known’ and ‘LOSO model stacking’ use 5000 decision trees.
35
Figure 2.8: Barplots of cross-dataset AUC scores for predicting disease status (case/control) using
known species abundance profiles trained by random forests directly (blue) and removing batch
effect by ComBat before training random forests models (orange). AUC scores are averages from
30 independent repetitions. Random forests classifiers use 1000 decision trees.
36
Chapter 3
Batch Normalization Followed by Merging is Powerful for Phenotype
Prediction Integrating Multiple Heterogeneous Studies
3.1 Introduction
Genotype to phenotype mapping is an essential problem in the current genomic era. With the development
of advanced biotechnologies, many types of genomic data such as single nucleotide polymorphisms, gene
expression profiles, proteomics, metagenomics, etc. have been generated in many different studies. These
omics data provide essential resources to understand the relationships between omics data and phenotypes.
Despite these fundamental developments, due to the heterogeneity of data, it is challenging to integrate
the omics data to understand genotype to phenotype mapping. For a single type of data such as gene
expression or metagenomic data, many sources of heterogeneity can occur. For example, the samples
can come from different ethnic groups with varying underlying distributions of the features. Even if the
samples come from the same population, the genomic data can be generated from different laboratories
and/or derived from different experimental technologies resulting in different distributions of the data.
Another types of heterogeneity can be caused by the different causal mechanisms of the same phenotype
in the populations under study [79]. The objective of this study is to investigate the best approaches to
37
integrate different studies of the same type of data under a variety of different heterogeneities. In this
work, we concentrate on gene expression profiles or microbial abundance in metagenomic studies.
Many machine learning algorithms including linear regression, logistic regression, penalized regression, support vector machines (SVM), random forests (RF), neural networks (NN) and deep neural networks
(DNN) have been used to predict phenotypes from omics data [40, 41, 80, 81]. Most previous studies validated the prediction methods using within dataset cross validation usually with relatively high prediction
accuracy. However, the prediction accuracy is markedly decreased when the learned algorithms are used
in independent datasets [82, 83]. Many sources of study heterogeneity, for example, different experimental
platforms or procedures and differences in patient cohorts [79], all contribute to compromise the prediction performance of machine learning models in cross-study settings. Thus, overcoming heterogeneity
in cross-study phenotype prediction is a critical step for developing machine learning algorithms with
reproducible prediction performance on independent datasets.
Many studies have been carried out to mitigate the heterogeneity in cross-study phenotype predictions.
Zhang et al. [84] focused on the batch effects of data when developing genomic classifiers. Patil et al. [47]
simulated genomic samples and perturbed the coefficients of linear relations between outcomes and predictors to evaluate model reproducibility with different degrees of heterogeneity. In this study, we address
three types of heterogeneity: different background distributions of genomic features in populations, batch
effects across different studies from the same population, and different disease models in various studies.
We aim to evaluate how different statistical methods can mitigate these three types of heterogeneity.
Merging all datasets into one and treating all samples as if they are from the same study is a generally used method for cross-study predictions. With the increase of sample size and diversity in the study
population, merging method has been shown to lead to better prediction performance than using only
individual studies [40, 41, 45]. Another approach is to integrate the trained predictors from different machine learning models derived from various training datasets. Ensemble weighted learning is a commonly
38
used integration method to deal with the impact of heterogeneity on cross-study prediction performance.
Ensemble learning methods that integrate predictions from multiple machine learning models showed the
ability to boost the prediction performance than using only the component methods that the ensemble
learning contains [46, 47]. Besides ensemble weighted learning methods, aggregating the ranks from sample predicted probability instead of the probability itself offers a promising alternative for integration. In
some situations, the predicted probabilities for the samples in the test data may not be correct, but the relative order could provide some useful information. In such situations, aggregating the ranks instead of the
predicted probabilities might be more reasonable. To the best of our knowledge, no studies investigated
rank aggregation methods based on omics data phenotype prediction.
ComBat [43] normalization is a commonly used method for removing batch effects between different
datasets. In our previous study in Chapter 2 [85], we showed that when dealing with heterogeneity in crossstudy predictions, applying ComBat only before training machine learning models did not improve the
prediction performance. Zhang et al. [84] showed that ensemble weighted learning methods outperform
batch correction by ComBat at high level of batch differences. Nevertheless, in this study, we aim to
explore the potential of combining the normalization of ComBat together with merging and integration
methods (ensemble weighted learning and rank aggregation) in the presence of three different types of
heterogeneity mentioned above. We provide both simulations and real data applications on metagenomic
and gene expression data to show the comparisons of performance from different statistical methods when
dealing with cross-study heterogeneity.
Our study offers innovations through the development of a comprehensive workflow that effectively
addresses heterogeneity in various types of omics data, an issue that has persistently compromised the
performance of machine learning models in cross-study phenotype predictions. By investigating the optimal approaches to integrate different studies and conducting realistic applications on colorectal cancer
and tuberculosis studies, we have created a robust framework that elevates the reproducibility of machine
39
learning classifiers. Furthermore, the innovative utilization of ComBat normalization with integration
methods markedly enhances prediction performance through removing heterogeneity effects. Our study
also ventures into mostly uncharted territory by exploring the potential of boosting phenotype prediction performance through rank aggregation methods in handling heterogeneous omic data, a significant
innovation in its own right.
3.2 Materials and Methods
3.2.1 Outline of workflow for integrating multiple heterogeneous metagenomic datasets
To investigate the prediction performance of different merging and integration methods when applied
to multiple heterogeneous datasets, we developed a comprehensive workflow with three main steps to
conduct the experiments.
The first step in our workflow involves simulating diverse metagenomic datasets under three distinct
scenarios, as illustrated in Figure 4.1A. In the first scenario, we examine how divergent training and test
populations with varied genomic feature distribution can impact the predictive performance of machine
learning models.
Population differences, such as variances in ethnicity, diet, and other factors, can lead to variations
in the genomic features within a population. These features encompass single nucleotide polymorphisms
(SNPs), expression levels, and microbial abundance in the microbiome. Consequently, when training machine learning classifiers on one population and applying them to a different population, it is crucial to
account for the heterogeneity in the background distributions. Neglecting this heterogeneity can adversely
affect the prediction performance.
To assess how heterogeneity arising from population differences can influence the performance of machine learning classifiers, and to identify the optimal approaches for integrating prediction methods from
40
various heterogeneous studies, we simulated three distinct populations. Each population was characterized by different background genomic distributions, and we manipulated these differences. For a detailed
description of the implementation process for simulating the two training datasets and one test dataset,
please refer to Scenario 1: Different background distributions of genomic features in training populations.
The second scenario of heterogeneity pertains to batch effects. In simple terms, batch effects represent
non-biological disparities that emerge across separate batches of data. These discrepancies are usually a
byproduct of technical variances in experimental conditions or different labs. Some contend that batch
effects can compromise the replicability of genomic studies [86].
It is commonplace to correct these batch effects in the pre-processing phase when dealing with genomic
data. Recently, several methods for batch effect correction have been proposed, including ComBat [43],
edgeR [87], and DESeq2 [88]. Additionally, some studies have suggested the use of ensemble learning
techniques to potentially minimize batch effects [47, 84].
In our study, we sought to evaluate these methods’ efficacy in reducing batch effects on the predictive
performance of binary classifiers. This was achieved through simulation studies. In this particular setup,
the training and test datasets originate from the same genomic feature distribution. However, upon simulating the training and test datasets, we then introduced batch effects into the training datasets, creating
two distinct batches. These two batches were then used as the training datasets for the subsequent experiments. We provide a detailed description of the method in Scenario 2: Different batch effects in studies
with the same background distribution of genomic features in a population.
In the aforementioned two scenarios, we examined how both the variability in the distribution of
genomic features among populations and batch effects influence a classifier’s performance. Furthermore,
we operated under the assumption that the disease models are consistent across either the same or different
populations. However, a range of studies have suggested that the microbes linked with certain diseases can
41
depend on the population in question. For instance, research indicates that colorectal cancer development
can vary among populations with differing rates of diabetes [57], smoking [59], and obesity [60].
Consequently, we have further assessed the effectiveness of merging and various integration methods
when the underlying disease models differ. More specifically, the degree of overlap in disease-associated
OTUs between training and test datasets are varied. This constitutes the third scenario in our research
workflow. We adjusted the number of disease-related microbes that overlap in the training and test disease
models, and then simulated different datasets in line with these adjustments. The methodology for this
process is outlined in Scenario 3: Different phenotype-associated feature models in different studies.
Following the creation of the training and test datasets from the three scenarios mentioned above,
depicted in Figure 4.1A, our goal was to assess the efficacy of the merging and integration methods in two
distinct settings, as demonstrated in Figure 4.1B. These settings are: the naive setting, which involves the
direct utilization of the simulated training and test datasets from the initial step into the third classification
phase; and the ComBat normalization setting, which involves normalizing the two training datasets prior
to the third classification phase. The process by which we conducted ComBat normalization on the training
datasets is discussed in detail in Naive and ComBat Normalization Stage.
Subsequent to the previous steps, the two simulated training datasets and one test dataset are used as
input for the final classification step of in this workflow. At this stage, we independently train a machine
learning classifier on each training dataset and subsequently apply each to the test dataset. This generates
two distinct lists of prediction probabilities. Then, we apply various integration methods to these two
probability lists to yield the final results.
The two lists of prediction probabilities from the trained machine learning classifiers are directly employed in ensemble weighted learning methods. However, when we apply rank aggregation methods, the
prediction probabilities are converted into ranks. A comprehensive explanation of the ensemble weighted
learning and rank aggregation methods we used can be found in Classification Stage.
42
In addition to these integration methods, we also adopted a merging strategy. As illustrated in Figure 4.1C, we combined the two training datasets and trained a machine learning classifier on this merged
dataset. This trained classifier then made predictions directly on the test dataset to yield final results. The
performances of these merging and integration methods are compared in the following sections.
In the subsequent three sections, we delve into the specifics of the three stages within our workflow:
the Simulation Stage, the Naive and ComBat Normalization Stage, and the Classification Stage. Each of
these stages plays a crucial role in our analyses, and understanding them is fundamental to comprehending
the entire process.
3.2.2 Simulation Stage
3.2.2.1 Scenario 1: Divergent training and test populations with varied genomic feature distribution
In the first scenario, we contemplated a situation where the two training populations are distinct from each
other and both are different from the test population. We adjusted our simulations to reflect the extent
of these differences. To emulate this scenario, we first established the baseline abundance levels for the
operational taxonomic unit (OTU) across different populations.
Three probability vectors were created, each representing the underlying OTU abundance levels in
three distinct populations. These were adapted from three real colorectal cancer (CRC) metagenomic
datasets. We curated a total of six publicly accessible, geographically diverse CRC metagenomic datasets
with download links provided in their respective original papers [25, 40, 61, 62, 63, 64]. We excluded samples from adenoma patients, only using samples from CRC patients and healthy controls. The case-control
numbers and countries of origin for each dataset are displayed in Table 4.1.
We created a PCoA plot to illustrate the population differences among these six CRC datasets, as shown
in Figure 3.2. Based on this plot, we selected the three least overlapping populations—those represented
by the Hannigan, Feng, and Yu datasets—as the foundation for generating the background OTU relative
43
abundance vectors. The Hannigan and Yu datasets were utilized to generate training data, while the Feng
dataset was used for test data generation.
The two training datasets were pre-processed to preserve the top 1000 OTUs with the largest variance
in each dataset. We then amalgamated the OTUs from the two datasets to create a comprehensive set of
OTUs for the subsequent analysis. This simulation study incorporated a total of 1,267 OTUs. We maintained the OTU count of the 1,267 OTUs for the Feng dataset, which forms the background distribution for
simulating the test dataset, and removed the remaining OTUs. Subsequently, the count data was converted
into a relative abundance vector, calculated by dividing each OTU’s total count from all samples by the
total counts of all OTUs.
Table 3.1: Overview of six colorectal cancer-related metagenomic datasets.
Dataset Country No. of cases No. of controls Reference
Zeller France 91 93 [25]
Thomas Italy 61 52 [40]
Yu China 74 54 [61]
Hannigan USA/Canada 27 28 [62]
Feng Austria 46 63 [63]
Vogtmann USA 52 52 [64]
Let us denote v1, v2, and v3 as the three background relative abundance vectors derived from the microbial abundance profiles of the healthy control samples in the pre-processed Hannigan, Yu, and Feng
datasets, respectively. These three vectors share the same 1,267 dimensionality, with each dimension representing the relative abundance level for each OTU. In order to scrutinize the influence of population
differences on cross-study prediction, we devised a pseudo-population with a relative abundance vector
defined as follows:
v1(α) = αv1 + (1 − α)v2 (3.1)
44
Note that v1(α) − v2 = α(v1 − v2). Consequently, the difference between the two populations simulated
based on v1(α) and v2 escalates with α. When α = 0, both simulated populations share the same underlying distribution, hence eliminating population differences between the two training sets. Conversely, at
α = 1, the two simulated populations exhibit the maximum difference. We employed diverse α values,
ranging from 0 to 1 in increments of 0.2, to reflect varying degrees of training population differences in subsequent analyses. The relative abundance profiles v1(α) and v2 served as background relative abundance
vectors for training populations, while v3 was applied to the test population.
From these 1,267 OTUs, we randomly selected 10 OTUs, assuming these were associated with a specific
disease of interest. Given that disease-associated OTUs can either be enriched or depleted, we presumed
the first 5 OTUs to be enriched and the remaining 5 to be depleted. These 10 OTUs were consistent across
all subsequent experiments. To quantify the disease effect on these associated OTUs, we defined a disease
effect factor, ed, and hypothesized that the relative abundance of these OTUs could be represented as:
{relative abundance}enriched = {relative abundance} ∗ ed (3.2)
{relative abundance}depleted =
{relative abundance}
ed
(3.3)
First, we adjusted v1(α), v2, and v3 in this manner and normalized them into probability vectors, denoted
as v1(α)
′
, v
′
2
, and v
′
3
, respectively. These were used to simulate case microbiome profiles. We adjusted ed
to be 1.05, 1.075, and 1.1 in our simulation studies to assess the impact of disease effect. The larger the ed
value, the more pronounced the difference between case and control samples. These pseudo-population
relative abundance vectors are then used to simulate individual relative abundance profiles.
Next, we aimed to determine the relative abundance for each individual on a population. Given a
pseudo-population relative abundance vector v (v1(α), v2 and v3 for control samples, v1(α)
′
, v
′
2
, and v
′
3
for case samples from the above procedure), we assume the relative abundance of each individual follows a
45
Dirichlet distribution with parameters ak = Cvk, k = 1, 2, · · · , K, where K = 1, 267 and is the number of
OTUs of interest. Under this assumption, the relative abundance profile (X1, X2, · · · , XK) of an individual
becomes a random vector with expectation E(Xk) = vk, and variance V ar(Xk) = vk(1 − vk)/(C + 1).
When C is very large, the variance of Xk will be close to 0 and Xk is very close to vk. To introduce some
variability, we set C to be 106
for all results in the main text. In the Supplementary Materials, we present
results with C set to be 105
to demonstrate the influence of different values of C. Using the above Dirichlet
distribution with parameter ak, we then sample X to be the relative abundance vector for each individual.
The OTU read counts for each individual are simulated by generating one million reads as the library size, using the multinomial distribution MN(librarysize, X), where X is the individual relative
abundance vector. Fifty controls and fifty cases were generated, and the resultant datasets were denoted
as training1, training2, and test, respectively. Subsequently, count data were transformed into logtransformed relative abundance data. This was achieved by first dividing the sample counts, followed by
applying a zero-replacement strategy proposed by Martn-Fernndez et al.[89]. In this strategy, we identified
the minimum non-zero abundance in the dataset, and replaced all zero abundances with 0.65 times this
minimum non-zero abundance. Finally, the non-zero relative abundance data were log-transformed and
used in all subsequent analyses.
3.2.2.2 Scenario 2: Different batch effects on training data with consistent underlying population
genomic feature distribution
In this simulation, we kept the underlying populations identical for both the training and test data. The
background OTU abundance profiles were selected from Yu et al.[61] as displayed in Table 4.1. This ensures
no population variations as described in Scenario 1 between the training and test samples. We designated
10 OTUs as disease-associated with a disease effect factor ed fixed at 1.025. For each of the two training
46
datasets and the test dataset, we generated 50 case and 50 control samples using similar procedure as in
Scenario 1.
To simulate batch effects on the training data, we followed procedures similar to those in Zhang et
al.[84]. We treated the two training datasets as two distinct batches and simulated different batch effects within them. The model for generating batch effects was based on the linear model proposed in
the ComBat batch correction method [43], which postulates an additive impact on the mean of normalized OTU abundances and a multiplicative impact on the variance. We chose three severity levels for
the effect on the mean (sevmean ∈ 0, 3, 5) and three levels for the effect on the variance (sevvar ∈
1, 2, 4). Consequently, the model generated batch effects on the two batches, adjusting the mean to
mean − sevmean, mean + sevmean, and the variance to var/sevvar, var × sevvar. We only applied the
batch effects to the training data, leaving the test dataset unaltered. The two training batches and the test
dataset were named as batch1, batch2, and test, respectively. Here, batch1 and batch2 serve the same
roles as training1 and training2 from the prior scenario.
3.2.2.3 Scenario 3: Varying degrees of overlap in disease-associated OTUs between training and
test datasets
During the simulation of different disease models, unlike the previous two scenarios where we fixed 10
disease-associated OTUs in both the training and test datasets, we introduced a variable parameter termed
’overlapping OTUs’ in the test dataset. While the 10 disease-associated OTUs remained constant in the
training data, the number of such OTUs in the test data varied between 2, 4, 6, 8, and 10 from the pool of
10 disease-associated OTUs in the training data. As the number of overlapping disease-associated OTUs
increases, the disease models in the training and test data converge. When the number of overlapping
OTUs reaches 10 in the test data, the disease models in the training and test data are identical.
47
After determining the overlapping OTUs between the training and test disease models, we proceeded
as per the previous two scenarios to simulate two training datasets, using the background OTU distribution
from the CRC dataset by Yu et al.[61], and one test dataset, using the background OTU distribution from the
dataset by Feng et al.[63]. The details about the two datasets are presented in Table 4.1. The three datasets
were simulated with the following parameters: 100 samples comprising of 50 cases and 50 controls, one
million reads, and a disease effect factor ed of 1.075. The two training datasets and the test dataset were
named training1, training2, and test, respectively. No batch effects were introduced in this simulation.
3.2.3 Naive and ComBat Normalization Stage
As depicted in Figure 4.1 B, two distinct settings were established post-simulation of datasets (Figure 4.1 A):
a naive setup and a ComBat normalization process. In the naive setup, machine learning classifiers were
trained directly on the unprocessed training datasets. These classifiers were then used to make predictions
on the test dataset, and the resulting predictions were synthesized using various integration techniques.
In contrast, the ComBat normalization setting involved using the test dataset as a reference to normalize the two training datasets independently via the ComBat method. This process resulted in two new
datasets, "Training1_ComBat" and "Training2_ComBat", while leaving the test dataset unaltered. Additionally, a combined dataset, "Merged_ComBat", was created by pooling together "Training1_ComBat" and
"Training2_ComBat", serving as a counterpart for comparison with the "Merged" dataset. Machine learning classifiers were also trained on these normalized datasets and make predictions on the test dataset
respectively.
In Scenario 2, we adjusted the terminology slightly due to the introduction of two batches, each subject
to different batch effects. Here, "Training1", "Training2", "Training1_ComBat" and "Training2_ComBat"
were renamed as "Batch1", "Batch2", "Batch1_ComBat" and "Batch2_ComBat", respectively. Additionally,
we developed three distinct merging methods:"NoBatchEffect", where machine learning classifiers were
48
trained on the original merged dataset without any simulated batch effect; "Merged", where classifiers
were trained on a dataset that was an amalgamation of "Batch1" and "Batch2" where the batch effects have
been simulated on the two batches, and "Merged_ComBat", which involved training on a combined dataset
of "Batch1_ComBat" and "Batch2_ComBat.
The machine learning classifiers were trained independently on these different datasets, and subsequently used to generate predictions on the test data. Following this step, these predictions were amalgamated by using ensemble weighted learning or rank aggregation methods, resulting in the final predictor.
3.2.4 Classification Stage
3.2.4.1 Machine learning classifiers
Our research incorporated four machine learning classifiers: random forests (RF), logistic regression with
L1 regularization (LASSO), support vector machine (SVM), and extreme gradient boosting (XGBoost). In
the results section, we primarily present findings from the RF classifiers in the main texts, while results
from the remaining three classifiers are included in the Supplementary Materials. In conjunction with the
two normalization settings, each of these classifiers was trained on the respective training datasets. All of
the four classifiers were implemented using the ‘caret’ package in R [70].
The RF classifiers incorporated 1000 decision trees, and the ‘mtry’ parameter was fine-tuned using 10-
fold cross-validation. We opted to use the ‘ranger’ method for the train function to reduce computational
time compared to the ‘rf’ method. For the LASSO classifier, the regularization parameter was selected from
a range of 0 to 1 in increments of 0.001. This parameter selection was made to maximize the area under
the operational characteristic curve (AUC) as determined by 10-fold cross-validation. Regarding the SVM,
we employed a polynomial kernel with a default cost parameter of 1, a default degree of 3, and a default
gamma of 0.1. Lastly, for the XGBoost classifier, we set the maximum number of trees to be created at 1000
and tested the maximum depth of the tree over the values 2, 4, 6, 8, and 10.
49
After training these classifiers, they were applied to the test dataset to produce prediction probabilities.
For certain integration methods that necessitate a validation dataset, we divided the test dataset randomly
into two halves: 50% served as validation data (val), and the other 50% remained as test data (test). We
ensured an even split between case and control samples to prevent any potential bias.
3.2.4.2 Ensemble weighted learning methods
Patil et al. [47] assessed the efficacy of cross-study learning by utilizing five alternative weighting approaches: a straightforward average of predictions from each single-study learner (referred to as "Avg"), an
average weighted by the study’s sample size ("n-Avg"), an average weighted by cross-study performance
("CS-Avg"), stacked regression ("Reg-s"), and the average of study-specific regression weights ("Reg-a").
Importantly, the latter three ensemble methodologies prioritized reproducibility.
In our study, we compared these five ensemble weighted learning methods. We also introduced two
additional methods combined with an machine learning classifiers, as suggested in Chapter 2. The specifics
of these ensemble weighted learning methodologies are detailed below:
1. Avg: This approach simply takes the mean of the prediction probabilities for the test data, which
were produced by classifiers trained on "Training1" and "Training2". The equation to calculate the prediction probability for sample i in test data is shown below.
pi =
1
2
× (p
training1
i + p
training2
i
) (3.4)
2. n_Avg: This method calculates the weighted average based on the sample size of the test data prediction probabilities, with the weights originating from classifiers trained on "Training1" and "Training2".
As the two training datasets in the simulations share the same sample sizes, "n_Avg" yields the same results
50
as "Avg". However, in real data applications where the dataset sample sizes can vary, these two methods
may deliver differing outcomes.
pi =
(n1 × p
training1
i + n2 × p
training2
i
)
n1 + n2
(3.5)
3. CS-Avg: This method involves taking an average weighted by cross-study performance. First, a
classifier is trained on "Training1" and then used on the samples from "Training2" to generate prediction
probabilities. From these probabilities, we calculate the cross-entropy loss, defined as follows:
cel = −
1
N
X
N
i=1
yi
log(pi) + (1 − yi) log(1 − pi), (3.6)
where N represents the sample size of the test dataset, yi
is the actual case/control status of sample i (yi
equals 0 for a control sample, or 1 for a case sample), and pi denotes the prediction probability for sample
i being a case sample, as determined by the machine learning classifier.
Subsequently, we computed the cross-entropy loss of the classifier trained on "Training1" and predicting on "Training2", and denoted it as cel1. Similarly, the classifier trained on "Training2" predicting
on "Training1" yielded a cross-entropy loss termed cel2. We then calculated the weights as weight1 =
|cel1 − max(cel1, cel2)|, and weight2 = |cel2 − max(cel1, cel2)|, and normalized these two weights to a
sum of 1 then name them as weightnorm
1
and weightnorm
2
. The prediction probability for sample i in test
data is then calculated as:
pi = weightnorm
1 × p
training1
i + weightnorm
2 × p
training2
i
(3.7)
Notably, in the context of two training datasets, the "CS-Avg" method always assigns a zero weight to
the classifier exhibiting the worst cross-study performance (i.e., higher cross-entropy loss). As a result,
51
only one of the prediction probabilities from the classifiers trained on "Training1" and "Training2" is utilized in this method under simulation conditions. In a broader scenario where multiple training datasets
are present, the cross-entropy loss for each training dataset predicting on all other datasets is calculated. For instance, celij is calculated using prediction probabilities for samples in dataset j, trained
on dataset i. The total cross-entropy loss for dataset i is then calculated as celi =
P
j,j̸=i
celij . Similar to the two training dataset scenario, weights for each dataset are calculated as weighti = |celi −
max(cel1, cel2, ..., celi
, ..., celn)|, where n represents the total number of training datasets used. The final
weights are normalized to sum to 1. This method assigns zero weight to the model that exhibits the worst
average performance across all other training datasets.
4. Reg-a: This method involves the averages of study-specific regression weights, which are computed
using non-negative least squares. The machine learning classifiers were trained separately on "Training1"
and "Training2". Predictions were made on "Training1" and "Training2", yielding four lists of probabilities.
When testing on "Training1", we fitted non-negative least squares to the two associated probability vectors,
with the actual case/control status in "Training1" serving as the response, and we obtained two coefficients.
The same procedure was repeated for testing on "Training2". This resulted in a 2 × 2 coefficient matrix,
with each row representing test data and each column representing training data. Next, we multiplied the
coefficients in each row by the sample size of the test data, and the weights were finally computed as the
column average of the adjusted coefficients.
5. Reg-s: This method involves stacked regression weights, which are computed using non-negative
least squares. The two prediction probability vectors obtained from the "Reg-a" method were stacked into
a single vector for each test dataset. We then fitted non-negative least squares to the stacked vectors, with
the actual case/control status serving as the response. The coefficients were then used as weights.
6. val-auc: Machine learning classifiers were trained independently on "Training1" and "Training2".
We then applied these trained classifiers on validation data for prediction. The AUC scores were calculated
52
by comparing the prediction probabilities to the actual disease status from samples in validation data, name
them as AUCval
1
and AUCval
2
. We then assign the two AUCs as weights to combine the test data prediction probabilities from the two trained classifiers. The equation to calculate the prediction probability for
sample i in test data is shown below.
pi =
AUCval
1 × p
training1
i + AUCval
2 × p
training2
i
AUCval
1 + AUCval
2
(3.8)
7. LOSO-auc: This method was proposed in the study by our previous study in Chapter 2 [85] which
involves the Leave-One-Sample-Out (LOSO) AUCs calculations. The high level idea is to assign different weights to prediction probabilities from classifiers trained on "Training1" and "Training2" by their
respective prediction accuracy without bringing any inference from the test data. For each sample i in the
test data, we obtained two prediction probabilities p
training1
i
and p
training2
i
by applying the two classifiers
trained on the two training datasets to the sample i respectively. We then excluded sample i from the test
data and computed AUC scores AUCtraining1
i
and AUCtraining2
i
by comparing the prediction probabilities
from the rest of samples in the test data to their ground truth disease statuses. Then, the LOSO method
assign the corresponding max(AUCtraining1
i − 0.5, 0) and max(AUCtraining2
i − 0.5, 0) as weights to
combine p
training1
i
and p
training2
i
to get a final prediction probabilty for sample i.
pi =
max(AUCtraining1
i − 0.5, 0) × p
training1
i + max(AUCtraining2
i − 0.5, 0) × p
training2
i
max(AUCtraining1
i − 0.5, 0) + max(AUCtraining2
i − 0.5, 0)
(3.9)
In the ComBat normalization scenarios, these seven integration techniques were similarly applied to
prediction probabilities acquired from classifiers trained on the "Training1_ComBat", "Training2_ComBat"
and "Merged_ComBat" datasets. The performance of each method was assessed by calculating the AUC
scores, which were derived by comparing the final prediction probabilities with the actual case/control
status.
53
3.2.4.3 Rank aggregation methods
Rank aggregation methods have not traditionally been applied for phenotype prediction. In this study,
we explored the application of rank aggregation for integrating different predictors, and evaluated their
performance. For each independent prediction method, we initially sorted the samples in the test data
based on their prediction probabilities of being classified as cases, with the order being in descending
probability. This led to the generation of a ranked list, with the lowest rank assigned to items with ties.
This process was repeated to create two ranked lists based on the predictions generated by two training
classifiers. Following this, we employed various rank aggregation methods to create a unified, aggregated
rank list for the samples.
We explored the use of five rank aggregation methods, comparing their performance both within this
group, and against the previously discussed ensemble learning methods.
1. mean: This approach takes the average of the two ranked lists.
ri =
1
2
× (r
training1
i + r
training2
i
) (3.10)
2. geometric mean: This method calculates the geometric mean of the two ranked lists.
ri =
q
r
training1
i × r
training2
i
(3.11)
3. Stuart rank aggregation: The first step in this method is to normalize the ranks into rank ratios.
The order statistics proposed by Stuart et al. [90] are then used to create an aggregated rank list. For
computation, we used the ‘RobustRankAggreg’ package in R, specifying ‘stuart’ as the method [91].
4. Robust rank aggregation (RRA): Proposed by Kolde et al. [91], this method is also based on order
statistics but has improved computational efficiency and statistical stability. For each item in the rank list,
the algorithm looks at its position and compares this with a baseline case in which all preference lists are
54
randomly shuffled. Each item is then assigned a P-value, indicating how much better its position in the
ranked lists is than would be expected by chance. These P-values are then used to re-rank the list. We
used the ‘RobustRankAggreg’ package in R to compute the results [91].
5. Bayesian analysis of rank data with covariates (BARC): This method, developed by Li et al.
[92], is a Bayesian-based rank aggregation approach that incorporates information from covariates. Although covariates are not of concern in our study, we used their rank aggregation method without involving any covariates. The method is available from https://github.com/li-xinran/BayesRankAnalysis.
After obtaining the aggregated rank lists using the aforementioned methods, we calculated the AUC
scores to assess the performance of these rank aggregation methods. Similar to the ensemble weighted
learning methods, these rank aggregation methods were applied separately to both the naive and ComBat
normalization settings. The final AUC score is the fraction of pairs where the case sample is ranked higher
than the control sample. A perfect ranking, where all case samples are ranked above all control samples,
would give an AUC of 1. A completely random ranking would on average give an AUC of 0.5, as case and
control samples would be equally likely to be ranked higher.
3.2.5 Applications on real CRC metagenomic datasets
We applied the developed methods for merging and integration to six real-world metagenomic datasets
from CRC studies. We selectively used samples from patients who had been diagnosed with CRC, excluding those with adenoma. Thus, the analysis was carried out on samples from CRC patients and healthy
controls. Each dataset varied in terms of the number of case and control samples, country of origin, and
associated references, all of which are detailed in Table 4.1.
The microbial count profiles for the six CRC datasets were generated using MicroPro [32]. Following
generation, the count data was log-transformed into relative abundance data, using the same procedures
55
as those described in the data pre-processing section of the simulation studies (Scenario 1: Different background distributions of genomic features in training populations). We further pre-processed each dataset
by retaining the top 1000 Operational Taxonomic Units (OTUs) with the largest variance.
We implemented a Leave-One-Dataset-Out (LODO) approach for these realistic data applications. In
this setting, each of the six datasets was successively treated as test data, with the remaining five datasets
serving as training data. For the training data, we created a comprehensive set of all OTUs, filling in zero
abundance where a specific OTU was missing from a dataset. As in the simulation studies, we applied
ComBat normalization to the five training datasets. However, the training data in the naive setting was
left unchanged.
The next steps involved training the machine learning classifiers independently on each of the five
training datasets, and subsequently predicting on the test dataset, followed by applying the integration
methods to the resulting five lists of prediction probabilities. As for the merging method, the five training
datasets were amalgamated into a single dataset, on which a single machine learning classifier was trained.
3.2.6 Applications on TB gene expression datasets
In the application to TB gene expression studies, we utilized six annotated and pre-processed datasets,
originally collected by Zhang et al. [84]. These datasets represent different geographical regions and vary
in the numbers of cases and controls. Detailed information regarding the origin, case-control breakdown,
and associated references is provided in Table 3.2.
These TB datasets had already undergone transformation into the logFPKMs. As such, we directly
selected the top 1000 gene features with the largest variance from each dataset. By taking the union of
these genes, we compiled a comprehensive set of features for subsequent analyses.
56
Similar to the approach for CRC datasets, we implemented the Leave-One-Dataset-Out strategy for
these TB gene expression datasets. This involved treating each dataset in turn as test data, with the remaining datasets serving as training data. Detailed procedures for this strategy are outlined in the section discussing applications to real CRC metagenomic datasets (Applications on real CRC metagenomic
datasets).
Table 3.2: Overview of six tuberculosis-related gene expression datasets.
Dataset Country No. of cases No. of controls Reference
Leong India 25 19 [80]
Zak South Africa/Gambia 16 104 [93]
Anderson South Africa/Malawi/Kenya 20 50 [94]
Walter USA 35 35 [95]
Kaforou1 South Africa 46 48 [96]
Kaforou2 Malawi 51 35 [96]
3.3 Results
3.3.1 ComBat normalization is essential for heterogeneous populations
In the first scenario, we assessed the influence of divergent background operational taxonomic unit (OTU)
distributions in the training samples on the predictive performance. The results are depicted in Figure 4.4.
When the training and test data possess disparate background OTU distributions, the direct application
of the predictive model trained on the training data to the test data yields substantially low predictive
accuracy with an AUC approximating 0.56, as illustrated in the "Training1" and "Training2" rows. Neither
merging the raw training data nor directly integrating the trained models from multiple training samples enhanced the predictive accuracy. These findings underscore the critical role of normalization in the
development of predictive models.
Many normalization methods have been developed for metagenomic data, but most of them primarily
address experimental artifacts rather than population differences across studies [43, 87, 88]. In our study,
57
we applied ComBat [43], a widely used normalization method, to normalize the metagenomic data from
different populations. We wanted to investigate if this normalization could improve prediction accuracy.
We observed that using ComBat to normalize the metagenomic data markedly enhanced the prediction
accuracy in the test data. For example, the AUC score for "Training2_ComBat" increased from an average of
0.56 to approximately 0.96, 0.88, and 0.66 when the disease effect was set at 1.1, 1.075, and 1.05, respectively.
We used the formula v1(α) = αv1 + (1 − α)v2 to calculate the background OTU relative abundance for
"Training1", where α represents the difference between "Training1" and "Training2". We observed that the
prediction accuracy decreased as α increased for all values of the disease effect ed. For instance, when
ed = 1.075, the AUC for "Training1_ComBat" decreased from 0.87 to 0.78 as α increased from 0 to 1. This
observation suggests that v1 is further away from the test data compared to v2, which could contribute to
the decrease in prediction accuracy.
After observing the markedly increase in prediction accuracy for "Training1_ComBat" and "Training2_ComBat", we explored whether integrating the two predictors through ensemble weighted learning or rank aggregation could further enhance the prediction accuracy. The results, shown in the top
rows marked in red, indicate that most integration methods performed similarly and outperformed both
"Training1_ComBat" and "Training2_ComBat". Notably, the merging method after ComBat normalization, "Merged_ComBat", yielded the best performance. When the disease effect ed was relatively high
(e.g., ed ≥ 1.075), the increase in AUC over other integration methods was minimal. However, when
ed = 1.05, the AUC for "Merged_ComBat" was 0.77, compared to approximately 0.72 for other integration
methods when α was small. Similarly, the AUC for "Merged_ComBat" was 0.75, compared to about 0.69
for other integration methods when α was large.
These results indicate that normalizing for population differences before training machine learning
models can play a crucial role in improving prediction performance. It is surprising considering that
ComBat was originally designed to correct for experimental artifacts such as batch effects, not specifically
58
for adjusting population differences. Our findings clearly demonstrate that ComBat can be used to adjust
for population differences and enhance cross-study prediction accuracy.
3.3.2 ComBat normalization effectively corrects batch effects within the same population
In the second scenario, we considered studies within the same population but conducted in different laboratories or using different sequencing technologies. In such cases, experimental batch effects can occur
and it is crucial to correct these batch effects to ensure accurate and reliable predictions. As described in
Scenario 2: Different batch effects in studies with the same background distribution of genomic features
in a population, we simulated batch effects that affected the mean and variance of OTU abundance levels,
respectively. We evaluated the prediction accuracy of the RF classifier on the test data for each type of
batch effect.
In this scenario, we set the disease effect ed = 1.025 to clearly compare the performance of different
methods. Figure 4.5A shows the results when there are additive effects on the mean (sevmean ∈ 0, 3, 5)
while the variance remains unchanged (sevvar = 1). Without data normalization, the AUC score on the
test data is slightly higher than 0.5 when sevmean = 0 and exactly 0.5 when sevmean ̸= 0, as expected.
After applying ComBat normalization, the AUC scores on the test data increased to around 0.75 for all
parameter values. Most of the ensemble weighted learning and rank aggregation methods applied further
improved the AUC scores, approaching 0.8. However, it is worth noting that "rank_RRA_ComBat" and
"CS_Avg_ComBat" slightly underperformed compared to other integration methods.
When simply merging the two training datasets after ComBat normalization, the highest AUC score of
approximately 0.85 was achieved. This result demonstrates that the merging method, "Merged_ComBat",
can effectively improve prediction accuracy even in the presence of batch effects within the same population. Furthermore, it is interesting to note that the prediction accuracy only slightly decreased from 0.88 to
59
0.85 when batch effects were absent, indicating the robustness and effectiveness of the "Merged_ComBat"
method.
Figure 4.5B presents the results when there are multiplicative effects on the variance (sevvar ∈ 1, 2, 4)
without any effect on the mean of OTU abundance levels (sevmean = 0). In this case, without normalization, there is still some predictive power for the test data, although the AUC score is generally low
at around 0.6 when considering the two batches separately. Integrating the two prediction probabilities
without ComBat normalization does not improve the prediction accuracy.
However, when sevvar = 1 or 2, applying ComBat normalization improves the AUC scores based on
the two training datasets, increasing them to approximately 0.75. Integrating the prediction probabilities
after ComBat normalization further enhances the AUC to around 0.80. On the other hand, when sevvar =
4, ComBat normalization only increases the AUC to 0.67 when using ensemble weighted learning and
rank aggregation methods. However, when merging the predictors after ComBat normalization, denoted
as "Merged_ComBat", the highest AUC score of 0.81 is achieved.
Overall, these findings highlight the effectiveness of ComBat normalization in correcting batch effects
within the same population and demonstrate the potential of the "Merged_ComBat" method to achieve
high prediction accuracy.
3.3.3 Prediction accuracy can be markedly decreased as the number of overlapping
disease associated OTUs decreases
In Scenario 3, we examined the impact of varying disease models in the training and test data on the
predictive accuracy of the Random Forest (RF) classifier. The results, presented in Figure 4.6, demonstrated
a general consistency in the relative performance of diverse prediction methods with the findings from
Scenarios 1 and 2, given that the number of overlapping disease-associated Operational Taxonomic Units
(OTUs) exceeded four. However, in this context, the merging method did not clearly outperform other
60
ensemble-weighted learning and rank aggregation methods under normalization setting. Despite this, it
retained a commendable performance compared to alternative methods.
As expected, we observed a noticeable dip in prediction accuracy with the reduction in the count of
overlapping disease-associated OTUs between the training and test data. For instance, when ed = 1.075
and the overlap in disease-associated OTUs was fewer than 6, the AUC value was below 0.57. This value
experienced an upswing to roughly 0.7 as the count of overlapping disease-associated OTUs escalated to
between 6 and 8. Further, with an overlap of 10 disease-associated OTUs, the optimal AUC value reached
approximately 0.91. These findings vividly illustrate the impact of disparities in disease models between
training and test data on predictive performance. Additionally, when faced with significant variations
between the training and test disease models, neither merging nor integration techniques were found
beneficial, even when utilized in conjunction with normalization methods.
3.3.4 Applications to metagenomic datasets related to colorectal cancer
We applied our methods to analyze six metagenomic datasets related to colorectal cancer (CRC) as summarized in Table 4.1. In this analysis, we employed a Leave-One-Dataset-Out (LODO) setting, where one
dataset was treated as the test data and the other five datasets were used as training data. The methodology for this analysis is described in detail in Applications on real CRC metagenomic datasets. Figure 3.6A
presents the average AUC scores from the six LODO experiments for the "Merged", "Ensemble weighting",
"Ensemble weighting (normalized)", "Rank aggregation" and "Rank aggregation (normalized)" methods.
The "Single learner" results in each LODO experiment represent the average performance of the five single
learners (trained on the five training datasets), followed by the average across the six LODO experiments.
The individual results for each LODO experiment can be found in Figure S6.
As shown in Figure 3.6A, using a single training dataset and predicting on the test data resulted in an
average AUC increase from 0.63 to 0.65 when ComBat normalization was applied to the training data. The
61
AUC scores for ensemble weighted learning and rank aggregation methods both showed slight improvements after ComBat normalization, but none of the integration methods outperformed merging the five
training datasets. Comparing the prediction performance of the merging and integration methods with
that of the single learner, it is evident that the AUC scores increased by approximately 0.1 on average, further supporting the notion that cross-study prediction using multiple training datasets is more accurate
than relying on a single training dataset.
Examining the individual training results as shown in Figure S6, we observed similar trends to those
in Figure 3.6A in most cases. Among the six test datasets, the Hannigan dataset consistently exhibited
the lowest AUC scores, and neither merging with ComBat normalization nor ensemble weighted learning
methods improved the prediction performance. Additionally, the AUC scores when using the Hannigan
dataset as the test data were the lowest among the six results, with an average of 0.61 for the single learner
and 0.63 for the integration methods. This observation aligns with the data distribution depicted in the
PCoA plot (Figure 3.2A), where the Hannigan dataset appears to be the most distinct and least overlapping with the other five datasets. This highlights the substantial differences in the count data distribution
of the Hannigan dataset compared to the other five datasets. As demonstrated in our simulation studies, differences in the background distributions of genomic features among populations can impact the
reproducibility of machine learning classifier’s prediction performance.
3.3.5 Applications to gene expression datasets related to tuberculosis
To further examine the prediction performance of merging and integration methods on real datasets, we
utilized six TB gene expression datasets as summarized in Table 3.2 and followed the procedures described
in Merging based method does not show any advantage than weighting based integration methods in the
context of quantitative phenotype predictions for the gene expression data.
62
In the TB studies application (Figure 3.6B), the overall AUC results were substantially higher than
those observed in the CRC studies, and ComBat normalization led to significant improvements across all
analyzed methods on average. Unlike the CRC studies, the ensemble weighted learning methods slightly
outperformed the merging method in both the naive and ComBat normalization settings.
From the individual plots Figure S7, we observed that when the Zak and Anderson datasets were used
as test data, the prediction results were lower compared to the other four datasets, and the improvements
resulting from ComBat normalization were smaller as well. This observation is consistent with the study
by Zhang et al. [84], where these two datasets exhibited the highest cross-entropy losses when used as test
data. One possible explanation for these differences is that the Zak and Anderson datasets only included
children or adolescents, while the other four datasets also included adults. According to Alcaïs et al. [97],
children and adults exhibit different tuberculosis clinical features and pathogenesis, which can influence
the reproducibility of machine learning models when populations have distinct disease characteristics.
We also observed remarkable variations among training different single learners. For instance, in
Figure 4.7 A, when the RF classifier was trained on the Walter and Leong datasets and used to predict
on the Zak dataset, the AUC results were much higher compared to training on Anderson, Kaforou1, and
Kaforou2. Similarly, when training the RF classifier on Kaforou1 and Kaforou2 and predicting on Anderson,
the AUCs reached 0.83 and 0.88, respectively, while training on the other three datasets and predicting on
Anderson resulted in an AUC of 0.5, resembling a random guess. These observations align with the data
distribution in the PCA plots (Figure 3.2B), where Anderson, Kaforou1, and Kaforou2 are closer to each
other and further from Zak, Walter, and Leong, while the latter three datasets are closer to each other.
This further underscores the significant impact of dataset heterogeneity on the reproducibility of machine
learning classifiers.
Finally, based on the results from the real applications in CRC and TB studies, consistent with the simulation study results mentioned earlier, we demonstrate that the ComBat normalization method is crucial for
63
handling heterogeneous populations. When dealing with heterogeneous populations, it is recommended
to employ both merging and integration methods to obtain the best prediction results.
3.3.6 Consistency across LASSO, SVM, and XGBoost classifiers mirroring RF Classifier
in simulation studies and real data applications
In addition to the RF classifier, we employed the LASSO, SVM, and XGBoost classifiers in experiments
involving both simulations and real data applications. The trends observed for the case/control status
prediction AUC values in both simulations and real data applications were similar across all classifiers.
In the simulations for Scenario 1, we noticed enhancements in AUC values using the ComBat normalization combined with merging and integration methods across all three classifiers. The improvements,
however, were not as notable for the LASSO (Figure S8) and SVM classifiers (Figure S9) as they were for
the RF (Figure 4.4) and XGBoost classifiers (Figure S10).
During Scenario 2, we recorded noticeable improvements in AUC scores for the LASSO (Figure S11)
and XGBoost (Figure S13) classifiers when we applied ComBat normalization along with merging and
integration methods, particularly in the presence of batch effects that impacted the mean of OTUs abundances. The SVM classifier (Figure S12) also demonstrated slight improvements under these conditions.
Conversely, when batch effects perturbed the variance of OTU abundances, the XGBoost classifier showed
a clear improvement in AUCs, while neither LASSO nor SVM classifiers demonstrated any evident enhancements.
In Scenario 3, the AUCs of the SVM (Figure S15) and XGBoost (Figure S16) classifiers improved when
applying merging and integration methods along with normalization, given that the number of overlapping
disease-associated OTUs exceeded four. However, we did not observe this advantage with the LASSO
classifier (Figure S14). Similarly, when there was a pronounced discrepancy in the disease models of the
training and test data, none of the employed methods showed improved results across any of the classifiers.
64
In the CRC and TB applications, we observed a consistent performance across all methods and settings
among the three additional classifiers (Figure S17, Figure S18, Figure S19), mirroring the behavior of the
RF classifier. While the enhancements in AUCs brought about by using ComBat normalization in combination with merging and integration methods were less pronounced compared to the simulations, they
still indicated the potential for these methods to enhance the prediction performance of machine learning
classifiers.
In comparing the four different classifiers, we observed that the RF and XGBoost classifiers delivered
remarkably similar AUC results in both simulations and real data applications, with a much larger enhancement in the performance compared to LASSO and SVM when using normalization with merging
and integration methods. In the simulation studies, the LASSO classifier delivered superior prediction performance compared to the other three classifiers. In contrast, the SVM classifier generally produced the
least effective prediction performance. Overall, the outcomes from the three additional classifiers align
closely with those from the RF classifier, reinforcing the importance of applying normalization coupled
with merging and integration methods. These strategies are key in mitigating the effects of heterogeneity
and enhancing the reproducibility of machine learning classifiers.
3.3.7 Rank aggregation is an alternative approach for integrating heterogeneous studies
Aggregating rank lists from multiple studies is a common strategy in genomic studies to gain a comprehensive understanding of biological phenomena. This approach provides a way to integrate heterogeneous
data without the need for data normalization across studies. Previous studies have shown that aggregated
rank lists yield more meaningful results than individual rank lists [90, 91, 92]. In our study, we extended
this concept to integrate prediction probabilities from multiple machine learning classifiers by transforming them into rank lists and applying various rank aggregation methods. We focused on five established
rank aggregation methods: mean of ranks, geometric mean of ranks, Stuart rank aggregation, Robust Rank
65
Aggregation (RRA), and Bayesian Analysis of Rank data with Covariates (BARC). Detailed descriptions of
these methods can be found in the Rank aggregation methods section.
In our simulation studies, we observed that the performance of the five rank aggregation methods was
comparable to that of ensemble weighted learning methods in all three scenarios (Figure 4.4, Figure 4.5,
and Figure 4.6). Notably, all five rank aggregation methods performed well when combined with ComBat
normalization.
In the real applications of CRC and TB, the five rank aggregation methods showed slightly lower prediction performance compared to ensemble weighted learning methods in the naive settings. However,
when combined with ComBat normalization, all five methods achieved similar and promising prediction
results compared to ensemble weighted learning methods. Interestingly, Figure S6 and Figure S7 demonstrate that for individual studies where ensemble weighted learning did not improve AUC scores (Figure S6
B and D), the rank aggregation methods actually enhanced prediction performance compared to the ensemble weighted learning methods.
The simulations and real applications consistently highlight the value of aggregating rank lists of prediction probabilities when integrating multiple heterogeneous datasets for phenotype prediction. We have
demonstrated that rank aggregation methods are as robust as ensemble weighted learning methods and
can even improve prediction performance in cases where ensemble weighted learning does not yield good
improvements.
3.4 Discussion
With the increasing availability of large collections of omic data, the reproducibility of machine learning
prediction models has raised great concerns when conducting cross-study predictions with the impact of
study heterogeneity. Previous studies have addressed this issue and developed many statistical methods
to overcome study heterogeneity, including merging with batch effect removal [98] and ensemble learning
66
methods [47]. In this study, we performed a comprehensive analysis of different methods on the phenotype prediction by integrating heterogeneous omic studies. We considered three different sources of
heterogeneity between datasets, including population differences, batch effects and different disease models. We developed a workflow in simulating these three sources of heterogeneity and generating simulated
samples based on real datasets. We also evaluated the prediction performance of many different statistical
methods, including merging, ensemble weighted learnings and rank aggregations. Besides the comparisons of different methods, we also explored the potential of normalizing the data by ComBat first then
applied those statistical methods mentioned above. We provided both simulation studies and real data applications on CRC metageomic datasets and TB gene expression datasets to compare different approaches.
In our simulation studies, we observed a decreasing trend in prediction accuracy among all statistical methods we investigated when the population heterogeneity became large, the batch effects increased
on training data, as well as the differences of disease models between training and test data enlarged.
These observations indicate that overcoming the heterogeneity needs to be addressed before applying machine learning prediction models on cross-study settings. Merging and integration methods that integrate
different studies for phenotype predictions without batch correction did not improve the prediction accuracy much when compared to single training model, but when combined with ComBat normalization,
we observed a remarkable improvement in the prediction accuracy in all simulations. These observations
indicate normalizing the heterogeneous datasets before training machine learning models is essential in
improving phenotype prediction performance.
It is noteworthy that our simulations yielded different conclusions in contrast with the study by Zhang
et al.[84] on the prediction performance of using merging with ComBat normalization methods. In their
study, they showed that merging with ComBat normalization was not as robust as ensemble learning
methods at high severity of batch effects. However, in our second scenario of simulating different severity
of batch effects on training and test data, we observed that merging combined with ComBat normalization
67
always achieved highest prediction performance in spite of severity of batch effects (Scenario 2: Different
batch effects in studies with the same background distribution of genomic features in a population). We
investigated the contradictions of observations in our study and the study by Zhang et al. [84], and we
noticed that the ComBat normalization process in our study was different from theirs. In their study, when
multiple training batches were provided, the batches were pooled into one training data, then applied
ComBat normalization on the pooled dataset. They then did a second round of ComBat normalization on
the test data using the pooled training data as reference. In our study, we normalized the training batches
using the test data as reference when conducting ComBat normalization independently. The normalized
training batches were pooled into one data for training machine learning model. Since test data is always
the target of prediction, instead of adjusting the test data, we used it as the baseline to adjust different
training batches so that the differences between training and test data were mitigated more effectively.
Our study showed that our normalization approach yielded higher prediction accuracy.
Consistent with simulations, the applications on the CRC metagenomic and TB gene expression datasets
with Leave-One-Dataset-Out experiments showed similar trends in terms of performance of merging and
integration methods combined with ComBat normalization. However, the increasing trend in prediction
accuracy is less remarkable than in the simulation studies. In our simulation studies, we intentionally manipulated the training and testing datasets as well as disease effect factors in the case and control samples
to effectively highlight the improvements achieved by combining normalization and integration methods.
In our simulation scenarios 1 and 3, the training and testing sets were chosen to be highly different. The
disease effect in all the simulations was set to be relatively high. We selected a relatively small constant
parameter C to be 106 o ensure that individual samples exhibit minimal variance in terms of simulated
OTU abundance. However, in real-world scenarios, the disease effect factor might not be as pronounced as
in our simulations, the feature distribution between training and testing data might be more aligned than
in our simulation settings, and different individuals could exhibit greater variance in OTU abundances. In
68
such contexts, the disparities in results with or without batch normalization might not be as significant
as observed in our simulations. In the Supplementary Materials, we explored the impact of individual
sample variance. When C was set to be 105
, leading to individuals having more variance in simulated
OTU abundances, the improvements in AUCs after normalization and integration were less pronounced,
as shown in Figure S26, Figure S27, and Figure S28. The less pronounced AUC improvements are further
corroborated by the comparisons under Scenario 1, where as the disease effect factor approaches 1 and
thus decreases, the improvements in AUC scores achieved through the combination of normalization and
integration methods become less noticeable. In contrast, when five training datasets were applied in real
data applications as opposed to two in the simulations, all merging and integration methods exhibited
improved prediction performance, even in the absence of ComBat normalization. These findings underscore the value of integrating multiple studies, rather than relying solely on a single study, to bolster the
reproducibility of machine learning models.
With the comparisons of the statistical methods used in our study, we saw similar trends for all the
ensemble weighted learning methods except the slightly lower performance of the "CS-Avg" method. The
"CS-Avg" method penalized the training dataset with the worst average performances on the rest of the
training datasets when doing cross-training-data validations, and it excludes this dataset from predicting
on test dataset. The worse performance of "CS-Avg" demonstrated that excluding the worst performance
training data may not be beneficial to phenotype predictions as it discards useful information from that
particular training data in the same time. Therefore, we suggest to use other ensemble weighted learning
methods that also penalize worst performance training data but retain the useful information in some way.
We also incorporated the rank aggregation methods into our study as well, and illustrated that the rank
aggregation methods showed similar prediction performances, which also boosted the prediction accuracy
remarkably. Rank aggregation methods should be considered as an alternative way for integrating heterogeneous studies in the future. We also noticed the extraordinary performance in merging method, and
69
consistent with the findings by Guan et al. [99], merging and integration methods can outperform each
other in different scenarios, and when training multiple studies we should consider to use both methods
to find optimal.
Our findings derived from employing various machine classifiers parallel the trends observed with
the use of RF classifiers in the main text for both simulation studies and real-world data applications.
Intriguingly, we observed a large enhancement in the performance of tree-based classifiers, namely RF
and XGBoost, when ComBat normalization coupled with integration methods were applied, while LASSO
and SVM displayed only marginal improvements with these methodologies.
A limitation of the strategies in our study is that they require test data information for adjusting the
heterogeneity effects from the training datasets. Consequently, when using a different test dataset, the
training data needs to be readjusted and the machine learning classifiers need to be retrained. In comparison, methods like Cross-Platform Omics Prediction (CPOP) [100] are designed to make predictions on a
single sample without normalization or the need for additional data integration. However, it should be
noted that the CPOP method is developed with penalized regression classifier, and currently may not be
compatible with machine learning methods, whereas our approach can be adapted with many different
machine learning methods. In addition, CPOP can only be applicable to situations with a limited number
of targeted features and can not be applicable to whole genomes or microbiomes.
70
Figure 3.1: Workflow for integrating multiple simulated heterogeneous metagenomic datasets. A:
Simulation stage of three different heterogeneity scenarios. Scenario 1: Divergent training and test populations with varied genomic feature distribution. Scenario 2: Different batch effects on training data with
consistent underlying population genomic feature distribution. Scenario 3: Varying degrees of overlap in
disease-associated OTUs between training and test datasets. The output of this step includes two simulated
training datasets and one test dataset. B: Naive and ComBat normalization settings. In the Naive setting,
the output datasets from Step A are directly used without any additional normalization. In the ComBat
normalization setting, the two training datasets from Step A undergo normalization using the ComBat
method to address potential batch effects by using the test dataset as a reference. C: Classification stage of
applying different ensemble weighted learning and rank aggregation methods. The two training datasets
from Step B are used to train two machine learning classifiers. These classifiers generate two lists of prediction probabilities when applied to the test dataset. For ensemble weighted learning, the two lists of
probabilities are directly integrated using specific weights (w1, w2) determined by the integration method,
resulting in a final list of prediction probabilities. For rank aggregation, the two lists of probabilities are
ranked, and the resulting lists of ranks are integrated using the respective rank aggregation methods. Different weights (a1, a2) determined by the method are used in this integration process, producing one final
list of ranks. Note that the integration methods utilize distinct weights (w1, w2, a1, a2) based on their
specific approach.
71
Figure 3.2: Distributions of genomic features across multiple colorectal cancer and tuberculosis
studies. A: Principal Coordinate Analysis (PCoA) of Bray–Curtis distances, calculated from six colorectal
cancer metagenomic count datasets. B: Principal Component Analysis (PCA) performed on six tuberculosis
gene expression datasets. The PCA was carried out on the logarithm of fragments per kilobase of transcript
per million mapped reads (log FPKM). In this figure, the Zak, Leong, and Walter datasets overlap with each
other and are far away from the other three datasets. C: PCA performed on the Zak, Leong, and Walter
tuberculosis gene expression datasets. The ellipses represent a 95% confidence level under the assumption
of a multivariate t-distribution. A round dot represents a case sample, while a triangle dot signifies a
control sample.
72
Figure 3.3: ComBat normalization markedly enhances cross-study prediction when the training
and test data have divergent feature distributions. The figures display the AUCs of RF predictions
employing various integration methods with three distinct disease effect factors. Columns represent different values of α. Method names without a "ComBat" suffix refer to those implemented in the naive
setting, while those with a "ComBat" suffix were conducted in the ComBat normalization setting. All the
experiments were conducted 100 times, and the AUC scores presented in the figure represent the averages from these 100 trials. Abbreviations: Ensem.learn.-Ensemble Learning; Rank aggr.-Rank aggregation;
norm.-normalization.
73
Figure 3.4: ComBat normalization improves cross-study prediction in the presence of batch effects. The figures show the AUC scores of RF prediction using different integration methods with varying
severity levels of batch effects. A: AUC score comparisons with different severity levels of additive batch
effects on the mean of OTU abundances, with no multiplicative batch effect on the variance. B: AUC
score comparisons with different severity levels of multiplicative batch effects on the variance of OTU
abundances, with no additive batch effect on the mean. The disease effect factor was set to 1.025 for both
scenarios. The integration methods without a suffix of "ComBat" were applied in the naive setting, while
those with a suffix of "ComBat" were applied after ComBat normalization. The experiments were repeated
100 times, and the shown AUC scores are the averages across the 100 trials. Abbreviations are the same as
in Figure 4.4.
74
Figure 3.5: ComBat normalization markedly increases cross-study prediction when the training
and test data have varying degrees of overlap in disease-associated OTUs. The figures show the
AUCs of RF prediction using different integration methods with various number of overlapping disease
associated OTUs. The disease effect factor was set to 1.075. Columns represent different numbers of
overlapping disease associated OTUs in the training and test data, the larger the number, the more similar
the two disease models are. When the number achieves 10, the two models are the same in the training
and test data. All the experiments were repeated for 100 times and the AUC scores shown on the figure
are the averages from the 100 trials. Abbreviations are the same as in Figure 4.4.
75
Figure 3.6: Realistic applications of merging and integration methods on multiple CRC metagenomic datasets and TB gene expression datasets using RF classifiers with average AUC socres
from six individual Leave-one-dataset-out(LODO) experiments. A: Leave-one-dataset-out average
AUC score comparisons among different methods in colorectal cancer metagenomic datasets. B: Leaveone-dataset-out average AUC score comparisons among different methods in tuberculosis gene expression
datasets. The results by different methods are grouped into six groups. "Single learner": Each of the five
training datasets were trained independently with RF classifier and predicted on the test dataset, then the
average AUC score was taken among the five predictions. "Merged": Merging method with pooling all five
training datasets into one training data. The "Single learner" and "Merged" experiments were conducted
under both naive and ComBat normalization settings. "Ensemble learning": The five training predictors
were integrated by ensemble weighted learning methods under naive setting. "Ensemble learning (normalized)": The five training predictors were integrated by ensemble weighted learning methods under
ComBat normalization setting. "Rank aggregation": The five training predictors were integrated by rank
aggregation methods under naive setting. "Rank aggregation (normalized)": The five training predictors
were integrated by rank aggregation methods under ComBat normalization setting. The red dots and associated values on the figure are the mean AUC scores for each method, the vertical bars are the median
AUC scores for each method, while the black dots represent the outliers. Same method under different
settings are represented in the same color of boxplots. All the experiments were repeated 30 times for
each test dataset, and the results presented in the figure were based on the average AUC scores of the total
180 replications for the six test datasets for metagenomic and gene expression studies, respectively.
76
Chapter 4
Mitigating Heterogeneity Effects in Omics-based Quantitative
Phenotype Prediction: A Comprehensive Workflow for Integrating
Multiple Studies with Batch Normalization
4.1 Introduction
The human microbiome, which refers to the collective genetic material of microorganisms that inhabit our
bodies, has been shown to play an important role in human health and disease. The microbiome can influence various physiological processes such as digestion, immune function, and metabolism, and alterations
in its composition have been linked to a variety of disorders including obesity [81], inflammatory bowel
disease [101], and cancer [40, 41].
Recent advances in high-throughput sequencing technologies have enabled the study of the human
microbiome at unprecedented levels of resolution, leading to the development of computational methods
for predicting the functional potential and activity of microbial communities. These methods, often based
on machine learning algorithms, have shown promise in predicting various phenotypes such as disease risk
based on microbial composition. Microbiome-based phenotype predictions have the potential to improve
our understanding of the complex interplay between the microbiome and human health and may pave the
way for personalized approaches to disease prevention and treatment.
77
Over the past decade, there has been extensive research into microbiome-based phenotype predictions.
However, the focus of the majority of studies in the field of microbiome research has been on qualitative
phenotype prediction, specifically the classification of individuals into groups based on their microbiome
composition. The prediction of quantitative phenotypes, such as continuous variables like body mass index
(BMI) or blood glucose levels, has received less attention. This is partly due to the challenge of accurately
predicting quantitative phenotypes due to the high variability of microbiome composition among individuals, as well as limitations in the methods used to quantify microbial composition, such as 16S rRNA
sequencing, which have inherent limitations in accuracy and reproducibility [49, 50]. Nevertheless, the
prediction of quantitative phenotypes by microbial composition is critical for investigating the impact of
the microbiome on human health. For example, a large Korean cohort study by Yun et al. [102] found
significant statistical differences in gut microbiota composition between individuals with different BMIs.
Another study conducted by Cohen et al. [103] suggested that the gut microbiome plays a role in controlling host blood glucose levels through the liver. Therefore, investigating the associations between the
microbiome and health-related quantitative phenotypes is essential to understanding the complex interplay between the microbiome and human health.
Cross-study phenotype predictions are a common approach in studies related to omics data, as it can
help to address the issue of sample size limitations in individual studies, provide valuable insights into
the relationship between genotype and phenotype in general populations, as well as improve the reproducibility and generalizability of research findings. Similar to other omics data, microbiome-based phenotype predictions also face challenges when conducting cross-study predictions. One major challenge is
the heterogeneity of the data, as omics data can come from a variety of sources, such as different populations or sequencing platforms, and may have varying degrees of quality and completeness. Additionally,
differences in study design, sample size, and phenotype definition can also contribute to data heterogeneity, which can affect the reproducibility of prediction models. Many studies have been conducted to gain
78
insights into handling heterogeneity in cross-study predictions [47, 79, 84, 99, 104]. However, the majority
of studies have focused on mitigating the effects of heterogeneity on qualitative phenotype cross-study
predictions, particularly, heterogeneity on disease status predictions. The impacts of heterogeneity on
quantitative phenotype predictions have not been well addressed.
There are several distinctions between case-control and quantitative phenotype studies. Firstly, casecontrol studies typically involve selecting a group of cases with a specific disease and a group of controls
without the disease. This conditional sampling leads to different microbial abundance distributions between cases and controls. In contrast, quantitative phenotype studies usually involve random sampling
of individuals from the general population, resulting in a sample with microbial abundance distribution
similar to that of the overall population. Secondly, the evaluation of performance differs between the
two study types. Case-control studies often employ the area under the receiver operating characteristic
curve (AUROC), which assesses the ability to discriminate between cases and controls without considering the severity or magnitude of the disease. In contrast, quantitative phenotype studies typically evaluate
performance using metrics such as the root mean square error (RMSE) between predicted and true trait
values. These metrics take into account the magnitude of the trait, which impacts the performance measure. Thirdly, quantitative phenotype prediction is more sensitive to feature heterogeneity compared to
qualitative phenotype prediction, particularly in cases of non-linear phenotypes. For instance, y = exp(x)
where x is the relative abundance of a particular microbe, even a small change in the relative abundance
of the microbe can cause a big alteration in the phenotype, as the phenotype value changes exponentially
in x. This sensitivity to small changes in genomic features can markedly impact the phenotype value.
Therefore, differences in feature heterogeneity may affect the performance of various integration methods
in quantitative phenotype studies. These differences in study design, evaluation metrics, and sensitivity to
feature heterogeneity highlight the distinctive aspects of case-control and quantitative trait studies, which
may influence the performance of integration methods used in each context.
79
In our previous study in Chapter 3, we developed a comprehensive workflow to simulate a variety of
different types of heterogeneity and evaluate the performances of merging and integration methods together with batch normalization [104]. As a continuation, in this study, we investigate the heterogeneity
effects on quantitative phenotype prediction incorporating multiple studies and aim to evaluate the performance of combining batch normalization with integration methods on mitigating heterogeneity effects
in cross-study quantitative phenotype predictions. As shown by our previous findings in Chapter 3, ComBat [43] normalization showed potential when combined with merging or integration methods in terms
of improving cross-study case-control prediction performance. We continue to investigate the potential of
ComBat in predictions of quantitative phenotypes. Additionally, we assess the performance of a recently
published batch effect correction method, ConQuR [105], which is specifically designed for microbiome
data. Combined with these two normalization methods respectively, we further evaluate the performances
of seven integration methods in quantitative phenotype predictions. We provide both realistic simulations
and real data examples to compare the performance of combining batch normalization with integration
methods in terms of mitigating heterogeneity effects on cross-study phenotype predictions. Our simulations address three types of heterogeneity: different background distributions of genomic features in
populations, batch effects across different studies from the same population, and different phenotypeassociated models in various studies. We simulate microbial feature abundances first, followed by simulating four types of quantitative feature-associated phenotypes. In the realistic applications, we first conduct
an analysis on BMI predictions by using microbial abundance data from four publicly available datasets [li,
106, 107, 108]. We then conduct a cross-study survival predictions on three ovarian cancer gene expression
datasets [109, 110, 111].
80
4.2 Materials and Methods
4.2.1 A comprehensive workflow for predicting quantitative phenotype from simulated
heterogeneous metagenomic datasets
We have developed a comprehensive workflow for predicting quantitative phenotype by integrating multiple simulated datasets. This workflow builds upon the methodology for case-control phenotype prediction
from our previous study in Chapter 3. The workflow consists of three main stages: simulations, normalization, and integration.
In the simulation stage, we conduct three different scenarios. Firstly, we investigate how different
background genomic feature distributions can affect quantitative phenotype predictions by downstream
machine learning models. We evaluate the performance of different integration approaches by simulating
three populations with different genomic distributions and focusing on the differences between the two
training datasets. More information on simulation of this scenario can be found in section Scenario 1:
Different background distributions of genomic features in training populations.
Secondly, we simulate batch effects with varying severity levels on the two training batches. Previous
studies have shown that batch effects can weaken the reproducibility of genomic findings [86]. Therefore,
it is crucial to remove batch effects before performing any downstream analysis. We explore the impact
of batch effects on quantitative phenotype predictions and describe the method in detail in section Scenario 2: Different batch effects in studies with the same background distribution of genomic features in a
population.
Lastly, we investigate how different underlying phenotype-associated feature models in the training
and test populations can affect phenotype predictions. We assume that the phenotype-associated features
are not exactly the same in the training and test data. We tune the number of overlapping genomic features
81
in the training and test models and simulate different datasets accordingly. More details on this scenario
can be found in section Scenario 3: Different phenotype-associated feature models in different studies.
In the second stage of the workflow, we evaluate the performance of different integration methods in
two settings. In the naive setting, the simulated training and test data from the simulation stage remain
unchanged and served as input to the third integration stage directly. In the normalization setting, we apply
a normalization method (either ComBat [43] or ConQuR [105]) to the two simulated training datasets using
the test data as a reference. This adjusts the abundance of the two training datasets while keeping the test
data unchanged. We describe the ComBat or ConQuR normalization method in detail in section Naive,
ComBat normalization, and ConQuR normalization.
In the integration stage, we train a machine learning model on each training dataset independently and
then apply different integration methods on the two predictors to generate final results. We describe seven
different integration methods used in this stage in section Integration methods for combining multiple
training models. In addition to the integration methods, we also apply a merging strategy. We pool the
two training datasets together and train a machine learning model on the merged dataset.
We apply the same normalization stage and integration stage to the training and test data in the simulation study and on real data applications.
4.2.2 Three simulation scenarios of heterogeneity
Similar to our previous study on case-control studies in Chapter 3, we construct three different scenarios
to represent the heterogeneity in the training and test data.
4.2.2.1 Scenario 1: Different background distributions of genomic features in training populations
In this scenario, we aim to investigate how genomic feature distribution differences in different training populations could affect downstream machine learning model predictions. To simulate this scenario,
82
Figure 4.1: Workflow for quantitative phenotype prediction by integrating simulated heterogeneous metagenomic datasets. A: Simulation stage of three different heterogeneity scenarios. Scenario
1: Different background distributions of genomic features in training populations. Scenario 2: Different
batch effects in studies with the same background distribution of genomic features in a population. Scenario 3: Different phenotype-associated feature models in training data and test data. The output from this
step consists of two simulated training datasets and one test dataset. B: Naive and normalization stage.
The output from A are unchanged in naive setting, while in the normalization stage, the two simulated
training datasets are normalized. C: Integration stage. The output from previous stage are used to train
machine learning models, followed by applying different integration methods.
83
we collect three publicly available and geographically diverse colorectal cancer metagenomic read count
datasets (Table 4.1) and use the healthy control samples from these datasets as the baseline for simulations.
We use the Hannigan [62] and Yu [61] datasets to generate the training data and the Feng [63] dataset to
generate the test data. The two training datasets are pre-processed to retain the top 1,000 operational taxonomic units (OTU) with the largest variance in the control samples in each dataset, and we take the union
of the OTUs from the two training datasets as the complete set of OTUs to be used in the analysis. A total
of 1,267 OTUs are included in the simulation study. These 1,267 OTUs are all present in the 7,464 OTUs
from the test dataset, while 857 out of 1,267 OTUs (approximate 2/3) overlap with the top 1,000 OTUs with
the largest variance in the test dataset. Finally, the read count data is transformed to relative abundance
vectors by dividing each OTU’s total counts by the sum of all OTUs’ total counts in each sample.
Table 4.1: Three colorectal cancer (CRC) metagenomic datasets used for simulations.
Dataset Country No. of controls Reference
Yu China 54 [61]
Hannigan USA/Canada 28 [62]
Feng Austria 63 [63]
To create a pseudo-population with a relative abundance vector, the healthy control relative abundance
profiles from the Hannigan and Yu datasets, v1 and v2, respectively, are combined using the equation
v1(α) = αv1 + (1 − α)v2, where α varied from 0 to 1 in increments of 0.2. The difference between the
two simulated populations based on v1(α) and v2 increases with α. The relative abundance profiles v1(α)
and v2 are used as background relative abundance vectors for training populations, while v3 adapted from
Feng dataset is used for the test population. The pseudo-population relative abundance vectors are then
used to simulate individual relative abundance profiles.
Next, we aim to determine the relative abundance for each individual on a population. Given a pseudopopulation relative abundance vector v (v1(α), v2 and v3 from the above procedure), we assume the relative
84
abundance of each individual follows a Dirichlet distribution with parameters ak = Cvk, k = 1, 2, · · · , K,
where K = 1, 267 and is the number of OTUs of interest. Under this assumption, the relative abundance
profile (X1, X2, · · · , XK) of an individual becomes a random vector with expectation E(Xk) = vk, and
variance V ar(Xk) = vk(1 − vk)/(C + 1). When C is very large, the variance of Xk will be close to 0
and Xk is very close to vk. To introduce some variability, we let C to be 105
. Using the above Dirichlet
distribution with parameter ak, we then sample X to be the relative abundance vector for each individual.
The OTU read counts for each individual are simulated by generating one million reads as the library
size, using the multinomial distribution MN(librarysize, X), where X is the individual relative abundance vector. The read count data are transformed to log relative abundance data using a zero-replacement
strategy with 0.65 times the minimum non-zero abundance, suggested by Martn-Fernndez et al.[89]. However, the simulated read count data are stored for use in the ConQuR normalization setting, which should
be applied directly to the metagenomic read count data.
Ten OTUs are randomly chosen from the 1,267 OTUs and these 10 OTUs are assumed to be associated
with a particular quantitative phenotype of interest. The first five OTUs are assumed to be positively
associated, while the other five are negatively associated with the phenotype. A vector of coefficients
is generated from from the uniform distribution in interval [3, 5] for positive associations and from the
uniform distribution in interval [-5, -3] for negative associations. The phenotype is simulated by using the
linear combination of the abundance from the 10 associated OTUs and the corresponding coefficients. The
simulated phenotype is then transformed by non-linear functions described in section Simulation of linear
and non-linear relationship of quantitative phenotypes to account for non-linear relationships. These 10
OTUs and their associated phenotype models are fixed for all the experiments in the following analysis.
85
4.2.2.2 Scenario 2: Different batch effects in studies with the same background distribution of
genomic features in a population
In the second scenario, we aim to investigate the impact of batch effects in different studies on downstream
machine learning model predictions. To achieve this goal, we simulate the OTU abundances and phenotypes for each individual in the two training datasets and one test dataset using similar procedure as in
Scenario 1. We keep the baseline population the same by using healthy control samples from the Yu [61]
dataset, thereby eliminating population variation between the training and test samples.
To simulate batch effects on the training data, we follow similar procedures as in Zhang et al.[84].
We use the ComBat batch correction method [43] and implement the batch effect model using a linear
model. This model assumes an additive effect on the mean of normalized OTU abundances and a multiplicative effect on the variance. We then generate three severity levels for the effect on the mean, denoted as sevmean ∈ {0, 3, 5}, and three severity levels for the effect on the variance, denoted as sevvar ∈
{1, 2, 4}. The model generates batch effects on the two batches, with adjusted mean values of {mean −
sevmean, mean + sevmean}, and adjusted variance values of {var/sevvar, var × sevvar}.
4.2.2.3 Scenario 3: Different phenotype-associated feature models in different studies
In the above two scenarios, we assume that the same phenotype-associated feature model applies to both
the training and test data. However, in real-world scenarios, the phenotype-associated feature models can
vary across different populations. To simulate this scenario, we use the healthy control samples in CRC
datasets from Hannigan et al. [62] and Yu et al. [61] as baselines for training data simulations, and the
CRC dataset by Feng et al. [63] for the test data. We follow the same simulation procedure to generate the
pseudo read count data for each individual as in the previous two scenarios.
To select the phenotype-associated OTUs, we fix 10 selected OTUs in the two training datasets. In
the test dataset, we randomly choose a subset of OTUs from the 10 in the training feature model and add
86
additional OTUs to ensure that the number of associated OTUs remains 10 in the test model. The subset
of OTUs contains 2, 4, 6, 8, or 10 OTUs. As the number of overlapping OTUs increases, the feature models
in the training and test data become more similar. When the number of overlapping OTUs reaches 10 in
the test data, the feature models in the training and test data are the same. Additional pseudo coefficients
are generated for the newly added associated OTUs in the test model as well.
4.2.3 Simulation of linear and non-linear relationship of quantitative phenotypes
We assume that the quantitative phenotype is related to the abundance of the associated OTUs. We focus
on two types of quantitative phenotypes: linear and non-linear. For the linear type of phenotype, we
assume that the phenotype is a linear combination of the associated microbial abundances. For the nonlinear type of phenotype, we simulated three different non-linear relationships between the phenotype
and the associated microbial abundances. The equations used to simulate the phenotypes are presented
below.
1. Linear: y = c ∗ β
T x + noise
2. Quadratic: y = c ∗ β
T x
2 + noise
3. Inverse: y =
c
βT x
+ noise
4. Logistic: y =
c
1+exp(βT x)
+ noise
where β is the pseudo coefficients, x is the vector of selected phenotype-associated microbial log relative abundance, x
2
in the quadratic equation represents the square value of the log relative abundance
of the component-wise assocaited OTUs, and c is a constant to control the range of simulated phenotypes
y, and is different for the four relationships. The noise vector is generated from the standard Gaussian
distribution.
87
4.2.4 Naive, ComBat normalization, and ConQuR normalization
In our previous case-control study, we demonstrated that ComBat normalization [43] plays a crucial role
in enhancing the performance of case-control status prediction when integrating heterogeneous datasets
before training machine learning models. In our current study, we seek to explore the potential of ComBat
normalization when integrating multiple training models for predicting quantitative phenotypes. Moreover, we aim to assess another batch effect correction method, ConQuR [105], which is specifically designed
for microbiome data using the conditional quantile regression method.
While both ComBat and ConQuR are normalization methods used to address batch effects in highthroughput omics data analysis, they have several major differences. First, ComBat utilizes empirical Bayes
frameworks to estimate and adjust for batch effects, while ConQuR employs a non-parametric quantile regression model to correct the conditional distribution of microbial read counts. This difference in modeling
approaches affects how the batch effects are estimated and corrected. Second, ComBat assumes a normal
distribution for the genomic data and requires data preprocessing, such as log-transformation, before applying the normalization. On the other hand, ConQuR does not make specific assumptions about the data
distribution, so it can work directly on the microbial taxonomic count data. Furthermore, ComBat assumes
that factors resulting in batch effects often affect many genomic features in similar ways. In other words,
the batch effects are assumed to be uniform across the range of genomic features. In contrast, ConQuR
does not assume uniformity of batch effects and allows for variations in the impact of batches across the
range of genomic features.
To investigate the efficacy of these methods in mitigating batch effects on quantitative phenotype
predictions, we design three different settings (Naive, ComBat normalization and ConQuR normalization)
in the second stage of our workflow (Figure 4.1B), employing simulation studies.
In the Naive setting, we train the machine learning models directly on the original training datasets
and apply them to the test dataset, followed by applying the integration methods. In the ComBat and
88
ConQuR normalization settings, we use the test dataset as a reference to normalize the abundance of two
training datasets separately, using ComBat or ConQuR. Consequently, the test dataset remains unaltered,
while the genomic feature abundances of the two training datasets are adjusted accordingly. Throughout
the Results section, we use various notations to present the results associated with these three settings:
Training1 and Training2: Naive setting, predictions from trained machine learning (ML) model by
the Training 1 and Training 2 dataset independently.
Training1_ComBat and Training2_ComBat: ComBat setting, predictions from trained ML model
by ComBat adjusted Training 1 and Training 2 dataset independently.
Training1_ConQuR and Training2_ConQuR: ConQuR setting, predictions from trained ML model
by ConQuR adjusted Training 1 and Training 2 dataset independently.
Merged: Naive setting, predictions from trained ML model by pooled Training 1 and Training 2
datasets.
Merged_ComBat: ComBat setting, Training 1 and Training 2 datasets are adjusted by ComBat then
pool together into one training data, then predictions from trained ML model on the adjusted pooled
dataset.
Merged_ConQuR: ConQuR setting, Training 1 and Training 2 datasets are adjusted by ConQuR then
pool together into one training data, then predictions from trained ML model on the adjusted pooled
dataset.
To ensure clear and consistent notations for the different integration methods and settings used in our
study, we apply the seven integration methods from Table 4.2 to all three settings. Specifically, we use the
integration method names directly for the Naive setting, and add a suffix of "_Combat" or "_ConQuR" for
integration methods applied in the ComBat or ConQuR settings, respectively.
In Scenario 2 of our simulations, we slightly modify the notations. We refer to the two training datasets
as "Batch1" and "Batch2" to reflect their distinct batch effects. We replace the "Merged" method with two
89
alternatives: "Merged_NoBatch" and "Merged_Batch". The former involved training an random foresets
(RF) regression model on the original merged dataset without any simulated batch effects, while the latter
involved pooling the simulated "Batch1" and "Batch2" datasets into one and then training an RF model on
the combined data. These modifications allow for more precise and consistent reporting of our results.
4.2.5 Integration methods for combining multiple training models
We apply seven different integration methods that assign different weights to the models trained by different training datasets and integrate the models together for predictions as our previous study in Chapter 3.
The original five integration methods are adapted from the study by Patil et al. [47], which evaluate the
performance of cross-study learner by using five alternative choices of weights. Additional two integration
methods are proposed by ourselves in previous studies in Chapter 2 and Chapter 3. In this study, we aim
to explore the potential of these integration methods in terms of quantitative phenotype predictions. The
detailed integration methods are described in Chapter 3.
Table 4.2: Integration methods used for combining multiple training models.
Method Abbreviation Reference
Average weighted Avg [47]
Average weighted by study sample size n_Avg [47]
Average weighted by cross-study performance CS_Avg [47]
Averages of study-specific regression weights Reg_a [47]
Stacked regression weights Reg_s [47]
Validation set RMSE weighted val-rmse [104]
Leave-one-sample-out RMSE weighted LOSO-rmse [85]
90
4.2.6 Applications on body mass index (BMI) prediction from Type-2 diabetes (T2D)
metagenomic datasets
To demonstrate our findings with a realistic application, we conduct a BMI prediction analysis based on
four T2D metagenomic datasets. Previous studies have established a relationship between gut microbial
compositions and obesity [112], with overweight and obesity accounting for 44% of the cases of T2D [113].
In this study, we aim to predict BMI values using multiple gut microbiome abundance data from different
studies while incorporating the T2D condition as a covariate in the normalization process.
From the principle component analysis (PCA) plot (Figure 4.2) of the metagenomic microbial compositions of the four datasets, we observe marked differences in the compositions, particularly in the Sankarannaraynan dataset. To address these differences and normalize the abundance data, we employ either the
ComBat or ConQuR normalization method, while including the T2D condition as a covariate. The samples
are divided into three T2D conditions: T2D, IGT (impaired glucose tolerance), and healthy. We obtain
the microbial read count data from the curatedMetagenomicData R package [114] and transform it into
relative abundance data by log-transforming the read count data using the same procedures outlined in
the data pre-processing for simulation studies (see section Scenario 1: Different background distributions
of genomic features in training populations). We then identify the common microbial OTUs from the four
datasets and utilize them as the features.
We implement a Leave-One-Dataset-Out (LODO) setting. For each dataset among the four datasets, we
treat it as a test dataset, and the other three datasets are used as training data. In the naive setting, the three
training datasets remain unchanged, while in the normalization setting, the test data is used as a reference
to normalize the three training datasets, including the sample T2D condition (T2D, IGT or healthy) as
a covariate. Both ComBat and ConQuR methods have the option to adjust the genomic features while
considering the effects of the covariate, along with the batch variable. This adjustment helps to remove
the influence of the covariate on genomic features and to isolate the batch-specific effects or other sources
91
of variation. Then, a RF regression model is trained on the three training datasets independently, and
integration methods is applied respectively. For the merging method, the three training datasets were
pooled into one training data, and a single RF regression model was trained on it.
Table 4.3: Four metagenomic datasets related to T2D.
Dataset Country No. of T2D No. of IGT No. of control Reference
Qin China 170 0 174 [106]
Karlsson Sweden 53 49 43 [107]
Li China 62 0 3 [29]
Sankaranarayanan USA (American Indian) 19 0 18 [108]
Figure 4.2: Metagenomic feature distributions from multiple T2D studies. Principal Component
Analysis (PCA) is conducted on four T2D metagenomic log relative abundance datasets, including T2D
samples, IGT samples, and healthy control samples. The analysis is performed on the intersection of
metagenomic features from the four datasets, which are later used for training machine learning models.
The ellipses in the PCA plot represent the 95% confidence level assuming a multivariate t-distribution.
4.2.7 Applications on survival prediction from ovarian cancer gene expression datasets
We further conduct a real data analysis on three ovarian cancer gene expression studies, as ovarian cancer
is the fifth leading cause of cancer deaths in women in the United States. Previous studies have identified
candidate gene markers of ovarian cancer by comparing the gene expression profiles in normal and ovarian cancer samples [115]. Additionally, patient survival time varies and depends on tumor stage, as well as
92
surgery and/or chemotherapy. Early stage (stage I-II) ovarian cancer is highly curable, with a 5-year survival rate of 60–80%, while the 5-year survival rate for III/IV stage ovarian cancer patients is only 19–41%
[116, 117]. Debulking surgery and platinum-based chemotherapy are the main treatments for ovarian cancer, and both can influence survival, especially debulking surgery, which is known to improve the survival
of ovarian cancer patients [118].
In our study, we conduct survival prediction on gene expression samples of ovarian cancer, incorporating tumor stage and debulking information as covariates in the normalization process. We obtain three
ovarian cancer gene expression datasets from the curatedOvarianData R package [119], and detailed information on these datasets is provided in Table 4.4. To ensure accurate prediction, we use the intersection
of the genes in the three datasets as the common genes.
We observe substantial differences in gene expression levels among the three datasets, as shown in both
the PCA plot (Figure 4.3) and the boxplot of gene expression levels for each sample. This emphasizes the
importance of normalization in integrating these datasets. During normalization, we include tumor stage
and debulking information as covariates but exclude them when fitting the CoxBoost model for survival
prediction. For this survival analysis, we only use the ComBat normalization method because ConQur normalization is specifically designed for metagenomic read count data. The ComBat normalization method
effectively normalized the gene expression data.
To evaluate the performance of our approach, we employ the Leave-One-Dataset-Out strategy on the
three ovarian cancer gene expression datasets, following the same approach described for the T2D datasets
(see section Applications on body mass index (BMI) prediction from Type-2 diabetes (T2D) metagenomic
datasets).
93
Table 4.4: Three gene expression datasets related to ovarian cancer.
Dataset Country No. of samples source Reference
Bentink USA 126 E.MTAB.386 [109]
Ferriss USA 56 GSE30161 [110]
Yoshihara Japan 110 GSE17260 [111]
Figure 4.3: Genomic feature distributions from multiple ovarian cancer studies. A: Principal component analysis (PCA) of three ovarian cancer gene expression datasets. The analysis is performed on
the intersection of gene features from the three datasets, which are later used for training machine learning models. The ellipses in the PCA plot represent the 95% confidence level assuming a multivariate tdistribution. B: Gene expression level comparisons for each individual in the three ovarian cancer datasets.
The gene expression levels of all genes for each individual are summarized using a boxplot. This comparison is performed to investigate the expression profile similarity and heterogeneity among individuals
across datasets.
4.2.8 Machine learning models used in phenotype predictions
RF [33] is a supervised learning algorithm capable of solving both regression and classification problems.
In our previous case-control study in Chapter 3, we employed RF classification to determine disease status, while in this study, we use RF regression to predict a quantitative phenotype for the majority of our
analysis. The RF algorithm is well known for reducing the overfitting problem with large numbers of predictors and has demonstrated good performance in predictions of both discrete and continuous data. In
our simulation study and real data application on BMI predictions, we train RF regression models. We use
the RF model from the R caret package [70] with 1,000 decision trees as this number of trees was shown
to be reasonable with thousands of features in terms of prediction accuracy and computational time from
our previous study in Chapter 3 [85], and the ‘mtry’ parameter is tuned by a 10-fold cross-validation. We
94
choose to use ‘ranger’ as the method for the train function, as it reduces the running time compared to the
‘rf’ method in the train function. We use the RMSE value between the predicted and the real quantitative
trait values as the performance evaluation metric for the RF regression model.
For the survival predictions on ovarian cancer data, we use the Cox model by likelihood-based boosting
(CoxBoost) [120] on the training data and predict on the test. CoxBoost is a machine learning model developed by fitting a Cox proportional hazards model [121] by component-wise likelihood-based boosting. It
is suitable for fitting models with a large number of predictors, overcoming the limitations of the original
Cox proportional hazards model. Previous studies on survival prediction have shown that CoxBoost outperforms other commonly used survival analysis models [122, 123], including Cox proportional hazards
model and Random Survival Forests [124]. It is the preferred machine learning algorithm to analyze highdimensional, censored data [125]. Therefore, in our application on survival predictions, we choose to use
the CoxBoost model with the Concordance Index (C-index) [126] as the performance evaluation metric.
The C-index is a statistical measurement widely used in survival analysis to evaluate the predictive power
of a model. In survival predictions, a pair of patients is considered concordant if the predicted risk of death
by a model is lower for the patient who dies at a later time point. The C-index represents the proportion
of concordant pairs among all pairs of subjects. The C-index ranges from 0.5 to 1, with values closer to 1
indicating better prediction accuracy.
95
4.3 Results
4.3.1 Batch normalization boosts prediction performance for homogeneous populations
with batch effects but falls short for diverse datasets and different phenotypeassociated feature models in quantitative phenotype predictions
In our simulation studies, we examine the impact of heterogeneity on prediction performance using three
different scenarios and four different types of quantitative phenotype. We compare the prediction performances of the naive and normalization settings. The results from these three simulation scenarios are
presented in Figure 4.4, Figure 4.5 and Figure 4.6.
In Scenario 1, we assess the effect of different background distributions of genomic features in the
training samples on prediction performance. The results are presented in Figure 4.4. Neither ComBat nor
ConQuR improve prediction performance consistently. With the Hannigan dataset as background training
and Feng dataset as background test, these methods increase RMSE resulting in inferior prediction performance. On the other hand, with the Yu dataset as background training and Feng dataset as background
test, both normalization methods decrease RMSE resulting in superior prediction performance. When
we mix the two training datasets with α fraction coming from the Hannigan dataset and 1 − α fraction
coming from the Yu dataset and using the mixture as training data, these normalization methods change
from beneficial to detrimental as α increases. In terms of the relative performance of the different integration methods, we observe that val_rmse and LOSO_rmse perform slightly better than other integration
methods, especially with higher values of α.
Figure 4.5 shows the results from Scenario 2, which analyzes how batch effects can impact model prediction performance when integrating different training datasets. As expected, we observe an increasing
trend of RMSE values by all methods with increasing batch effects on the two training batches. Interestingly, we found that batch effects on the mean OTU abundance appeared to have a greater impact than on
96
the variance of OTU abundance, in terms of model prediction performance. This finding is consistent with
our previous case-control study in Chapter 3. Moreover, we observe improvements in prediction performance with all methods applied in the two normalization settings, which is expected since both ComBat
and ConQuR are designed to remove batch effects.
Figure 4.6 displays results from our third simulation scenario, where we examine the effects of divergent phenotype-associated feature models in training and test data. We observe that a decreased overlap
in features, especially with as few as 2 OTUs in common, often led to degraded predictions. We observe
that batch normalization may not enhance machine learning prediction performance when the phenotypeassociated feature models between training and test data diverge largely.
In summary, our simulation studies indicate that batch normalization enhances machine learning
model predictions when training datasets come from the same population but differ due to batch effects.
Yet, when these datasets are sourced from distinct populations, with either varying genomic feature distributions or differing phenotype-associated feature models, batch normalization offers no discernible advantage.
4.3.2 BMI predictions on T2D metagenomic datasets confirms the importance of batch
normalization combined with integration methods when training and test data
are similar and with sizeable samples
We analyze four metagenomic datasets related to T2D as summarized in Table 4.3, using various methods.
From the principal component analysis (PCA) plot in Figure 4.2, it is evident that there are differences
in metagenomic feature distributions among the four datasets. Notably, the Sankaranarayanan dataset is
highly different from the other three datasets further accentuated by its smaller sample size (section Four
metagenomic datasets related to T2D.), resulting in much larger RMSE values when conducting crossdataset BMI prediction (Figure 4.7).
97
Figure 4.4: Cross-study quantitative phenotype predictions comparisons by different methods
under naive and batch normalization settings when the training and test populations have different feature distributions. The figures display the median RMSE values of RF regression predictions
using various integration methods with different simulated quantitative phenotypes. The columns represent different values of α. The methods without a suffix are the ones carried out in the naive setting,
while the methods with a suffix of "ComBat" or "ConQuR" were carried out in the ComBat or ConQuR
normalization setting, respectively. The experiments were repeated 100 times, and the figures show the
median RMSE values from the 100 trials.
The results in Figure 4.7 illustrate the average RMSE values using one dataset as test data while integrating other three datasets as training. For instance, when using the Qin dataset as test data, we observe
lower RMSE values in all seven integration methods with ComBat normalization setting than in the naive
setting, with six methods have significant improvements by Mann–Whitney U test. For the Karlsson test
data, we observe five out of seven integration methods with lower RMSE values in both ComBat and ConQuR settings. Additionally, the RMSE values in the merging method decrease largely as well. Using the Li
dataset as test data, we observe lower RMSE values in six out of seven integration methods with ComBat
98
Figure 4.5: Batch normalization increases cross-study quantitative phenotype prediction when
the studies have batch effects. The figures display the median RMSE values of RF regression predictions
using different integration methods with various batch severity levels. Columns represent different batch
effect factors in mean and variance of microbial abundance. "m": batch effect factor adjusting the mean.
"v": batch effect factor adjusting the variance. The methods without a suffix are the ones carried out in the
naive setting, while the methods with a suffix of "ComBat" or "ConQuR" were carried out in the ComBat
or ConQuR normalization setting, respectively. The experiments were repeated 100 times, and the figures
show the median RMSE values from the 100 trials.
setting and in all seven methods with ConQuR setting. Moreover, the merging method also show much
lower RMSE values in the two normalization settings. When using the Sankaranarayanan test data, we
notice three out of seven integration methods have better performance in ConQuR setting, but no improvement in the ComBat setting, probably caused by the large differences between the metagenomic feature
distribution of the Sankaranarayanan dataset and other datasets as shown in Figure 4.2 as well as the least
number of individuals.
99
Figure 4.6: Batch normalization does not improve cross-study prediction when the training and
test data have different quantitative-phenotype-associated feature models. The figures display the
median RMSE values of RF regression predictions using different integration methods with with various
number of overlapping phenotype-associated OTUs. Columns represent different numbers of overlapping
disease associated OTUs in the training and test data, the larger the number, the more similar the two
disease models are. When the number achieves 10, the two models are the same in the training and test data.
The methods without a suffix are the ones carried out in the naive setting, while the methods with a suffix
of "ComBat" or "ConQuR" were carried out in the ComBat or ConQuR normalization setting, respectively.
The experiments were repeated 100 times, and the figures show the median RMSE values from the 100
trials.
There is also a noticeable trend in prediction accuracy correlating with the overlap between training
and test datasets. For instance, when testing on the Karlsson dataset and training on the other three,
RMSE values were lowest for the Li dataset, intermediate for the Qin dataset, and highest for the Sankaranarayanan dataset, indicates a pattern consistent with the overlap observed in Figure 4.2. Moreover, our
results suggest that normalization methods are particularly effective when datasets exhibit a high overlap
100
and both have sizeable samples (Figure 4.7A, B and C). However, normalization can compromise prediction
accuracy when datasets are considerably distinct or when sample sizes are relatively small (Figure 4.7D).
When comparing different integration methods, no significant differences in prediction performance
were observed among them. Nevertheless, the "LOSO_rmse" and "val_rmse" methods yielded marginally
superior outcomes.
In conclusion, the two batch normalization techniques demonstrated efficacy in predicting BMI based
on T2D metagenomic data. These findings corroborate prior simulation studies, underscoring the crucial
role of batch normalization when incorporating machine learning models trained across diverse studies.
4.3.3 Survival prediction of ovarian cancer gene expression data can be boosted by
batch normalization with substantial sample size
To assess the prediction performance of merging and integration methods on real datasets, we conduct a
survival analysis on three ovarian cancer gene expression datasets, as summarized in Table 4.4.
In the initial analysis of ovarian cancer survival using naive settings (Figure 4.8), most of the applied
methods result in a C-index around or below 0.5, indicating performance close to a random guess model.
However, after applying ComBat normalization and training CoxBoost models on Bentink and Ferriss
datasets, we observed a substantial improvements in survival predictions, with all seven integration methods producing a C-index close to 0.6 when using Yoshihara as the test data. With the Bentink dataset as
test data, normalization led to slight C-index enhancements for the merging and five other integration
methods. In contrast, only the merging method demonstrated a minor improvement when the Ferriss
dataset served as the test data after normalization. A key takeaway from these observations is that normalization proves beneficial for datasets with a substantial sample size, enabling integration methods to
produce meaningful predictions (Figure 4.8A and B). However, caution is needed with smaller datasets.
101
Figure 4.7: Realistic applications of merging and integration methods on BMI predictions using
four T2D metagenomic datasets. The real data analysis is conducted using a Leave-One-Dataset-Out
setting. Each sub-figure shows the prediction results on the corresponding test dataset, while the other
three datasets are used as training data. The bars in each sub-figure represent different integration methods, with each color representing a different method group, and the three same-colored bars representing
the three different settings: Naive, ComBat normalization, and ConQuR normalization. The median RMSE
values are shown for each test dataset, and all experiments are repeated 30 times for randomness in RF
regression models and different validation sets used in "val_rmse" method. Asterisks show the significance
level of the RMSE comparisons between normalization setting and naive setting by Mann–Whitney U tests.
For instance, the Ferriss dataset, which has a notably smaller sample size compared to the other two as per
Table 4.4, exhibited reduced prediction accuracy post-normalization (Figure 4.8C).
These results underscore the importance of batch normalization, and we recommend applying normalization before conducting any complex survival analysis. Our findings also highlight the potential benefits
of incorporating batch correction methods into the analysis pipeline to improve survival prediction performance on real datasets.
102
Figure 4.8: Realistic applications of merging and integration methods on survival predictions
using three ovarian cancer gene expression datasets. Each sub-figure shows the prediction results
on the corresponding test dataset, while the other two datasets are used as training data. The bars in each
sub-figure represent different integration methods, with each color representing a different method group,
and the two same-colored bars representing the two different settings: Naive and ComBat normalization.
The median C-index values are shown for each test dataset, and all experiments are repeated 10 times for
different validation sets used in "val_cindex" method.
4.3.4 ComBat and ConQuR normalization methods show similar performance in removing
effects by heterogeneity on machine learning model prediction performance
In our simulation studies and real data analysis on BMI predictions, we evaluate the performances of
ComBat and ConQuR normalization methods combined with merging or integration methods. Across all
three simulation scenarios (Figure 4.4, Figure 4.5, and Figure 4.6) and all types of quantitative phenotype,
we observe similar performances by ComBat and ConQuR in terms of removing heterogeneity effects
when combined with different merging and integration methods.
In the analysis of BMI prediction using T2D metagenomic abundance data, we observe that both ComBat and ConQuR performed similarly when combined with the seven integration methods and are both
103
better than no normalization in most cases. Later, we conduct Mann–Whitney U tests on the RMSE values
from the ComBat and ConQuR settings on all methods. Among those comparisons, we do not find that
ComBat and ConQuR have a significant difference in terms of predictions when using a single training
dataset or when using the seven integration methods.
Overall, our results from both simulation studies and real data analysis suggest that both ComBat and
ConQuR normalization methods are useful for removing heterogeneity effects when integrating multiple
studies, and the prediction performance of machine learning models can be improved by applying both
normalization methods.
4.3.5 Merging based method does not show any advantage than weighting based integration
methods in the context of quantitative phenotype predictions
The merging method, which involves combining all datasets into one and treating all samples as if they
are from the same study, is a commonly used method for cross-study predictions. Previous studies have
shown that the merging method can lead to better prediction performance than using only individual studies, particularly with the increase of sample size and diversity in the study populations [41, 40, 99, 45]. In
our previous case-control study on heterogeneity in Chapter 3, we found that the merging method combined with batch normalization was very powerful for qualitative phenotype prediction when integrating
multiple heterogeneous studies, compared with other integration methods. However, in this study, we
do not observe any advantage of the merging method in any of the settings compared to the integration
methods in both simulations and realistic applications.
In simulations, we observe that the merging method performed similarly or slightly worse to the integration methods in Scenario 1 and 3 (Figure 4.4 and Figure 4.6). In Scenario 2, the merging method
performed slightly better than the other integration methods (Figure 4.5). In the two applications on
metagenomic and gene expression datasets, the merging method has worse performance compared to
104
other integration methods, as evidenced by the higher RMSE values in Figure 4.7 and lower C-index values
in Figure 4.8. These findings are not consistent with our previous findings in qualitative phenotype predictions, suggesting that the merging method may not be suitable for prediction on quantitative phenotypes.
Therefore, other integration methods should be taken into consideration when dealing with quantitative
phenotypes.
4.4 Discussion
Integrating multiple studies from diverse sources can be challenging due to the heterogeneity in crossstudy phenotype predictions. While previous research has proposed several statistical methods to mitigate
heterogeneity effects for qualitative phenotype predictions [47, 79, 84, 99, 104], the effects of heterogeneity on quantitative phenotype predictions have not been well addressed. Therefore, this study aims to
comprehensively analyze different methods for integrating heterogeneous datasets and their impact on
quantitative phenotype prediction. We focus on three sources of heterogeneity, including population differences, batch effects, and different phenotype-associated feature models, and develop a workflow for
simulating datasets with these three sources of heterogeneity. In addition, we evaluate the prediction performance of different integration methods, with particular interest in quantitative phenotype. Apart from
comparing different integration methods, we explore the potential of combining normalization methods
with integration methods in the context of quantitative phenotype predictions. Our findings are demonstrated through both simulation studies and realistic applications.
In the first part of our study, we simulate pseudo microbial abundance and four different types of
pseudo quantitative phenotypes. Our results, as illustrated by the RMSE values in Figure 4.4, Figure 4.5,
and Figure 4.6, show that normalization settings can decrease RMSEs for the second scenario of batch effects with the same background feature distributions for the training and test data. However, for the other
two types of heterogeneity from the first and third scenarios, normalization seemed less effective. This
105
emphasizes the imperative of addressing such heterogeneity in cross-study phenotype predictions, given
its impact on machine learning model reproducibility. Employing normalization in tandem with integration techniques is essential for enhancing cross-study predictive accuracy for quantitative phenotypes,
especially in the face of batch effect-induced heterogeneity.
In the latter segment of our study, we delve into real data applications using four T2D metagenomic
datasets. Our analysis indicate that normalization methods shine when datasets have a large overlap and
boast ample samples, as evidenced in Figure 4.7A, B, and C. Conversely, normalization could impede prediction precision when datasets diverge largely or have limited sample sizes (Figure 4.7D). Subsequently,
we analyze survival prediction for ovarian cancer gene expression data, employing various machine learning models. As Figure 4.8 illustrates, batch normalization, paired with integration techniques, effectively
counteracts heterogeneity impacts on model reproducibility, especially when datasets are sizeable. This
holds true irrespective of the omics data type, chosen machine learning model, or evaluation metric. While
the enhancements in real-world applications are not as pronounced as in our simulations, we believe this
stems from the heightened heterogeneity encountered during real dataset integration. This is evident in
the PCA plots (Figure 4.2 and Figure 4.3). Still, our findings underscore the significance of normalization
in elevating the predictive prowess of quantitative phenotypes, even in real-world scenarios.
Our study evaluates the effectiveness of ComBat [43] and ConQuR [105] normalization methods in
mitigating heterogeneity effects on quantitative phenotype predictions when integrating multiple datasets.
We previously demonstrated the potential of ComBat in mitigating different sources of heterogeneity in
a case-control study in Chapter 3, and we now show that ComBat works well in mitigating batch effects
from similar populations in the context of quantitative phenotype as well. Additionally, we observe that
ConQuR, a recently developed method, is also effective in removing batch effects in similar populations,
from the results of our simulations. Our findings suggest that ConQuR is a useful tool in metagenomic
analysis for removing heterogeneity.
106
The merging method is a popular approach for cross-study predictions and has been demonstrated
to improve prediction performance in disease status predictions by increasing sample size and diversity
in the study population [40, 41, 45, 99]. In the previous study from Chapter 3, we showed the outstanding case-control prediction performance of the merging method compared to other integration methods.
However, our current study reveals a different outcome for quantitative phenotype prediction, where we
observe similar or even worse performance in the merging method compared to integration methods in
both simulation studies and real data applications. This observation can potentially be explained as follows. Ensemble-based integration methods are generally more robust to noise and outliers caused by
heterogeneities than merging methods, since they can average out the effects of individual noisy or outlier
models, resulting in a more stable and reliable prediction. Consequently, in quantitative phenotype predictions, we observed a slightly inferior performance of merging methods when compared to integration
methods. These results suggest that the merging method may not be optimal for quantitative phenotype
prediction, and alternative integration methods should be considered. Furthermore, in the comparisons of
all seven integration methods in both simulations and real data applications, we did not observe any significantly better method by conducting Kruskal-Wallis tests. Therefore, future work may focus on exploring
alternative integration methods that demonstrate advantages over existing methods.
To the best of our knowledge, this study represents the first investigation into the effects of heterogeneity on cross-study predictions with quantitative phenotypes. Our findings emphasize the importance
of addressing heterogeneity in quantitative phenotype predictions when incorporating multiple studies,
similar to what has been demonstrated in disease status predictions. We hope that our findings will inspire
further research in the field of quantitative phenotype prediction and encourage the development of more
effective strategies for mitigating heterogeneity effects on machine learning model reproducibility. This
will provide valuable insights into the complex interplay between the microbiome and human health, and
ultimately lead to better understanding and treatment of diseases.
107
Chapter 5
Conclusions and future work
5.1 Conclusions
In this dissertation, we explore methods to enhance phenotype predictions through the integrative analysis of multiple heterogeneous microbiome studies. Initially, we undertook cross-cohort CRC status predictions using six public metagenomic datasets, aiming to boost prediction performance through various
techniques. Drawing inspiration from these initial findings, we delved deeper into the effects of crosscohort heterogeneity on case/control prediction performance. This led us to explore diverse normalization
and integration methods, which we tested in both simulated and real-world scenarios. Furthermore, we
shifted our focus to quantitative phenotype predictions, conducting integrative analyses using methodologies akin to our earlier case/control studies. This yielded novel insights that are both valuable and
actionable. Our research sheds light on how study heterogeneity influences phenotype predictions, offering promising avenues for a wide array of future applications. We next discuss the implications of each
project in details.
5.1.1 Increasing prediction performance of CRC disease status
In our comprehensive analysis of CRC-microbiome predictive capabilities across six metagenomic datasets
from Chapter 2, we made several significant observations. One of the standout findings was the consistent
108
superiority of Centrifuge in deriving microbial relative abundance profiles. This tool outperformed other
metagenomic taxonomic profiling tools in prediction AUCs. A possible reason for Centrifuge’s enhanced
performance in predicting CRC disease status might be its capability to align sequences against all reference
genomes, even those with low abundance.
Our experiments with machine learning algorithms also yielded noteworthy results. The random
forests algorithm, particularly when augmented with a larger number of decision trees, demonstrated robustness in classification tasks. It surpassed LASSO and SVM classifiers in performance but also matched
the capabilities of SVM across all experimental settings. This suggests that the observed variations in AUC
scores across different datasets aren’t solely dependent on the classifier in use.
However, our study wasn’t without challenges. We observed that predictions derived from crossdataset analyses inherently exhibited lower accuracy compared to those from within-dataset analyses.
This discrepancy can be attributed to the inherent heterogeneity across datasets, often referred to as the
batch effect. While we did explore potential solutions like the ComBat function to counteract this effect,
the results didn’t show a marked improvement in cross-dataset AUC scores. This challenge of dataset
heterogeneity and the quest for improved normalization across datasets is a promising avenue for future
research.
Lastly, our exploration into the realm of feature combination for predictive modeling revealed the
potential of the LOSO model stacking method. This approach consistently outperformed the traditional
feature stacking method in both AUC and AUPRC scores across various settings. Moreover, the LOSO
model stacking method offers computational advantages, making it a more efficient choice for handling
larger feature dimensions. This insight could be pivotal for future research endeavors aiming to optimize
microbiome-based disease prediction.
109
5.1.2 Phenotype prediction by integrating multiple heterogeneous studies
Addressing heterogeneity in case/control phenotype predictions
In Chapter 3, building on the challenges identified in our initial project from Chapter 2 regarding reduced
prediction performances in cross-cohort studies, we delved deeper into the integration of heterogeneous
omic studies, specifically for case/control phenotype prediction. Recognizing the imperative of addressing
heterogeneity for robust phenotype predictions, we crafted a comprehensive workflow. This workflow
not only simulates various heterogeneity types but also evaluates them, emphasizing the importance of
normalization, particularly using the ComBat method, in enhancing prediction accuracy.
Applying our methodologies to six colorectal cancer metagenomic studies and six tuberculosis gene
expression studies, we highlighted the pronounced improvement in prediction performance when merging strategies are combined with normalization and ensemble weighted learning methods. Furthermore,
our research unveiled the promise of rank aggregation methods. These methods, showcasing robustness
akin to ensemble weighted learning approaches, present an alternative avenue for enhancing prediction
outcomes.
In summary, our findings from Chapter 3 provide a roadmap for navigating the complexities of omics
data heterogeneity. By offering tangible strategies and methods, our research paves the way for more
consistent and precise case/control phenotype predictions.
Navigating quantitative phenotype predictions
In Chapter 4, we expanded our simulation workflow to encompass quantitative phenotype predictions.
Our findings highlighted the selective effectiveness of normalization, particularly against batch effects, emphasizing the importance of addressing heterogeneity for consistent machine learning predictions across
studies.
110
Real-world data reinforced these insights. Normalization was most effective in datasets with substantial overlap and samples, but less so in more diverse or smaller datasets. The benefits of normalization,
while clear in simulations, were subtler in real-world scenarios, likely due to the complexities of integrating
actual datasets.
We introduced the ConQuR normalization method alongside the established ComBat. Both methods
effectively mitigated heterogeneity, with ComBat proving its worth in quantitative phenotypes and ConQuR emerging as a promising tool for metagenomic analyses.
Our study also revealed a shift in the performance of the merging method. While previously effective
for case/control predictions, it was outperformed by ensemble-based integration methods in quantitative
phenotype predictions. This suggests the need for a diversified approach, considering alternative integration methods for optimal results.
Our works in Chapter 3 and Chapter 4 represent a comprehensive journey through the intricate landscape of omics data integration, spanning both case/control and quantitative phenotype predictions. Through
methodical exploration and rigorous application, we’ve underscored the paramount importance of addressing heterogeneity to ensure robust and reproducible predictions. While Chapter 3 provided a strategic
roadmap for navigating the challenges of case/control phenotype predictions, Chapter 4 expanded the
horizon, delving into the nuances of quantitative phenotypes and revealing the selective effectiveness of
various methods. Together, these chapters not only highlight the complexities inherent in cross-study predictions but also offer tangible strategies to navigate them. As we conclude this exploration, it’s evident
that the future of omics data integration lies in a flexible, nuanced approach, continually adapting to the
evolving challenges and intricacies of the field.
111
5.2 Future works
In Chapter 3 and Chapter 4, we delved into various normalization and integration techniques aimed at
counteracting the effects of study heterogeneity on phenotype prediction reproducibility. Our comprehensive simulations and real-world applications revealed enhancements when these methods were applied.
However, no single method consistently outperformed the others in either the simulations or real-world
scenarios. Moreover, the improvements were less pronounced when integration methods were applied
without prior normalization.
A promising avenue for future research is the adaptation of methods from the polygenic risk score
(PRS) domain in genome-wide association study (GWAS) studies. PRS has showcased its immense potential in biomedical research, especially in identifying individuals at high risk for various diseases based
on their genotypes. Yet, the broader application of PRS across the general populace is constrained due
to the limited transferability of PRS models developed for Europeans to non-European populations [127].
Consequently, several strategies have been proposed to enhance the universality of PRS. For instance, one
approach involves estimating effect sizes individually for each population and subsequently deriving a linear combination of these sizes from a validation dataset of the target population [128]. Other studies have
jointly modeled GWAS summary statistics from multiple populations, operating under the assumption that
causal variants are predominantly shared across populations [129, 130, 131, 132].
A recent innovation in this field is SDPRX [127], a method designed to amalgamate GWAS summary
statistics and linkage disequilibrium (LD) matrices from two distinct populations, utilizing a hierarchical
Bayesian model. This model is bifurcated into two primary components: the likelihood and the prior.
The likelihood bridges the gap between the marginal effect sizes in GWAS summary statistics from the
two populations and the actual effect sizes via a multivariate normal distribution, factoring in LD. The
prior, on the other hand, delineates the genetic architecture of a trait across two populations using a blend
of four distinct components: effect sizes of one single nucleotide polymorphism (SNP) are zero in both
112
populations, effect sizes of one SNP are non-zero in population 1 only, effect sizes of one SNP are non-zero
in population 2 only and effect sizes of one SNP are non-zero and correlated in both populations. SDPRX
seamlessly integrates GWAS summary statistics and LD matrices from the two populations, leveraging the
likelihood function and characterizing the joint distribution of the SNP effect sizes using the prior function.
Post model fitting via Markov chain Monte Carlo (MCMC), SDPRX yields adjusted effect sizes apt for PRS
computation.
Drawing inspiration from SDPRX’s approach to addressing generalizability across populations, we
envision proposing a method akin to SDPRX in the future. In our context, GWAS statistics could be substituted with microbial relative abundance profiles derived from our metagenomic datasets, and LD matrices
could be replaced with microbial OTU correlation matrices. Subsequently, a logistic regression model could
be fitted to each microbial OTU under the aforementioned four scenarios to ascertain the effect size for
each OTU. Utilizing MCMC, we can then produce adjusted effect sizes for each microbial OTU, which can
then be fed into machine learning classifiers to predict the phenotype.
113
Bibliography
[1] Grace A. Ogunrinola, John O. Oyewale, Oyewumi O. Oshamika, and Grace I. Olasehinde. “The
Human Microbiome and Its Impacts on Health”. In: International Journal of Microbiology 2020
(June 2020), pp. 1–7. doi: 10.1155/2020/8045646.
[2] Krzysztof Skowron, Justyna Bauza-Kaszewska, Zuzanna Kraszewska,
Natalia Wiktorczyk-Kapischke, Katarzyna Grudlewska-Buda, Joanna Kwiecińska-Piróg,
Ewa Wałecka-Zacharska, Laura Radtke, and Eugenia Gospodarek-Komkowska. “Human Skin
Microbiome: Impact of Intrinsic and Extrinsic Factors on Skin Microbiota”. In: Microorganisms 9.3
(Mar. 2021), p. 543. doi: 10.3390/microorganisms9030543.
[3] Sharon M. Donovan. “Introduction to the special focus issue on the impact of diet on gut
microbiota composition and function and future opportunities for nutritional modulation of the
gut microbiome to improve human health”. In: Gut Microbes 8.2 (Feb. 2017), pp. 75–81. doi:
10.1080/19490976.2017.1299309.
[4] Hourieh Sadrekarimi, Zhanna R. Gardanova, Morteza Bakhshesh, Farnoosh Ebrahimzadeh,
Amirhossein Fakhre Yaseri, Lakshmi Thangavelu, Zahra Hasanpoor, Firoozeh Abolhasani Zadeh,
and Mohammad Saeed Kahrizi. “Emerging role of human microbiome in cancer development and
response to therapy: special focus on intestinal microflora”. In: Journal of Translational Medicine
20.1 (July 2022). doi: 10.1186/s12967-022-03492-7.
[5] “A framework for human microbiome research”. In: Nature 486.7402 (June 2012), pp. 215–221.
doi: 10.1038/nature11209.
[6] Dibyendu Dutta and Seah H. Lim. “Bidirectional interaction between intestinal microbiome and
cancer: opportunities for therapeutic interventions”. In: Biomarker Research 8.1 (Aug. 2020). doi:
10.1186/s40364-020-00211-6.
[7] Ozioma S. Chioma, Laura E. Hesse, Austin Chapman, and Wonder P. Drake. “Role of the
Microbiome in Interstitial Lung Diseases”. In: Frontiers in Medicine 8 (Jan. 2021). doi:
10.3389/fmed.2021.595522.
[8] Ansel Hsiao and Jun Zhu. “Pathogenicity and virulence regulation of Vibrio cholerae at the
interface of host-gut microbiome interactions”. In: Virulence 11.1 (Nov. 2020), pp. 1582–1599. doi:
10.1080/21505594.2020.1845039.
114
[9] Andreas Hiergeist, Joachim Gläsner, Udo Reischl, and André Gessner. “Analyses of Intestinal
Microbiota: Culture versus Sequencing”. In: ILAR Journal 56.2 (Aug. 2015), pp. 228–240. doi:
10.1093/ilar/ilv017.
[10] F. Sanger, A.R. Coulson, B.G. Barrell, A.J.H. Smith, and B.A. Roe. “Cloning in single-stranded
bacteriophage as an aid to rapid DNA sequencing”. In: Journal of Molecular Biology 143.2 (Oct.
1980), pp. 161–178. doi: 10.1016/0022-2836(80)90196-5.
[11] Daehwan Kim, Li Song, Florian P. Breitwieser, and Steven L. Salzberg. “Centrifuge: rapid and
sensitive classification of metagenomic sequences”. In: Genome Research 26.12 (Oct. 2016),
pp. 1721–1729. doi: 10.1101/gr.210641.116.
[12] Nicola Segata, Levi Waldron, Annalisa Ballarini, Vagheesh Narasimhan, Olivier Jousson, and
Curtis Huttenhower. “Metagenomic microbial community profiling using unique clade-specific
marker genes”. In: Nature Methods 9.8 (June 2012), pp. 811–814. doi: 10.1038/nmeth.2066.
[13] Derrick E Wood and Steven L Salzberg. “Kraken: ultrafast metagenomic sequence classification
using exact alignments”. In: Genome Biology 15.3 (Mar. 2014). doi: 10.1186/gb-2014-15-3-r46.
[14] Jennifer Lu, Florian P. Breitwieser, Peter Thielen, and Steven L. Salzberg. “Bracken: estimating
species abundance in metagenomics data”. In: PeerJ Computer Science 3 (Jan. 2017), e104. doi:
10.7717/peerj-cs.104.
[15] Stephen Nayfach, Beltran Rodriguez-Mueller, Nandita Garud, and Katherine S. Pollard. “An
integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial
transmission and biogeography”. In: Genome Research 26.11 (Oct. 2016), pp. 1612–1625. doi:
10.1101/gr.201863.115.
[16] Shinichi Sunagawa, Daniel R Mende, Georg Zeller, Fernando Izquierdo-Carrasco,
Simon A Berger, Jens Roat Kultima, Luis Pedro Coelho, Manimozhiyan Arumugam, Julien Tap,
Henrik Bjørn Nielsen, Simon Rasmussen, Søren Brunak, Oluf Pedersen, Francisco Guarner,
Willem M de Vos, Jun Wang, Junhua Li, Joël Doré, S Dusko Ehrlich, Alexandros Stamatakis, and
Peer Bork. “Metagenomic species profiling using universal phylogenetic marker genes”. In:
Nature Methods 10.12 (Oct. 2013), pp. 1196–1199. doi: 10.1038/nmeth.2693.
[17] Alexandre Almeida, Stephen Nayfach, Miguel Boland, Francesco Strozzi, Martin Beracochea,
Zhou Jason Shi, Katherine S. Pollard, Ekaterina Sakharova, Donovan H. Parks, Philip Hugenholtz,
Nicola Segata, Nikos C. Kyrpides, and Robert D. Finn. “A unified catalog of 204, 938 reference
genomes from the human gut microbiome”. In: Nature Biotechnology 39.1 (July 2020), pp. 105–114.
doi: 10.1038/s41587-020-0603-3.
[18] Stephen Nayfach, Zhou Jason Shi, Rekha Seshadri, Katherine S. Pollard, and Nikos C. Kyrpides.
“New insights from uncultivated genomes of the global human gut microbiome”. In: Nature
568.7753 (Mar. 2019), pp. 505–510. doi: 10.1038/s41586-019-1058-x.
[19] Courtney R. Armour, Stephen Nayfach, Katherine S. Pollard, and Thomas J. Sharpton. “A
Metagenomic Meta-analysis Reveals Functional Signatures of Health and Disease in the Human
Gut Microbiome”. In: mSystems 4.4 (Aug. 2019). Ed. by Nicola Segata. doi:
10.1128/msystems.00332-18.
115
[20] Andrew B. Shreiner, John Y. Kao, and Vincent B. Young. “The gut microbiome in health and in
disease”. In: Current Opinion in Gastroenterology 31.1 (Jan. 2015), pp. 69–75. doi:
10.1097/mog.0000000000000139.
[21] Fredrik H. Karlsson, Valentina Tremaroli, Intawat Nookaew, Göran Bergström, Carl Johan Behre,
Björn Fagerberg, Jens Nielsen, and Fredrik Bäckhed. “Gut metagenome in European women with
normal, impaired and diabetic glucose control”. In: Nature 498.7452 (May 2013), pp. 99–103. doi:
10.1038/nature12198.
[22] Junjie Qin, Yingrui Li, Zhiming Cai, Shenghui Li, Jianfeng Zhu, Fan Zhang, Suisha Liang,
Wenwei Zhang, Yuanlin Guan, Dongqian Shen, Yangqing Peng, Dongya Zhang, Zhuye Jie,
Wenxian Wu, Youwen Qin, Wenbin Xue, Junhua Li, Lingchuan Han, Donghui Lu, Peixian Wu,
Yali Dai, Xiaojuan Sun, Zesong Li, Aifa Tang, Shilong Zhong, Xiaoping Li, Weineng Chen,
Ran Xu, Mingbang Wang, Qiang Feng, Meihua Gong, Jing Yu, Yanyan Zhang, Ming Zhang,
Torben Hansen, Gaston Sanchez, Jeroen Raes, Gwen Falony, Shujiro Okuda, Mathieu Almeida,
Emmanuelle LeChatelier, Pierre Renault, Nicolas Pons, Jean-Michel Batto, Zhaoxi Zhang,
Hua Chen, Ruifu Yang, Weimou Zheng, Songgang Li, Huanming Yang, Jian Wang,
S. Dusko Ehrlich, Rasmus Nielsen, Oluf Pedersen, Karsten Kristiansen, and Jun Wang. “A
metagenome-wide association study of gut microbiota in type 2 diabetes”. In: Nature 490.7418
(Sept. 2012), pp. 55–60. doi: 10.1038/nature11450.
[23] Dirk Gevers, Subra Kugathasan, Lee A. Denson, Yoshiki Vázquez-Baeza, Will Van Treuren,
Boyu Ren, Emma Schwager, Dan Knights, Se Jin Song, Moran Yassour, Xochitl C. Morgan,
Aleksandar D. Kostic, Chengwei Luo, Antonio González, Daniel McDonald, Yael Haberman,
Thomas Walters, Susan Baker, Joel Rosh, Michael Stephens, Melvin Heyman, James Markowitz,
Robert Baldassano, Anne Griffiths, Francisco Sylvester, David Mack, Sandra Kim,
Wallace Crandall, Jeffrey Hyams, Curtis Huttenhower, Rob Knight, and Ramnik J. Xavier. “The
Treatment-Naive Microbiome in New-Onset Crohn’s Disease”. In: Cell Host Microbe 15.3 (Mar.
2014), pp. 382–392. doi: 10.1016/j.chom.2014.02.005.
[24] Yael Haberman, Timothy L. Tickle, Phillip J. Dexheimer, Mi-Ok Kim, Dora Tang, Rebekah Karns,
Robert N. Baldassano, Joshua D. Noe, Joel Rosh, James Markowitz, Melvin B. Heyman,
Anne M. Griffiths, Wallace V. Crandall, David R. Mack, Susan S. Baker, Curtis Huttenhower,
David J. Keljo, Jeffrey S. Hyams, Subra Kugathasan, Thomas D. Walters, Bruce Aronow,
Ramnik J. Xavier, Dirk Gevers, and Lee A. Denson. “Pediatric Crohn disease patients exhibit
specific ileal transcriptome and microbiome signature”. In: Journal of Clinical Investigation 125.3
(Mar. 2015), pp. 1363–1363. doi: 10.1172/jci79657.
[25] Georg Zeller, Julien Tap, Anita Y Voigt, Shinichi Sunagawa, Jens Roat Kultima, Paul I Costea,
Aurélien Amiot, Jürgen Böhm, Francesco Brunetti, Nina Habermann, Rajna Hercog, Moritz Koch,
Alain Luciani, Daniel R Mende, Martin A Schneider, Petra Schrotz-King, Christophe Tournigand,
Jeanne Tran Van Nhieu, Takuji Yamada, Jürgen Zimmermann, Vladimir Benes, Matthias Kloor,
Cornelia M Ulrich, Magnus von Knebel Doeberitz, Iradj Sobhani, and Peer Bork. “Potential of
fecal microbiota for early-stage detection of colorectal cancer”. In: Molecular Systems Biology
10.11 (Nov. 2014). doi: 10.15252/msb.20145645.
116
[26] Qing Wang, Jianzhong Ye, Daiqiong Fang, Longxian Lv, Wenrui Wu, Ding Shi, Yating Li,
Liya Yang, Xiaoyuan Bian, Jingjing Wu, Xianwan Jiang, Kaicen Wang, Qiangqiang Wang,
Mark P. Hodson, Loïc M. Thibaut, Joshua W. K. Ho, Eleni Giannoulatou, and Lanjuan Li.
“Multi-omic profiling reveals associations between the gut mucosal microbiome, the metabolome,
and host DNA methylation associated gene expression in patients with colorectal cancer”. In:
BMC Microbiology 20.S1 (Apr. 2020). doi: 10.1186/s12866-020-01762-2.
[27] Patricia G. Wolf, Elise S. Cowley, Adam Breister, Sarah Matatov, Luke Lucio, Paige Polak,
Jason M. Ridlon, H. Rex Gaskins, and Karthik Anantharaman. “Diversity and distribution of sulfur
metabolic genes in the human gut microbiome and their association with colorectal cancer”. In:
Microbiome 10.1 (Apr. 2022). doi: 10.1186/s40168-022-01242-x.
[28] Qin-Song Sheng, Kang-Xin He, Jian-Jiong Li, Zi-Feng Zhong, Fei-Xia Wang, Le-Lin Pan, and
Jian-Jiang Lin. “Comparison of Gut Microbiome in Human Colorectal Cancer in Paired
Tumor and Adjacent Normal Tissues”. In: OncoTargets and Therapy Volume 13 (Jan.
2020), pp. 635–646. doi: 10.2147/ott.s218004.
[29] Jing Li, Ai-hua Zhang, Fang-fang Wu, and Xi-jun Wang. “Alterations in the Gut Microbiota and
Their Metabolites in Colorectal Cancer: Recent Progress and Future Prospects”. In: Frontiers in
Oncology 12 (Feb. 2022). doi: 10.3389/fonc.2022.841552.
[30] Patrick D. Schloss and Sarah L. Westcott. “Assessing and Improving Methods Used in Operational
Taxonomic Unit-Based Approaches for 16S rRNA Gene Sequence Analysis”. In: Applied and
Environmental Microbiology 77.10 (May 2011), pp. 3219–3226. doi: 10.1128/aem.02810-10.
[31] Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. “MEGAHIT: an
ultra-fast single-node solution for large and complex metagenomics assembly via succinct de
Bruijn graph”. In: Bioinformatics 31.10 (Jan. 2015), pp. 1674–1676. doi:
10.1093/bioinformatics/btv033.
[32] Zifan Zhu, Jie Ren, Sonia Michail, and Fengzhu Sun. “MicroPro: using metagenomic unmapped
reads to provide insights into human microbiota and disease associations”. In: Genome Biology
20.1 (Aug. 2019). doi: 10.1186/s13059-019-1773-5.
[33] Andy Liaw and Matthew Wiener. “Classification and Regression by randomForest”. In: R News 2.3
(Dec. 2002), pp. 18–22.
[34] Corinna Cortes and Vladimir Vapnik. “Support-vector networks”. In: Machine Learning 20.3 (Sept.
1995), pp. 273–297. doi: 10.1007/bf00994018.
[35] Robert Tibshirani. “Regression Shrinkage and Selection Via the Lasso”. In: Journal of the Royal
Statistical Society: Series B (Methodological) 58.1 (Jan. 1996), pp. 267–288. doi:
10.1111/j.2517-6161.1996.tb02080.x.
[36] Indhupriya Subramanian, Srikant Verma, Shiva Kumar, Abhay Jere, and Krishanpal Anamika.
“Multi-omics Data Integration, Interpretation, and Its Application”. In: Bioinformatics and Biology
Insights 14 (Jan. 2020), p. 117793221989905. doi: 10.1177/1177932219899051.
117
[37] Farhana R. Pinu, David J. Beale, Amy M. Paten, Konstantinos Kouremenos, Sanjay Swarup,
Horst J. Schirra, and David Wishart. “Systems Biology and Multi-Omics Integration: Viewpoints
from the Metabolomics Research Community”. In: Metabolites 9.4 (Apr. 2019), p. 76. doi:
10.3390/metabo9040076.
[38] Sebastian Canzler, Jana Schor, Wibke Busch, Kristin Schubert, Ulrike E. Rolle-Kampczyk,
Hervé Seitz, Hennicke Kamp, Martin von Bergen, Roland Buesen, and Jörg Hackermüller.
“Prospects and challenges of multi-omics data integration in toxicology”. In: Archives of
Toxicology 94.2 (Feb. 2020), pp. 371–388. doi: 10.1007/s00204-020-02656-y.
[39] Brooke L Fridley and Joanna M Biernacka. “Gene set analysis of SNP data: benefits, challenges,
and future directions”. In: European Journal of Human Genetics 19.8 (Apr. 2011), pp. 837–843. doi:
10.1038/ejhg.2011.57.
[40] Andrew Maltez Thomas, Paolo Manghi, Francesco Asnicar, Edoardo Pasolli, Federica Armanini,
Moreno Zolfo, Francesco Beghini, Serena Manara, Nicolai Karcher, Chiara Pozzi, Sara Gandini,
Davide Serrano, Sonia Tarallo, Antonio Francavilla, Gaetano Gallo, Mario Trompetto,
Giulio Ferrero, Sayaka Mizutani, Hirotsugu Shiroma, Satoshi Shiba, Tatsuhiro Shibata,
Shinichi Yachida, Takuji Yamada, Jakob Wirbel, Petra Schrotz-King, Cornelia M. Ulrich,
Hermann Brenner, Manimozhiyan Arumugam, Peer Bork, Georg Zeller, Francesca Cordero,
Emmanuel Dias-Neto, João Carlos Setubal, Adrian Tett, Barbara Pardini, Maria Rescigno,
Levi Waldron, Alessio Naccarati, and Nicola Segata. “Metagenomic analysis of colorectal cancer
datasets identifies cross-cohort microbial diagnostic signatures and a link with choline
degradation”. In: Nature Medicine 25.4 (Apr. 2019), pp. 667–678. doi: 10.1038/s41591-019-0405-7.
[41] Jakob Wirbel, Paul Theodor Pyl, Ece Kartal, Konrad Zych, Alireza Kashani, Alessio Milanese,
Jonas S. Fleck, Anita Y. Voigt, Albert Palleja, Ruby Ponnudurai, Shinichi Sunagawa,
Luis Pedro Coelho, Petra Schrotz-King, Emily Vogtmann, Nina Habermann, Emma Niméus,
Andrew M. Thomas, Paolo Manghi, Sara Gandini, Davide Serrano, Sayaka Mizutani,
Hirotsugu Shiroma, Satoshi Shiba, Tatsuhiro Shibata, Shinichi Yachida, Takuji Yamada,
Levi Waldron, Alessio Naccarati, Nicola Segata, Rashmi Sinha, Cornelia M. Ulrich,
Hermann Brenner, Manimozhiyan Arumugam, Peer Bork, and Georg Zeller. “Meta-analysis of
fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer”. In:
Nature Medicine 25.4 (Apr. 2019), pp. 679–689. doi: 10.1038/s41591-019-0406-6.
[42] Michael Wainberg, Nasa Sinnott-Armstrong, Nicholas Mancuso, Alvaro N. Barbeira,
David A. Knowles, David Golan, Raili Ermel, Arno Ruusalepp, Thomas Quertermous, Ke Hao,
Johan L. M. Björkegren, Hae Kyung Im, Bogdan Pasaniuc, Manuel A. Rivas, and Anshul Kundaje.
“Opportunities and challenges for transcriptome-wide association studies”. In: Nature Genetics
51.4 (Mar. 2019), pp. 592–599. doi: 10.1038/s41588-019-0385-z.
[43] W. Evan Johnson, Cheng Li, and Ariel Rabinovic. “Adjusting batch effects in microarray
expression data using empirical Bayes methods”. In: Biostatistics 8.1 (Apr. 2006), pp. 118–127. doi:
10.1093/biostatistics/kxj037.
[44] Matteo Re and Giorgio Valentini. “Prediction of Gene Function Using Ensembles of SVMs and
Heterogeneous Data Sources”. In: Studies in Computational Intelligence. Springer Berlin
Heidelberg, 2009, pp. 79–91. doi: 10.1007/978-3-642-03999-7_5.
118
[45] Lei Xu, Aik Choon Tan, Raimond L Winslow, and Donald Geman. “Merging microarray data from
separate breast cancer studies provides a robust prognostic test”. In: BMC Bioinformatics 9.1 (Feb.
2008). doi: 10.1186/1471-2105-9-125.
[46] Saso Džeroski and Bernard Ženko. “Is Combining Classifiers with Stacking Better than Selecting
the Best One?” In: Machine Learning 54.3 (Mar. 2004), pp. 255–273. doi:
10.1023/b:mach.0000015881.36452.6e.
[47] Prasad Patil and Giovanni Parmigiani. “Training replicable predictors in multiple studies”. In:
Proceedings of the National Academy of Sciences 115.11 (Mar. 2018), pp. 2578–2583. doi:
10.1073/pnas.1708283115.
[48] Shengyu Hao, Pan Jiang, Liang Xie, Guiling Xiang, Zilong Liu, Weiping Hu, Qinhan Wu,
Liyan Jiang, Yi Xiao, and Shanqun Li. “Essential Genes and MiRNA–mRNA Network
Contributing to the Pathogenesis of Idiopathic Pulmonary Arterial Hypertension”. In: Frontiers in
Cardiovascular Medicine 8 (May 2021). doi: 10.3389/fcvm.2021.627873.
[49] Silvia G. Acinas, Ramahi Sarma-Rupavtarm, Vanja Klepac-Ceraj, and Martin F. Polz.
“PCR-Induced Sequence Artifacts and Bias: Insights from Comparison of Two 16S rRNA Clone
Libraries Constructed from the Same Sample”. In: Applied and Environmental Microbiology 71.12
(Dec. 2005), pp. 8966–8969. doi: 10.1128/aem.71.12.8966-8969.2005.
[50] Francesco Durazzi, Claudia Sala, Gastone Castellani, Gerardo Manfreda, Daniel Remondini, and
Alessandra De Cesare. “Comparison between 16S rRNA and shotgun sequencing data for the
taxonomic characterization of the gut microbiota”. In: Scientific Reports 11.1 (Feb. 2021). doi:
10.1038/s41598-021-82726-y.
[51] Suguru Nishijima, Wataru Suda, Kenshiro Oshima, Seok-Won Kim, Yuu Hirose, Hidetoshi Morita,
and Masahira Hattori. “The gut microbiome of healthy Japanese and its microbial and functional
uniqueness”. In: DNA Research 23.2 (Mar. 2016), pp. 125–133. doi: 10.1093/dnares/dsw002.
[52] David Zeevi, Tal Korem, Niv Zmora, David Israeli, Daphna Rothschild, Adina Weinberger,
Orly Ben-Yacov, Dar Lador, Tali Avnit-Sagi, Maya Lotan-Pompan, Jotham Suez, Jemal Ali Mahdi,
Elad Matot, Gal Malka, Noa Kosower, Michal Rein, Gili Zilberman-Schapira, Lenka Dohnalová,
Meirav Pevsner-Fischer, Rony Bikovsky, Zamir Halpern, Eran Elinav, and Eran Segal.
“Personalized Nutrition by Prediction of Glycemic Responses”. In: Cell 163.5 (Nov. 2015),
pp. 1079–1094. doi: 10.1016/j.cell.2015.11.001.
[53] Rebecca L. Siegel, Kimberly D. Miller, and Ahmedin Jemal. “Cancer statistics, 2019”. In: CA: A
Cancer Journal for Clinicians 69.1 (Jan. 2019), pp. 7–34. doi: 10.3322/caac.21551.
[54] Adam S. Butterworth, Julian P.T. Higgins, and Paul Pharoah. “Relative and absolute risk of
colorectal cancer for individuals with a family history: A meta-analysis”. In: European Journal of
Cancer 42.2 (Jan. 2006), pp. 216–227. doi: 10.1016/j.ejca.2005.09.023.
[55] Louise E. Johns and Richard S. Houlston. “A systematic review and meta-analysis of familial
colorectal cancer risk”. In: The American Journal of Gastroenterology 96.10 (Oct. 2001),
pp. 2992–3003. doi: 10.1111/j.1572-0241.2001.04677.x.
119
[56] Maurice W. M. D. Lutgens, Martijn G. H. van Oijen, Geert J. M. G. van der Heijden,
Frank P. Vleggaar, Peter D. Siersema, and Bas Oldenburg. “Declining Risk of Colorectal Cancer in
Inflammatory Bowel Disease”. In: Inflammatory Bowel Diseases 19.4 (Mar. 2013), pp. 789–799. doi:
10.1097/mib.0b013e31828029c0.
[57] K. K. Tsilidis, J. C. Kasimis, D. S. Lopez, E. E. Ntzani, and J. P. A. Ioannidis. “Type 2 diabetes and
cancer: umbrella review of meta-analyses of observational studies”. In: BMJ 350.jan02 1 (Jan.
2015), g7607–g7607. doi: 10.1136/bmj.g7607.
[58] V Bagnardi, M Rota, E Botteri, I Tramacere, F Islami, V Fedirko, L Scotti, M Jenab, F Turati,
E Pasquali, C Pelucchi, C Galeone, R Bellocco, E Negri, G Corrao, P Boffetta, and C La Vecchia.
“Alcohol consumption and site-specific cancer risk: a comprehensive dose–response
meta-analysis”. In: British Journal of Cancer 112.3 (Nov. 2014), pp. 580–593. doi:
10.1038/bjc.2014.579.
[59] Edoardo Botteri, Simona Iodice, Vincenzo Bagnardi, Sara Raimondi, Albert B. Lowenfels, and
Patrick Maisonneuve. “Smoking and Colorectal Cancer”. In: JAMA 300.23 (Dec. 2008), p. 2765.
doi: 10.1001/jama.2008.839.
[60] Yanlei Ma, Yongzhi Yang, Feng Wang, Peng Zhang, Chenzhang Shi, Yang Zou, and Huanlong Qin.
“Obesity and Risk of Colorectal Cancer: A Systematic Review of Prospective Studies”. In: PLoS
ONE 8.1 (Jan. 2013). Ed. by Olga Y. Gorlova, e53916. doi: 10.1371/journal.pone.0053916.
[61] Jun Yu, Qiang Feng, Sunny Hei Wong, Dongya Zhang, Qiao yi Liang, Youwen Qin,
Longqing Tang, Hui Zhao, Jan Stenvang, Yanli Li, Xiaokai Wang, Xiaoqiang Xu, Ning Chen,
William Ka Kei Wu, Jumana Al-Aama, Hans Jørgen Nielsen, Pia Kiilerich,
Benjamin Anderschou Holbech Jensen, Tung On Yau, Zhou Lan, Huijue Jia, Junhua Li,
Liang Xiao, Thomas Yuen Tung Lam, Siew Chien Ng, Alfred Sze-Lok Cheng,
Vincent Wai-Sun Wong, Francis Ka Leung Chan, Xun Xu, Huanming Yang, Lise Madsen,
Christian Datz, Herbert Tilg, Jian Wang, Nils Brünner, Karsten Kristiansen,
Manimozhiyan Arumugam, Joseph Jao-Yiu Sung, and Jun Wang. “Metagenomic analysis of faecal
microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer”. In: Gut
66.1 (Sept. 2015), pp. 70–78. doi: 10.1136/gutjnl-2015-309800.
[62] Geoffrey D. Hannigan, Melissa B. Duhaime, Mack T. Ruffin, Charlie C. Koumpouras, and
Patrick D. Schloss. “Diagnostic Potential and Interactive Dynamics of the Colorectal Cancer
Virome”. In: mBio 9.6 (Dec. 2018). Ed. by Jeff F. Miller. doi: 10.1128/mbio.02248-18.
[63] Qiang Feng, Suisha Liang, Huijue Jia, Andreas Stadlmayr, Longqing Tang, Zhou Lan,
Dongya Zhang, Huihua Xia, Xiaoying Xu, Zhuye Jie, Lili Su, Xiaoping Li, Xin Li, Junhua Li,
Liang Xiao, Ursula Huber-Schönauer, David Niederseer, Xun Xu, Jumana Yousuf Al-Aama,
Huanming Yang, Jian Wang, Karsten Kristiansen, Manimozhiyan Arumugam, Herbert Tilg,
Christian Datz, and Jun Wang. “Gut microbiome development along the colorectal
adenoma-carcinoma sequence”. In: Nature Communications 6.1 (Mar. 2015). doi:
10.1038/ncomms7528.
120
[64] Emily Vogtmann, Xing Hua, Georg Zeller, Shinichi Sunagawa, Anita Y. Voigt, Rajna Hercog,
James J. Goedert, Jianxin Shi, Peer Bork, and Rashmi Sinha. “Colorectal Cancer and the Human
Gut Microbiome: Reproducibility with Whole-Genome Shotgun Sequencing”. In: PLOS ONE 11.5
(May 2016). Ed. by John Parkinson, e0155362. doi: 10.1371/journal.pone.0155362.
[65] Ziwei Zhou, Jiewen Chen, Herui Yao, and Hai Hu. “Fusobacterium and Colorectal Cancer”. In:
Frontiers in Oncology 8 (Oct. 2018). doi: 10.3389/fonc.2018.00371.
[66] Antony Cougnoux, Guillaume Dalmasso, Ruben Martinez, Emmanuel Buc, Julien Delmas,
Lucie Gibold, Pierre Sauvanet, Claude Darcha, Pierre Déchelotte, Mathilde Bonnet, Denis Pezet,
Harald Wodrich, Arlette Darfeuille-Michaud, and Richard Bonnet. “Bacterial genotoxin colibactin
promotes colon tumour growth by inducing a senescence-associated secretory phenotype”. In:
Gut 63.12 (Mar. 2014), pp. 1932–1942. doi: 10.1136/gutjnl-2013-305257.
[67] Fakhri Haghi, Elshan Goli, Bahman Mirzaei, and Habib Zeighami. “The association between fecal
enterotoxigenic B. fragilis with colorectal cancer”. In: BMC Cancer 19.1 (Sept. 2019). doi:
10.1186/s12885-019-6115-1.
[68] Derek Reiman, Ahmed Metwally, and Yang Dai. “Using convolutional neural networks to explore
the microbiome”. In: (July 2017). doi: 10.1109/embc.2017.8037799.
[69] K. D. Pruitt, T. Tatusova, and D. R. Maglott. “NCBI reference sequences (RefSeq): a curated
non-redundant sequence database of genomes, transcripts and proteins”. In: Nucleic Acids
Research 35.Database (Jan. 2007), pp. D61–D65. doi: 10.1093/nar/gkl842.
[70] Max Kuhn. “Building Predictive Models in R Using the caret Package”. In: Journal of Statistical
Software 28.5 (2008). doi: 10.18637/jss.v028.i05.
[71] Thomas G Dietterich. “Machine-Learning Research: Four Current Directions”. In: AI Magazine
18.4 (1997), pp. 97–136.
[72] Jan Grau, Ivo Grosse, and Jens Keilwagen. “PRROC: computing and visualizing precision-recall
and receiver operating characteristic curves in R”. In: Bioinformatics 31.15 (Mar. 2015),
pp. 2595–2597. doi: 10.1093/bioinformatics/btv153.
[73] Simon H. Ye, Katherine J. Siddle, Daniel J. Park, and Pardis C. Sabeti. “Benchmarking
Metagenomics Tools for Taxonomic Classification”. In: Cell 178.4 (Aug. 2019), pp. 779–794. doi:
10.1016/j.cell.2019.07.010.
[74] J. Tamames, M. Cobo-Simón, and F. Puente-Sánchez. “Assessing the performance of different
approaches for functional and taxonomic annotation of metagenomes”. In: BMC Genomics 20.1
(Dec. 2019). doi: 10.1186/s12864-019-6289-6.
[75] Frank Wilcoxon. “Individual Comparisons by Ranking Methods”. In: Biometrics Bulletin 1.6 (Dec.
1945), p. 80. doi: 10.2307/3001968.
[76] Jerome H. Friedman. “Greedy function approximation: A gradient boosting machine.” In: The
Annals of Statistics 29.5 (Oct. 2001). doi: 10.1214/aos/1013203451.
121
[77] Thais Mayumi Oshiro, Pedro Santoro Perez, and José Augusto Baranauskas. “How Many Trees in
a Random Forest?” In: (2012), pp. 154–168. doi: 10.1007/978-3-642-31537-4_13.
[78] Jeffrey T. Leek, W. Evan Johnson, Hilary S. Parker, Andrew E. Jaffe, and John D. Storey. “The sva
package for removing batch effects and other unwanted variation in high-throughput
experiments”. In: Bioinformatics 28.6 (Jan. 2012), pp. 882–883. doi: 10.1093/bioinformatics/bts034.
[79] Yuqing Zhang, Christoph Bernau, Giovanni Parmigiani, and Levi Waldron. “The impact of
different sources of heterogeneity on loss of accuracy from genomic prediction models”. In:
Biostatistics 21.2 (Sept. 2018), pp. 253–268. doi: 10.1093/biostatistics/kxy044.
[80] Samantha Leong, Yue Zhao, Noyal M. Joseph, Natasha S. Hochberg, Sonali Sarkar,
Jane Pleskunas, David Hom, Subitha Lakshminarayanan, C. Robert Horsburgh, Gautam Roy,
Jerrold J. Ellner, W. Evan Johnson, and Padmini Salgame. “Existing blood transcriptional
classifiers accurately discriminate active tuberculosis from latent infection in individuals from
south India”. In: Tuberculosis 109 (Mar. 2018), pp. 41–51. doi: 10.1016/j.tube.2018.01.002.
[81] Rina Kagawa, Yoshimasa Kawazoe, Yusuke Ida, Emiko Shinohara, Katsuya Tanaka, Takeshi Imai,
and Kazuhiko Ohe. “Development of Type 2 Diabetes Mellitus Phenotyping Framework Using
Expert Knowledge and Machine Learning Approach”. In: Journal of Diabetes Science and
Technology 11.4 (Dec. 2016), pp. 791–799. doi: 10.1177/1932296816681584.
[82] Christoph Bernau, Markus Riester, Anne-Laure Boulesteix, Giovanni Parmigiani,
Curtis Huttenhower, Levi Waldron, and Lorenzo Trippa. “Cross-study validation for the
assessment of prediction algorithms”. In: Bioinformatics 30.12 (June 2014), pp. i105–i112. doi:
10.1093/bioinformatics/btu279.
[83] Lo-Bin Chang and Donald Geman. “Tracking Cross-Validated Estimates of Prediction Error as
Studies Accumulate”. In: Journal of the American Statistical Association 110.511 (July 2015),
pp. 1239–1247. doi: 10.1080/01621459.2014.1002926.
[84] Yuqing Zhang, Prasad Patil, W. Evan Johnson, and Giovanni Parmigiani. “Robustifying genomic
classifiers to batch effects via ensemble learning”. In: Bioinformatics 37.11 (Nov. 2020),
pp. 1521–1527. doi: 10.1093/bioinformatics/btaa986.
[85] Yilin Gao, Zifan Zhu, and Fengzhu Sun. “Increasing prediction performance of colorectal cancer
disease status using random forests classification based on metagenomic shotgun sequencing
data”. In: Synthetic and Systems Biotechnology 7.1 (Mar. 2022), pp. 574–585. doi:
10.1016/j.synbio.2022.01.005.
[86] Peter Kupfer, Reinhard Guthke, Dirk Pohlers, Rene Huber, Dirk Koczan, and Raimund W Kinne.
“Batch correction of microarray data substantially improves the identification of genes
differentially expressed in Rheumatoid Arthritis and Osteoarthritis”. In: BMC Medical Genomics
5.1 (June 2012). doi: 10.1186/1755-8794-5-23.
[87] Mark D. Robinson, Davis J. McCarthy, and Gordon K. Smyth. “edgeR: a Bioconductor package for
differential expression analysis of digital gene expression data”. In: Bioinformatics 26.1 (Nov.
2009), pp. 139–140. doi: 10.1093/bioinformatics/btp616.
122
[88] Michael I Love, Wolfgang Huber, and Simon Anders. “Moderated estimation of fold change and
dispersion for RNA-seq data with DESeq2”. In: Genome Biology 15.12 (Dec. 2014). doi:
10.1186/s13059-014-0550-8.
[89] Josep Martín-Fernández, Carles Vidal, and Vera Pawlowsky-Glahn. “Dealing with Zeros and
Missing Values in Compositional Data Sets Using Nonparametric Imputation”. In: Mathematical
Geology 35 (Jan. 2003), pp. 253–278. doi: 10.1023/A:1023866030544.
[90] Joshua M. Stuart, Eran Segal, Daphne Koller, and Stuart K. Kim. “A Gene-Coexpression Network
for Global Discovery of Conserved Genetic Modules”. In: Science 302.5643 (Oct. 2003),
pp. 249–255. doi: 10.1126/science.1087447.
[91] Raivo Kolde, Sven Laur, Priit Adler, and Jaak Vilo. “Robust rank aggregation for gene list
integration and meta-analysis”. In: Bioinformatics 28.4 (Jan. 2012), pp. 573–580. doi:
10.1093/bioinformatics/btr709.
[92] Xinran Li, Dingdong Yi, and Jun S. Liu. “Bayesian Analysis of Rank Data with Covariates and
Heterogeneous Rankers”. In: Statistical Science 37.1 (Feb. 2022). doi: 10.1214/20-sts818.
[93] Daniel E Zak, Adam Penn-Nicholson, Thomas J Scriba, Ethan Thompson, Sara Suliman,
Lynn M Amon, Hassan Mahomed, Mzwandile Erasmus, Wendy Whatney, Gregory D Hussey,
Deborah Abrahams, Fazlin Kafaar, Tony Hawkridge, Suzanne Verver, E Jane Hughes, Martin Ota,
Jayne Sutherland, Rawleigh Howe, Hazel M Dockrell, W Henry Boom, Bonnie Thiel,
Tom H M Ottenhoff, Harriet Mayanja-Kizza, Amelia C Crampin, Katrina Downing,
Mark Hatherill, Joe Valvo, Smitha Shankar, Shreemanta K Parida, Stefan H E Kaufmann,
Gerhard Walzl, Alan Aderem, and Willem A Hanekom. “A blood RNA signature for tuberculosis
disease risk: a prospective cohort study”. In: The Lancet 387.10035 (June 2016), pp. 2312–2322. doi:
10.1016/s0140-6736(15)01316-1.
[94] Suzanne T. Anderson, Myrsini Kaforou, Andrew J. Brent, Victoria J. Wright, Claire M. Banwell,
George Chagaluka, Amelia C. Crampin, Hazel M. Dockrell, Neil French, Melissa S. Hamilton,
Martin L. Hibberd, Florian Kern, Paul R. Langford, Ling Ling, Rachel Mlotha, Tom H.M. Ottenhoff,
Sandy Pienaar, Vashini Pillay, J. Anthony G. Scott, Hemed Twahir, Robert J. Wilkinson,
Lachlan J. Coin, Robert S. Heyderman, Michael Levin, and Brian Eley. “Diagnosis of Childhood
Tuberculosis and Host RNA Expression in Africa”. In: New England Journal of Medicine 370.18
(May 2014), pp. 1712–1723. doi: 10.1056/nejmoa1303657.
[95] Nicholas D. Walter, Mikaela A. Miller, Joshua Vasquez, Marc Weiner, Adam Chapman,
Melissa Engle, Michael Higgins, Amy M. Quinones, Vanessa Rosselli, Elizabeth Canono,
Christina Yoon, Adithya Cattamanchi, J. Lucian Davis, Tzu Phang, Robert S. Stearman,
Gargi Datta, Benjamin J. Garcia, Charles L. Daley, Michael Strong, Katerina Kechris,
Tasha E. Fingerlin, Randall Reves, and Mark W. Geraci. “Blood Transcriptional Biomarkers for
Active Tuberculosis among Patients in the United States: a Case-Control Study with Systematic
Cross-Classifier Evaluation”. In: Journal of Clinical Microbiology 54.2 (Feb. 2016). Ed. by
G. A. Land, pp. 274–282. doi: 10.1128/jcm.01990-15.
123
[96] Myrsini Kaforou, Victoria J. Wright, Tolu Oni, Neil French, Suzanne T. Anderson,
Nonzwakazi Bangani, Claire M. Banwell, Andrew J. Brent, Amelia C. Crampin, Hazel M. Dockrell,
Brian Eley, Robert S. Heyderman, Martin L. Hibberd, Florian Kern, Paul R. Langford, Ling Ling,
Marc Mendelson, Tom H. Ottenhoff, Femia Zgambo, Robert J. Wilkinson, Lachlan J. Coin, and
Michael Levin. “Detection of Tuberculosis in HIV-Infected and -Uninfected African Adults Using
Whole Blood RNA Expression Signatures: A Case-Control Study”. In: PLoS Medicine 10.10 (Oct.
2013). Ed. by Adithya Cattamanchi, e1001538. doi: 10.1371/journal.pmed.1001538.
[97] Alexandre Alcaïs, Claire Fieschi, Laurent Abel, and Jean-Laurent Casanova. “Tuberculosis in
children and adults”. In: The Journal of Experimental Medicine 202.12 (Dec. 2005), pp. 1617–1621.
doi: 10.1084/jem.20052302.
[98] Robin Kosch and Klaus Jung. “Conducting gene set tests in meta-analyses of transcriptome
expression data”. In: Research Synthesis Methods 10.1 (Feb. 2019), pp. 99–112. doi:
10.1002/jrsm.1337.
[99] Zoe Guan, Giovanni Parmigiani, and Prasad Patil. Merging versus Ensembling in Multi-Study
Prediction: Theoretical Insight from Random Effects. arXiv:1905.07382v3 [stat.ML]. 2021. doi:
10.48550/ARXIV.1905.07382. arXiv: 1905.07382 [stat.ML].
[100] Kevin Y. X. Wang, Gulietta M. Pupo, Varsha Tembe, Ellis Patrick, Dario Strbenac,
Sarah-Jane Schramm, John F. Thompson, Richard A. Scolyer, Samuel Muller, Garth Tarr,
Graham J. Mann, and Jean Y. H. Yang. “Cross-Platform Omics Prediction procedure: a statistical
machine learning framework for wider implementation of precision medicine”. In: npj Digital
Medicine 5.1 (July 2022). doi: 10.1038/s41746-022-00618-5.
[101] Kerri L. Glassner, Bincy P. Abraham, and Eamonn M.M. Quigley. “The microbiome and
inflammatory bowel disease”. In: Journal of Allergy and Clinical Immunology 145.1 (Jan. 2020),
pp. 16–27. doi: 10.1016/j.jaci.2019.11.003.
[102] Yeojun Yun, Han-Na Kim, Song E. Kim, Seong Gu Heo, Yoosoo Chang, Seungho Ryu,
Hocheol Shin, and Hyung-Lae Kim. “Comparative analysis of gut microbiota associated with
body mass index in a large Korean cohort”. In: BMC Microbiology 17.1 (July 2017). doi:
10.1186/s12866-017-1052-0.
[103] Tibor I. Krisko, Hayley T. Nicholls, Curtis J. Bare, Corey D. Holman, Gregory G. Putzel,
Robert S. Jansen, Natalie Sun, Kyu Y. Rhee, Alexander S. Banks, and David E. Cohen.
“Dissociation of Adaptive Thermogenesis from Glucose Homeostasis in Microbiome-Deficient
Mice”. In: Cell Metabolism 31.3 (Mar. 2020), 592–604.e9. doi: 10.1016/j.cmet.2020.01.012.
[104] Yilin Gao and Fengzhu Sun. “Batch Normalization Followed by Merging Is Powerful for
Phenotype Prediction Integrating Multiple Heterogeneous Studies”. In: bioRxiv (2022). doi:
10.1101/2022.09.28.509843. eprint:
https://www.biorxiv.org/content/early/2022/09/28/2022.09.28.509843.full.pdf.
124
[105] Wodan Ling, Jiuyao Lu, Ni Zhao, Anju Lulla, Anna M. Plantinga, Weijia Fu, Angela Zhang,
Hongjiao Liu, Hoseung Song, Zhigang Li, Jun Chen, Timothy W. Randolph, Wei Li A. Koay,
James R. White, Lenore J. Launer, Anthony A. Fodor, Katie A. Meyer, and Michael C. Wu. “Batch
effects removal for microbiome data via conditional quantile regression”. In: Nature
Communications 13.1 (Sept. 2022). doi: 10.1038/s41467-022-33071-9.
[106] Junjie Qin, Yingrui Li, Zhiming Cai, Shenghui Li, Jianfeng Zhu, Fan Zhang, Suisha Liang,
Wenwei Zhang, Yuanlin Guan, Dongqian Shen, Yangqing Peng, Dongya Zhang, Zhuye Jie,
Wenxian Wu, Youwen Qin, Wenbin Xue, Junhua Li, Lingchuan Han, Donghui Lu, Peixian Wu,
Yali Dai, Xiaojuan Sun, Zesong Li, Aifa Tang, Shilong Zhong, Xiaoping Li, Weineng Chen,
Ran Xu, Mingbang Wang, Qiang Feng, Meihua Gong, Jing Yu, Yanyan Zhang, Ming Zhang,
Torben Hansen, Gaston Sanchez, Jeroen Raes, Gwen Falony, Shujiro Okuda, Mathieu Almeida,
Emmanuelle LeChatelier, Pierre Renault, Nicolas Pons, Jean-Michel Batto, Zhaoxi Zhang,
Hua Chen, Ruifu Yang, Weimou Zheng, Songgang Li, Huanming Yang, Jian Wang,
S. Dusko Ehrlich, Rasmus Nielsen, Oluf Pedersen, Karsten Kristiansen, and Jun Wang. “A
metagenome-wide association study of gut microbiota in type 2 diabetes”. In: Nature 490.7418
(Sept. 2012), pp. 55–60. doi: 10.1038/nature11450.
[107] Fredrik H. Karlsson, Valentina Tremaroli, Intawat Nookaew, Göran Bergström, Carl Johan Behre,
Björn Fagerberg, Jens Nielsen, and Fredrik Bäckhed. “Gut metagenome in European women with
normal, impaired and diabetic glucose control”. In: Nature 498.7452 (May 2013), pp. 99–103. doi:
10.1038/nature12198.
[108] Krithivasan Sankaranarayanan, Andrew T. Ozga, Christina Warinner, Raul Y. Tito,
Alexandra J. Obregon-Tito, Jiawu Xu, Patrick M. Gaffney, Lori L. Jervis, Derrell Cox,
Lancer Stephens, Morris Foster, Gloria Tallbull, Paul Spicer, and Cecil M. Lewis. “Gut Microbiome
Diversity among Cheyenne and Arapaho Individuals from Western Oklahoma”. In: Current
Biology 25.24 (Dec. 2015), pp. 3161–3169. doi: 10.1016/j.cub.2015.10.060.
[109] Stefan Bentink, Benjamin Haibe-Kains, Thomas Risch, Jian-Bing Fan, Michelle S. Hirsch,
Kristina Holton, Renee Rubio, Craig April, Jing Chen, Eliza Wickham-Garcia, Joyce Liu,
Aedin Culhane, Ronny Drapkin, John Quackenbush, and Ursula A. Matulonis. “Angiogenic mRNA
and microRNA Gene Expression Signature Predicts a Novel Subtype of Serous Ovarian Cancer”.
In: PLoS ONE 7.2 (Feb. 2012). Ed. by Chad Creighton, e30269. doi: 10.1371/journal.pone.0030269.
[110] J. Stuart Ferriss, Youngchul Kim, Linda Duska, Michael Birrer, Douglas A. Levine,
Christopher Moskaluk, Dan Theodorescu, and Jae K. Lee. “Multi-Gene Expression Predictors of
Single Drug Responses to Adjuvant Chemotherapy in Ovarian Carcinoma: Predicting Platinum
Resistance”. In: PLoS ONE 7.2 (Feb. 2012). Ed. by Wael El-Rifai, e30550. doi:
10.1371/journal.pone.0030550.
[111] Kosuke Yoshihara, Atsushi Tajima, Tetsuro Yahata, Shoji Kodama, Hiroyuki Fujiwara,
Mitsuaki Suzuki, Yoshitaka Onishi, Masayuki Hatae, Kazunobu Sueyoshi, Hisaya Fujiwara,
Yoshiki Kudo, Kohei Kotera, Hideaki Masuzaki, Hironori Tashiro, Hidetaka Katabuchi,
Ituro Inoue, and Kenichi Tanaka. “Gene Expression Profile for Predicting Survival in
Advanced-Stage Serous Ovarian Cancer Across Two Independent Datasets”. In: PLoS ONE 5.3
(Mar. 2010), e9615. doi: 10.1371/journal.pone.0009615.
125
[112] Antoine Aoun, Fatima Darwish, and Natacha Hamod. “The Influence of the Gut Microbiome on
Obesity in Adults and the Role of Probiotics, Prebiotics, and Synbiotics for Weight Loss”. In:
Preventive Nutrition and Food Science 25.2 (June 2020), pp. 113–123. doi:
10.3746/pnf.2020.25.2.113.
[113] Deborah R. Leitner, Gema Frühbeck, Volkan Yumuk, Karin Schindler, Dragan Micic,
Euan Woodward, and Hermann Toplak. “Obesity and Type 2 Diabetes: Two Diseases with a Need
for Combined Treatment Strategies - EASO Can Lead the Way”. In: Obesity Facts 10.5 (2017),
pp. 483–492. doi: 10.1159/000480525.
[114] Edoardo Pasolli, Lucas Schiffer, Paolo Manghi, Audrey Renson, Valerie Obenchain,
Duy Tin Truong, Francesco Beghini, Faizan Malik, Marcel Ramos, Jennifer B Dowd,
Curtis Huttenhower, Martin Morgan, Nicola Segata, and Levi Waldron. “Accessible, curated
metagenomic data through ExperimentHub”. In: Nature Methods 14.11 (Nov. 2017), pp. 1023–1024.
doi: 10.1038/nmeth.4468.
[115] John B. Welsh, Patrick P. Zarrinkar, Lisa M. Sapinoso, Suzanne G. Kern, Cynthia A. Behling,
Bradley J. Monk, David J. Lockhart, Robert A. Burger, and Garret M. Hampton. “Analysis of gene
expression profiles in normal and neoplastic ovarian tissue samples identifies candidate
molecular markers of epithelial ovarian cancer”. In: Proceedings of the National Academy of
Sciences 98.3 (Jan. 2001), pp. 1176–1181. doi: 10.1073/pnas.98.3.1176.
[116] Stephanie Lheureux, Marc Braunstein, and Amit M. Oza. “Epithelial ovarian cancer: Evolution of
management in the era of precision medicine”. In: CA: a cancer journal for clinicians 69.4 (2019),
pp. 280–304. doi: 10.3322/caac.21559.
[117] Stephanie Lheureux, Marsela Braunstein, and Amit M. Oza. “Epithelial ovarian cancer: Evolution
of management in the era of precision medicine”. In: CA: A Cancer Journal for Clinicians 69.4
(May 2019), pp. 280–304. doi: 10.3322/caac.21559.
[118] Lushi Yu, Hongyun Gong, Qian Li, Honggang Ren, Yi Wang, Haihua He, Tian Li, and Qibin Song.
“Survival Analysis of Radiation Therapy in Ovarian Cancer: A SEER Database Analysis”. In:
Journal of Oncology 2021 (Feb. 2021). Ed. by Nihal Ahmad, pp. 1–11. doi: 10.1155/2021/8849039.
[119] Benjamin Frederick Ganzfried, Markus Riester, Benjamin Haibe-Kains, Thomas Risch,
Svitlana Tyekucheva, Ina Jazic, Xin Victoria Wang, Mahnaz Ahmadifar, Michael J. Birrer,
Giovanni Parmigiani, Curtis Huttenhower, and Levi Waldron. “curatedOvarianData: clinically
annotated data for the ovarian cancer transcriptome”. In: Database 2013 (Jan. 2013). doi:
10.1093/database/bat013.
[120] Harald Binder, Arthur Allignol, Martin Schumacher, and Jan Beyersmann. “Boosting for
high-dimensional time-to-event data with competing risks”. In: Bioinformatics 25.7 (Feb. 2009),
pp. 890–896. doi: 10.1093/bioinformatics/btp088.
[121] David R. Cox. “Regression Models and Life-Tables”. In: Journal of the Royal Statistical Society.
Series B (Methodological) 34.2 (1972), pp. 187–220. url: http://www.jstor.org/stable/2985181.
126
[122] Wenzheng Sun, Mingyan Jiang, Jun Dang, Panchun Chang, and Fang-Fang Yin. “Effect of
machine learning methods on predicting NSCLC overall survival time based on Radiomics
analysis”. In: Radiation Oncology 13.1 (Oct. 2018). doi: 10.1186/s13014-018-1140-9.
[123] Moritz Herrmann, Philipp Probst, Roman Hornung, Vindi Jurinovic, and Anne-Laure Boulesteix.
“Large-scale benchmark study of survival prediction methods using multi-omics data”. In:
Briefings in Bioinformatics 22.3 (Aug. 2020). doi: 10.1093/bib/bbaa167.
[124] Hemant Ishwaran, Udaya B. Kogalur, Eugene H. Blackstone, and Michael S. Lauer. “Random
survival forests”. In: The Annals of Applied Statistics 2.3 (Sept. 2008). doi: 10.1214/08-aoas169.
[125] Annette Spooner, Emily Chen, Arcot Sowmya, Perminder Sachdev, Nicole A. Kochan,
Julian Trollor, and Henry Brodaty. “A comparison of machine learning methods for survival
analysis of high-dimensional clinical data for dementia prediction”. In: Scientific Reports 10.1
(Nov. 2020). doi: 10.1038/s41598-020-77220-w.
[126] Frank E. Harrell. “Evaluating the Yield of Medical Tests”. In: JAMA: The Journal of the American
Medical Association 247.18 (May 1982), p. 2543. doi: 10.1001/jama.1982.03320430047030.
[127] Geyu Zhou, Tianqi Chen, and Hongyu Zhao. “SDPRX: A statistical method for cross-population
prediction of complex traits”. In: The American Journal of Human Genetics 110.1 (2023), pp. 13–22.
issn: 0002-9297. doi: https://doi.org/10.1016/j.ajhg.2022.11.007.
[128] Omer Weissbrod, Masahiro Kanai, Huwenbo Shi, Steven Gazal, Wouter J. Peyrot, Amit V. Khera,
Yukinori Okada, The Biobank Japan Project, Alicia R. Martin, Hilary Finucane, and Alkes L. Price.
“Leveraging fine-mapping and non-European training data to improve cross-population
polygenic risk scores”. In: medRxiv (2021). doi: 10.1101/2021.01.19.21249483. eprint:
https://www.medrxiv.org/content/early/2021/08/20/2021.01.19.21249483.full.pdf.
[129] Mingxuan Cai, Jiashun Xiao, Shunkang Zhang, Xiang Wan, Hongyu Zhao, Gang Chen, and
Can Yang. “A unified framework for cross-population trait prediction by leveraging the genetic
correlation of polygenic traits”. In: The American Journal of Human Genetics 108.4 (2021),
pp. 632–655. issn: 0002-9297. doi: https://doi.org/10.1016/j.ajhg.2021.03.002.
[130] Yunfeng Ruan, Yen-Feng Lin, Yen-Chen Anne Feng, Chia-Yen Chen, Max Lam, Zhenglin Guo,
Stanley Global Asia Initiatives, Lin He, Akira Sawa, Alicia R. Martin, Shengying Qin,
Hailiang Huang, and Tian Ge. “Improving Polygenic Prediction in Ancestrally Diverse
Populations”. In: medRxiv (2021). doi: 10.1101/2020.12.27.20248738. eprint:
https://www.medrxiv.org/content/early/2021/08/24/2020.12.27.20248738.full.pdf.
[131] Haoyu Zhang, Jianan Zhan, Jin Jin, Jingning Zhang, Wenxuan Lu, Ruzhang Zhao,
Thomas U. Ahearn, Zhi Yu, Jared O’Connell, Yunxuan Jiang, Tony Chen, Dayne Okuhara,
23andMe Research Team, Montserrat Garcia-Closas, Xihong Lin, Bertram L. Koelsch, and
Nilanjan Chatterjee. “A new method for multi-ancestry polygenic prediction improves
performance across diverse populations”. In: bioRxiv (2023). doi: 10.1101/2022.03.24.485519.
eprint: https://www.biorxiv.org/content/early/2023/08/17/2022.03.24.485519.full.pdf.
127
[132] Jeffrey P. Spence, Nasa Sinnott-Armstrong, Themistocles L. Assimes, and Jonathan K. Pritchard.
“A flexible modeling and inference framework for estimating variant effect sizes from GWAS
summary statistics”. In: bioRxiv (2022). doi: 10.1101/2022.04.18.488696. eprint:
https://www.biorxiv.org/content/early/2022/04/19/2022.04.18.488696.full.pdf.
128
A Supplementary Materials for Chapter 2
A.1 Prediction performance comparison of model stacking method and traditional
feature stacking method
In our study, we compared the leave-one-sample-out (LOSO) model stacking method and the feature stacking method in terms of prediction performance. The feature stacking method combines the known and
novel reference tables into one data matrix and then uses it to build predictive models. However, the LOSO
model stacking method assigns weights to known and novel predictive probabilities and then an overall
AUC or AUPRC is calculated using the weighted probabilities.
We compared the prediction performance in terms of both AUC and AUPRC. For AUC comparison
(Figure S1), in the within-dataset comparison results, 5 out of 6 datasets have better mean AUC scores
when using LOSO model stacking method than feature stacking method. In the cross-dataset comparison
results, 20 out of 30 LOSO model stacking AUCs are better than feature stacking AUCs and 4 out of 30 are
the same. For AUPRC comparison (Figure S2), all 6 datasets have higher AUPRC scores when using LOSO
model stacking method in within-dataset comparison, while in cross-dataset settings, 19/30 have higher
AUPRC scores using LOSO model stacking method and 3/30 have the same performance. Therefore, our
results suggest that the LOSO model stacking method is a better choice for microbiome-CRC predictive
analysis compared with the feature stacking method.
A.2 Leave-one-dataset-out predictive analysis
We also conducted predictive analyses in the leave-one-dataset-out setting, where one dataset was selected
as the testing set and the remaining five datasets as the combined training set. Due to the huge computational overhead of running MicroPro on five combined datasets, we reused the predictive probabilities
from the cross-dataset setting. For each sample i in the testing set, the predictive probabilities from the
129
Figure S1: Barplots of AUC scores using feature stacking (orange) and LOSO model stacking (blue)
in two experimental designs: within-dataset (subfigure A) and cross-dataset (subfigure B). All
AUC scores are averages from 30 independent repetitions. ‘feature stacking’ refers to the random forests
model trained on abundances from both known and novel organisms directly. ‘LOSO model stacking’
refers to incorporating predictive probabilities from known and novel organisms to predict by LOSO model
stacking method. 5000 decision trees are used all random forests models.
known and novel microbial abundances were calculated using equations (1-2), respectively, where pij and
AUCij were the predictive probability and the AUC score of the testing set excluding sample i in the
cross-dataset setting with dataset j as the training set. Similarly, the LOSO model stacking probability is
calculated using equation (3).
130
Figure S2: Barplots of AUPRC scores using feature stacking (orange) and LOSO model stacking
(blue) in two experimental designs: within-dataset (subfigure A) and cross-dataset (subfigure B).
All AUPRC scores are averages from 30 independent repetitions. ‘feature stacking’ refers to the random
forests model trained on abundances from both known and novel organisms directly. ‘LOSO model stacking’ refers to incorporating predictive probabilities from known and novel organisms to predict by LOSO
model stacking method. 5000 decision trees are used all random forests models.
p
known
i
=
P
training dataset j max(AUCknown
ij −0.5,0)×p
known
P
ij
training dataset j max(AUCknown
ij −0.5,0) (A.1)
131
p
novel
i
=
P
training dataset j max(AUCnovel
ij −0.5,0)×p
novel
P
ij
training dataset j max(AUCnovel
ij −0.5,0) (A.2)
pi =
P
training dataset j [max(AUCknown
ij −0.5,0)×p
known
ij +max(AUCnovel
ij −0.5,0)×p
novel
ij ]
P
training dataset j [max(AUCknown
ij −0.5,0)+max(AUCnovel
ij −0.5,0)] (A.3)
Leave-one-dataset-out prediction results are shown in Figure S3. We observed 3 out of 6 AUCs were
increased when incorporating the novel microbial abundances, while the AUC for one dataset did not
change. For AUPRC results, 3 out of 6 datasets increased when incorporating the novel microbial abundances. This demonstrates that using the LOSO model stacking method consistently improves the prediction performance.
Figure S3: Barplots of AUC (subfigure A) and AUPRC (subfigure B) scores using known microbial
abundances (orange) and LOSO model stacking (blue) in Leave-one-dataset-out(LODO) analysis.
All AUC and AUPRC scores are averages from 10 independent repetitions. ‘known’ refers to the random
forests model using only abundances of known organisms and ‘LOSO model stacking’ refers to incorporating predictive probabilities from known and novel organisms to predict by LOSO model stacking method.
Random forests models use 5000 decision trees.
132
A.3 Comparison of prediction accuracy based on microbial species abundance profiles
using the UHGG database and NCBI RefSeq Database
MicroPro [32] used Centrifuge [11] to map reads to the NCBI RefSeq database [69]. However, the average
proportion of mapped reads is only between 40% and 60%, resulting in a large proportion of reads unable
to be mapped to reference genomes in the NCBI RefSeq database (Figure S4). In previous analysis, we
already explored how unmapped reads can contribute to the CRC disease predictive analysis. We next further investigated the effect of using a more comprehensive collection of microbial reference genomes in
the predictive analysis. The database we used was the Unified Human Gastrointestinal Genome (UHGG)
collection, comprised of 204,938 non-redundant genomes from 4,644 gut prokaryotes [17]. We used Centrifuge to profile the six CRC datasets again to obtain new microbial species abundance profiles by mapping
the reads to the UHGG database. The UHGG database was claimed to be the most comprehensive sequence
resources of the human gut microbiome established so far [17]. According to the mapping rate results in
Figure S4, the average mapping rates for all the six datasets increased markedly compared to the NCBI
RefSeq database and achieved more than 85% for all six datasets. We trained random forests on the new
microbial species abundance profiles with 1000 decision trees to see if a higher mapping rate can help to
improve the AUC scores. As shown in Figure S5, the three experimental settings show similar AUC scores
in using NCBI RefSeq database and UHGG database, with an average of 0.735 and 0.708, respectively, and
a p-value of 0.011 by two-sided paired Mann-Whitney U test. The AUC scores of the UHGG database are
only higher in 11/42 cases in the experiments. These results imply that the new species added to UHGG
database may not be useful for the CRC disease status prediction, which is consistent with our previous
findings that including novel microbial species only slightly improved the prediction performance.
B Supplementary Materials for Chapter 3
133
Figure S4: Barplots of average mapping rates against NCBI RefSeq (orange) and UHGG (blue)
databases for each dataset. Average mapping rate is the mean mapping rate of all samples in the same
dataset. Sample mapping rate is calculated as the number of mapped reads divided by total number of
reads in the sample.
134
Figure S5: Barplots of AUC scores for predicting disease status (case/control) trained by random
forests on known species abundance profiles classified by NCBI (orange) and UHGG (blue) reference databases in three experimental designs: within-dataset (subfigure A), cross-dataset (subfigure B), and leave-one-dataset-out (LODO) (subfigure C). AUC scores of within-dataset and crossdataset analyses are averages from 30 independent repetitions, while AUC scores of LODO analysis are
averages from 10 independent repetitions.
135
Figure S6: Realistic applications of merging and integration methods on multiple colorectal cancer metagenomic datasets using RF classifiers from six individual Leave-one-dataset-out(LODO)
experiments. The results by different methods are grouped into six groups. "Single learner": Each of
the five training datasets were trained independently with RF classifier and predicted on the test dataset,
then the average AUC score was taken among the five predictions. "Merged": Merging method with pooling all five training datasets into one training data. The "Single learner" and "Merged" experiments were
conducted under both naive and ComBat normalization settings. "Ensemble learning": The five training
predictors were integrated by ensemble weighted learning methods under naive setting. "Ensemble learning (normalized)": The five training predictors were integrated by ensemble weighted learning methods
under ComBat normalization setting. "Rank aggregation": The five training predictors were integrated by
rank aggregation methods under naive setting. "Rank aggregation (normalized)": The five training predictors were integrated by rank aggregation methods under ComBat normalization setting. The red dots and
associated values on the figure are the mean AUC scores for each method, the vertical bars are the median
AUC scores for each method, while the black dots represent the outliers. Same method under different
settings are represented in the same color of boxplots. All the experiments were repeated 30 times for
each test dataset, and the results presented in the figure were based on the average AUC scores of the total
180 replications for the six test datasets.
136
Figure S7: Realistic applications of merging and integration methods on multiple tuberculosis
genomic datasets using RF classifiers from six individual Leave-one-dataset-out(LODO) experiments. The results by different methods are grouped into six groups. "Single learner": Each of the five
training datasets were trained independently with RF classifier and predicted on the test dataset, then the
average AUC score was taken among the five predictions. "Merged": Merging method with pooling all five
training datasets into one training data. The "Single learner" and "Merged" experiments were conducted
under both naive and ComBat normalization settings. "Ensemble learning": The five training predictors
were integrated by ensemble weighted learning methods under naive setting. "Ensemble learning (normalized)": The five training predictors were integrated by ensemble weighted learning methods under
ComBat normalization setting. "Rank aggregation": The five training predictors were integrated by rank
aggregation methods under naive setting. "Rank aggregation (normalized)": The five training predictors
were integrated by rank aggregation methods under ComBat normalization setting. The red dots and associated values on the figure are the mean AUC scores for each method, the vertical bars are the median
AUC scores for each method, while the black dots represent the outliers. Same method under different
settings are represented in the same color of boxplots. All the experiments were repeated 30 times for
each test dataset, and the results presented in the figure were based on the average AUC scores of the total
180 replications for the six test datasets.
137
Figure S8: ComBat normalization increases cross-study prediction when the training and test data
have different feature distribution with LASSO classifiers. The figures show the AUCs predictions of
LASSO using different integration methods with three different disease effect factors. Columns represents
different values of α. All the method names without a suffix of "ComBat" are the methods carried out in
the naive setting, while the names with a suffix of "ComBat" were carried out in the ComBat normalization
setting. All the experiments were repeated for 100 times and the AUC scores shown on the figure are the
averages from the 100 trials. However, the differences between the results using normalization versus nonormalization are not as dramatic as other machine learning classifiers indicating robustness of LASSO
with respect to population differences.
138
Figure S9: ComBat normalization showed similar low cross-study prediction accuracy compared
to no normalization when the training and test data have different feature distribution with
SVM classifiers. The figures show the AUCs predictions of SVM with polynomial kernel using different
integration methods with three different disease effect factors. Columns represents different values of
α. All the method names without a suffix of "ComBat" are the methods carried out in the naive setting,
while the names with a suffix of "ComBat" were carried out in the ComBat normalization setting. All the
experiments were repeated for 100 times and the AUC scores shown on the figure are the averages from
the 100 trials.
139
Figure S10: ComBat normalization markedly increases cross-study prediction when the training
and test data have different feature distribution with XGBoost classifiers. The figures show the
AUCs predictions of XGBoost using different integration methods with three different disease effect factors. Columns represents different values of α. All the method names without a suffix of "ComBat" are the
methods carried out in the naive setting, while the names with a suffix of "ComBat" were carried out in
the ComBat normalization setting. All the experiments were repeated for 100 times and the AUC scores
shown on the figure are the averages from the 100 trials.
140
Figure S11: ComBat normalization markedly increases cross-study prediction when the studies
have batch effects when using LASSO classifiers. The figures show the AUCs predictions of LASSO
using different integration methods with various batch severity levels. A: AUC score comparisons with
different severity levels of additive batch effects on the mean of OTU abundances, with no multiplicative
batch effect on the variance. B: AUC score comparisons with different severity levels of multiplicative
batch effects on the variance of OTU abundances, with no additive batch effect on the mean. The disease
effect factor was set to 1.025 for both situations. All the method names without a suffix of "ComBat"
are the methods done in naive setting, while the names with a suffix of "ComBat" were done in ComBat
normalization setting. All the experiments were repeated for 100 times and the AUC scores shown on the
figure are the averages from the 100 trials.
141
Figure S12: ComBat normalization slightly increases cross-study prediction accuracy when the
studies have batch effects when using SVM classifiers. The figures show the AUCs predictions of SVM
with polynomial kernel using different integration methods with various batch severity levels. A: AUC
score comparisons with different severity levels of additive batch effects on the mean of OTU abundances,
with no multiplicative batch effect on the variance. B: AUC score comparisons with different severity levels
of multiplicative batch effects on the variance of OTU abundances, with no additive batch effect on the
mean. The disease effect factor was set to 1.025 for both situations. All the method names without a suffix
of "ComBat" are the methods done in naive setting, while the names with a suffix of "ComBat" were done
in ComBat normalization setting. All the experiments were repeated for 100 times and the AUC scores
shown on the figure are the averages from the 100 trials.
142
Figure S13: ComBat normalization markedly increases cross-study prediction when the studies
have batch effects when using XGBoost classifiers. The figures show the AUCs predictions of XGBoost
using different integration methods with various batch severity levels. A: AUC score comparisons with
different severity levels of additive batch effects on the mean of OTU abundances, with no multiplicative
batch effect on the variance. B: AUC score comparisons with different severity levels of multiplicative
batch effects on the variance of OTU abundances, with no additive batch effect on the mean. The disease
effect factor was set to 1.025 for both situations. All the method names without a suffix of "ComBat"
are the methods done in naive setting, while the names with a suffix of "ComBat" were done in ComBat
normalization setting. All the experiments were repeated for 100 times and the AUC scores shown on the
figure are the averages from the 100 trials.
143
Figure S14: ComBat normalization markedly increases cross-study prediction when the training
and test data have different disease models when using LASSO classifiers. The figures show the
AUCs predictions of LASSO using different integration methods with various number of overlapping disease associated OTUs. The disease effect factor was set to 1.075. Columns represent different numbers of
overlapping disease associated OTUs in the training and test data, the larger the number, the more similar
the two disease models are. When the number achieves 10, the two models are the same in the training
and test data. All the experiments were repeated for 100 times and the AUC scores shown on the figure
are the averages from the 100 trials.
144
Figure S15: ComBat normalization markedly increases cross-study prediction when the training
and test data have different disease models when using SVM classifiers. The figures show the
AUCs predictions of SVM with polynomial kernel using different integration methods with various number
of overlapping disease associated OTUs. The disease effect factor was set to 1.075. Columns represent
different numbers of overlapping disease associated OTUs in the training and test data, the larger the
number, the more similar the two disease models are. When the number achieves 10, the two models are
the same in the training and test data. All the experiments were repeated for 100 times and the AUC scores
shown on the figure are the averages from the 100 trials.
145
Figure S16: ComBat normalization markedly increases cross-study prediction when the training
and test data have different disease models when using XGBoost classifiers. The figures show the
AUCs predictions of XGBoost using different integration methods with various number of overlapping
disease associated OTUs. The disease effect factor was set to 1.075. Columns represent different numbers
of overlapping disease associated OTUs in the training and test data, the larger the number, the more
similar the two disease models are. When the number achieves 10, the two models are the same in the
training and test data. All the experiments were repeated for 100 times and the AUC scores shown on the
figure are the averages from the 100 trials.
146
Figure S17: Realistic applications of merging and integration methods on multiple CRC metagenomic datasets and TB gene expression datasets using LASSO classifiers with average AUC socres
from six individual Leave-one-dataset-out(LODO) experiments. A: Leave-one-dataset-out average
AUC score comparisons among different methods in colorectal cancer metagenomic datasets. B: Leaveone-dataset-out average AUC score comparisons among different methods in tuberculosis gene expression
datasets. The results by different methods are grouped into six groups. "Single learner": Each of the five
training datasets were trained independently with RF classifier and predicted on the test dataset, then the
average AUC score was taken among the five predictions. "Merged": Merging method with pooling all five
training datasets into one training data. The "Single learner" and "Merged" experiments were conducted
under both naive and ComBat normalization settings. "Ensemble learning": The five training predictors
were integrated by ensemble weighted learning methods under naive setting. "Ensemble learning (normalized)": The five training predictors were integrated by ensemble weighted learning methods under
ComBat normalization setting. "Rank aggregation": The five training predictors were integrated by rank
aggregation methods under naive setting. "Rank aggregation (normalized)": The five training predictors
were integrated by rank aggregation methods under ComBat normalization setting. The red dots and associated values on the figure are the mean AUC scores for each method, the vertical bars are the median
AUC scores for each method, while the black dots represent the outliers. Same method under different
settings are represented in the same color of boxplots. All the experiments were repeated 30 times for
each test dataset, and the results presented in the figure were based on the average AUC scores of the total
180 replications for the six test datasets for metagenomic and gene expression studies, respectively.
147
Figure S18: Realistic applications of merging and integration methods on multiple CRC metagenomic datasets and TB gene expression datasets using SVM classifiers with average AUC socres
from six individual Leave-one-dataset-out(LODO) experiments. A: Leave-one-dataset-out average
AUC score comparisons among different methods in colorectal cancer metagenomic datasets. B: Leaveone-dataset-out average AUC score comparisons among different methods in tuberculosis gene expression
datasets. The results by different methods are grouped into six groups. "Single learner": Each of the five
training datasets were trained independently with RF classifier and predicted on the test dataset, then the
average AUC score was taken among the five predictions. "Merged": Merging method with pooling all five
training datasets into one training data. The "Single learner" and "Merged" experiments were conducted
under both naive and ComBat normalization settings. "Ensemble learning": The five training predictors
were integrated by ensemble weighted learning methods under naive setting. "Ensemble learning (normalized)": The five training predictors were integrated by ensemble weighted learning methods under
ComBat normalization setting. "Rank aggregation": The five training predictors were integrated by rank
aggregation methods under naive setting. "Rank aggregation (normalized)": The five training predictors
were integrated by rank aggregation methods under ComBat normalization setting. The red dots and associated values on the figure are the mean AUC scores for each method, the vertical bars are the median
AUC scores for each method, while the black dots represent the outliers. Same method under different
settings are represented in the same color of boxplots. All the experiments were repeated 30 times for
each test dataset, and the results presented in the figure were based on the average AUC scores of the total
180 replications for the six test datasets for metagenomic and gene expression studies, respectively.
148
Figure S19: Realistic applications of merging and integration methods on multiple CRC metagenomic datasets and TB gene expression datasets using XGBoost classifiers with average AUC
socres from six individual Leave-one-dataset-out(LODO) experiments. A: Leave-one-dataset-out
average AUC score comparisons among different methods in colorectal cancer metagenomic datasets. B:
Leave-one-dataset-out average AUC score comparisons among different methods in tuberculosis gene expression datasets. The results by different methods are grouped into six groups. "Single learner": Each of
the five training datasets were trained independently with RF classifier and predicted on the test dataset,
then the average AUC score was taken among the five predictions. "Merged": Merging method with pooling all five training datasets into one training data. The "Single learner" and "Merged" experiments were
conducted under both naive and ComBat normalization settings. "Ensemble learning": The five training
predictors were integrated by ensemble weighted learning methods under naive setting. "Ensemble learning (normalized)": The five training predictors were integrated by ensemble weighted learning methods
under ComBat normalization setting. "Rank aggregation": The five training predictors were integrated by
rank aggregation methods under naive setting. "Rank aggregation (normalized)": The five training predictors were integrated by rank aggregation methods under ComBat normalization setting. The red dots and
associated values on the figure are the mean AUC scores for each method, the vertical bars are the median
AUC scores for each method, while the black dots represent the outliers. Same method under different
settings are represented in the same color of boxplots. All the experiments were repeated 30 times for
each test dataset, and the results presented in the figure were based on the average AUC scores of the total
180 replications for the six test datasets for metagenomic and gene expression studies, respectively.
149
Figure S20: Realistic applications of merging and integration methods on multiple colorectal
cancer metagenomic datasets using LASSO classifiers from six individual Leave-one-datasetout(LODO) experiments. The results by different methods are grouped into six groups. "Single learner":
Each of the five training datasets were trained independently with RF classifier and predicted on the test
dataset, then the average AUC score was taken among the five predictions. "Merged": Merging method
with pooling all five training datasets into one training data. The "Single learner" and "Merged" experiments were conducted under both naive and ComBat normalization settings. "Ensemble learning": The
five training predictors were integrated by ensemble weighted learning methods under naive setting. "Ensemble learning (normalized)": The five training predictors were integrated by ensemble weighted learning
methods under ComBat normalization setting. "Rank aggregation": The five training predictors were integrated by rank aggregation methods under naive setting. "Rank aggregation (normalized)": The five
training predictors were integrated by rank aggregation methods under ComBat normalization setting.
The red dots and associated values on the figure are the mean AUC scores for each method, the vertical
bars are the median AUC scores for each method, while the black dots represent the outliers. Same method
under different settings are represented in the same color of boxplots. All the experiments were repeated
30 times for each test dataset, and the results presented in the figure were based on the average AUC scores
of the total 180 replications for the six test datasets.
150
Figure S21: Realistic applications of merging and integration methods on multiple tuberculosis
genomic datasets using LASSO classifiers from six individual Leave-one-dataset-out(LODO) experiments. The results by different methods are grouped into six groups. "Single learner": Each of the five
training datasets were trained independently with RF classifier and predicted on the test dataset, then the
average AUC score was taken among the five predictions. "Merged": Merging method with pooling all five
training datasets into one training data. The "Single learner" and "Merged" experiments were conducted
under both naive and ComBat normalization settings. "Ensemble learning": The five training predictors
were integrated by ensemble weighted learning methods under naive setting. "Ensemble learning (normalized)": The five training predictors were integrated by ensemble weighted learning methods under
ComBat normalization setting. "Rank aggregation": The five training predictors were integrated by rank
aggregation methods under naive setting. "Rank aggregation (normalized)": The five training predictors
were integrated by rank aggregation methods under ComBat normalization setting. The red dots and associated values on the figure are the mean AUC scores for each method, the vertical bars are the median
AUC scores for each method, while the black dots represent the outliers. Same method under different
settings are represented in the same color of boxplots. All the experiments were repeated 30 times for
each test dataset, and the results presented in the figure were based on the average AUC scores of the total
180 replications for the six test datasets.
151
Figure S22: Realistic applications of merging and integration methods on multiple colorectal cancer metagenomic datasets using SVM classifiers from six individual Leave-one-datasetout(LODO) experiments. The results by different methods are grouped into six groups. "Single learner":
Each of the five training datasets were trained independently with RF classifier and predicted on the test
dataset, then the average AUC score was taken among the five predictions. "Merged": Merging method
with pooling all five training datasets into one training data. The "Single learner" and "Merged" experiments were conducted under both naive and ComBat normalization settings. "Ensemble learning": The
five training predictors were integrated by ensemble weighted learning methods under naive setting. "Ensemble learning (normalized)": The five training predictors were integrated by ensemble weighted learning
methods under ComBat normalization setting. "Rank aggregation": The five training predictors were integrated by rank aggregation methods under naive setting. "Rank aggregation (normalized)": The five
training predictors were integrated by rank aggregation methods under ComBat normalization setting.
The red dots and associated values on the figure are the mean AUC scores for each method, the vertical
bars are the median AUC scores for each method, while the black dots represent the outliers. Same method
under different settings are represented in the same color of boxplots. All the experiments were repeated
30 times for each test dataset, and the results presented in the figure were based on the average AUC scores
of the total 180 replications for the six test datasets.
152
Figure S23: Realistic applications of merging and integration methods on multiple tuberculosis
genomic datasets using SVM classifiers from six individual Leave-one-dataset-out(LODO) experiments. The results by different methods are grouped into six groups. "Single learner": Each of the five
training datasets were trained independently with RF classifier and predicted on the test dataset, then the
average AUC score was taken among the five predictions. "Merged": Merging method with pooling all five
training datasets into one training data. The "Single learner" and "Merged" experiments were conducted
under both naive and ComBat normalization settings. "Ensemble learning": The five training predictors
were integrated by ensemble weighted learning methods under naive setting. "Ensemble learning (normalized)": The five training predictors were integrated by ensemble weighted learning methods under
ComBat normalization setting. "Rank aggregation": The five training predictors were integrated by rank
aggregation methods under naive setting. "Rank aggregation (normalized)": The five training predictors
were integrated by rank aggregation methods under ComBat normalization setting. The red dots and associated values on the figure are the mean AUC scores for each method, the vertical bars are the median
AUC scores for each method, while the black dots represent the outliers. Same method under different
settings are represented in the same color of boxplots. All the experiments were repeated 30 times for
each test dataset, and the results presented in the figure were based on the average AUC scores of the total
180 replications for the six test datasets.
153
Figure S24: Realistic applications of merging and integration methods on multiple colorectal
cancer metagenomic datasets using XGBoost classifiers from six individual Leave-one-datasetout(LODO) experiments. The results by different methods are grouped into six groups. "Single learner":
Each of the five training datasets were trained independently with RF classifier and predicted on the test
dataset, then the average AUC score was taken among the five predictions. "Merged": Merging method
with pooling all five training datasets into one training data. The "Single learner" and "Merged" experiments were conducted under both naive and ComBat normalization settings. "Ensemble learning": The
five training predictors were integrated by ensemble weighted learning methods under naive setting. "Ensemble learning (normalized)": The five training predictors were integrated by ensemble weighted learning
methods under ComBat normalization setting. "Rank aggregation": The five training predictors were integrated by rank aggregation methods under naive setting. "Rank aggregation (normalized)": The five
training predictors were integrated by rank aggregation methods under ComBat normalization setting.
The red dots and associated values on the figure are the mean AUC scores for each method, the vertical
bars are the median AUC scores for each method, while the black dots represent the outliers. Same method
under different settings are represented in the same color of boxplots. All the experiments were repeated
30 times for each test dataset, and the results presented in the figure were based on the average AUC scores
of the total 180 replications for the six test datasets.
154
Figure S25: Realistic applications of merging and integration methods on multiple tuberculosis
genomic datasets using XGBoost classifiers from six individual Leave-one-dataset-out(LODO)
experiments. The results by different methods are grouped into six groups. "Single learner": Each of
the five training datasets were trained independently with RF classifier and predicted on the test dataset,
then the average AUC score was taken among the five predictions. "Merged": Merging method with pooling all five training datasets into one training data. The "Single learner" and "Merged" experiments were
conducted under both naive and ComBat normalization settings. "Ensemble learning": The five training
predictors were integrated by ensemble weighted learning methods under naive setting. "Ensemble learning (normalized)": The five training predictors were integrated by ensemble weighted learning methods
under ComBat normalization setting. "Rank aggregation": The five training predictors were integrated by
rank aggregation methods under naive setting. "Rank aggregation (normalized)": The five training predictors were integrated by rank aggregation methods under ComBat normalization setting. The red dots and
associated values on the figure are the mean AUC scores for each method, the vertical bars are the median
AUC scores for each method, while the black dots represent the outliers. Same method under different
settings are represented in the same color of boxplots. All the experiments were repeated 30 times for
each test dataset, and the results presented in the figure were based on the average AUC scores of the total
180 replications for the six test datasets.
155
Figure S26: AUC results of Scenario 1: Divergent training and test populations with varied genomic feature distribution with constant parameter C = 105 during Dirichlet simulations. The
figures display the AUCs of RF predictions employing various integration methods with three distinct disease effect factors. Columns represent different values of α. Method names without a "ComBat" suffix
refer to those implemented in the naive setting, while those with a "ComBat" suffix were conducted in the
ComBat normalization setting. All the experiments were conducted 100 times, and the AUC scores presented in the figure represent the averages from these 100 trials. Abbreviations: Ensem.learn.-Ensemble
Learning; Rank aggr.-Rank aggregation; norm.-normalization.
156
Figure S27: AUC results of Scenario 2: Different batch effects on training data with consistent
underlying population genomic feature distribution with constant parameter C = 105 during
Dirichlet simulations. The figures show the AUC scores of RF prediction using different integration
methods with varying severity levels of batch effects. A: AUC score comparisons with different severity
levels of additive batch effects on the mean of OTU abundances, with no multiplicative batch effect on the
variance. B: AUC score comparisons with different severity levels of multiplicative batch effects on the
variance of OTU abundances, with no additive batch effect on the mean. The disease effect factor was set
to 1.025 for both scenarios. The integration methods without a suffix of "ComBat" were applied in the naive
setting, while those with a suffix of "ComBat" were applied after ComBat normalization. The experiments
were repeated 100 times, and the shown AUC scores are the averages across the 100 trials. Abbreviations
are the same as in Figure S26.
157
Figure S28: AUC results of Scenario 3: Different phenotype-associated feature models in different
studies with constant parameter C = 105 during Dirichlet simulations. The figures show the AUCs
of RF prediction using different integration methods with various number of overlapping disease associated
OTUs. The disease effect factor was set to 1.075. Columns represent different numbers of overlapping
disease associated OTUs in the training and test data, the larger the number, the more similar the two
disease models are. When the number achieves 10, the two models are the same in the training and test
data. All the experiments were repeated for 100 times and the AUC scores shown on the figure are the
averages from the 100 trials. Abbreviations are the same as in Figure S26.
158
Abstract (if available)
Abstract
The human microbiome, an intricate ecosystem of microorganisms within and on us, significantly influences our health and disease states. The emergence of next-generation sequencing (NGS) technologies has broadened our understanding, providing a comprehensive perspective on the complex interplay between microbes and humans. This understanding is further enriched by the evolution of bioinformatics tools, deepening our grasp on the functional roles and relationships within the microbiome. However, challenges, especially in predicting sample phenotypes across varied studies due to dataset heterogeneity, remain.
In the first part of this dissertation, we embarked on a colorectal cancer (CRC) - microbiome predictive analysis using extensive metagenomic datasets. Our exploration spanned various microbial taxonomic profiling tools to extract relative abundances of recognized microbial genomes. We also integrated mapping and assembly techniques to capture both known and novel genomes' relative abundance profiles. Utilizing the random forests (RF) classification algorithm, we aimed to predict colorectal cancer status based on these microbial profiles. Our findings indicated superior prediction performance using profiles estimated by Centrifuge compared to those by MetaPhlAn2 and Bracken. Additionally, we introduced a novel method integrating the relative abundance profiles of both known and novel microbial entities, enhancing the CRC prediction from metagenomes.
In the second part, our focus shifted to disease case/control phenotype prediction by integrating multiple heterogeneous studies. We implemented a comprehensive workflow to simulate diverse heterogeneity types and assess various integration methods, emphasizing batch normalization using ComBat. Applying our methods to six CRC metagenomic studies and six tuberculosis (TB) gene expression studies, we demonstrated the critical role of batch normalization in enhancing phenotype prediction performance by machine learning classifiers across multiple heterogeneous datasets. Alongside normalization, both merging strategy and ensemble weighted learning methods emerged as potent tools to bolster prediction performance. We also posited rank aggregation methods as a robust alternative, showcasing resilience akin to ensemble weighted learning approaches.
In the final part, we extended our investigation from the second part to quantitative phenotype prediction. We evaluated a range of normalization and integration strategies to counteract heterogeneity effects on cross-study quantitative phenotype predictions. Through simulations and real-world data applications, we established that combining normalization with integration effectively neutralizes heterogeneity stemming from batch effects in similar populations, thereby refining prediction accuracy.
Our research underscores the pivotal role of addressing study heterogeneity to ensure robust predictions. These insights pave the way for future endeavors in microbiome-based disease prediction and treatment.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Exploration of human microbiome through metagenomic analysis and computational algorithms
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Computational algorithms and statistical modelings in human microbiome analyses
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Statistical analysis of high-throughput genomic data
PDF
Statistical methods and analyses in the Multiethnic Cohort (MEC) human gut microbiome data
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Electrochemical studies of outward and inward extracellular electron transfer by microorganisms from diverse environments
PDF
Comparison of participant and study partner predictions of cognitive impairment in the Alzheimer's disease neuroimaging initiative 3 study
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
Asset Metadata
Creator
Gao, Yilin
(author)
Core Title
Enhancing phenotype prediction through integrative analysis of heterogeneous microbiome studies
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Degree Conferral Date
2023-12
Publication Date
11/20/2023
Defense Date
11/14/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
batch normalization,bioinformatics,colorectal cancer,compuatational biology,diabetes,heterogeneity,integrative analysis,machine learning,metagenomic taxonomic profiling,metagenomics,microbiome,OAI-PMH Harvest,ovarian cancer,phenotype prediction,statistical methods,Tuberculosis
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sun, Fengzhu (
committee chair
), Chen, Liang (
committee member
), Edge, Michael (
committee member
), Michail, Sonia (
committee member
)
Creator Email
lynngao815@hotmail.com,yilingao@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113777753
Unique identifier
UC113777753
Identifier
etd-GaoYilin-12483.pdf (filename)
Legacy Identifier
etd-GaoYilin-12483
Document Type
Dissertation
Format
theses (aat)
Rights
Gao, Yilin
Internet Media Type
application/pdf
Type
texts
Source
20231127-usctheses-batch-1108
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
batch normalization
bioinformatics
colorectal cancer
compuatational biology
diabetes
heterogeneity
integrative analysis
machine learning
metagenomic taxonomic profiling
metagenomics
microbiome
ovarian cancer
phenotype prediction
statistical methods