Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Feature engineering and supervised learning on metagenomic sequence data
(USC Thesis Other)
Feature engineering and supervised learning on metagenomic sequence data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
FEATURE ENGINEERING AND SUPERVISED LEARNING ON
METAGENOMIC SEQUENCE DATA
by
Mengge Zhang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
December 2019
Copyright 2019 Mengge Zhang
Dedication
To my family.
ii
Acknowledgments
First of all, I would like to express my most sincere appreciations to Professor
Fengzhu Sun, my advisor and committee chair, who guided me tremendously and
supported me unreservedly all along my PhD study. His devotion for science,
enthusiam for life and persistent encouragement for students, will benefit me for
my whole life. Without his guidance and effort this thesis would not have been
completed.
My heartiest gratitude also go to all my qualification and dissertation commit-
tee members, Professor Michael S. Waterman and Professor Jed Fuhrman, Profes-
sor Liang Chen and Professor Remo Rohs, for their wise suggestions and generous
help.
In addition to my appreciation for Professor Fengzhu Sun and Professor Jed
Furhman who led the collabrative research of our virus-host association study, I am
also grateful for my collabrators Dr Jie Ren, Dr Nathan Ahlgren and Dr Lianping
Yang for their efforts and insights in our project.
Furthermore, I must thank all my friends and colleagues during my PhD study
in USC, from whom I gained so much help and encouragement.
I want to express my deepest gratitude to my mother Jinhua Cai and my father
Dejia Zhang, who raised me with unconditional love and supported me no matter
what. I want to thank my brother Xinghua Zhang, who set an example to me and
iii
guided me all along my way. I am also grateful for the love and attachment from
my dearest daughter Hannah, who gives me courage as a mother. Finally, I want
to dedicate my special appreciation for my beloved husband Yichao Dong, who
accompanied me through good times as well as adversity, and supported me with
his love for all the times.
iv
Contents
Dedication ii
Acknowledgments iii
List of Figures viii
List of Tables xliii
Abstract xlix
1 Introduction 1
2 Study of virus-host infectious relationship with supervised learn-
ing methods 3
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Feature definitions . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Supervised learning methods . . . . . . . . . . . . . . . . . . 9
Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . 9
Support vector machine with RBF kernel . . . . . . . . . . . 10
Random forest . . . . . . . . . . . . . . . . . . . . . . . . . 11
Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.4 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.5 Estimating the fraction of viruses infecting a host in viral
tagging experiments . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Comparison of different supervised learning methods . . . . 13
Potential explanations for the performance variation across
different host genera . . . . . . . . . . . . . . . . . 15
2.3.2 Comparison between RF, Manhattan and d
∗
2
dissimilarity
measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Identification of hosts of viral contigs in metagenomic studies 19
v
2.3.3 Estimation of the reliability of observed virus-host infectious
associations from viral-tagging experiments . . . . . . . . . . 22
3 Predictionofcolorectalcancerbyalignment-freesupervisedlearn-
ing methods 26
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 Clinical indicators as potential confounders for CRC prediction 30
3.2.3 Feature definition . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.4 Removing human DNA from metagenomic samples . . . . . 35
3.2.5 Supervised learning methods . . . . . . . . . . . . . . . . . . 36
LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Deep Learning (Multilayer Perceptron) . . . . . . . . . . . . 37
3.2.6 Boostrapping and subsampling . . . . . . . . . . . . . . . . 38
3.2.7 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.1 Comparison of different supervised learning methods and
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Intra-dataset CRC prediction . . . . . . . . . . . . . . . . . 46
Cross-datasets CRC prediciton . . . . . . . . . . . . . . . . 51
Combined-datasets CRC prediction . . . . . . . . . . . . . . 59
3.3.2 Subsampling of metagenomic sequencing data and CRC pre-
diction with subsampled data . . . . . . . . . . . . . . . . . 64
Evaluation of subsampled metagenomic sequencing data . . 64
CRC prediction with subsampled data . . . . . . . . . . . . 67
3.3.3 Comparison of k-mer based method and OTU-based methods 73
4 Conclusions 76
4.1 Virus-host infectious association . . . . . . . . . . . . . . . . . . . . 76
4.2 CRC prediction by alignment-free supervised learning method . . . 78
Reference List 80
A Supplmentary Materials for Chapter 2 87
A.1 Supplementary Methods: . . . . . . . . . . . . . . . . . . . . . . . . 87
A.1.1 AUC score . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
A.1.2 Gradient boost . . . . . . . . . . . . . . . . . . . . . . . . . 87
A.1.3 Random forest . . . . . . . . . . . . . . . . . . . . . . . . . 88
A.1.4 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.3 Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . 92
vi
B Supplmentary Materials for Chapter 3 114
B.1 Supplementary Methods . . . . . . . . . . . . . . . . . . . . . . . . 114
B.1.1 Two-proportion z-test . . . . . . . . . . . . . . . . . . . . . . 114
B.1.2 Mann-Whitney U-test . . . . . . . . . . . . . . . . . . . . . 115
B.1.3 Kullback–Leibler Divergence . . . . . . . . . . . . . . . . . . 115
B.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
B.3 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . 117
B.3.1 Clinical indicator distributions . . . . . . . . . . . . . . . . . 117
B.3.2 AUCplotsofCRCpredictionwithsubsampledmetagenomic
sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Intra-datasetCRCpredictionwithsubsampledmetagenomic
sequences . . . . . . . . . . . . . . . . . . . . . . . 118
Cross-datasetsCRCpredictionwithsubsampledmetagenomic
sequences . . . . . . . . . . . . . . . . . . . . . . . 139
Combined-datasetsCRCpredictionwithsubsampledmetage-
nomic sequences . . . . . . . . . . . . . . . . . . . 181
vii
List of Figures
2.1 The average AUC scores of the different machine learning
methods and features. The averaged AUC scores are cal-
culated by the average of the AUC scores across different
hosts and different genome background distributions. The
figures from top to bottom are the performances of different word
lengths k = 4, k = 6 and k = 8. The black segments on top of the
bars are the standard deviation of the AUC scores across different
hosts and different genome background distributions. . . . . . . . . 15
2.2 SchemeoftheRFmethodforthepredictionofhostsofviral
contigs of different lengths. For each of the 9 main host genera,
we produced 4 training datasets with differed sequence lengths by
breaking the whole viral genomes into nonoverlapping contigs of
lengths 1kbps, 3kbps, 5kbps and the whole genomes with 0.05%
sequencing errors added; Similarly, we also generated 4 different
testingdatasetswithdifferentcontiglengths. Wethenevaluatedthe
performances of RF for each training dataset with specific sequence
length on all 4 testing datasets. . . . . . . . . . . . . . . . . . . . . 22
viii
2.3 The histograms of the RF scores for the viral sequences in
the negative and positive data sets, and a) T4-like, c) non-
T4-like, and e) viral contigs, respectively. The corresponding fitted
density functions are given as b), d), and f), respectively. In all of
the 6 subfigures, the horizontal axis is the the prediction scores. In
(a),(c) and (e), the right y-axis indicates the fraction and the left
y-axis indicates the fraction divided by the bin-size. . . . . . . . . 25
3.1 Gender ratios in 3 datasets. Bars from left to right are gender
ratios of 1. CRC patients in NC dataset; 2. Healthy controls in NC
dataset; 3. CRC patients in PO dataset; 4. Healthy controls in PO
dataset; 5. CRC patients in MSB dataset; 6. Healthy controls in
MSBdataset. NC,POandMSBareshortenedformsofNatureCom,
PlosOne and MolSysBio datasets, respectively. . . . . . . . . . . . 33
3.2 BMI and Age distributions for 3 datasets. (a)(b)(c) are age
distributions for NatureCom dataset, PlosOne dataset and MolSys-
Biodataset. (d)(e)(f) areBMI distributionsforNatureCom dataset,
PlosOne dataset and MolSysBio dataset. . . . . . . . . . . . . . . 34
ix
3.3 Performance of different methods and features for intra-
NatureCom dataset. This figure shows average AUC scores with
confidence intervals (CI) of different k-mer lengths under i.i.d and
1
st
-order Markov chain model over 4 different feature definitions and
supervised learning methods including logistic regression, LASSO,
random forest (RF), SVM with linear kernel, polynomial kernel and
RBF kernel (SVM(LN), SVM(PLN), SVM(RBF)) as well as mul-
tilayer perceptron (MLP) within NatureCom (NC) dataset. The ?
on top of each AUC bar indicates the highest AUC score in all 20
training data/testing data splittings. . . . . . . . . . . . . . . . . . 48
3.4 Performance of different methods and features for intra-
PlosOne dataset. This figure shows the average AUC scores with
confidence intervals (CI) of different k-mer lengths under i.i.d and
1
st
-order Markov chain model over 4 different feature definitions and
supervised learning methods including logistic regression, LASSO,
random forest (RF), SVM with linear kernel, polynomial kernel and
RBF kernel (SVM(LN), SVM(PLN), SVM(RBF)) as well as multi-
layer perceptron (MLP) within PlosOne (PO) dataset. The? on top
of each AUC bar indicates the highest AUC score in all 20 training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 49
x
3.5 Performance of different methods and features for intra-
MolSysBio dataset. This figure shows the average AUC scores
with confidence intervals (CI) of different k-mer lengths under i.i.d
and 1
st
-order Markov chain model over 4 different feature defini-
tions and supervised learning methods including logistic regression,
LASSO, random forest (RF), SVM with linear kernel, polynomial
kernelandRBFkernel(SVM(LN),SVM(PLN),SVM(RBF))aswell
as multilayer perceptron (MLP) within MolSysBio (MSB) dataset.
The ? on top of each AUC bar indicates the highest AUC score in
all 20 training data/testing data splittings. . . . . . . . . . . . . . 50
3.6 Performance of different methods and features for cross-
NatureCom-PlosOne-datasets. This figure shows the average
AUC scores with confidence intervals (CI) of differentk-mer lengths
underi.i.d and 1
st
-order Markov chain model over 4 different feature
definitions and supervised learning methods based on using Nature-
Com (NC) dataset as training data and PlosOne (PO) dataset as
testing data. The ? on top of each AUC bar indicates the highest
AUC score in all 50 training data/testing data splittings. . . . . . 53
3.7 Performance of different methods and features for cross-
NatureCom-MolSysBio-datasets. Thisfigureshowstheaverage
AUC scores with confidence intervals (CI) of differentk-mer lengths
underi.i.d and 1
st
-order Markov chain model over 4 different feature
definitions and supervised learning methods based on using Nature-
Com (NC) dataset as training data and MolSysBio (MSB) dataset
as testing data. The? on top of each AUC bar indicates the highest
AUC score in all 50 training data/testing data splittings. . . . . . 54
xi
3.8 Performance of different methods and features for cross-
PlosOne-NatureCom-datasets. This figure shows the average
AUC scores with confidence intervals (CI) of differentk-mer lengths
underi.i.d and 1
st
-order Markov chain model over 4 different feature
definitions and supervised learning methods based on using PlosOne
(PO) dataset as training data and NatureCom (NC) dataset as test-
ing data. The? on top of each AUC bar indicates the highest AUC
score in all 50 training data/testing data splittings. . . . . . . . . . 55
3.9 Performance of different methods and features for cross-
PlosOne-MolSysBio-datasets. This figure shows the average
AUC scores with confidence intervals (CI) of differentk-mer lengths
underi.i.d and 1
st
-order Markov chain model over 4 different feature
definitions and supervised learning methods based on using PlosOne
(PO)datasetastrainingdataandMolSysBio(MSB)datasetastest-
ing data. The? on top of each AUC bar indicates the highest AUC
score in all 50 training data/testing data splittings. . . . . . . . . . 56
3.10 Performance of different methods and features for cross-
MolSysBio-NatureCom-datasets. This figure shows the aver-
age AUC scores with confidence intervals (CI) of different k-mer
lengths under i.i.d and 1
st
-order Markov chain model over 4 differ-
ent feature definitions and supervised learning methods based on
using MolSysBio (MSB) dataset as training data and NatureCom
(NC) dataset as testing data. The ? on top of each AUC bar indi-
cates the highest AUC score in all 50 training data/testing data
splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
xii
3.11 Performance of different methods and features for cross-
MolSysBio-PlosOne-datasets. This figure shows the average
AUC scores with confidence intervals (CI) of differentk-mer lengths
underi.i.d and 1
st
-order Markov chain model over 4 different feature
definitions and supervised learning methods based on using MolSys-
Bio (MSB) dataset as training data and PlosOne (PO) dataset as
testing data. The ? on top of each AUC bar indicates the highest
AUC score in all 50 training data/testing data splittings. . . . . . 58
3.12 Performanceofdifferentmethodsandfeaturesforcombined-
NatureCom-PlosOne-datasets. This figure shows the average
AUC scores with confidence intervals (CI) of differentk-mer lengths
underi.i.d and 1
st
-order Markov chain model over 4 different feature
definitions and supervised learning methods based on using Nature-
Com (NC) and PlosOne (PO) datasets as training data, and Mol-
SysBio (MSB) dataset as testing data. The ? on top of each AUC
bar indicates the highest AUC score in all 50 training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.13 Performanceofdifferentmethodsandfeaturesforcombined-
NatureCom-MolSysBio-datasets. Thisfigureshowstheaverage
AUC scores with confidence intervals (CI) of differentk-mer lengths
underi.i.d and 1
st
-order Markov chain model over 4 different feature
definitions and supervised learning methods based on using Nature-
Com (NC) and MolSysBio (MSB) datasets as training data, and
PlosOne (PO) dataset as testing data. The ? on top of each AUC
bar indicates the highest AUC score in all 50 training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
xiii
3.14 Performanceofdifferentmethodsandfeaturesforcombined-
PlosOne-MolSysBio-datasets. This figure shows the average
AUC scores with confidence intervals (CI) of differentk-mer lengths
underi.i.d and 1
st
-order Markov chain model over 4 different feature
definitions and supervised learning methods based on using PlosOne
(PO) and MolSysBio (MSB) datasets as training data, and Nature-
Com (NC) dataset as testing data. The ? on top of each AUC bar
indicates the highest AUC score in all 50 training data/testing data
splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.15 Theself-inconsistencyandunrepresentativenessofsubsam-
pling with sample sizeN∈{10kbps, 50kbps, 400kbps, 2000kbps}
for k-mer lengths k∈{3, 6, 9}. . . . . . . . . . . . . . . . . . . . . . 66
B.1 Age distribution for all of the samples in three datasets. . . 117
B.2 BMI distribution for all of the samples in three datasets. . 117
B.3 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
modelover4differentfeaturedefinitionsforLogisticregres-
sion with different sampling sizes within the NatureCom
(NC) dataset. The ? on top of each AUC bar indicates the high-
est AUC score in all of the 20 different random training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
xiv
B.4 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
modelover4differentfeaturedefinitionsforLogisticregres-
sion with different sampling sizes within the PlosOne (PO)
dataset. The? on top of each AUC bar indicates the highest AUC
score in all of the 20 different random training data/testing data
splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
B.5 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
modelover4differentfeaturedefinitionsforLogisticregres-
sion with different sampling sizes within the MolSysBio
(MSB) dataset. The? on top of each AUC bar indicates the high-
est AUC score in all of the 20 different random training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
B.6 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for LASSO with
differentsamplingsizeswithintheNatureCom(NC)dataset.
The ? on top of each AUC bar indicates the highest AUC score in
all of the 20 different random training data/testing data splittings. . 121
B.7 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for LASSO with
different sampling sizes within the PlosOne (PO) dataset.
The ? on top of each AUC bar indicates the highest AUC score in
all of the 20 different random training data/testing data splittings. . 122
xv
B.8 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for LASSO with
differentsamplingsizeswithintheMolSysBio(MSB)dataset.
The ? on top of each AUC bar indicates the highest AUC score in
all of the 20 different random training data/testing data splittings. . 123
B.9 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
modelover4differentfeaturedefinitionsforRandomforest
with different sampling sizes within the NatureCom (NC)
dataset. The? on top of each AUC bar indicates the highest AUC
score in all of the 20 different random training data/testing data
splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
B.10 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Random for-
est with different sampling sizes within the PlosOne (PO)
dataset. The? on top of each AUC bar indicates the highest AUC
score in all of the 20 different random training data/testing data
splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
xvi
B.11 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
modelover4differentfeaturedefinitionsforRandomforest
with different sampling sizes within the MolSysBio (MSB)
dataset. The? on top of each AUC bar indicates the highest AUC
score in all of the 20 different random training data/testing data
splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
B.12 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Linear
Kernel) with different sampling sizes within the Nature-
Com (NC) dataset. The ? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 127
B.13 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Linear
Kernel) with different sampling sizes within the Nature-
Com (NC) dataset. The ? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 128
xvii
B.14 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Linear
Kernel) with different sampling sizes within the Nature-
Com (NC) dataset. The ? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 129
B.15 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Poly-
nomial Kernel) with different sampling sizes within the
NatureCom (NC) dataset. The? on top of each AUC bar indi-
catesthehighestAUCscoreinallofthe20differentrandomtraining
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 130
B.16 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Poly-
nomial Kernel) with different sampling sizes within the
NatureCom (NC) dataset. The? on top of each AUC bar indi-
catesthehighestAUCscoreinallofthe20differentrandomtraining
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 131
xviii
B.17 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Poly-
nomial Kernel) with different sampling sizes within the
NatureCom (NC) dataset. The? on top of each AUC bar indi-
catesthehighestAUCscoreinallofthe20differentrandomtraining
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 132
B.18 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (RBF
Kernel) with different sampling sizes within the Nature-
Com (NC) dataset. The ? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 133
B.19 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (RBF
Kernel) with different sampling sizes within the Nature-
Com (NC) dataset. The ? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 134
xix
B.20 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (RBF
Kernel) with different sampling sizes within the Nature-
Com (NC) dataset. The ? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 135
B.21 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Multilayer
perceptronwithdifferentsamplingsizeswithintheNature-
Com (NC) dataset. The ? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 136
B.22 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Multilayer
perceptronwithdifferentsamplingsizeswithinthePlosOne
(PO) dataset. The? on top of each AUC bar indicates the highest
AUC score in all of the 20 different random training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
xx
B.23 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Multilayer
perceptron with different sampling sizes within the Mol-
SysBio (MSB) dataset. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 138
B.24 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
modelover4differentfeaturedefinitionsforLogisticregres-
sion with different sampling sizes using the NatureCom
(NC)datasetastrainingdataandthePlosOne(PO)dataset
as testing data. The? on top of each AUC bar indicates the high-
est AUC score in all of the 20 different random training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
B.25 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
modelover4differentfeaturedefinitionsforLogisticregres-
sion with different sampling sizes using the NatureCom
(NC) dataset as training data and the MolSysBio (MSB)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 140
xxi
B.26 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
modelover4differentfeaturedefinitionsforLogisticregres-
sion with different sampling sizes using the PlosOne (PO)
dataset as training data and the NatureCom (NC) dataset
as testing data. The? on top of each AUC bar indicates the high-
est AUC score in all of the 20 different random training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
B.27 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
modelover4differentfeaturedefinitionsforLogisticregres-
sion with different sampling sizes using the PlosOne (PO)
dataset as training data and the MolSysBio (MSB) dataset
as testing data. The? on top of each AUC bar indicates the high-
est AUC score in all of the 20 different random training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
B.28 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
modelover4differentfeaturedefinitionsforLogisticregres-
sionwithdifferentsamplingsizesusingtheMolSysBio(MSB)
dataset as training data and the NatureCom (NC) dataset
as testing data. The? on top of each AUC bar indicates the high-
est AUC score in all of the 20 different random training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
xxii
B.29 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
modelover4differentfeaturedefinitionsforLogisticregres-
sionwithdifferentsamplingsizesusingtheMolSysBio(MS-
B) dataset as training data and the PlosOne (PO) dataset
as testing data. The? on top of each AUC bar indicates the high-
est AUC score in all of the 20 different random training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
B.30 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for LASSO with
different sampling sizes using the NatureCom (NC) dataset
as training data and the PlosOne (PO) dataset as testing
data. The ? on top of each AUC bar indicates the highest AUC
score in all of the 20 different random training data/testing data
splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
B.31 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for LASSO with
different sampling sizes using the NatureCom (NC) dataset
as training data and the MolSysBio (MSB) dataset as test-
ing data. The ? on top of each AUC bar indicates the highest
AUC score in all of the 20 different random training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
xxiii
B.32 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for LASSO with
different sampling sizes using the PlosOne (PO) dataset as
training data and the NatureCom (NC) dataset as testing
data. The ? on top of each AUC bar indicates the highest AUC
score in all of the 20 different random training data/testing data
splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
B.33 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for LASSO with
different sampling sizes using the PlosOne (PO) dataset as
training data and the MolSysBio (MSB) dataset as testing
data. The ? on top of each AUC bar indicates the highest AUC
score in all of the 20 different random training data/testing data
splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
B.34 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for LASSO with
differentsamplingsizesusingtheMolSysBio(MSB)dataset
as training data and the NatureCom (NC) dataset as test-
ing data. The ? on top of each AUC bar indicates the highest
AUC score in all of the 20 different random training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
xxiv
B.35 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for LASSO with
differentsamplingsizesusingtheMolSysBio(MSB)dataset
as training data and the PlosOne (PO) dataset as testing
data. The ? on top of each AUC bar indicates the highest AUC
score in all of the 20 different random training data/testing data
splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
B.36 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Random For-
estwithdifferentsamplingsizesusingtheNatureCom(NC)
dataset as training data and the PlosOne (PO) dataset as
testing data. The? on top of each AUC bar indicates the highest
AUC score in all of the 20 different random training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
B.37 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Random For-
estwithdifferentsamplingsizesusingtheNatureCom(NC)
dataset as training data and the MolSysBio (MSB) dataset
as testing data. The? on top of each AUC bar indicates the high-
est AUC score in all of the 20 different random training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
xxv
B.38 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Random For-
est with different sampling sizes using the PlosOne (PO)
dataset as training data and the NatureCom (NC) dataset
as testing data. The? on top of each AUC bar indicates the high-
est AUC score in all of the 20 different random training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
B.39 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Random For-
est with different sampling sizes using the PlosOne (PO)
dataset as training data and the MolSysBio (MSB) dataset
as testing data. The? on top of each AUC bar indicates the high-
est AUC score in all of the 20 different random training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
B.40 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Random For-
estwithdifferentsamplingsizesusingtheMolSysBio(MSB)
dataset as training data and the NatureCom (NC) dataset
as testing data. The? on top of each AUC bar indicates the high-
est AUC score in all of the 20 different random training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
xxvi
B.41 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Random For-
estwithdifferentsamplingsizesusingtheMolSysBio(MSB)
dataset as training data and the PlosOne (PO) dataset as
testing data. The? on top of each AUC bar indicates the highest
AUC score in all of the 20 different random training data/testing
data splittings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
B.42 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Lin-
ear Kernel) with different sampling sizes using the Nature-
Com (NC) dataset as training data and the PlosOne (PO)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 157
B.43 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Linear
Kernel) with different sampling sizes using the NatureCom
(NC) dataset as training data and the MolSysBio (MSB)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 158
xxvii
B.44 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Linear
Kernel) with different sampling sizes using the PlosOne
(PO) dataset as training data and the NatureCom (NC)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 159
B.45 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Linear
Kernel) with different sampling sizes using the PlosOne
(PO) dataset as training data and the MolSysBio (MSB)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 160
B.46 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Linear
Kernel) with different sampling sizes using the MolSysBio
(MSB) dataset as training data and the NatureCom (NC)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 161
xxviii
B.47 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Lin-
ear Kernel) with different sampling sizes using the Mol-
SysBio (MSB) dataset as training data and the PlosOne
(PO) dataset as testing data. The ? on top of each AUC bar
indicates the highest AUC score in all of the 20 different random
training data/testing data splittings. . . . . . . . . . . . . . . . . . 162
B.48 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Polyno-
mialKernel)withdifferentsamplingsizesusingtheNature-
Com (NC) dataset as training data and the PlosOne (PO)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 163
B.49 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Polyno-
mialKernel)withdifferentsamplingsizesusingtheNature-
Com (NC) dataset as training data and the MolSysBio
(MSB) dataset as testing data. The ? on top of each AUC
bar indicates the highest AUC score in all of the 20 different ran-
dom training data/testing data splittings. . . . . . . . . . . . . . . 164
xxix
B.50 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Polyno-
mialKernel)withdifferentsamplingsizesusingthePlosOne
(PO) dataset as training data and the NatureCom (NC)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 165
B.51 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Polyno-
mialKernel)withdifferentsamplingsizesusingthePlosOne
(PO) dataset as training data and the MolSysBio (MSB)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 166
B.52 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Polyno-
mial Kernel) with different sampling sizes using the Mol-
SysBio (MSB) dataset as training data and the NatureCom
(NC) dataset as testing data. The ? on top of each AUC bar
indicates the highest AUC score in all of the 20 different random
training data/testing data splittings. . . . . . . . . . . . . . . . . . 167
xxx
B.53 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Polyno-
mial Kernel) with different sampling sizes using the Mol-
SysBio (MSB) dataset as training data and the PlosOne
(PO) dataset as testing data. The ? on top of each AUC bar
indicates the highest AUC score in all of the 20 different random
training data/testing data splittings. . . . . . . . . . . . . . . . . . 168
B.54 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (RBF
Kernel) with different sampling sizes using the Nature-
Com (NC) dataset as training data and the PlosOne (PO)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 169
B.55 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (RBF
Kernel) with different sampling sizes using the NatureCom
(NC) dataset as training data and the MolSysBio (MSB)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 170
xxxi
B.56 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (RBF
Kernel) with different sampling sizes using the PlosOne
(PO) dataset as training data and the NatureCom (NC)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 171
B.57 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (RBF
Kernel) with different sampling sizes using the PlosOne
(PO) dataset as training data and the MolSysBio (MSB)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 172
B.58 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (RBF
Kernel) with different sampling sizes using the MolSysBio
(MSB) dataset as training data and the NatureCom (NC)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 173
xxxii
B.59 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (RBF
Kernel) with different sampling sizes using the MolSys-
Bio (MSB) dataset as training data and the PlosOne (PO)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 174
B.60 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Multilayer
perceptron with different sampling sizes using the Nature-
Com (NC) dataset as training data and the PlosOne (PO)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 175
B.61 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Multilayer
perceptron with different sampling sizes using the Nature-
Com (NC) dataset as training data and the MolSysBio
(MSB) dataset as testing data. The ? on top of each AUC
bar indicates the highest AUC score in all of the 20 different ran-
dom training data/testing data splittings. . . . . . . . . . . . . . . 176
xxxiii
B.62 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Multilayer
perceptron with different sampling sizes using the PlosOne
(PO) dataset as training data and the NatureCom (NC)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 177
B.63 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Multilayer
perceptron with different sampling sizes using the PlosOne
(PO) dataset as training data and the MolSysBio (MSB)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 178
B.64 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Multilayer
perceptron with different sampling sizes using the MolSys-
Bio (MSB) dataset as training data and the NatureCom
(NC) dataset as testing data. The ? on top of each AUC bar
indicates the highest AUC score in all of the 20 different random
training data/testing data splittings. . . . . . . . . . . . . . . . . . 179
xxxiv
B.65 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Multilayer
perceptron with different sampling sizes using the MolSys-
Bio (MSB) dataset as training data and the PlosOne (PO)
dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training
data/testing data splittings. . . . . . . . . . . . . . . . . . . . . . . 180
B.66 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
modelover4differentfeaturedefinitionsforLogisticRegres-
sion with different sampling sizes combining NatureCom
(NC) and PlosOne (PO) datasets as training data, then
using the MolSysBio (MSB) dataset as testing data. The ?
on top of each AUC bar indicates the highest AUC score in all of
the 20 different random training data/testing data splittings. . . . . 181
B.67 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
modelover4differentfeaturedefinitionsforLogisticRegres-
sion with different sampling sizes combining NatureCom
(NC)andMolSysBio(MSB)datasetsastrainingdata, then
using the PlosOne (PO) dataset as testing data. The ? on
top of each AUC bar indicates the highest AUC score in all of the
20 different random training data/testing data splittings. . . . . . . 182
xxxv
B.68 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
modelover4differentfeaturedefinitionsforLogisticRegres-
sion with different sampling sizes combining PlosOne (PO)
andMolSysBio(MSB)datasetsastrainingdata, thenusing
the NatureCom (NC) dataset as testing data. The ? on top
of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings. . . . . . . . . 183
B.69 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for LASSO with
different sampling sizes combining NatureCom (NC) and
PlosOne (PO) datasets as training data, then using the
MolSysBio (MSB) dataset as testing data. The ? on top of
each AUC bar indicates the highest AUC score in all of the 20 dif-
ferent random training data/testing data splittings. . . . . . . . . . 184
B.70 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for LASSO with
different sampling sizes combining NatureCom (NC) and
MolSysBio (MSB) datasets as training data, then using the
PlosOne (PO) dataset as testing data. The ? on top of each
AUC bar indicates the highest AUC score in all of the 20 different
random training data/testing data splittings. . . . . . . . . . . . . . 185
xxxvi
B.71 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for LASSO with
different sampling sizes combining PlosOne (PO) and Mol-
SysBio (MSB) datasets as training data, then using the
NatureCom (NC) dataset as testing data. The ? on top of
each AUC bar indicates the highest AUC score in all of the 20 dif-
ferent random training data/testing data splittings. . . . . . . . . . 186
B.72 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Random For-
est with different sampling sizes combining NatureCom
(NC) and PlosOne (PO) datasets as training data, then
using the MolSysBio (MSB) dataset as testing data. The ?
on top of each AUC bar indicates the highest AUC score in all of
the 20 different random training data/testing data splittings. . . . . 187
B.73 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Random For-
est with different sampling sizes combining NatureCom
(NC)andMolSysBio(MSB)datasetsastrainingdata, then
using the PlosOne (PO) dataset as testing data. The ? on
top of each AUC bar indicates the highest AUC score in all of the
20 different random training data/testing data splittings. . . . . . . 188
xxxvii
B.74 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Random For-
est with different sampling sizes combining PlosOne (PO)
andMolSysBio(MSB)datasetsastrainingdata, thenusing
the NatureCom (NC) dataset as testing data. The ? on top
of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings. . . . . . . . . 189
B.75 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Linear
Kernel) with different sampling sizes combining Nature-
Com (NC) and PlosOne (PO) datasets as training data,
then using the MolSysBio (MSB) dataset as testing data.
The ? on top of each AUC bar indicates the highest AUC score in
all of the 20 different random training data/testing data splittings. . 190
B.76 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Linear
Kernel) with different sampling sizes combining Nature-
Com(NC)andMolSysBio(MSB)datasetsastrainingdata,
then using the PlosOne (PO) dataset as testing data. The
? on top of each AUC bar indicates the highest AUC score in all of
the 20 different random training data/testing data splittings. . . . . 191
xxxviii
B.77 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Linear
Kernel) with different sampling sizes combining PlosOne
(PO)andMolSysBio(MSB)datasetsastrainingdata, then
using the NatureCom (NC) dataset as testing data. The ?
on top of each AUC bar indicates the highest AUC score in all of
the 20 different random training data/testing data splittings. . . . . 192
B.78 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Polyno-
mialKernel)withdifferentsamplingsizescombiningNature-
Com (NC) and PlosOne (PO) datasets as training data,
then using the MolSysBio (MSB) dataset as testing data.
The ? on top of each AUC bar indicates the highest AUC score in
all of the 20 different random training data/testing data splittings. . 193
B.79 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Polyno-
mialKernel)withdifferentsamplingsizescombiningNature-
Com(NC)andMolSysBio(MSB)datasetsastrainingdata,
then using the PlosOne (PO) dataset as testing data. The
? on top of each AUC bar indicates the highest AUC score in all of
the 20 different random training data/testing data splittings. . . . . 194
xxxix
B.80 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (Polyno-
mialKernel)withdifferentsamplingsizescombiningPlosO-
ne (PO) and MolSysBio (MSB) datasets as training data,
then using the NatureCom (NC) dataset as testing data.
The ? on top of each AUC bar indicates the highest AUC score in
all of the 20 different random training data/testing data splittings. . 195
B.81 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (RBF
Kernel) with different sampling sizes combining Nature-
Com (NC) and PlosOne (PO) datasets as training data,
then using the MolSysBio (MSB) dataset as testing data.
The ? on top of each AUC bar indicates the highest AUC score in
all of the 20 different random training data/testing data splittings. . 196
B.82 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (RBF
Kernel) with different sampling sizes combining Nature-
Com(NC)andMolSysBio(MSB)datasetsastrainingdata,
then using the PlosOne (PO) dataset as testing data. The
? on top of each AUC bar indicates the highest AUC score in all of
the 20 different random training data/testing data splittings. . . . . 197
xl
B.83 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for SVM (RBF
Kernel) with different sampling sizes combining PlosOne
(PO)andMolSysBio(MSB)datasetsastrainingdata, then
using the NatureCom (NC) dataset as testing data. The ?
on top of each AUC bar indicates the highest AUC score in all of
the 20 different random training data/testing data splittings. . . . . 198
B.84 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Multilayer
perceptronwithdifferentsamplingsizescombiningNature-
Com (NC) and PlosOne (PO) datasets as training data,
then using the MolSysBio (MSB) dataset as testing data.
The ? on top of each AUC bar indicates the highest AUC score in
all of the 20 different random training data/testing data splittings. . 199
B.85 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Multilayer
perceptronwithdifferentsamplingsizescombiningNature-
Com(NC)andMolSysBio(MSB)datasetsastrainingdata,
then using the PlosOne (PO) dataset as testing data. The
? on top of each AUC bar indicates the highest AUC score in all of
the 20 different random training data/testing data splittings. . . . . 200
xli
B.86 Average AUC scores with confidence intervals (CI) of dif-
ferent k-mer lengths under i.i.d and 1
st
-order markov chain
model over 4 different feature definitions for Multilayer
perceptronwithdifferentsamplingsizescombiningPlosOne
(PO)andMolSysBio(MSB)datasetsastrainingdata, then
using the NatureCom (NC) dataset as testing data. The ?
on top of each AUC bar indicates the highest AUC score in all of
the 20 different random training data/testing data splittings. . . . . 201
xlii
List of Tables
2.1 Description of the training and test data. For a specific year,
the positive training data set contains viruses infecting the corre-
sponding host identified before the specific year and the positive
testing data set contains viruses infecting the corresponding host
discovered after the specific year. The negative training data and
the negative testing data were chosen randomly without overlaps
from the viruses that were not identified to infect the host. . . . . . 7
2.2 TheAUCscoresofusingRFcombinedwiththefirstfeature
with different word pattern lengths across 9 different hosts.
The highest score for each host is highlighted in bold. . . . . . . . . 14
2.3 Average Manhattan distances of word relative frequency
vectors between pairs of viruses infecting each host. . . . . . 16
2.4 The distribution of viruses among three major viral fami-
lies, Myoviridae, Podoviridae and Siphoviridae, for viruses
infecting each of the nine host genera. The last column is the
entropy of the distribution. . . . . . . . . . . . . . . . . . . . . . . . 18
xliii
2.5 Comparison of AUC scores of RF (random forest) com-
bined with word frequency vector with that based on Man-
hattan distance and d
∗
2
statistic when k = 6. For the back-
ground model of d
∗
2
statistic, we considered independent identically
distributed (i.i.d.) model, first and second order Markov chains. . . 20
2.6 The AUC scores of the RF method for the prediction of
hostsofviralcontigswithdifferentlengthsusingthemodels
built from contigs of different lengths. . . . . . . . . . . . . . 23
3.1 Description of the datasets. This table reflects three different
datasets we focus on. The locations where the datasets data were
collected are shown in the last column of the table. For each cohort,
the number of individuals in different diagnostic groups (CRC, Ade-
noma, Healthy) are listed, the mean and standard deviation of Age
as well as BMI for each group are also included. . . . . . . . . . . . 31
3.2 Pearson correlations between different clinical indicators.
ThistablereflectsthePearsonCorrelationsofthepaired-combinations
in Age, Gender, BMI and CRC diagnosis for NatureCom (NC)
dataset, PlosOne (PO) dataset, MolSysBio (MSB) dataset and com-
bination of all three datasets (Compelete). . . . . . . . . . . . . . . 32
3.3 Data partitionings in 3 different evaluation scenarios. This
tableshowsthedatapartitioningsforevaluatingdifferentsupervised
learning techniques and features in intra-dataset, cross-datasets and
combined-datasets scenarios. . . . . . . . . . . . . . . . . . . . . . . 42
xliv
3.4 Best-performing features and methods. This table is a sum-
mary of best-performing features, k-mer lengths and supervised
learning methods for intra-dataset, cross-datasets and combined-
datasetsscenarios. NC,POandMSBareshortenedformsofNature-
Com, PlosOneandMolSysBiodatasets; SVM(LN)andSVM(RBF)
is the shortened form for SVM with linear and RBF kernels, respec-
tively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Performance of LASSO and SVM (Linear Kernel). This
table shows the comparison of highest AUC scores in all feature-
method combinations with AUC scores ofF
1
andF
4
at k = 6 and
k = 9 under i.i.d model for LASSO and SVM (Linear Kernel) in
intra-dataset, cross-datasets and combined-datasets scenarios. . . . 45
3.6 Best-performing features and methods with subsampling
in intra-dataset scenario. This table is a summary of best-
performing features, k-mer lengths, background model and corre-
sponding AUC scores for each supervised learning method based on
subsampled metagenomic sequences with sampling sizes 10 kbps,
50 kbps, 400 kbps, 2000 kbps as well as non-sampled sequences
(WGS) in the intra-dataset scenario. NC, PO and MSB are short-
enedformsofNatureCom, PlosOneandMolSysBiodatasets, respec-
tively. SVM(LN), SVM(PL), SVM(RBF), RF and MLP are short-
ened forms of SVM with linear kernel, SVM with polynomial kernel,
SVM with RBF kernel, Random Forest and Multilayer-perceptron,
respectively. F, k, m and s represent for feature, k-mer lengths,
Markov order of background model and AUC score, respectively. . . 69
xlv
3.7 Best-performing features and methods with subsampling
in cross-datasets scenario. This table is a summary of best-
performing features, k-mer lengths, background model and corre-
sponding AUC scores for each supervised learning method based on
subsampled metagenomic sequences with sampling sizes 10 kbps,
50 kbps, 400 kbps, 2000 kbps as well as non-sampled sequences
(WGS) in the cross-datasets scenario. NC, PO and MSB are short-
enedformsofNatureCom, PlosOneandMolSysBiodatasets, respec-
tively. SVM(LN), SVM(PL), SVM(RBF), RF and MLP are short-
ened forms of SVM with linear kernel, SVM with polynomial kernel,
SVM with RBF kernel, Random Forest and Multilayer-perceptron,
respectively. F, k, m and s represent for feature, k-mer lengths,
Markov order of background model and AUC score, respectively. . . 70
3.8 Best-performingfeaturesandmethodswithsubsamplingin
combined-datasets scenario. This table is a summary of best-
performing features, k-mer lengths, background model and corre-
sponding AUC scores for each supervised learning method based
on subsampled metagenomic sequences with sampling sizes 10k,
50k, 400k, 2000k as well as non-sampled sequences (WGS) in the
combined-datasets scenario. NC, PO and MSB are shortened forms
ofNatureCom, PlosOneandMolSysBiodatasets, respectively. SVM
(LN), SVM (PL), SVM (RBF), RF and MLP are shortened forms of
SVM with Linear Kernel, SVM with Polynomial Kernel, SVM with
RBF Kernel, Random Forest and Multilayer-perceptron, respec-
tively. F, k, m and s represent for feature, k-mer lengths, Markov
order of background model and AUC score, respectively. . . . . . . 72
xlvi
3.9 The performances of the alignment-free methods v.s. the
performances of the original studies. This table reflects the
prediction accuracies in terms of AUC scores between our k-mer
based approach and the approaches used in the original studies
where the datasets are first produced. The AUC scores in the
parentheses are the ones that have higher AUC scores with subsam-
pled metagenomic sequencing data compared with using the whole
metagenomic sequences. . . . . . . . . . . . . . . . . . . . . . . . . 73
3.10 Comparison of performance for OTU-based methods and
alignment-free method. This table shows the comparison of
accuracies in terms of AUC scores of our k-mer-based CRC predic-
tion method (FeatureF
1
of Logistic regression withi.i.d model) and
OTU-basedmethodsforintra-dataset, cross-datasetsandcombined-
datasets evaluations. The AUC scores in the parentheses are the
ones with subsampled metagenomic sequencing data that outper-
formed the performance of using the whole metagenomic sequences. 75
A.1 Number of viral genomes founded to have infected the
corresponding host bacterium genus up to the respective
year. The NCBI phage genome database currently contains 1,426
completely sequenced viral genomes with precisely identified hosts.
Among all the bacteria at the genus level, we focus on 9 bacterial
genera with at least 45 identified infectious viuses and, and together
they have been identified as the hosts of 836 out of 1,426 complete
viral genomes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
xlvii
A.2 AUC scores of unsupervised learning based on the average
Manhattan distance and d
∗
2
dissimilarity of viruses in the
testing data with the viruses in the positive training data.
Three different word lengths : k = 4, k = 6, and k = 8. Four
different background models were applied for d
∗
2
dissimilarity; . . . . 93
A.3 The average AUC scores for all the feature-method com-
binations. Three different word length (4, 6, 8) and four different
background models (i.i.d. model, 1
st
, 2
nd
and 3
rd
order Markov
chains) were used. For each feature-method combination trained
purely based on the training data, we apply it onto the testing data
and calculate the AUC score. We repeated our pipeline for 50 times
with different negative training data and negative testing data each
time, and then we calculated the average AUC score. . . . . . . . . 96
xlviii
Abstract
The goal of this dissertation is to uncover the potential of features associated
with virus-host interaction and colorectal cancer in metagenomic studies. This
dissertation is comprised of two parts.
The first part is a study of virus-host infectious association by supervised learn-
ing methods. Uncovering the virus-host infectious association is important for
understanding the functions and dynamics of microbial communities. Both cel-
lular and fractionated viral metagenomic data generate a large number of viral
contigs with missing host information. Although relative simple methods based on
the similarity between the word frequency vectors of viruses and bacterial hosts
have been developed to study virus-host associations, the problem is significantly
understudied. We hypothesize that machine learning methods based on word fre-
quencies can be efficiently used to study virus-host infectious associations.
The second part is an investigation of prediction methods of colorectal can-
cer (CRC) versus healthy controls, using alignment-free methods to construct the
feature vectors. We first evaluated the prediction performance using 3 different
datasets, and compared with the state-of-art research results. The performance
of alignment-free approaches is comparable with that of alignment-based methods
with intra-dataset, cross-datasets and combined-datasets assessment methods. We
then investigated the changes of performance in CRC prediction while using the
xlix
shallow subsamples instead of the whole metagenomic sequencing data. Accord-
ing to our results, the shallow subsamples of the whole metagenomic sequences are
more powerful when predicting CRC based on supervised learning on metagenomic
sequences with missing data.
l
Chapter 1
Introduction
Viruses are the most abundant organism on earth with the number of viruses over
10-fold higher than the number of bacteria [1,2]. Viruses play important roles in
almost all domains of life due to their wide distribution in both the environment
and the body of living organisms [3,4] including water [5,6], soil and the human
body [3,7]. To produce progeny, viral particles must infect a living organism,
namely, thehosts, byfirstinfectingthehostcellandlaterhijackingthehostcellular
replication mechanisms. Bacteria, archaea and animals are the natural virus hosts.
Viral infections often cause cellular and physiological changes in the host cells, for
example, altering the genomic sequences of their hosts [8], and sometimes causing
dysfunctions in the hosts [9–12].
The class of viruses that specifically infect bacteria is known as bacteriophages.
They are of special interest to ecologists and microbiologists because of the close
connection that bacteria have with the human health and the environment. For
example, the human microbiomes can be affected by bacteriophages [13]. Some
bacteriophageshavebeenshowntoalterthecompositionofmicrobialcommunities,
leading to changes in these communities.
Despite the importance of viruses in microbial communities, the mechanisms
of viruses infecting hosts are not fully understood. Metagenomic studies using
next generation sequencing (NGS) technologies such as the Human Microbiome
Project (HMP) [14,15] and the global ocean survey (GOS) [16] generated a large
number of short read data targeting total genomic (cellular) or fractionated virus
1
particles. Many viral sequences are generated without knowing what their hosts
are. This opens up an opportunity for the study of virus-host association by
utilizing this wealth of sequencing information. Thus, the primary objective
of the first project is the development of computational approaches for
thepredictionofinfectiousassociationsbetweenvirusesandprokaryotic
hosts.
2
Chapter 2
Study of virus-host infectious
relationship with supervised
learning methods
2.1 Background
As emphasized in the introduction session, it is important to match viruses with
their prokaryotic hosts. This problem has not been heavily investigated before,
and we are aware of two [3,4] studies [17,18] that did relevant work. Ahmed et
al. [18] developed a computational method based on "oligostickiness" for studying
virus-host infectious association relationship. However, the authors based their
studies on only 25 viruses and 7 bacterial hosts and the software is not available
(per communications with the authors). Roux et al. [17] used Manhattan distance
between the frequency vectors of word patterns (k-tuple, gram) for a virus and
a potential bacterial host to study their relationships and some promising results
were obtained. The study showed that the word frequency vectors of viruses con-
tain information about their hosts. Based on this study, we hypothesize that
machine learning methods based on the word frequency vectors of viruses can be
used to predict virus-host infectious associations more accurately.
In this study, we collected 1,426 completely sequenced viral genomes with pre-
cisely identified hosts from the NCBI phage genome database. Among all the
3
bacteria at the genus level, we focus on 9 bacterial genera each of which containing
at least 45 viruses infecting the hosts, providing large sample sizes to optimize the
machine learning methods. Together they have been identified as the hosts of 836
out of 1,426 viral genomes (Table A.1). In addition, most of these 9 hosts have
been shown to play important roles in microbiome studies and they are also closely
related to human diseases [19–24]. Therefore, identification of viral sequences that
infect each of the 9 hosts from the vast amount of newly generated viral sequence
data has high significance.
It has been hypothesized and supported by relevant data that the word pattern
usage between viruses and their hosts is more similar than that between random
virus-host pairs [17]. This hypothesis is based on the fact that the virus is depen-
dent on the molecular machinery of its host to replicate, so the virus is expected
to adopt similar word pattern usage of the host, evolving to maximize replication.
Therefore, based on the hypothesis that infectious virus-host pairs have similar
word pattern usage, we decided to represent each virus by a feature vector based
on the word pattern usage. However, it is not clear what the best feature vector
representation should be. In this study, we studied four different feature vector
representations based on the counts of word patterns.
Then we investigated different supervised learning methods based on these fea-
ture vectors to predict viruses that potentially infect a particular host. In this
study, we investigated the supervised learning methods including logistic regres-
sion, SVM, random forest, Gaussian naive Bayes and Bernoulli naive Bayes [25].
We next build frameworks for all of the feature-method combinations. By applying
the frameworks on viral complete genome sequence data of the nine main host gen-
era, we identified the best feature-method combination based on the area under the
receiver operating characteristic curve (AUC) scores. We also studied the effect
4
of word length and genome sequence length on the accuracy of the prediction
methods.
New technologies such as viral tagging [26] have been developed to associate
viruses with particular hosts. Like all high-throughput biotechnologies, there are
many potential false positives (observed associations that are not due to infec-
tion) and false negatives (associations missed by the experiments). It is important
to estimate the fraction of true infectious associations among observed associa-
tions, and to separate true infectious associations from false ones. We applied our
approach to estimate the fraction of true infectious associations among observed
virus-host associations from viral tagging experiment data [26].
2.2 Materials and Methods
2.2.1 Data description
We downloaded 1,426 complete viral genome sequences with known host informa-
tion from the NCBI viral genome database. The NCBI viral data file contains the
genome sequence, the host of the virus, and the year it was identified. We focused
on 9 bacterial host genera that have at least 45 infectious viruses identified so that
enough data are available for learning. Table A.1 shows the number of known
infectious viruses identified up to each year from 2010 to 2015 for the 9 bacterial
genera.
For each of the 9 bacterial host genera, we built a model to predict new viruses
that are potentially capable of infecting the corresponding host. In order to eval-
uate the performance of a supervised learning method, we need to partition the
data into training data and testing data. Instead of using cross-validation as in
5
most studies, we designed a more realistic scenario to predict future new discover-
ies of viruses infecting the host given previously known infectious viruses. Table
2.1 shows the positive training data and positive testing data as before and after
the chosen cutting year, respectively. The negative training data and the neg-
ative testing data were chosen randomly without overlaps from the viruses that
were not identified to infect the corresponding host. The sizes of positive train-
ing/testing and negative training/testing data were set equal. Besides, in order to
reduce the variation of performance introduced by selecting negative training data
and negative testing data, we selected the negative data randomly for 50 times.
The performance of any method was measured as the average performance over 50
repeats with different negative training and testing data.
In addition to identifying the optimal machine learning methods for predict-
ing virus-host infectious associations, we also applied our best prediction method
to estimate the fraction of true infectious associations (reliability) in viral tag-
ging experiments [26,27]. Viral tagging is a new high throughput experimental
procedure for detecting viruses infecting a particular host.
In viral tagging experiment, a particular bacterial host of interest is used as
bait to fish out viruses potentially infecting the host. The viral sequences are
then sequenced using NGS. In the viral tagging experiment of Deng et al. [26], 30
cyanobacterial viral genomes from the assembled reads screened by viral tagging
against a particular host Synechococcus sp. WH7803 were obtained. 19 out of 30
candidate viral genomes were shown to be T4-like viruses of Synechococcus, and 11
of 30 viral genomes are from non-T4-like viral population. The sequence lengths of
the 30 genomes range from 31.5kbps to 197kbps with the average length of about
83kbps.
6
Table 2.1: Description of the training and test data. For a specific year, the positive training data set
contains viruses infecting the corresponding host identified before the specific year and the positive testing data
set contains viruses infecting the corresponding host discovered after the specific year. The negative training data
and the negative testing data were chosen randomly without overlaps from the viruses that were not identified to
infect the host.
Bacterial Cutting # of viruses # of viruses # of non-infectious
genus year before after viruses
cutting year cutting year
Bacillus 2012 31 31 1364
Escherichia 2012 141 32 1253
Lactococcus 2013 49 6 1371
Mycobacterium 2013 172 46 1208
Pseudomonas 2013 68 28 1330
Salmonella 2012 32 22 1372
Staphylococcus 2012 43 20 1363
Synechococcus 2012 30 17 1379
Vibrio 2012 39 29 1358
In addition to the 30 almost complete viral genomes, Deng et.al [26] also gener-
ated about 10,864 raw viral short reads with lengths ranging from 15bp to 580bp,
and the average length of the short reads is 183bp. We assembled these reads into
contigs using the state of art assembly program metaSPAde [28] and we concen-
trated on 1,661 contigs with lengths at least 1.5kbps.
7
2.2.2 Feature definitions
We considered four different definitions of features. For each viral sequence, we
counted the number of occurrencesN
w
for every word of lengthk, w∈A
k
, where
A is the set of the alphabet. For example, if we consider DNA sequences and
k = 2, thenA
k
={AA,AC,AG,AT,CA,··· ,TT}). The four features are defined
as follows:
F
1
=
N
w
L−k + 1
,w∈A
k
, F
2
=
(
N
w
−E(N
w
)
E(N
w
)
,w∈A
k
)
,
F
3
=
N
w
−E(N
w
)
q
E(N
w
)
,w∈A
k
, F
4
=
(
N
w
−E(N
w
)
σ(N
w
)
,w∈A
k
)
.
where L is the length of the viral genome sequence; k is the length of the words;
and E(N
w
) and σ(N
w
) are the expectation and standard deviation of N
w
under
a certain random model of the viral sequence. Ren et al. [29] proposed to use
Markov chains (MC) to model genome sequences and showed promising results
for alignment-free genome sequence comparison. In this study, we considered four
different models of the viral sequences, including the independent and identically
distributed (i.i.d.) model (0
th
order MC), 1
st
, 2
nd
and 3
rd
order MCs. For each
MC model, the probability transition matrix was calculated based on each virus’s
own genome sequence, and the resulting E(N
w
) and σ(N
w
) were calculated using
the formulas in [30].
ThefirstfeatureF
1
isthestandardwordfrequencyvector. Theideasofdefining
featuresF
2
,F
3
andF
4
came from the recent studies on alignment-free sequence
comparison [29,31,32] showing that subtracting the expected word counts from the
observed word counts can improve the efficiency of sequence comparison. These
feature definitions differ in the denominator for normalizing the word counts. The
8
second feature definition is based on the statistic in the CVtree from Hao’s group
[33]. Thethirdandfourthfeaturedefinitionsarebasedonthed
∗
2
statisticin[31,32].
2.2.3 Supervised learning methods
For a given bacterial host, suppose that there aren viruses{V
1
,V
2
,···V
n
} infecting
the host and m viruses{V
n+1
,V
n+2
,···V
n+m
} not infecting the host. For a given
feature definition, let x
i
be the feature vector for thei-th virus,i = 1, 2,··· ,n+m
and y
i
= 1 for i∈{1, 2,··· ,n} and y
i
= 0 for i∈{n + 1,n + 2,··· ,n +m}. We
investigated the following machine learning methods for distinguishing infecting
and non-infecting viruses for a particular host. These methods can be found in [25]
and we outline them below. Details of these methods can be found in Section
A.1.
Logistic Regression
LogisticRegression[34]isacommonlyusedsupervisedlearningapproachtopredict
binary-valued labels. For any virus that we try to predict its infectious association
with the given bacterial host, let Y be the binary label of the virus and x be the
feature vector of the virus. In logistic regression, define h
β
(x) =
exp(β
T
x)
1+exp(β
T
x)
, where
the superscript “T" indicates the transpose and β = (β
1
,β
2
,··· ,β
p
)
T
.
We assume that the class label of a given virus with feature vector x follows
the distribution:
p(Y = 1|x;β) =h
β
(x)
p(Y = 0|x;β) = 1−h
β
(x)
=⇒ p(Y =y|x;β) = (h
β
(x))
y
(1−h
β
(x))
1−y
9
When estimating the parameter β with maximum likelihood estimation, in
order to deal with the sparsity issue of the data, we added the LASSO regulariza-
tion (least absolute shrinkage and selection operator) [35] and performed the fea-
ture selection with L
1
-norm penalization. Then the problem formulation becomes
finding β such that
−
n+m
X
i=1
log(p(Y =y
i
|x
i
;β)) +λ
p
X
i=1
|β
i
|
is minimized, where λ acts as the penalty term for the number of parameters.
In our study, the λ is set as the default value, which is 1, in the scikit-learn
package [36].
After solving β, for a new virus with feature vector x, the prediction score is
then given by ˆ y =h
β
(x).
Support vector machine with RBF kernel
The support vector machine (SVM) [37] is a popular method for binary classifica-
tion and it has been successfully applied to many different problems. In general,
SVM aims to find the optimal hyperplane that separates the data labeled with
y
i
= 1 from the data labeled with y
i
= 0. SVM can be expressed as the following
optimization problem
min||w||
2
subject to
w· Φ(x
i
) +b≥ 1 if y
i
= 1
w· Φ(x
i
) +b≤ 0 if y
i
= 0
In our study, we used the Gaussian radial basis function (RBF) as the kernel.
Mathematically, the RBF kernel is represented as K
RBF
(x
i
, x
j
) = exp(−γ·||x
i
−
x
j
||
2
) = Φ(x
i
)· Φ(x
j
). As a free parameter of the RBF, a small γ represents the
10
data as Gaussian distribution with large variance. Naturally, feature vectors with
high dimension will have high variation, so here we set the γ as the reciprocal of
the dimension of the features [38].
After w and b are solved, for the feature vector x corresponding to any new
virus, the prediction score is then determined by ˆ y = w· Φ(x) +b.
Random forest
Random forest (RF) is a classification method that uses the ensembled classifica-
tion trees [39], with each tree constructed using a bootstrap sample of the data.
At each split, the subtree represents a random subset of the variables. Each tree
in the RF is allowed to grow fully to reduce bias in the decision process, while the
randomness of variable selection reduces correlation of the individual trees. There-
fore, a decision made in the RF is an ensemble that has low bias and low variation,
because the decision is made by a collective of low-bias and low correlated trees.
(See details in Section A.1.3.)
Naive Bayes
For any virus, let Y be the binary label and x be the feature vector of the virus.
From the Bayes theorem, we have
p(Y|x) =
p(Y )p(x|Y )
p(x)
=⇒
p(Y = 1|x) =α·p(Y = 1)p(x|Y = 1)
p(Y = 0|x) =α·p(Y = 0)p(x|Y = 1)
where α =
1
p(x)
. The prediction score is then given by
ˆ y =α·p(Y = 1)p(x|Y = 1) = 1−α·p(Y = 0)p(x|Y = 0)
11
Dependingonthedifferentassumeddistributionsofp(x|Y ), thenaiveBayes[40,41]
method can be further divided into Gaussian naive Bayes and Bernoulli naive
Bayes.(see details in Section A.1.4.)
2.2.4 Evaluation criteria
For each of the supervised learning methods with any of the four defined features,
we trained the model based purely on the training data and obtained a score func-
tion. Then we applied the model to the testing data and calculated the prediction
scores for each virus. For testing data, a higher prediction score indicates higher
probability that the virus infects the host. Based on the prediction scores of the
testing data, we calculated the AUC scores [42] (Section A.1.1) as the evaluation
criterion for each of the feature-method combinations.
2.2.5 Estimating the fraction of viruses infecting a host in
viral tagging experiments
We assumed that viral contigs derived from viral tagging experiments are a mix-
ture of contigs from viruses infecting the host and those not infecting the host.
For an optimal learning method built from the training data, we assumed that the
distribution of the scores for viruses not infecting the host follows a beta distri-
bution Φ
0
(α
0
,β
0
) based on our preliminary exploration of the data. Similarly, we
assumed that the scores of viruses infecting the host follow another beta distribu-
tion Φ
1
(α
1
,β
1
) from the preliminary studies. The distribution of the scores for the
observed viral contigs is a mixture of Φ
0
(α
0
,β
0
) and Φ
1
(α
1
,β
1
). Let P
obs
(·) denote
the distribution for the scores of the viral contigs from the experiments. Then
P
obs
(·) = (1−γ)· Φ
0
(α
0
,β
0
) +γ· Φ
1
(α
1
,β
1
).
12
where γ is the fraction of contigs derived from viral sequences infecting the host.
We first estimated the parameters (α
0
,β
0
) and (α
1
,β
1
) using the moment esti-
mators from the scores for the negative and positive testing data, respectively.
Then we used the maximum likelihood approach to estimate the fractionγ and its
confidence interval.
2.3 Results
2.3.1 Comparison of different supervised learning methods
The complete results on the performance of different combinations of features
and machine learning methods are given in Table A.3. To see which machine
learning method performs the best for a given feature definition, we calculated the
averageAUCscoreacrossthe9bacterialhostgeneraaswellasdifferentbackground
sequence models. The results are shown in Figure 2.1. It can be seen from the
figure that for all the four features, the RF method outperforms others in general,
although there are some exceptions. If we fix the RF method, there are not much
performance differences using the four features. Since the first feature definition is
the simplest and does not need background models for the sequences, we suggest
the use of RF method based on the relative frequencies of word patterns.
Animportantprobleminusingwordpatternsisthedeterminationofthelength
of word patterns. If the length of word patterns is too short, the frequency vectors
can not fully capture the information in the viral sequences. On the other hand, if
the length of word patterns is too large, the frequency vector has high variation.
Therefore, appropriate choice of the length of word patterns is essential. Fixing
the first feature and the RF method, the AUC scores with different word lengths
for the 9 host genera are shown in Table 2.2. When k = 4, five out of the 9 host
13
genera can achieve AUC over 0.95, one with AUC between 0.90 to 0.95, and three
with AUC between 0.85 and 0.90. The average AUC is slightly increased for 6 out
of the 9 host genera whenk is increased from 4 to 6. However, whenk is increased
to 8, the average AUC is significantly decreased for some of the host genera.
Table 2.2: The AUC scores of using RF combined with the first feature with different word pattern
lengths across 9 different hosts. The highest score for each host is highlighted in bold.
k = 4 k = 6 k = 8
Bacillus 0.856 0.863 0.823
Escherichia 0.878 0.858 0.807
Lactococcus 0.972 1.000 0.988
Mycobacterium 0.987 0.985 0.984
Pseudomonas 0.978 0.981 0.967
Salmonella 0.889 0.896 0.891
Staphylococcus 0.993 0.987 0.983
Synechococcus 0.965 0.978 0.955
Vibrio 0.936 0.940 0.892
14
(a)
(b)
(c)
Figure 2.1: The average AUC scores of the different machine learning methods and features. The
averaged AUC scores are calculated by the average of the AUC scores across different hosts and
differentgenomebackgrounddistributions. Thefiguresfromtoptobottomaretheperformancesofdifferent
word lengths k = 4, k = 6 and k = 8. The black segments on top of the bars are the standard deviation of the
AUC scores across different hosts and different genome background distributions.
Potential explanations for the performance variation across different
host genera
We are interested in understanding the underlying reasons for the performance
variationacrossthedifferenthostgenera. Wehypothesizedthattheviralsequences
15
for the host genera with high prediction accuracy are more similar to each other
than those for the other host genera. To test this hypothesis, we first calculated the
Manhattan distances of the first feature vectors for pairs of viruses infecting each
host genus and they are shown in Table 2.3. Significant associations between the
AUC scores and the average Manhattan distances within a group were observed
except when k = 4 (Spearman correlation -0.450 (p-value = 0.22), -0.800 (p-vale
= 0.01), and -0.683 (p-value = 0.04), for k = 4, 6, and 8, respectively.) This
observation indicates that the prediction performance can be partially explained
by the average distances among the viruses infecting a host genus.
Table 2.3: Average Manhattan distances of word relative frequency vectors between pairs of viruses
infecting each host.
k = 4 k = 6 k = 8
Bacillus 0.342 0.558 1.146
Escherichia 0.372 0.706 1.416
Lactococcus 0.194 0.417 1.020
Mycobacterium 0.292 0.513 1.044
Pseudomonas 0.379 0.633 1.241
Salmonella 0.324 0.568 1.249
Staphylococcus 0.266 0.455 0.984
Synechococcus 0.335 0.516 0.986
Vibrio 0.371 0.658 1.360
16
We next explored the taxonomic compositions of the viral sequences infecting
each host genus. Most viruses in our data belong to the order of Caudovirales that
is composed of three major groups: the myoviruses, podoviruses, and siphoviruses
(Table 2.4). These groups of viruses exhibit different host ranges. Myoviruses
often have the broadest host ranges and podoviruses and siphoviruses typically
have relatively narrow host ranges [43,44]. Table 2.4 shows that viruses infect-
ing the nine host genera have very different taxonomic profiles. Viruses infecting
Lactococcus and Mycobacterium are primarily siphoviruses. Most of the viruses
infecting Staphylococcus belong to either myoviruses or siphoviruses. The three
host genera, Lactococcus, Mycobacterium and Staphylococcus, have very high AUC
scores over 0.98 when k = 6. The viruses infecting Synechococcus are primarily
myoviruses and the AUC corresponding to this host genus is also high (0.978 when
k = 6). The only exception is the Pseudomonas genus that the viruses infecting
the host spread across the three groups while still keeping a high AUC score of
0.981 when k = 6. For the other host genera, the viruses infecting them generally
belong to all three groups and they have relatively low, although decent, AUC
scores. We also calculated the entropy of the viruses according to the different
groups of viruses for each host and found that the entropy was also highly associ-
ated with the AUC scores (Spearman correlation coefficients between entropy and
AUC scores are -0.600 (p-value = 0.09), -0.750 (p-value = 0.02), -0.783 (p-value =
0.01) for k = 4, 6, and 8, respectively.)
17
Table 2.4: The distribution of viruses among three major viral families, Myoviridae, Podoviridae
and Siphoviridae, for viruses infecting each of the nine host genera. The last column is the entropy of
the distribution.
Myoviruses Podoviruses Siphoviruses other Entropy
Bacillus 21 8 27 3 1.656
Escherichia 49 30 43 51 1.972
Lactococcus 0 2 53 0 0.472
Mycobacterium 10 0 206 2 0.384
Pseudomonas 38 29 21 8 1.829
Salmonella 10 19 21 4 1.789
Staphylococcus 19 4 35 5 1.535
Synechococcus 28 6 5 8 1.603
Vibrio 20 21 6 21 1.875
2.3.2 Comparison between RF, Manhattan and d
∗
2
dissim-
ilarity measures
Roux et.al. [17] used Manhattan distance between the word frequency vectors of
viruses and bacterial hosts to predict virus-host infectious association. In addi-
tion, the d
∗
2
statistic [31,32] was shown to have superb performance in measuring
sequence dissimilarities. We compared the performances of RF with that based
on the Manhattan distance and the d
∗
2
statistic. The d
∗
2
statistic between two
18
sequences is defined as the uncentered correlation between two feature vectors
according to the third definition of features.
We first calculated the average Manhattan distance between the frequency vec-
tors of a viral sequence in the testing set with the viruses in the positive training
data. We predicted a virus to infect the host if the distance is smaller than a given
threshold. The predictions were then compared with the true infectious relation-
ships to obtain the false positive rate and the true positive rate. By changing the
threshold, we obtained the ROC curve and the AUC was calculated. The pro-
cedure was repeated 50 times in order to reduce the variation introduced by the
selection of the negative testing data.
We did similar analysis using the d
∗
2
dissimilarity measure. For d
∗
2
, the back-
ground Markov chain model was needed and we considered independent identically
distributed (i.i.d.) model, first, second and third order Markov chains. We com-
pared the performances based on Manhattan, d
∗
2
under i.i.d., first, and second
order MC background models and random forest with k = 4, k = 6 and k = 8.
The performances of different methods when k=6 are given in Table 2.5. The
results based onk = 4 andk = 8 as well as the third order MC background model
for d
∗
2
are given in Supplementary Material (Table A.3).
Identification of hosts of viral contigs in metagenomic studies
The primary motivation of our study is the identification of hosts of viral contigs
in metagenomic studies. In viral metagenomic studies that viral DNA is separated
from cellular DNA before sequencing, only viral genomes are sequenced (although
there is usually some contaminating cellular DNA) and their host information is
completely lost. It is important to match viral contigs with their corresponding
19
Table 2.5: Comparison of AUC scores of RF (random forest) combined with word frequency vector
with that based on Manhattan distance and d
∗
2
statistic when k = 6. For the background model of d
∗
2
statistic, we considered independent identically distributed (i.i.d.) model, first and second order Markov chains.
Manhattan d
∗
2
RF-feat-1
i.i.d. 1
st
−mc 2
nd
−mc
Bacillus 0.829 0.752 0.873 0.851 0.863
Escherichia 0.880 0.833 0.958 0.945 0.856
Lactococcus 0.767 0.775 0.828 0.750 1.000
Mycobacterium 0.976 0.977 0.966 0.984 0.985
Pseudomonas 0.951 0.934 0.974 0.970 0.981
Salmonella 0.837 0.818 0.900 0.900 0.896
Staphylococcus 0.964 0.941 0.947 0.974 0.987
Synechococcus 0.929 0.906 0.994 0.993 0.978
Vibrio 0.841 0.733 0.854 0.817 0.940
hosts for the understanding of the virus-host infection dynamics in microbial com-
munities. Because intact viral genomes are rarely recovered as contigs in viral
metagenomic studies, the assembled viral contigs are generally much shorter than
whole genome sequences and thus we study the performance of RF for the identi-
fication of bacterial hosts of viral contigs of different lengths. In order to achieve
this objective, we studied the performance of RF for the prediction of hosts of viral
contigs with different lengths: 1kbps, 3kbps, 5kbps, and whole genome.
20
In addition to the RF method learned based on complete viral genomes, we can
also learn the RF methods based on contigs of different lengths. We hypothesized
that, to predict the hosts of contigs of a certain length, the best method should be
learned from contigs with similar lengths. To test this hypothesis, we carried out
the following study.
For both training and testing data corresponding to the 9 bacterial host genera,
wefirstbrokethewholeviralgenomesintononoverlappingcontigsoflengths1kbps,
3kbps, 5kbps and whole genomes, respectively. To incorporate sequencing errors of
NGStechnologies, wemodifiedthecontigswith0.05%sequencingerrorrate. When
asequencingerroroccursatanucleotidebase, theoriginalbasewaschanged toany
oftheotherthreebaseswithequalprobability. Figure2.2showstheschemaofour
study. We built RF predictors using the first feature of the training contigs using
different lengths and predicted the hosts of the testing contigs. Table 2.6 shows
the results. It can be seen from the table that the performances of the learned RF
model based on contigs of 3kbps and 5kbps are similar and are consistently among
the best predictors for all the sequence contigs. When the contig length is 1kbps,
the word frequency vectors are not stable and have too much variation resulting
in low performance of the learned model. On the other hand, the RF model based
on the whole genome sequences does not perform well for short contigs of lengths
1kbps or shorter. A potential explanation is that the frequency vectors of short
contigs differ significantly from that of the whole genomes.
21
Figure 2.2: Scheme of the RF method for the prediction of hosts of viral contigs of different lengths.
For each of the 9 main host genera, we produced 4 training datasets with differed sequence lengths by breaking
the whole viral genomes into nonoverlapping contigs of lengths 1kbps, 3kbps, 5kbps and the whole genomes with
0.05% sequencing errors added; Similarly, we also generated 4 different testing datasets with different contig
lengths. We then evaluated the performances of RF for each training dataset with specific sequence length on all
4 testing datasets.
2.3.3 Estimation of the reliability of observed virus-host
infectious associations from viral-tagging experi-
ments
The above studies showed that RF with the first feature performs well in predicting
contigs coming from viral genomes infecting a host. For the host Synechococcus,
the best word length isk = 6. From the RF model that was trained by 30 positive
viralgenomesand30negativeviralgenomes, foranyviralsequencetobepredicted,
a score between (0, 1) can be calculated. We first calculated the scores of the 17
positive viral genomes in the testing data, and we also calculated the scores of the
22
Table 2.6: The AUC scores of the RF method for the prediction of hosts of viral contigs with
different lengths using the models built from contigs of different lengths.
Test: 1kb
Baci. Esch. Lact. Myco. Pseu. Salm. Stap. Syne. Virb.
Train: 1kb 0.773 0.805 0.840 0.962 0.924 0.812 0.936 0.928 0.842
Train: 3kb 0.819 0.857 0.833 0.977 0.959 0.818 0.955 0.960 0.858
Train: 5kb 0.831 0.848 0.848 0.977 0.952 0.826 0.957 0.957 0.845
Train: wgs 0.821 0.718 0.818 0.886 0.833 0.792 0.948 0.890 0.774
Test: 3kb
Baci. Esch. Lact. Myco. Pseu. Salm. Stap. Syne. Virb.
Train: 1kb 0.766 0.862 0.842 0.979 0.947 0.843 0.961 0.961 0.880
Train: 3kb 0.823 0.878 0.868 0.980 0.975 0.866 0.966 0.967 0.898
Train: 5kb 0.850 0.899 0.889 0.985 0.978 0.880 0.976 0.983 0.917
Train: wgs 0.854 0.827 0.885 0.967 0.952 0.872 0.976 0.951 0.876
Test: 5kb
Baci. Esch. Lact. Myco. Pseu. Salm. Stap. Syne. Virb.
Train: 1kb 0.768 0.870 0.822 0.982 0.955 0.838 0.964 0.972 0.867
Train: 3kb 0.827 0.900 0.869 0.985 0.977 0.871 0.974 0.986 0.904
Train: 5kb 0.852 0.890 0.883 0.986 0.978 0.883 0.972 0.972 0.907
Train: wgs 0.858 0.865 0.888 0.983 0.970 0.882 0.979 0.965 0.900
Test: wgs
Baci. Esch. Lact. Myco. Pseu. Salm. Stap. Syne. Virb.
Train: 1kb 0.778 0.860 0.814 0.984 0.960 0.812 0.971 0.981 0.896
Train: 3kb 0.854 0.901 0.817 0.994 0.988 0.884 0.989 0.994 0.930
Train: 5kb 0.870 0.923 0.861 0.994 0.992 0.889 0.992 0.996 0.934
Train: wgs 0.862 0.859 1.0 0.985 0.979 0.893 0.987 0.981 0.938
negative viral genomes except the 30 negative genomes that were used in training
the model. As stated in the Section 2.2, We assumed that the scores for the
23
negative and positive sequences follow beta distributions and the corresponding
parameters were estimated using moment estimators.
For the 30 candidate viral genomes identified to infect Synechococcus, we cal-
culated the RF scores of the 19 T4-like viruses and 11 non-T4-like viruses. Figure
2.3(a)(c) shows the histograms of the RF scores of the viral sequences from the
positiveandnegativetestingdatasets, and(a)theT4-likeand(c)non-T4-likecan-
didate genomes, respectively. The histogram of the scores of the T4-like viruses has
a significant overlap with that for the positive testing data set. This observation
strongly suggests that most of the identified T4-like viruses do infect Synechococ-
cus. On the other hand, the histogram for the scores of the identified non-T4-like
viruses peaked in the bins of viruses not infecting the host Synechococcus, but
mixed to a small extent with the viruses that infect the host Synechococcus. This
observation raises doubts that most of the identified non-T4-like viruses infect the
host.
We then estimated the fraction of true infectious viruses using the maximum
likelihood approach described in the “Materials and Methods" section. According
to the distribution of positive testing viruses and negative viruses with respect
to the host Synechococcus, the fitted distributions for the positive and negative
virusesare Φ
1
(3.35, 1.10)and Φ
0
(3.54, 8.78), respectively. Themaximumlikelihood
estimate ofγ is 0.949 with the 95% confidence interval [0.933, 0.964] for the T4-like
viruses. The estimated γ is 0.288 with the 95% confidence interval [0.265, 0.311].
The fitted density functions are given in Figure 2.3(b) and (d), respectively.
We also used the same method to estimate the fraction of 1,661 contigs with
lengths at least 1.5kbps from the viral tagging experiment with the host Syne-
chococcus [26]. This fraction was estimated at 30.4% with 95% confidence interval
[0.287, 0.320] (Figure 2.3 (e) and (f)).
24
(a) (b)
(c) (d)
(e) (f)
Figure2.3: ThehistogramsoftheRFscoresfortheviralsequencesinthenegativeandpositivedata
sets, and a) T4-like, c) non-T4-like, and e) viral contigs, respectively. The corresponding fitted density functions
are given as b), d), and f), respectively. In all of the 6 subfigures, the horizontal axis is the the prediction scores.
In (a),(c) and (e), the right y-axis indicates the fraction and the left y-axis indicates the fraction divided by the
bin-size.
25
Chapter 3
Prediction of colorectal cancer by
alignment-free supervised
learning methods
3.1 Background
Colorectal cancer (CRC) has been identified as the fourth most common cancer
(after lung, breast and prostate cancers) and the second most common non-gender
specific tumor type [45,46]. In 2018 alone, an estimated number of 50,630 deaths
have been linked to colorectal cancers in the U.S. [47]. Early diagnosis and non-
invasive screening of colorectal cancer are therefore critical in saving lives caused
by late-stage tumor development. Albeit a small number of colorectal cancer cases
are caused by genetic disorders, it has been demonstrated that tumorigenesis of
colorectal cancer is also highly associated with environmental factors such as age,
gender, body mass index (BMI), as well as related diseases such as inflammatory
bowel disease (IBD) [48]. In addition, life styles such as diet, smoking habits and
lack of physical exercises have also been linked to the disease.
Gut microbiota is increasingly being associated with CRC, and in particular,
several microbial species have been implied to be associated with disease causation
[49–52]. In the meantime, high-throughput metagenomic sequencing technology
from human fecal samples faciliated association studies between microbiome and
26
human diseases [53,54]. In recent years, several studies revealed the associations
between human colorectal carcinogenesis and human gut microbiome.
Depending on the data availability at the time of the study, CRC predicting
approaches can be assessed in three different ways: 1. Training data and testing
data from the same dataset (intra-dataset); 2. Training data and testing data are
two independent datasets (cross-datasets); 3. Training data is a combination of
multipledatasets, leavingonedatasetouttobeusedasthetestingdata(combined-
datasets).
Zeller et al. [55] first applied taxomonic profiling on intra-cohort patients’ fecal
sequencing samples to identify CRC markers with relative abundance information,
which is then used to classify CRC patients versus healthy controls using LASSO.
Feng et al. [56] used a similar approach but additionally took into account the
metagenomic linkage groups from the human metagenomic geneabundance profile,
and the prediction was based on random forest. Yu et al. [57] used a similar
prediction strategy as Feng et al. [56] on a different intra-cohort datasets.
On the other hand, two recent publications focused on CRC classfication across
different datasets and combination of datasets: Wirbel et al. [58] extended Zeller et
al. [55]’s protocol of classifying CRC patients versus healthy controls on 8 different
datasets with intra-, cross- and combined-datasets assessment. Thomas et al. [59]
performed a similar study on almost the same datasets as in Wirbel et al. [58],
with some minor differences in data preprocessing. Besides, Thomas et al. [59]
used random forests, instead of LASSO, as the classifier.
All of the above-mentioned metagenomic sequencing-based CRC versus normal
prediction studies relied on the process of OTU binning, in which metagenomic
sequences are aligned to known reference genomes. In particular, the prediction
27
procedures used features derived from relative abundance information in the tax-
onomic profile.
Inourvirus-hostassociationstudydescribedinChapter2, weusedalignment-
free approaches to extract features from metagenomic sequences, and achieved
impressive performance. We hypothesized that the association between micro-
biome and human gut environment is a similar scientific question as the virus-host
association. As a result, we extended our methods and applied them to the pre-
diction of CRC patients versus healthy controls, in which we extracted features
based on human gut metagenomic sequencing data as well as patients’ clinical
information such as age, gender and body mass index (BMI).
In our investigation, four different representations of alignment-free features
are engineered based on the word counts of the fecal sequencing data. We subse-
quentlyevaluatedtheperformanceofmultiplesupervisedlearningmethods,includ-
ing Logistic regression, LASSO, SVM with three different kernel functions, Ran-
dom Forests, and Multilayer-perceptron (Deep Learning), using evaluation criteria
based on AUC scores of intra-, cross- and combined-datasets.
3.2 Materials and Methods
3.2.1 Data description
Lesion in human colon can be cancerous or benign, which are called carcinoma and
adenoma, respectively. We therefore formulated the problem to binary prediction
of cancer versus non-cancer, and developed our methods and subsequent evalua-
tion using the following three independent CRC-related human fecal whole-genome
shotgun sequencing datasets:
28
The first dataset (NatureCom dataset) is from Feng et al. [56], which contains
a single cohort from Austria. In this dataset, there are 46 patients diagnosed with
colorectal cancer, 47 patients diagnosed with advanced adenoma and 63 healthy
controls. Theseconddataset(PlosOnedataset)isfromVogtmannetal.[60], which
includes fecal samples from 52 colorectal cancer patients and 52 healthy individuals
in the United States. The third dataset (MolSysBio dataset) is from Zeller et
al. [55], where the fecal samples were collected from four different countries. In
this dataset, patients with benign lesion were further classified into small adenoma
group and large adenoma group based on the size of the lesion area, which in total,
accounts for 297 cancer patients, 37 patients with large adenoma, 73 patients with
small adenoma and 205 healthy individuals.
Meanwhile, we also planned to include another relevant dataset [57] that
involves 47 carcinoma patients and 109 healthy individuals. However, the publicly
accessible data provided by the authors lacks diagnostic/clinical labels for each
individual sample, making verification of prediction impossible. We contacted the
authors by email hoping to obtain the missing information for the dataset but had
no response. Consequently we were not able to include this dataset in our study.
Both the NatureCom dataset [56] and the MolSysBio dataset [55] contain sam-
plesofadenoma. Fengetal.[56]treatedadenomasamplesashealthycontrols, even
thoughthemetagnomicgeneabundanceofadenomasamplesisbetweenthelevelof
CRCandhealthysamples. Zelleretal.[55]claimedthatalthoughthemetagenomic
compositions of healthy individuals and patients with adenoma (regardless of size)
were largely indistinguishable, large adenoma is widely thought to be a precursor
for CRC, and for this reason, they excluded patients with large adenoma from the
dataset and treated the small adenoma samples as healthy controls. Lastly for the
29
PlosOne dataset [60], there are only CRC patients and healthy controls with no
adenoma cases.
In order to be consistent with how these previous studies treated the datasets
with regard to adenoma samples, we consider all the adenoma samples in Nature-
Com dataset as healthy controls, while for the MolSysBio dataset we treated the
small adenoma samples as healthy controls and completely excluded the large ade-
noma samples.
All of the three datasets contain clinical indicators including age, gender, body
mass index (BMI) for each individual. The detailed compositions as well as the
average age, gender and BMI for each group in the three datasets are shown in
Table 3.1.
3.2.2 Clinical indicators as potential confounders for CRC
prediction
Before building any classifiers based on the fecal sequencing data for CRC patients
and healthy controls, we are interested in whether there are other potential con-
founders for colorectal cancer prediction, in terms of certain easily measurable
clinical indicators, such as age, gender, body mass index (BMI).
Figure 3.1 shows the gender distribution amongst CRC versus control groups
for all of the three datasets and a slightly higher proportion of males compared
to females is observed in the CRC group for the NatureCom and the MolSysBio
datasets. We performed two-proportion z-test [61] (Section B.1.1) on the pro-
portions of males in the CRC and the control groups for the NatureCom and the
MolSysBio datasets. The resulted p-values (0.468 and 0.072 for the NatureCom
and the MolSysBio dataset, respectively) do not provide strong evidence to infer
gender as a significant confounder in causing CRC.
30
Table 3.1: Description of the datasets. This table reflects three different datasets we focus on. The locations
where the datasets data were collected are shown in the last column of the table. For each cohort, the number of
individuals in different diagnostic groups (CRC, Adenoma, Healthy) are listed, the mean and standard deviation
of Age as well as BMI for each group are also included.
Dataset Groups Age BMI Country
NatureCom
Control (63) 67±6.37 27.6±3.78
Adenoma (47) 66±7.86 28.0±4.69 Austria
CRC (46) 67±10.91 26.5±3.53
PlosOne
Control (52) 61±11.03 25.35±4.27
United States
CRC (52) 62±13.58 24.89±4.25
MolSysBio
Control (205) 56±11.79 29.00±5.85
Adenoma (110) 60±8.67 26.00±4.70 France, Germany
CRC (297) 65±12.23 26.00±4.46 Denmark & Spain
Total
Control (320) 60±11.49 28.17±5.55
Adenoma (157) 64±8.64 27.34±4.73
CRC (395) 64±12.40 25.85±4.21
Figure 3.2 demonstrates age and BMI distributions amongst CRC versus con-
trol groups for all of the three datasets. We performed Mann-Whitney U-test [61]
(Section B.1.2) on age and BMI distributions for all of the three datasets. The p-
values for age and BMI distributions are (0.243, 0.481, 6.524*10
−6
) and (0.057,
0.310, 0.487) for NatureCom, PlosOne and MolSysBio datasets, respectively.
From the p-value perspective for the age and the BMI distribution, we did not
31
observe consistent shifts of the CRC group’s distribution while compared with
that of the control group’s in all of the three datasets.
Table 3.2 shows the Pearson correlations [62] for pairwise combinations of
age, BMI, gender and CRC diagnosis (cancer vs. normal). In each case, we did
not observe coherent correlation between these simple clinical indicators, making a
strong case for us to exlude them as confounders for CRC prediction. Our findings
are consistent with that of Wirbel et al. [58], in which they observed no prediction
bias introduced by age, gender and BMI in their model.
As an outcome of this, we choose not to include these clinical indicators when
constructing the feature representations of our datasets for CRC prediction.
Table 3.2: Pearson correlations between different clinical indicators. This table reflects the Pearson
Correlations of the paired-combinations in Age, Gender, BMI and CRC diagnosis for NatureCom (NC) dataset,
PlosOne (PO) dataset, MolSysBio (MSB) dataset and combination of all three datasets (Compelete).
Complete NC Dataset PO Dataset MSB Dataset
Age/Gender -0.0423 -0.1090 -0.1920 0.0439
Age/BMI 0.0506 0.0203 0.0882 -0.0116
Age/Diagnosis 0.1187 0.0117 -0.0112 0.2786
Gender/BMI 0.0150 0.1804 -0.0539 -0.0128
Gender/Diagnosis 0.0886 0.0621 0.0259 0.1005
BMI/Diagnosis -0.0923 -0.1426 -0.0543 -0.0427
32
28
60
37 37
54
85
18
50
15 15
37
93
NC CA NCER NC NORMA L PO CA NCER PO NORMA L MSB CANCER MSB NORMAL
GENDER RATIOS
Male Female
Figure 3.1: Gender ratios in 3 datasets. Bars from left to right are gender ratios of 1. CRC patients in NC
dataset; 2. Healthy controls in NC dataset; 3. CRC patients in PO dataset; 4. Healthy controls in PO dataset;
5. CRC patients in MSB dataset; 6. Healthy controls in MSB dataset. NC, PO and MSB are shortened forms of
NatureCom, PlosOne and MolSysBio datasets, respectively.
33
(a) (b)
(c) (d)
(e) (f)
Figure 3.2: BMI and Age distributions for 3 datasets. (a)(b)(c) are age distributions for NatureCom
dataset, PlosOne dataset and MolSysBio dataset. (d)(e)(f) are BMI distributions for NatureCom dataset, PlosOne
dataset and MolSysBio dataset.
34
3.2.3 Feature definition
For each metagenomic sequencing sample in the patient’s fecal sample, we counted
the number of occurrences N
w
for every word of length k,w∈A
k
, whereA is the
set of the oligonucleotides{A,C,G,T}. We then define four different features as
follows:
F
1
=
N
w
L−k + 1
,w∈A
k
F
2
=
(
N
w
−E(N
w
)
E(N
w
)
,w∈A
k
)
F
3
=
N
w
−E(N
w
)
q
E(N
w
)
,w∈A
k
F
4
=
(
N
w
−E(N
w
)
σ(N
w
)
,w∈A
k
)
where L is for the total length of the metagenomic sequence; k is word pattern
length;E(N
w
) andσ(N
w
) are the expectation and standard deviation ofN
w
under
a certain random distribution model of the metagenomic sequence. In this study,
we will consider two Markov models, identically and independently distributed
(i.i.d.) model (0-th order MC), and 1
st
order MCs, for each of the four feature
definitions. Notably,F
1
will remain the same for any background models. The
probabilitytransitionmatrixforeachmetagenomicsequencingsampleiscalculated
individually on its own, and the resultingE(N
w
) andσ(N
w
) were calculated based
on the transition matrix [30].
3.2.4 Removing human DNA from metagenomic samples
The reads in the metagenomic sequencing data are mix of microbiome genomic
sequences as well as human genome sequences. Vincent et al. [63] reported that
the proportion of human DNA fragments in the fecal sample ranges as wide as
from 0% to 98% in their study of Clostridium difficile Infection (CDI). The fecal
35
sample of hospitalized CDI patients tend to have greater proportion of human
DNA fragments, while healthy controls tends to have smaller proportion of human
DNA fragments in their fecal sample. The same situation might be applicable
in the case of colorectal cancer (CRC) as well. The existence of human DNA of
various proportions in fecal samples across different datasets might pose a great
challenge for estimating the correct background model as well as for extracting
consistent features from the sequencing samples.
In order to guarantee that our study will not be affected by the existence of
human DNA fragments in the metagenomic sequencing data, before applying our
featureengineeringandsupervisedlearningpredictionpipelineonthemetagenomic
sequencing data, we first go through a standardized process to filter out any human
DNA fragments that exsit in the metagenomic sequencing data with the following
steps:
1. For each of the paired-end metagenomic sequencing data, map all the reads
against the human reference genome (hg38) by Bowtie2 [64].
2. Gather the unmapped proportion of reads from the original metagenomic
sequencing data by SAMtools [65] and save as new paired-end sequencing
data by BEDTools [66].
After the above data filteration steps, we apply our feature engineering and
supervised learning prediction pipeline (Section 3.2.5) on the purified sequencing
data.
3.2.5 Supervised learning methods
Suppose a training data contains n CRC patients{S
1
,S
2
,··· ,S
n
} and m healthy
controls{S
n+1
,S
n+2
,··· ,S
n+m
}, for any of the feature definitions under a specific
36
Markov model, let x
i
be the feature vector andy
i
be the label for thei−th sample
in the training data, i = 1, 2,··· ,n +m.
y
i
=
1 if i∈{1, 2,··· ,n}
0 if i∈{n + 1,n + 2,··· ,n +m}
Besides the supervised learning methods Logistic Regression, Support Vector
Machine (SVM) with RBF kernel and Random Forest we used in the virus-host
interaction project (Chapter 2), we also include SVM with linear kernel, SVM
withpolynomialkernel, LASSOandMultilayerPerceptron (MLP,aDeepLearning
model) into our study. We then investigated the performance of each supervised
learning method based on each of the four features we defined above.
LASSO
LASSO [35] has achieved good performances in many binary-valued regression and
prediction problems. The problem formulations of LASSO and Logistic regression
are quite similar except for the regularization penalty part. Logistic regression
uses L
2
−norm penalization but LASSO uses L
1
−norm penalization instead:
−
n+m
X
i=1
log(p(Y =y
i
|~ x
i
;β)) +λ
p
X
j=1
|β
j
|
The regularization strength λ is set to 1 and 5-fold cross-validation is used.
Deep Learning (Multilayer Perceptron)
A Multilayer Perceptron (MLP) [67] method belongs to the feedforward ANN
(artificial neural network). An MLP is a fully connected network that has one
input layer, multiple hidden layers and one output layer, each node−i in a certain
37
layer connects to node−j in the next layer with weight w
ij
. Let y
i
be the output
ofj−th node (j = 1 in our case) andy
j
= Φ(v
j
) be the activation function, where
v
j
is the sum of the input connections. Let e
j
(k) = y
j
(k)− ˆ y
j
(k) be the error of
the j−th output node (j = 1 in our case) for the k−th sample, where y
j
(k) is the
true label ofk−th sample and ˆ y
j
(k) is the predicted value. Using gradient descent
to minimize the total errors Ξ(k) =
1
2
P
j
e
2
j
(k), the change in weight w
i
j is given
by Δw
ij
(k) =−η
∂ε(k)
∂v
j
(k)
y
i
(k), wherey
i
is the output of node−i in the previous layer
and η is the learning rate. After the weights of the layer next to the output layer
are estimated, the weights of the previous layers will then be estimated by the
backpropagation approach.
Fortheactivationfunction, weusedmostcommonly-appliedhyperbolictangent
function Φ(v
j
) = tanh(v
j
) in our case and set the learning rate η = 1. Since the
number of hidden layers and number of neurons n in the hidden layer are tricky
issues, we set them as variables and determined by the 5-fold validation. The
number of hidden layers is chosen from 1-hidden layer model, 2-hidden layer model
and 3-hidden layer model; the number of neurons (nodes) in each hidden layer is
chosen from 32, 64 and 128. Based on the performances on the validation sets, we
choose the optimal model and test on the testing data.
3.2.6 Boostrapping and subsampling
Different from the OTU (Operational Taxonomic Unit)-profiling based methods
which use 16s rRNA data, our study focuses on whole genome sequencing data.
Even after the filtration of human DNA as in 3.2.4, the total number of reads in
each of the paired-end sequences is approximately around 10,000,000bps (∼ 10000
kbps), which is still quite large. We hope to downsize the sequencing data in a way
that the downsized data could be sufficient to represent thek−mer distribution of
38
the original data. In such case, not only the data preprocessing time will be vastly
reduced, but also it provides the possibility that the CRC screening can be done
with shallow sequencing at lower cost.
In Asgari et al. [68] of predicting environments/host phenotypes for micro-
bial communities, they investigated the sufficiency of using the shallow subsample
instead of the entire 16S rRNA sequence data by a bootstrapping framework. They
then used the shallow subsampled 16S rRNA data for environments/host pheno-
types predictions for the microbial communities, and obtained better performance
compared with the Operational Taxonomic Unit (OTU)-based methods.
SupposeadatasetcontainingM samples(paired-endfastqfiles)intotal, Asgari
et al. [68] proposed the following bootstrapping framework:
1. Randomly select N reads (called subsample) with replacement from each
sample in the dataset, where N is the subsampling size.
2. Extract k−mer based-features from the subsampled N reads for each of the
sample in the dataset and then perform the supervised learning methods
with respect to the objective.
In order to measure the sufficiencies of the subsamples as representations of the
original samples, Asgari et al. [68] proposed two metrics:
1. Self-consistency: If repeat the subsampling for N
R
times independently,
the k-mer distributions for all N
R
subsamples for a specific sample (in the
original dataset) should be consistent with each other.
2. Representativeness: If repeat the subsampling for N
R
times indepen-
dently, the k-mer distributions for all N
R
subsamples for a specific sample
should be as similar to the original sample as possible.
39
For quantification purpose, for each sample in the original dataset, they further
defined self-inconsistency
¯
D
S
i
and unrepresentativeness
¯
D
R
i
as following:
¯
D
S
i
=
1
N
R
(N
R
− 1)
p6=q
X
(p,q)∈{1,2,···,N
R
}
D
KL
(θ
p
i
(k),θ
q
i
(k))
¯
D
R
i
=
1
N
R
X
p∈{1,2,···,N
R
}
D
KL
(θ
p
i
(k),θ
i
(k))
whereθ
i
(k) is thek-mer distribution of thei-th original sample in the dataset;
θ
p
i
(k) andθ
q
i
(k) are thek-mer distributions of thep-th andq-th subsamples for the
original i-th sample, respectively. D
KL
(θ
1
,θ
2
) is the Kullback–Leibler divergence
for distributions θ
1
and θ
2
(see Section B.1.3).
Over allM samples in the dataset, the overall self-inconsistency andunrep-
resentativeness are then defined as:
¯
D
S
=
1
M
M
X
i=1
¯
D
S
i
,
¯
D
R
=
1
M
M
X
i=1
¯
D
R
i
Following Asgari et al. [68], for our study, we select the subsampling sizes to be
N∈{10 kbps, 50 kbps, 400 kbps, 2000 kbps}, which can cover the total number
of reads in the original paired-end sequences at 0.1%, 0.5%, 4% and 20% levels.
Fork-mer lengthsk∈{3, 6, 9}, we investigate the self-inconsistency and unrep-
resentativeness of subsamplings for each individual dataset as well as wrapping all
three datasets together with different subsampling sizes. In our study, we set the
number of repeated subsamplings N
R
= 5.
40
Lastly, we repeat our feature engineering and supervised learning prediction
pipeline on the subsamples and compare the corresponding perfromances with the
performances without subsampling.
3.2.7 Evaluation Criteria
Based on the three independent datasets, we define the following three different
data partitioning scenarios (Table 3.3) to evaluate the performances of the 7
supervised learning techniques and 4 features we defined:
1. Intra-dataset evaluation: For each of the three datasets (NatureCom,
PlosOne and MolSysBio), the full dataset is randomly splitted into the train-
ing and testing data according to the ratio of 60% to 40%. After the random
splitting, we train the model on the training data and test it the testing data.
2. Cross-datasets evaluation: For the three datasets (NatureCom, PlosOne
and MolSysBio), we fix one dataset as the training data and test the model
on the other two datasets.
3. Combined-datasets evaluation: For the three datasets (NatureCom,
PlosOne and MolSysBio), we select two datasets and combine them as the
training data, the model is then tested on the last one dataset.
For all 3 scenarios, while training the predicting model for each specific com-
bination of supervised learning method and feature, we use 5-fold cross-validation
on the training data. After the model is trained, the AUC score [42] is calculated
based on the predicting scores of the testing data. Such training and testing data
splitting procedure is repeated for 20 times and we then calculate the mean AUC
score and the confidence interval (CI) of the AUC scores based on the results of
20 repeats.
41
Table 3.3: Data partitionings in 3 different evaluation scenarios. This table shows the data partitionings
for evaluating different supervised learning techniques and features in intra-dataset, cross-datasets and combined-
datasets scenarios.
Training Data Testing Data
NatureCom (60%) NatureCom (40%)
Intra-dataset PlosOne (60%) PlosOne (40%)
MolSysBio (60%) MolSysBio (40%)
Cross-datasets
NatureCom PlosOne
NatureCom MolSysBio
PlosOne NatureCom
PlosOne MolSysBio
MolSysBio NatureCom
MolSysBio PlosOne
NatureCom & PlosOne MolSysBio
Combined-datasets NatureCom & MolSysBio PlosOne
PlosOne & MolSysBio NatureCom
3.3 Results
3.3.1 Comparison of different supervised learning methods
and features
Before applying the supervised learning methods, in order to eliminate the inter-
ference introduced by human DNA fragments in the fecal samples, we first filtered
42
the metagenomic sequences to remove human DNA fragments using the protocol
in Section 3.2.4 for all of the 3 datasets.
For 7 different supervised learning methods, 4 different feature definitions with
3 different k-mer lengths (k∈{3, 6, 9}) and 2 different background models (i.i.d
model and 1
st
order Markov chain model), we evaluate the performance of each
combination on 3 different scenarios: intra-dataset, cross-datasets and combined-
datasets.
According to our results, it is difficult to demonstrate an overall-best method or
feature, since the optimal method and feature vary by different training and testing
datasets. Such results are actually consistent with the conclusion as in LaPierre
et.al. [69] of evaluating machine learning in metagenome-based disease prediction.
Though no method and feature outperformed others with absolute superiority, we
still observed that featureF
1
andF
4
at k−mer lengths k = 6 and k = 9 under
i.i.d background model with LASSO and SVM (linear kernel) performed the best
or close to the best (see Table 3.4, 3.5) in most cases.
Besides, for three different scenarios, we also noticed that the AUC scores of the
intra-dataset scenario outperformed cross- and combined-datasets scenarios. For
the combined-datasets scenario, though the training data is enlarged compared
with the cross-datasets scenario, however, there is no significant improvement in
the AUC scores for CRC prediction.
43
Table 3.4: Best-performing features and methods. This table is a summary of best-performing features,
k-mer lengths and supervised learning methods for intra-dataset, cross-datasets and combined-datasets scenarios.
NC, PO and MSB are shortened forms of NatureCom, PlosOne and MolSysBio datasets; SVM (LN) and SVM
(RBF) is the shortened form for SVM with linear and RBF kernels, respectively.
Best - performing
Testing- Training- Scenario Method k Feature AUC - score
Data Data
NC
NC Intra-dataset LASSO 9 F
4
0.919
PO Cross-datasets Logistic 9 F
2
(m = 1) 0.653
MSB Cross-datasets LASSO 6 F
1
0.754
PO&MSB Combined-datasets SVM (RBF) 6 F
1
0.720
PO
PO Intra-dataset SVM (LN) 6 F
1
0.688
NC Cross-datasets MLP 6 F
1
0.610
MSB Cross-datasets LASSO 9 F
1
0.653
NC&MSB Combined-datasets SVM (LN) 6 F
2
,F
3
,F
4
0.625
MSB
MSB Intra-dataset SVM (RBF) 6 F
2
,F
3
,F
4
0.805
NC Cross-datasets MLP 6 F
1
0.692
PO Cross-datasets LASSO 9 F
1
0.739
NC&PO Combined-datasets LASSO 6 F
1
0.745
44
Table 3.5: Performance of LASSO and SVM (Linear Kernel). This table shows the comparison of highest
AUC scores in all feature-method combinations with AUC scores ofF
1
andF
4
at k = 6 and k = 9 under i.i.d
model for LASSO and SVM (Linear Kernel) in intra-dataset, cross-datasets and combined-datasets scenarios.
AUC - scores
Testing- Training- High- LASSO SVM (linear kernel)
Data Data est F
1
F
4
F
1
F
4
F
1
F
4
F
1
F
4
k = 6 k = 6 k = 9 k = 9 k = 6 k = 6 k = 9 k = 9
NC
NC 0.919 0.876 0.887 0.906 0.919 0.718 0.729 0.776 0.768
PO 0.653 0.608 0.534 0.568 0.515 0.516 0.591 0.575 0.632
MSB 0.754 0.754 0.681 0.677 0.696 0.658 0.690 0.692 0.673
PO&MSB 0.720 0.671 0.670 0.628 0.634 0.681 0.711 0.713 0.701
PO
PO 0.688 0.540 0.491 0.599 0.574 0.660 0.513 0.653 0.601
NC 0.610 0.532 0.525 0.470 0.527 0.489 0.480 0.522 0.512
MSB 0.653 0.602 0.589 0.653 0.634 0.525 0.589 0.567 0.618
NC&MSB 0.625 0.623 0.586 0.617 0.583 0.525 0.625 0.548 0.624
MSB
MSB 0.805 0.766 0.747 0.733 0.718 0.644 0.798 0.738 0.782
NC 0.692 0.591 0.600 0.471 0.475 0.584 0.581 0.482 0.576
PO 0.739 0.684 0.555 0.739 0.630 0.653 0.691 0.711 0.726
NC&PO 0.745 0.745 0.713 0.738 0.685 0.688 0.742 0.581 0.710
45
Intra-dataset CRC prediction
For intra-dataset scenario, the training data and testing data are splitted within
each single dataset. Figures 3.3, 3.4 and 3.5 show the resulted AUC scores
for NatureCom, PlosOne and MolSysBio datasets, respectively. For each of the
3 datasets, the subfigures on the left and right are AUC bars of i.i.d background
model and 1
st
order Markov chain background model, respectively. From top to
bottom, subfigures represent AUC scores in the order ofk-mer lengthsk = 3,k = 6
and k = 9.
For intra-dataset prediction overall, the i.i.d background model outperformed
the 1
st
order Markov chain background model consistently. Besides, the AUC
scores for k-mer lengths k = 3 underperformed the ones of k-mer lengths k = 6
and k = 9. As for the supervised learning methods and features perspective, the
best performing ones vary by different datasets.
1. For NatureCom [56] dataset specifically, as can be observed from Figure 3.3
and Table 3.4, the performances of different features are quite close to each
other across different methods and k-mer lengths. Among the 7 supervised
learning methods, LASSO performed significantly better than the others and
reached the best at k−mer length k = 9. The averaged AUC scores forF
1
,
F
2
,F
3
andF
4
under i.i.d background at k = 9 with LASSO are 0.906,
0.904, 0.904 and 0.919, respectively.
2. For PlosOne dataset [60], the processing procedure of the original samples
is significantly different from the other two datasets, which might be the
main factor that causes the low predicting performance. The fecal samples
in this dataset were collected, freeze-dried and in cryopreservation for more
than 25 years before sequencing. Based on the results in Figure 3.4 and
46
Table3.4,F
1
performedslightlybetterthantheotherfeaturesfork = 6and
k = 9. However, it is hard to demonstrate a supervised learning method that
performed overwhelmingly than the others. Among all feature and method
combinations, the optimal AUC score (0.688) is received byF
1
for SVM
with linear kernel with k−mer length k = 6.
3. ForMolSysBiodataset[55], accordingtotheresultsinFigure3.5andTable
3.4F
2
,F
3
andF
4
performed slightly better thanF
1
fork-mer lengthsk = 6
and k = 9. Among all supervised learning methods, SVM with RBF kernel
performed the best, the corresponding AUC scores for SVM (RBF kernel) is
∼0.805 forF
2
,F
3
andF
4
with k = 6.
For all of the three datasets, while comparing the optimal AUC scores among
all feature-method combinations with the AUC scores of LASSO and SVM (linear
kernel) at either k = 6 or k = 9 under the i.i.d. background model (Table 3.5),
we see that the performances of LASSO and SVM (linear kernel) at either k = 6
or k = 9 under the i.i.d. background model are among or close to the best ones.
47
Figure 3.3: Performance of different methods and features for intra-NatureCom dataset. This
figure shows average AUC scores with confidence intervals (CI) of different k-mer lengths under i.i.d and 1
st
-
order Markov chain model over 4 different feature definitions and supervised learning methods including logistic
regression, LASSO, random forest (RF), SVM with linear kernel, polynomial kernel and RBF kernel (SVM(LN),
SVM(PLN), SVM(RBF)) as well as multilayer perceptron (MLP) within NatureCom (NC) dataset. The? on top
of each AUC bar indicates the highest AUC score in all 20 training data/testing data splittings.
48
Figure 3.4: Performance of different methods and features for intra-PlosOne dataset. This figure
shows the average AUC scores with confidence intervals (CI) of different k-mer lengths under i.i.d and 1
st
-
order Markov chain model over 4 different feature definitions and supervised learning methods including logistic
regression, LASSO, random forest (RF), SVM with linear kernel, polynomial kernel and RBF kernel (SVM(LN),
SVM(PLN), SVM(RBF)) as well as multilayer perceptron (MLP) within PlosOne (PO) dataset. The ? on top of
each AUC bar indicates the highest AUC score in all 20 training data/testing data splittings.
49
Figure 3.5: Performance of different methods and features for intra-MolSysBio dataset. This figure
shows the average AUC scores with confidence intervals (CI) of different k-mer lengths under i.i.d and 1
st
-
order Markov chain model over 4 different feature definitions and supervised learning methods including logistic
regression, LASSO, random forest (RF), SVM with linear kernel, polynomial kernel and RBF kernel (SVM(LN),
SVM(PLN), SVM(RBF)) as well as multilayer perceptron (MLP) within MolSysBio (MSB) dataset. The ? on
top of each AUC bar indicates the highest AUC score in all 20 training data/testing data splittings.
50
Cross-datasets CRC prediciton
In the cross-datasets scenario, we fix one dataset as the training data and then test
the model on the other two datasets. Similar as in the intra-datasets prediction,
it is difficult to demonstrate any overall optimal feature and supervised learning
method. Likewise in the intra-dataset scenario, we will do the analysis based on
each different training data and testing data combination.
1. Fix NatureCom dataset to be the training data (Figure 3.6 - 3.7, Table
3.4), featureF
1
at k-mer lengths k = 6 (the same for i.i.d and 1
st
order
Markov chain background models) has the best performance with method
multilayer perceptron for both PlosOne dataset (AUC score: 0.610) and
MolSysBio dataset (AUC score: 0.692) as testing data.
2. FixPlosOnedatasetastrainingdataandtestonNatureComdataset(Figure
3.8, Table 3.4), logistic regression performed slightly better than the other
methods withF
2
(AUC score: 0.653) at k = 9 under the 1
st
-order Markov
background model. If tested with MolSysBio dataset (Figure 3.9), LASSO
is the best performing methods withF
1
(AUC score:∼0.739).
3. Fix MolSysBio dataset as training data (Figure 3.10 - 3.11, Table 3.4),
the overall performances of k = 6 and k = 9 are quite similar. For different
testing datasets specifically, if the model is tested on NatureCom dataset,F
1
at k = 6 under the i.i.d background model has the optimal AUC score with
method LASSO (AUC score: 0.754); If tested on PlosOne dataset, different
methods performed quite similarly, the optimal AUC score is 0.653 withF
1
and LASSO at k = 9 under the i.i.d background model.
In summary, for cross-datasets scenario, LASSO withF
1
performed best for
most of the training data and testing data combinations (Table 3.5). However,
51
whileusingtheNatureComdatasetasthetrainingdata, methodmultilayerpercep-
tron has superiority over other methods for both PlosOne and MolSysBio datasets
as testing data.
52
Figure 3.6: Performance of different methods and features for cross-NatureCom-PlosOne-datasets.
This figure shows the average AUC scores with confidence intervals (CI) of differentk-mer lengths underi.i.d and
1
st
-order Markov chain model over 4 different feature definitions and supervised learning methods based on using
NatureCom (NC) dataset as training data and PlosOne (PO) dataset as testing data. The ? on top of each AUC
bar indicates the highest AUC score in all 50 training data/testing data splittings.
53
Figure 3.7: Performance of different methods and features for cross-NatureCom-MolSysBio-
datasets. This figure shows the average AUC scores with confidence intervals (CI) of different k-mer lengths
underi.i.d and 1
st
-order Markov chain model over 4 different feature definitions and supervised learning methods
based on using NatureCom (NC) dataset as training data and MolSysBio (MSB) dataset as testing data. The ?
on top of each AUC bar indicates the highest AUC score in all 50 training data/testing data splittings.
54
Figure 3.8: Performance of different methods and features for cross-PlosOne-NatureCom-datasets.
This figure shows the average AUC scores with confidence intervals (CI) of differentk-mer lengths underi.i.d and
1
st
-order Markov chain model over 4 different feature definitions and supervised learning methods based on using
PlosOne (PO) dataset as training data and NatureCom (NC) dataset as testing data. The ? on top of each AUC
bar indicates the highest AUC score in all 50 training data/testing data splittings.
55
Figure 3.9: Performance of different methods and features for cross-PlosOne-MolSysBio-datasets.
This figure shows the average AUC scores with confidence intervals (CI) of differentk-mer lengths underi.i.d and
1
st
-order Markov chain model over 4 different feature definitions and supervised learning methods based on using
PlosOne (PO) dataset as training data and MolSysBio (MSB) dataset as testing data. The? on top of each AUC
bar indicates the highest AUC score in all 50 training data/testing data splittings.
56
Figure 3.10: Performance of different methods and features for cross-MolSysBio-NatureCom-
datasets. This figure shows the average AUC scores with confidence intervals (CI) of different k-mer lengths
underi.i.d and 1
st
-order Markov chain model over 4 different feature definitions and supervised learning methods
based on using MolSysBio (MSB) dataset as training data and NatureCom (NC) dataset as testing data. The ?
on top of each AUC bar indicates the highest AUC score in all 50 training data/testing data splittings.
57
Figure 3.11: Performance of different methods and features for cross-MolSysBio-PlosOne-datasets.
This figure shows the average AUC scores with confidence intervals (CI) of differentk-mer lengths underi.i.d and
1
st
-order Markov chain model over 4 different feature definitions and supervised learning methods based on using
MolSysBio (MSB) dataset as training data and PlosOne (PO) dataset as testing data. The? on top of each AUC
bar indicates the highest AUC score in all 50 training data/testing data splittings.
58
Combined-datasets CRC prediction
In combined-datasets scenario, we combine two datasets as the training data and
test the model on the other dataset. Based on the three datasets, we have three
different cases in splitting training data and testing data.
According to the AUC scores in Figure 3.12, 3.13 and 3.14, we have con-
sistent observations that, for most of the supervised learning methods, the perfor-
mances of k = 6 are as good as or even better than the ones of k = 9 under i.i.d
model. Besides, SVM with linear kernel is among the best methods for all three
cases. Lastly,F
2
,F
3
andF
4
performed quite similarly and are slightly better than
F
1
.
Comparing with cross-datasets prediction, combined-datasets scenario have an
enlarged training data. Initially, we speculate that the predicting performance will
be improved with the expanded training dataset. However, the results in Table
3.4 do not support such speculation.
1. CombinetheNatureComandthePlosOnedatasetstogetherastrainingdata,
then test the model on the MolSysBio dataset,F
1
under i.i.d model with
LASSO at k = 6 has the highest AUC score (0.745). Whereas, the AUC
scoresofF
2
,F
3
andF
4
fork = 6underi.i.dmodelforSVMwithlinearkernel
is∼0.742,whichisquiteclosetothehighestAUCscore. Comparingwiththe
highest AUC scores in the cross-dataset scenario (Table 3.4), the optimal
AUC score after combining NatureCom and PlosOne datasets together as
training data is slightly higher than the optimal AUC score (0.739) in the
cross-dataset scenario.
2. Combine the NatureCom and the MolSysBio datasets together as training
data, then test the model on the PlosOne dataset, the hightest AUC scores
59
are reached byF
2
,F
3
andF
4
with SVM (linear kernel) at k = 6 under i.i.d
model, which is∼0.625. Comparing with the results in the cross-dataset
scenario (Table 3.4), the optimal AUC score after combining NatureCom
and MolSysBio datasets together as training data is in between the optimal
AUC scores of using NatureCom dataset alone as training data or using
MolSysBio dataset alone as training data.
3. Combine the PlosOne and the MolSysBio datasets together as training data,
then test the model on the NatureCom dataset, though the highest AUC
score is 0.720 with SVM (RBF kernel) forF
1
, the AUC scores ofF
2
,F
3
and
F
4
fork = 6 underi.i.d model for SVM with linear kernel (∼0.711) is quite
close to the highest AUC score. Comparing with the cross-dataset scenario,
the optimal AUC score stayed in between two optimal AUC scores in the
cross-datasets scenario.
60
Figure 3.12: Performance of different methods and features for combined-NatureCom-PlosOne-
datasets. This figure shows the average AUC scores with confidence intervals (CI) of different k-mer lengths
underi.i.d and 1
st
-order Markov chain model over 4 different feature definitions and supervised learning methods
based on using NatureCom (NC) and PlosOne (PO) datasets as training data, and MolSysBio (MSB) dataset as
testing data. The ? on top of each AUC bar indicates the highest AUC score in all 50 training data/testing data
splittings.
61
Figure 3.13: Performance of different methods and features for combined-NatureCom-MolSysBio-
datasets. This figure shows the average AUC scores with confidence intervals (CI) of different k-mer lengths
underi.i.d and 1
st
-order Markov chain model over 4 different feature definitions and supervised learning methods
based on using NatureCom (NC) and MolSysBio (MSB) datasets as training data, and PlosOne (PO) dataset as
testing data. The ? on top of each AUC bar indicates the highest AUC score in all 50 training data/testing data
splittings.
62
Figure 3.14: Performance of different methods and features for combined-PlosOne-MolSysBio-
datasets. This figure shows the average AUC scores with confidence intervals (CI) of different k-mer lengths
underi.i.d and 1
st
-order Markov chain model over 4 different feature definitions and supervised learning methods
based on using PlosOne (PO) and MolSysBio (MSB) datasets as training data, and NatureCom (NC) dataset as
testing data. The ? on top of each AUC bar indicates the highest AUC score in all 50 training data/testing data
splittings.
63
3.3.2 Subsampling of metagenomic sequencing data and
CRC prediction with subsampled data
In Asgari et al. [68] of predicting environments/host phenotypes for microbial
communities, they demonstrated that with the shallow subsample of 16S rRNA
sequence data, using k−mer normalized frequencies as feature (equivalent to our
featureF
1
) for supervised learning methods will outperform the Operational Tax-
onomic Unit (OTU)-based methods. They proposed a bootstrapping framework to
estimate the sufficiency of using subsamples to represent the complete 16s rRNA
sequence data with various sampling sizes for different k-mer lengths.
Following Asgari et al. [68], we select the subsampling sizes to be N ∈{10
kbps, 50 kbps, 400 kbps, 2000 kbps}, which can cover the total number of reads
in the original paired-end sequences at 0.1%, 0.5%, 4% and 20% levels. For each
of the sampling size, we repeated the bootstrapping and subsampling for R = 5
rounds. The corresponding self-inconsistency and unrepresentativeness (Section
3.2.6) are calculated based onR = 5, and the AUC scores in CRC prediction with
subsampled metagenomic sequences are averaged over 5 different rounds.
Evaluation of subsampled metagenomic sequencing data
In order to validate the assertion that the shallow subsamples of the whole metage-
nomic sequences are sufficient for predicting CRC and can even promote the pre-
diction accuracy, we performed our CRC prediction pipeline on the subsamples
with subsampling sizes N∈{10 kbps, 50 kbps, 400 kbps, 2000 kbps}.
Before applying our CRC prediction pipeline on the subsampled metagenomic
sequences, we first investigated the self-inconsistency and unrepresentativeness
(Section 3.2.6) of subsamplings for datasets NatureCom, PlosOne as well as
64
MolSysBio, and also for the combination of all three datasets with different sub-
sampling sizes (Figure 3.15) according to Asgari et al. [68]’s approach.
Based on the definitions inSection 3.2.6, a lowerself-inconsistency score indi-
cates a more stabalized k−mer distribution for every bootstrapping subsampling;
And a lower unrepresentativeness score means that the k−mer distribution of the
subsamples are closer to the k−mer distribution of the original samples.
As can be observed from Figure 3.15, for all three datasets (Nature-
Com, PlosOne and MolSysBio) as well as their combination, the trend of self-
inconsistency for k = 6 and k = 9 are similar and significantly different from
k = 3. For k = 6 and k = 9, after the sampling size is above 50 kbps, the self-
inconsistency does not fluctuate drastically. For unrepresentativeness perspective,
k = 6 and k = 9 are still similar to each other and significantly different from
k = 3. The unrepresentativeness value is quite stable across different sampling
sizes but a not-so-obvious minimum value can be observed at sampling size 50
kbps for each single one as well as the combination of all three datasets.
From the trends of the self-inconsistency and the unrepresentativeness in our
study, we conclude that the subsampling will be more stable after the subsampling
size is greater than or equal to 50kbps. Besides, as for thek−mer distribution per-
spective, subsamples with sampling size 50kbps should be the best representation
of the original metagenomic sequences, though the differences in unrepresentative-
ness among different sampling sizes are not siginificant.
65
(a) (b)
(c) (d)
(e) (f)
Figure 3.15: The self-inconsistency and unrepresentativeness of subsampling with sample sizeN∈{10
kbps, 50 kbps, 400 kbps, 2000 kbps} for k-mer lengths k∈{3,6,9}.
NC, PO, MSB and CB are the shortened forms for NatureCom, PlosOne,
MolSysBio and combined datasets.
66
CRC prediction with subsampled data
In order to validate the assertion that the shallow subsamples of the whole metage-
nomic sequences are sufficient for predicting CRC and can even promote the pre-
diction accuracy, after subsampled the metagenomic sequences with sampling sizes
N ∈{10 kbps, 50 kbps, 400 kbps, 2000 kbps}, we repeat the feature extraction
pipeline and applied all of the 7 different supervised learning methods with dif-
ferent k−mer lengths as well as different background models on the subsampled
data.
AscanbeobservedfromthesummarizedAUCbarplotswithdifferentsampling
sizes in Section B.3.2, B.3.2 and B.3.2 (for intra-dataset, cross-datasets
and combined-datasets, respectively.), the overall AUC scores based on the
sampled metagenomic sequences are quite close to the AUC scores based on the
whole metagenomic sequences. However, it is hard to demonstrate a sampling
size that its corresponding CRC prediction can always outperform the others at
different sampling sizes.
Table 3.6, 3.7 and 3.8 summarize the best-performing sampling sizes, feature
definitions as well as background models of k−mer lengths k = 6 and k = 9 for
each supervised learning method in intra-dataset, cross-datasets and combined
datasets scenarios, respectively. From the results in the table, we observed that
in most cases, the AUC performance of subsampled sequences are not superior to
the AUC performances of using the whole metagenomic sequences. However, we
noticed that if fix PlosOne dataset as testing data, in all three scenarios (intra-
dataset, cross-datasets and combined-datasets), the performance of AUC scores
can be improved significantly.
InTable3.6, thebestAUCscoreforintra-PlosOne-datasetis0.688withSVM
(Linear kernel) forF
1
at k = 6. With sampling size 10 kbps at k = 6 or sampling
67
size 2000 kbps at k = 9 withF
4
under the 1
st
−order Markov order background ,
the AUC score of SVM (Linear kernel) can be increased to 0.733.
In Table 3.7, using the NatureCom dataset as the training data and testing
on the PlosOne dataset, the optimal AUC score can be elevated from 0.610 (the
highestscore inusing thewholemetagenomic sequences)to 0.689 byusing sampled
sequences with sampling size 2000 kbps for Logistic regression withF
3
at k = 9
under the 1
st
−order Markov order background model. Similar situation occurred
whenusingMolSysBiodatasetasthetrainingdataandtestingonPlosOnedataset.
In the combined-datasets scenario, as shown in Table 3.8, the optimal AUC
score can be improved from 0.625 (using whole metagenomic sequences and SVM
with linear kernel forF
4
at k = 6 under i.i.d. model) to 0.722 (using sampled
metagenomic sequences with sampling size 50 kbps and LASSO forF
4
at k = 6
under the 1
st
−order Markov order background model).
As we have mentioned in Section 3.3.1, the PlosOne dataset is quite differ-
ent from the NatureCom and MolSysBio datasets. The original fecal samples of
PlosOne dataset were in cryopreservation for at least 25 years before sequencing,
and we speculate that there might be certain DNA fragment missing or dam-
aged in the metagenomic sequencing data of PlosOne dataset. From our results
in Table 3.6, 3.7 and 3.8, we may conclude that comparing with using the
whole genomic sequencing data, the shallow subsamples of the whole
metagenomic sequences are more powerful when predicting CRC based
on supervised learning on metagenomic sequences with missing data.
68
Table 3.6: Best-performing features and methods with subsampling in intra-dataset scenario. This
table is a summary of best-performing features, k-mer lengths, background model and corresponding AUC scores
for each supervised learning method based on subsampled metagenomic sequences with sampling sizes 10 kbps,
50kbps, 400kbps, 2000kbps as well as non-sampled sequences (WGS) in the intra-dataset scenario. NC, PO and
MSB are shortened forms of NatureCom, PlosOne and MolSysBio datasets, respectively. SVM(LN), SVM(PL),
SVM(RBF), RF and MLP are shortened forms of SVM with linear kernel, SVM with polynomial kernel, SVM
with RBF kernel, Random Forest and Multilayer-perceptron, respectively. F, k, m and s represent for feature,
k-mer lengths, Markov order of background model and AUC score, respectively.
Best - performing
Datasets Method k = 6 k = 9
Sampling- F m AUC- Sampling- F m AUC-
size score size score
Logistic 50kbps F
1
0 0.771 50kbps F
1
0 0.777
Training: LASSO wgs F
4
0 0.887 wgs F
4
0 0.919
NC RF 10kbps F
1
0 0.690 wgs F
1
0 0.695
SVM(LN) 50kbps F
4
0 0.758 2000kbps F
1
0 0.803
Testing: SVM(PL) wgs F
4
0 0.770 wgs F
1
0 0.762
NC SVM(RBF) wgs F
4
0 0.744 2000kbps F
1
0 0.800
MLP 50kbps F
4
1 0.667 wgs F
1
0 0.716
Best-performing in wgs: 0.919 (LASSO,F
4
, k = 9, i.i.d.)
Logistic 50kbps F
2
1 0.641 wgs F
1
0 0.679
Training: LASSO 2000kbps F
3
1 0.620 wgs F
1
0 0.679
PO RF 400kbps F
4
1 0.669 400kbps F
4
1 0.677
SVM(LN) 10kbps F
4
1 0.733 2000kbps F
4
1 0.733
Testing: SVM(PL) 50kbps F
4
1 0.661 400kbps F
4
1 0.692
PO SVM(RBF) 10kbps F
4
1 0.727 2000kbps F
4
1 0.728
MLP 2000kbps F
1
0 0.665 400kbps F
4
1 0.581
Best-performing in wgs: 0.688 (SVM (LN),F
1
, k = 6, i.i.d.)
Logistic 400kbps F
4
0 0.769 2000kbps F
4
0 0.773
Training: LASSO 2000kbps F
1
0 0.777 wgs F
1
0 0.750
MSB RF wgs F
4
0 0.746 wgs F
1
0 0.744
SVM(LN) wgs F
4
0 0.798 2000kbps F
4
0 0.795
Testing: SVM(PL) wgs F
1
0 0.719 10kbps F
1
0 0.738
MSB SVM(RBF) wgs F
4
0 0.805 2000kbps F
4
0 0.797
MLP 400kbps F
1
0 0.716 2000kbps F
4
0 0.691
Best-performing in wgs: 0.805 (SVM (RBF),F
4
, k = 6, i.i.d.)
69
Table 3.7: Best-performing features and methods with subsampling in cross-datasets scenario. This
table is a summary of best-performing features, k-mer lengths, background model and corresponding AUC scores
for each supervised learning method based on subsampled metagenomic sequences with sampling sizes 10kbps, 50
kbps, 400 kbps, 2000 kbps as well as non-sampled sequences (WGS) in the cross-datasets scenario. NC, PO and
MSB are shortened forms of NatureCom, PlosOne and MolSysBio datasets, respectively. SVM(LN), SVM(PL),
SVM(RBF), RF and MLP are shortened forms of SVM with linear kernel, SVM with polynomial kernel, SVM
with RBF kernel, Random Forest and Multilayer-perceptron, respectively. F, k, m and s represent for feature,
k-mer lengths, Markov order of background model and AUC score, respectively.
Best - performing
Datasets Method k = 6 k = 9
Sampling- F m AUC- Sampling- F m AUC-
size score size score
Logistic 2000kbps F
3
1 0.642 2000kbps F
3
1 0.689
Training: LASSO 2000kbps F
3
1 0.642 wgs F
1
0 0.570
NC RF 400kbps F
4
1 0.626 2000kbps F
4
1 0.629
SVM(LN) 50kbps F
1
0 0.568 2000kbps F
2
1 0.580
Testing: SVM(PL) 2000kbps F
2
1 0.532 10kbps F
1
0 0.544
PO SVM(RBF) 2000kbps F
4
1 0.647 wgs F
1
0 0.571
MLP 2000kbps F
4
1 0.674 2000kbps F
1
0 0.486
Best-performing in wgs: 0.610 (MLP,F
1
, k = 6, i.i.d.)
Logistic wgs F
3
1 0.689 wgs F
3
1 0.690
Training: LASSO wgs F
4
1 0.651 wgs F
3
1 0.655
NC RF wgs F
4
1 0.660 wgs F
4
1 0.659
SVM(LN) wgs F
4
1 0.600 wgs F
4
1 0.632
Testing: SVM(PL) 50kbps F
1
0 0.634 10kbps F
1
0 0.696
MSB SVM(RBF) wgs F
4
1 0.639 wgs F
2
1 0.632
MLP 2000kbps F
1
0 0.700 400kbps F
4
1 0.582
Best-performing in wgs: 0.692 (MLP,F
1
, k = 6, i.i.d.)
Logistic wgs F
3
1 0.584 wgs F
2
1 0.653
Training: LASSO wgs F
1
0 0.608 wgs F
4
1 0.647
PO RF wgs F
3
1 0.601 wgs F
4
1 0.644
SVM(LN) 50kbps F
4
0 0.639 wgs F
4
0 0.632
Testing: SVM(PL) wgs F
4
1 0.608 wgs F
4
1 0.649
NC SVM(RBF) 50kbps F
1
0 0.595 wgs F
4
0 0.594
MLP wgs F
1
0 0.463 2000kbps F
2
1 0.518
Best-performing in wgs: 0.653 (Logistic,F
2
, k = 9, 1
st
−ordermc)
70
Table 3.7 cont. Best - performing
Datasets Method k = 6 k = 9
Sampling- F m AUC- Sampling- F m AUC-
size score size score
Logistic 2000kbps F
4
0 0.703 wgs F
1
0 0.713
Training: LASSO wgs F
1
0 0.684 wgs F
1
0 0.739
PO RF wgs F
3
1 0.656 wgs F
3
1 0.661
SVM(LN) 2000kbps F
4
0 0.729 wgs F
4
0 0.726
Testing: SVM(PL) 2000kbps F
1
0 0.704 wgs F
1
0 0.705
MSB SVM(RBF) 2000kbps F
4
0 0.729 wgs F
1
0 0.735
MLP 400kbps F
1
0 0.493 wgs F
3
0 0.614
Best-performing in wgs: 0.739 (LASSO,F
1
, k = 9, i.i.d.)
Logistic 400kbps F
1
0 0.663 wgs F
4
1 0.710
Training: LASSO wgs F
1
0 0.754 wgs F
1
0 0.687
MSB RF wgs F
4
0 0.610 wgs F
4
0 0.611
SVM(LN) wgs F
4
0 0.690 wgs F
1
0 0.673
Testing: SVM(PL) wgs F
4
1 0.615 wgs F
4
1 0.652
NC SVM(RBF) wgs F
1
0 0.695 wgs F
1
0 0.685
MLP wgs F
3
1 0.618 wgs F
3
1 0.564
Best-performing in wgs: 0.754 (LASSO,F
1
, k = 6, i.i.d.)
Logistic 2000kbps F
3
1 0.653 2000kbps F
3
1 0.676
Training: LASSO 2000kbps F
3
1 0.675 wgs F
1
0 0.653
MSB RF 2000kbps F
4
1 0.652 2000kbps F
4
1 0.627
SVM(LN) 2000kbps F
4
1 0.683 400kbps F
1
0 0.641
Testing: SVM(PL) 2000kbps F
4
1 0.690 2000kbps F
4
1 0.707
PO SVM(RBF) 400kbps F
4
1 0.663 400kbps F
1
0 0.642
MLP 2000kbps F
4
1 0.685 2000kbps F
4
1 0.583
Best-performing in wgs: 0.653 (LASSO,F
1
, k = 9, i.i.d.)
71
Table 3.8: Best-performing features and methods with subsampling in combined-datasets scenario.
This table is a summary of best-performing features, k-mer lengths, background model and corresponding AUC
scores for each supervised learning method based on subsampled metagenomic sequences with sampling sizes 10k,
50k, 400k, 2000k as well as non-sampled sequences (WGS) in the combined-datasets scenario. NC, PO and MSB
are shortened forms of NatureCom, PlosOne and MolSysBio datasets, respectively. SVM (LN), SVM (PL), SVM
(RBF), RF and MLP are shortened forms of SVM with Linear Kernel, SVM with Polynomial Kernel, SVM with
RBF Kernel, Random Forest and Multilayer-perceptron, respectively.F, k, m and s represent for feature, k-mer
lengths, Markov order of background model and AUC score, respectively.
Best - performing
Datasets Method k = 6 k = 9
Sampling- F m AUC- Sampling- F m AUC-
size score size score
Logistic wgs F
1
0 0.713 2000kbps F
1
0 0.726
Training: LASSO wgs F
1
0 0.745 wgs F
1
0 0.738
NC&PO RF wgs F
4
1 0.655 2000kbps F
4
1 0.640
SVM(LN) wgs F
4
0 0.742 2000kbps F
4
0 0.724
Testing: SVM(PL) 2000kbps F
1
0 0.757 50kbps F
1
0 0.766
MSB SVM(RBF) wgs F
1
0 0.730 50kbps F
1
0 0.728
MLP wgs F
1
0 0.699 50kbps F
4
1 0.621
Best-performing in wgs: 0.745 (LASSO,F
1
, k = 6, i.i.d.)
Logistic 2000kbps F
4
1 0.676 2000kbps F
2
1 0.655
Training: LASSO 50kbps F
4
1 0.722 wgs F
1
0 0.617
NC&MSB RF 2000kbps F
4
1 0.621 2000kbps F
4
1 0.603
SVM(LN) 2000kbps F
4
0 0.649 wgs F
4
0 0.624
Testing: SVM(PL) 2000kbps F
1
0 0.590 wgs F
4
0 0.593
PO SVM(RBF) wgs F
1
0 0.569 50kbps F
1
0 0.612
MLP wgs F
1
0 0.615 wgs F
2
0 0.550
Best-performing in wgs: 0.625 (SVM (LN),F
4
, k = 6, i.i.d.)
Logistic wgs F
4
0 0.676 2000kbps F
4
0 0.690
Training: LASSO wgs F
4
0 0.634 wgs F
1
0 0.715
PO&MSB RF 2000kbps F
4
1 0.584 wgs F
4
1 0.596
SVM(LN) wgs F
4
0 0.711 wgs F
1
0 0.713
Testing: SVM(PL) wgs F
4
1 0.613 wgs F
4
1 0.647
NC SVM(RBF) wgs F
1
0 0.720 wgs F
1
0 0.716
MLP 2000kbps F
3
1 0.619 wgs F
2
1 0.543
Best-performing in wgs: 0.720 (SVM (RBF),F
1
, k = 6, i.i.d.)
72
3.3.3 Comparison of k-mer based method and OTU-based
methods
We first compare ourk-mer based CRC prediction results with the original studies
where the three datasets are first produced. In Table 3.9, for NatureCom [56]
dataset, eventhoughtheoriginalAUCscoreisashighas0.96, wenoticedthatthey
haveoverfittingissuewhentheybuilttheirmodel, astheyincludedthetestingdata
while training the model, thus their evaluation might not be correct. The AUC
score for the original study of PlosOne [60] dataset is not available because it was
not focusing on predicting the CRC. For MolSysBio dataset [55], the AUC score
of the original study outperformed our k-mer based approach.
Table 3.9: The performances of the alignment-free methods v.s. the performances of the original
studies. This table reflects the prediction accuracies in terms of AUC scores between our k-mer based approach
and the approaches used in the original studies where the datasets are first produced. The AUC scores in the
parentheses are the ones that have higher AUC scores with subsampled metagenomic sequencing data compared
with using the whole metagenomic sequences.
Dataset Highest AUC score Highest AUC score
of original approach of k-mer based approach
NatureCom 0.96 0.92
PlosOne NA 0.69 (0.73)
MolSysBio 0.87 0.81
In a parallel study done by Wirbel et al. [58], they first conducted the Opera-
tional Taxonomic Units (OTU) analysis to identify the microbial species from the
metagenomic sequencsing data; next, from the total 849 species they selected a
set of 94 microbial species based on how significantly their relative abundances are
abletodifferentiateCRCversuscontrolgroups; lastlytheybuiltaLASSOclassifier
with features based on the abundance profiles (of the 94 selected microbial-species)
73
on the cross-datasets datasets. The resultant AUC scores were satisfactory depite
being lower than their intra-cohort evaluation results (Table 3.10).
Thomas et al. [59] performed a very similar analysis as in Wirbel et al. [58]’s
study, with some differences in data pre-processing details. Also using features
that are based on the microbiome abundance profiles, they built a random forest
classifier for CRC patients versus controls. This study obtained comparable AUC
scores as Wirbel et al.’s (Table 3.10).
Comparing with the performances of Wirbel et al. [58] and Thomas et al. [59],
we found that our alignment-free approach is actually very competative accord-
ing to the AUC scores in Table 3.9. Particularly, even without subsampling,
our alignment-free approach outperformed the OTU-based methods for PlosOne
dataset with missing data.
74
Table 3.10: Comparison of performance for OTU-based methods and alignment-free method. This
table shows the comparison of accuracies in terms of AUC scores of our k-mer-based CRC prediction method
(FeatureF
1
of Logistic regression withi.i.d model) and OTU-based methods for intra-dataset, cross-datasets and
combined-datasets evaluations. The AUC scores in the parentheses are the ones with subsampled metagenomic
sequencing data that outperformed the performance of using the whole metagenomic sequences.
NatureCom PlosOne MolSysBio
NatureCom 0.92 0.61 (0.69) 0.69 (0.70)
Alignment-free PlosOne 0.65 0.69 (0.73) 0.74
Approach
MolSysBio 0.75 0.65 (0.71) 0.81
Combined 0.72 0.63 (0.72) 0.75 (0.77)
NatureCom 0.92 0.60 0.79
Thomas et al.’s PlosOne 0.77 0.63 0.75
Approach
MolSysBio 0.76 0.59 0.87
Combined NA NA NA
NatureCom 0.92 0.59 0.62
Wirbel et al.’s PlosOne 0.78 0.74 0.76
Approach
MolSysBio 0.76 0.64 0.85
Combined 0.86 0.80 0.84
75
Chapter 4
Conclusions
4.1 Virus-host infectious association
In this project, we developed methods to predict if a given viral DNA sequence
(genome or large contig) comes from a virus that infects a particular host. First of
all, we implemented five supervised learning methods including LASSO, gradient
boosting regression, SVM with RBF kernel, RF and naive Bayesian (Gaussian
and Bernoulli) with four features proposed based on word frequency with various
orders of Markov chains as background models for viral sequences. We compared
different machine learning methods and different feature representations based on
nine host genera with at least 45 infectious associated viruses. We concluded
that RF outperforms other methods with less dependence on sequence background
model. For the four proposed feature representations, the relative word frequency
representation (first feature) has the benefit of simplicity and has better or similar
performances as other features. Besides, for word length selection, we compared
the performance of RF using k = 4, k = 6 and k = 8. For 6 out of 9 host genera,
the performance of RF based on k = 6 is the best. For host genera, Escherichia,
Mycobacterium and Staphylococcus, k = 4 performed slightly better than k = 6.
When choosing word length k = 8, the performance of RF is lowered for all nine
host genera.
Second, for all the nine main host genera, we studied the effect of contig length
on the performance of RF for predicting the virus host infectious relationship.
76
According to our simulation result, constructing the model by using contigs with
lengths 3 kbps or 5 kbps performs generally well for contigs with length from 1
kbps to the whole genome.
Third, wedevelopedamaximumlikelihoodapproachforestimatingthefraction
of viruses infecting a bacterial host in viral tagging expriment [26] based on word
frequencies. We focused on two types of viruses: T4-like viruses and non-T4-like
viruses. We showed that about 95% of the identified T4-like viruses appears to
infect Synechococcus. On the other hand, only about 29% of the identified non-
T4-like viruses and 30% of the contigs over 1.5kbps have sequence word patterns
that matched known Synechococcus viruses, and this raises doubts if the others
actually infect Synechococcus. The scores for the contigs based on RF can also be
used to prioritize the contigs for infection.
Finally, as viruses infecting their hosts can lead to changes in the metabolic
rates, cell fates and functions of some of the host genes and therefore impact
the whole community, it is significant to study virus-host infections. Our study
not only has the potential in predicting future novel virus-host associations, but
also can be applied to estimate the fraction of true infectious associations in
high-throughput experiment such as viral tagging and SAG (single-cell amplified
genomes).
Our study also has some limitations. First, the machine learning methods
depend on relative large number of viruses infecting particular host genera. To
meet this requirement, we studied only nine hosts. Many bacteria are available
and most of them do not have viruses identified to infect them yet. Therefore,
machine learning methods can not be applied to such potential hosts. Second, we
only explored four feature representations of viruses based on word counts. Other
viral sequence representations maybe more effective in the identification of viruses
77
infecting particular hosts. For example, we can allow some mismatches or gaps for
particular words for word counting [70,71]. Finally, there are many variations for
a particular machine leaning method and we only implemented one version. For
example, we only used RBF kernel for SVM. Other kernels such as polynomial,
exponential, and radial kernels [25] may give better results. These are the topics
for future studies. Despite these limitations, we clearly showed that RF can be
used to predict viruses infecting particular hosts with very high accuracy.
4.2 CRC prediction by alignment-free super-
vised learning method
In this project, we explored the potential application of alignment-free method for
CRC classification with human metagenomic sequencing data based on 3 publicly
available CRC datasets datasets. First of all, we did a preliminary analysis of of
several simple clinical indicators including age, gender and body mass index (BMI)
for the datasets and the result implies that none of these factors is consistent
enough to be used in CRC versus healthy control prediction.
Before building CRC prediction pipeline by supervised learning methods, in
order to eliminate the influence for CRC prediction introduced by the human
genome, we first filtered the metagenomic sequences with a standardized protocol
to remove human DNA from the data. With the purified metagenomic sequences,
we targeted on seven supervised learning methods including Logistic Regression,
LASSO, Support Vector Machine with Linear Kernel, Polynomial Kernel and RBF
kernel, Random Forest and Multilayer Perceptron with four features defined based
on word counts for CRC prediction based on metagenomic sequencing data. All
predicting models are built with different word lengths k = 3, k = 6 and k = 9
78
as well as two different background markov models — i.i.d. model and 1
st
order
Markovmodel. WeobserveimprovementofAUCscoreswiththeincreaseofk−mer
lengths from k = 3 to k = 6 or k = 9. Though it is hard to demonstrate the best
method or features, the performances of LASSO and SVM with linear kernel were
mostly among the best performing methods, andF
1
andF
4
at k = 6 or k = 9
underi.i.d. model worked consistently well across different datasets and scenarios.
While comparing 3 different scenarios, intra-dataset scenario has the best per-
formances across different datasets. Besides, even though the training data for the
combined-scenario is enlarged, we did not observe significant improvement with
respect to AUC scores while comparing with the cross-datasets scenario.
We then followed a bootstrapping framework of subsampling to investigate
the sufficiency of using a small portion of the metagenomic sequencing data in
CRC prediciton. With different sampling sizes, the overall AUC scores based on
the sampled metagenomic sequences are quite close to the AUC scores based on
the whole metagenomic sequences. At the same time, we found that there are
siginificant improvements by subsampling if fix PlosOne dataset as the testing
data. From such observation, we conclude that comparing with using the
whole genomic sequencing data, the shallow subsamples of the whole
metagenomic sequences are more powerful when predicting CRC based
on supervised learning on metagenomic sequences with missing data.
Lastly, we compared the performances of our alignment-free approach with the
performancesofthemicrobialspeciesabundanceprofiling-orientedapproaches. We
found that our alignment-free approach is quite competative and particularly, with
subsamplingprocedure, ouralignment-freeapproachoutperformedtheOTU-based
methods significantly for PlosOne dataset with missing data.
79
Reference List
[1] Lawrence CM, Menon S, Eilers BJ, Bothner B, Khayat R, Douglas T, et al.
Structural and functional studies of archaeal viruses. Journal of biological
chemistry. 2009;284(19):12599–12603.
[2] Edwards RA, Rohwer F. Viral metagenomics. Nature reviews microbiology.
2005;3(6):504–510.
[3] Minot S, Sinha R, Chen J, Li H, Keilbaugh SA, Wu GD, et al. The human
gut virome: inter-individual variation and dynamic response to diet. Genome
research. 2011;21(10):1616–1625.
[4] Wilson14 WH, Wommack15 KE, Wilhelm SW, Weitz JS. Re-examination of
the relationship between marine virus and microbial cell abundances. Nature
microbiology. 2016;1(15024Epub).
[5] Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, Carlson C, et al.
The marine viromes of four oceanic regions. PLoS biology. 2006;4(11):e368.
[6] Brum JR, Ignacio-Espinoza JC, Roux S, Doulcier G, Acinas SG, Alberti A,
et al. Patterns and ecological drivers of ocean viral communities. Science.
2015;348(6237):1261498.
[7] Minot S, Bryson A, Chehoud C, Wu GD, Lewis JD, Bushman FD. Rapid
evolution of the human gut virome. Proceedings of the National Academy of
Sciences. 2013;110(30):12450–12455.
[8] Qin N, Yang F, Li A, Prifti E, Chen Y, Shao L, et al. Alterations of the
human gut microbiome in liver cirrhosis. Nature. 2014;513(7516):59–64.
[9] Cadwell K. The virome in host health and disease. Immunity. 2015;42(5):805–
813.
[10] Norman JM, Handley SA, Baldridge MT, Droit L, Liu CY, Keller BC, et al.
Disease-specific alterations in the enteric virome in inflammatory bowel dis-
ease. Cell. 2015;160(3):447–460.
80
[11] Reyes A,HaynesM, HansonN, AnglyFE, HeathAC, Rohwer F,etal. Viruses
in the faecal microbiota of monozygotic twins and their mothers. Nature.
2010;466(7304):334–338.
[12] Virgin HW. The virome in mammalian physiology and disease. Cell.
2014;157(1):142–150.
[13] De Paepe M, Leclerc M, Tinsley CR, Petit MA. Bacteriophages: an underes-
timated role in human and animal health? Frontiers in cellular and infection
microbiology. 2014;4:39.
[14] ConsortiumHMP,etal. Aframeworkforhumanmicrobiomeresearch. Nature.
2012;486(7402):215–221.
[15] Consortium HMP, et al. Structure, function and diversity of the healthy
human microbiome. Nature. 2012;486(7402):207–214.
[16] Williamson SJ, Allen LZ, Lorenzi HA, Fadrosh DW, Brami D, Thiagarajan
M, et al. Metagenomic exploration of viruses throughout the Indian Ocean.
PLoS One. 2012;7(10):e42047.
[17] Roux S, Hallam SJ, Woyke T, Sullivan MB. Viral dark matter and virus–
host interactions resolved from publicly available microbial genomes. Elife.
2015;4:e08490.
[18] Ahmed S, Saito A, Suzuki M, Nemoto N, Nishigaki K. Host–parasite relations
ofbacteriaandphagescanbeunveiledbyOligostickiness, ameasureofrelaxed
sequence similarity. Bioinformatics. 2009;25(5):563–570.
[19] Relman DA, Schmidt TM, MacDermott RP, Falkow S. Identification of the
uncultured bacillus of Whipple’s disease. New England journal of Mmedicine.
1992;327(5):293–301.
[20] Darfeuille-Michaud A, Boudeau J, Bulois P, Neut C, Glasser AL, Barnich N,
et al. High prevalence of adherent-invasive Escherichia coli associated with
ileal mucosa in Crohn’s disease. Gastroenterology. 2004;127(2):412–421.
[21] Steidler L, Hans W, Schotte L, Neirynck S, Obermeier F, Falk W, et al.
Treatment of murine colitis by Lactococcus lactis secreting interleukin-10.
Science. 2000;289(5483):1352–1355.
[22] Feasey NA, Dougan G, Kingsley RA, Heyderman RS, Gordon MA. Invasive
non-typhoidal salmonella disease: an emerging and neglected tropical disease
in Africa. The Lancet. 2012;379(9835):2489–2499.
81
[23] Jarraud S, Mougel C, Thioulouse J, Lina G, Meugnier H, Forey F, et al.
Relationships between Staphylococcus aureus genetic background, virulence
factors, agr groups (alleles), and human disease. Infection and immunity.
2002;70(2):631–641.
[24] Blake PA, Merson MH, Weaver RE, Hollis DG, Heublein PC. Disease caused
by a marine vibrio: clinical characteristics and epidemiology. New England
journal of medicine. 1979;300(1):1–5.
[25] Friedman J, Hastie T, Tibshirani R. The elements of statistical learning.
Springer Series in Statistics Springer, Berlin;.
[26] Deng L, Ignacio-Espinoza JC, Gregory AC, Poulos BT, Weitz JS, Hugenholtz
P, et al. Viral tagging reveals discrete populations in Synechococcus viral
genome sequence space. Nature. 2014;513(7517):242–245.
[27] Deng L, Gregory A, Yilmaz S, Poulos BT, Hugenholtz P, Sullivan MB. Con-
trasting life strategies of viruses that infect photo-and heterotrophic bacteria,
as revealed by viral tagging. mBio. 2012;3(6):e00373–12.
[28] Nurk S, Meleshko D, Korobeynikov A, Pevzner P. metaSPAdes: a new versa-
tile de novo metagenomics assembler. arXiv preprint arXiv:160403071. 2016;.
[29] Ren J, Song K, Deng M, Reinert G, Cannon CH, Sun F. Inference of Marko-
vian properties of molecular sequences from NGS data and applications to
comparative genomics. Bioinformatics. 2016;32(7):993–1000.
[30] Waterman MS. Introduction to computational biology: maps, sequences and
genomes. CRC Press;.
[31] WanL,ReinertG,SunF,WatermanMS. Alignment-freesequencecomparison
(II): theoretical power of comparison statistics. Journal of computational
biology. 2010;17(11):1467–1490.
[32] Reinert G, Chew D, Sun F, Waterman MS. Alignment-free sequence
comparison (I): statistics and power. Journal of computational biology.
2009;16(12):1615–1634.
[33] Qi J, Luo H, Hao B. CVTree: a phylogenetic tree reconstruction tool based
on whole genomes. Nucleic acids research. 2004;32(suppl 2):W45–W47.
[34] Hosmer Jr DW, Lemeshow S. Applied logistic regression. John Wiley & Sons;.
[35] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the
royal statistical society: series B (methodological). 1996;58(1):267–288.
82
[36] Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, et al.
API design for machine learning software: experiences from the scikit-learn
project. arXiv preprint arXiv:13090238. 2013;.
[37] Cristianini N, Shawe-Taylor J. An introduction to support vector machines
and other kernel-based learning methods. Cambridge University Press;.
[38] Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM
Transactions on intelligent systems and technology. 2011;2(3):27.
[39] Breiman L. Random forests. Machine learning. 2001;45(1):5–32.
[40] Rish I. An empirical study of the naive Bayes classifier. In: IJCAI 2001
workshop on empirical methods in artificial intelligence. vol. 3. IBM New
York; 2001. p. 41–46.
[41] Russell SJ, Norvig P, Canny JF, Malik JM, Edwards DD. Artificial intelli-
gence: a modern approach. Prentice Hall Upper Saddle River;.
[42] Hanley JA, McNeil BJ. The meaning and use of the area under a receiver
operating characteristic (ROC) curve. Radiology. 1982;143(1):29–36.
[43] Sullivan MB, Waterbury JB, Chisholm SW. Cyanophages infecting the
oceanic cyanobacterium Prochlorococcus. Nature. 2003;424(6952):1047–1051.
[44] Jenkins C, Hayes P. Diversity of cyanophages infecting the heterocys-
tous filamentous cyanobacterium Nodularia isolated from the brackish Baltic
Sea. Journal of the marine biological association of the United Kingdom.
2006;86(03):529–536.
[45] Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global
cancer statistics 2018: GLOBOCAN estimates of incidence and mortality
worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians.
2018;68(6):394–424.
[46] Ferlay J, Soerjomataram I, Dikshit R, Eser S, Mathers C, Rebelo M,
et al. Cancer incidence and mortality worldwide: sources, methods and
major patterns in GLOBOCAN 2012. International journal of cancer.
2015;136(5):E359–E386.
[47] Segal R, Miller K, Jemal A. Cancer statistics, 2018. CA: A cancer journal for
clinicians. 2018;68:7–30.
[48] Lichtenstein P, Holm NV, Verkasalo PK, Iliadou A, Kaprio J, Koskenvuo M,
etal. Environmentalandheritablefactorsinthecausationofcancer—analyses
ofcohortsoftwinsfromSweden, Denmark, andFinland. NewEnglandjournal
of medicine. 2000;343(2):78–85.
83
[49] Wu S, Rhee KJ, Albesiano E, Rabizadeh S, Wu X, Yen HR, et al. A human
colonic commensal promotes colon tumorigenesis via activation of T helper
type 17 T cell responses. Nature medicine. 2009;15(9):1016.
[50] Goodwin AC, Shields CED, Wu S, Huso DL, Wu X, Murray-Stewart TR,
etal. PolyaminecatabolismcontributestoenterotoxigenicBacteroidesfragilis-
induced colon tumorigenesis. Proceedings of the National Academy of Sci-
ences. 2011;108(37):15354–15359.
[51] KosticAD,ChunE,RobertsonL,GlickmanJN,GalliniCA,MichaudM,etal.
Fusobacterium nucleatum potentiates intestinal tumorigenesis and modulates
the tumor-immune microenvironment. Cell host & microbe. 2013;14(2):207–
215.
[52] Rubinstein MR, Wang X, Liu W, Hao Y, Cai G, Han YW. Fusobacterium
nucleatum promotes colorectal carcinogenesis by modulating E-cadherin/β-
catenin signaling via its FadA adhesin. Cell host & microbe. 2013;14(2):195–
206.
[53] Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, et al. A
human gut microbial gene catalogue established by metagenomic sequencing.
Nature. 2010;464(7285):59.
[54] Consortium T, et al. Structure, function and diversity of the healthy human
microbiome. Nature. 2012;486(7402):207–214.
[55] ZellerG,TapJ,VoigtAY,SunagawaS,KultimaJR,CosteaPI,etal. Potential
of fecal microbiota for early-stage detection of colorectal cancer. Molecular
systems biology. 2014;10(11):766.
[56] Feng Q, Liang S, Jia H, Stadlmayr A, Tang L, Lan Z, et al. Gut microbiome
development along the colorectal adenoma–carcinoma sequence. Nature com-
munications. 2015;6:6528.
[57] YuJ,FengQ,WongSH,ZhangD,yiLiangQ,QinY,etal. Metagenomicanal-
ysis of faecal microbiome as a tool towards targeted non-invasive biomarkers
for colorectal cancer. Gut. 2017;66(1):70–78.
[58] Wirbel J, Pyl PT, Kartal E, Zych K, Kashani A, Milanese A, et al. Meta-
analysis of fecal metagenomes reveals global microbial signatures that are
specific for colorectal cancer. Nature medicine. 2019;p. 1.
[59] Thomas AM, Manghi P, Asnicar F, Pasolli E, Armanini F, Zolfo M, et al.
Metagenomic analysis of colorectal cancer datasets identifies cross-cohort
microbial diagnostic signatures and a link with choline degradation. Nature
medicine. 2019;p. 1.
84
[60] Vogtmann E, Hua X, Zeller G, Sunagawa S, Voigt AY, Hercog R, et al. Col-
orectal cancer and the human gut microbiome: reproducibility with whole-
genome shotgun sequencing. PloS one. 2016;11(5):e0155362.
[61] Casella G, Berger RL. Statistical inference. vol. 2. Duxbury Pacific Grove,
CA; 2002.
[62] Benesty J, Chen J, Huang Y, Cohen I. Pearson correlation coefficient. In:
Noise reduction in speech processing. Springer; 2009. p. 1–4.
[63] Vincent C, Mehrotra S, Loo VG, Dewar K, Manges AR. Excretion of host
DNA in feces is associated with risk of Clostridium difficile infection. Journal
of immunology research. 2015;2015.
[64] Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature
methods. 2012;9(4):357.
[65] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al.
The sequence alignment/map format and SAMtools. Bioinformatics.
2009;25(16):2078–2079.
[66] Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing
genomic features. Bioinformatics. 2010;26(6):841–842.
[67] Pal SK, Mitra S. Multilayer perceptron, fuzzy sets, and classification. IEEE
Transactions on neural networks. 1992;3(5):683–697.
[68] Asgari E, Garakani K, McHardy AC, Mofrad MR. MicroPheno: Predict-
ing environments and host phenotypes from 16S rRNA gene sequencing
using a k-mer based representation of shallow sub-samples. Bioinformatics.
2018;34(13):i32–i42.
[69] LaPierre N, Ju CJT, Zhou G, Wang W. MetaPheno: A critical evaluation of
deep learning and machine learning in metagenome-based disease prediction.
Methods. 2019;.
[70] Leimeister CA, Boden M, Horwege S, Lindner S, Morgenstern B. Fast
alignment-free sequence comparison using spaced-word frequencies. Bioin-
formatics. 2014;30(14):1991–1999.
[71] Göke J, Schulz MH, Lasserre J, Vingron M. Estimation of pairwise sequence
similarity of mammalian enhancers with word neighbourhood counts. Bioin-
formatics. 2012;28(5):656–663.
[72] Kullback S, Leibler RA. On information and sufficiency. The annals of math-
ematical statistics. 1951;22(1):79–86.
85
[73] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O,
et al. Scikit-learn: Machine learning in Python. Journal of machine learning
research. 2011;12(Oct):2825–2830.
[74] Gulli A, Pal S. Deep Learning with Keras. Packt Publishing Ltd; 2017.
86
Appendix A
Supplmentary Materials for
Chapter 2
A.1 Supplementary Methods:
A.1.1 AUC score
AUC is the abbreviation the "Area Under the Curve" for ROC (receiver operat-
ing characteristic) curves, which illustrates the performance of binary classifiers
with varying discrimination threshold. The ROC curve is created by plotting the
true positive rate (along the vertical axis) against the false positive rate (along
the horizontal axis) with changing thresholds. Larger AUC score indicates better
performance of the binary classifier.
A.1.2 Gradient boost
The gradient boost problem can be expressed as finding
f
∗
= arg min
f
E
x,y
[l(y,f(x))]
The loss function l(y,f
∗
(x)) we used is the least square sum. Gradient boosting
starts with a constant value f
0
(x) = arg min
γ
P
n
i=1
l(y
i
,γ), where n is the sam-
ple size. Then interatively, with the steepest descent step, f
m
(x) = f
m−1
(x)−
γ
n
P
n
i=1
∇
f
l(y
i
,f
m−1
(x
i
)).
87
After the predicting function f
∗
(x) is identified, the prediction score is then
given by ˆ y =f
∗
(x).
A.1.3 Random forest
Decision tree
Decision Tree is a non-parametric supervised learning method commonly used for
classification and regression. From top to down, the decision tree has decision
nodes and leaf nodes, where at each decision node there are two or more branches
corresponding to different outcomes of decision, and each leaf node of the tree rep-
resents a final classification result of the data point. In our model, each decision
node tests if the feature vector satisfies a certain threshold and makes a branching
decision based on the outcome of the test, and the leaf node represents a classifi-
cation label for the feature vector.
Bootstrap aggregating
Let{(x
i
,y
i
),i = 1, 2,··· ,n} be the original training dataset, x
i
’s represent for
the feature vectors and y
i
’s represent for class labels. The bootstrap aggregating
(also called tree bagging) is to select n random samples from the original training
data{(x
i
,y
i
),i = 1, 2,··· ,n} with replacement and fit a decision tree based on
the selected random sample. Repeat the procedure for B times and denote
ˆ
f
b
(x)
(1≤b≤B) as the decision function of the decision tree trained from each random
sampling. Given the feature vector x of test data, the prediction will be the voting
for the decisions from all B decision trees:
ˆ
f(x) =
P
B
b=1
ˆ
f
b
(x). Bootstrap sampling
is a good way to decrease the variance of the decision tree without adding bias to
the model.
88
Difference between random forest and the bootstrap aggregating
Different from bootstrap aggregating, in each decision tree that is trained from the
random sampled data with replacement, the random forest only use a randomly
selectedsubsetofthefeaturesinsteadofusingallofthefeaturesasinthebootstrap
aggregating. In this way, some features acting as strong predictors will be de-
correlated.
A.1.4 Naive Bayes
Denote Y as the binary random variable of the virus label and x as the feature
vector, then by Bayes’ theorem,
p(Y|x) =
p(Y )p(x|Y )
p(x)
=⇒
p(Y = 1|x) =α·p(Y = 1)p(x|Y = 1)
p(Y = 0|x) =α·p(Y = 0)p(x|Y = 1)
where c =
1
p(x)
. The prediction score is then given by
ˆ y =α·p(Y = 1)p(x|Y = 1) = 1−α·p(Y = 0)p(x|Y = 0)
If ˆ y > 0.5, the virus is classified into the infectious class; Otherwise into the not
infectious class. Depending on the different presumed distributions of p(x|Y ), the
naive Bayes approach can be subdevided into Gaussian naive Bayes and Bernoulli
naive Bayes.
Gaussian naive Bayes
Let C
1
be the class of viruses with label y = 1 and C
0
be the class of viruses with
label y = 0. Let μ
1
and Σ
1
be the mean and covariance matrix for x
i
’s∈ C
1
.
89
Similarly, let μ
0
and Σ
0
be the mean and covariance matrix for x
j
’s∈ C
0
. Then
for any feature vector x from testing data:
p(x = v|C
1
) =
1
q
(2π)
k
|Σ
1
|
exp
−
1
2
(x−μ
1
)
T
Σ
−1
1
(x−μ
1
)
p(x = v|C
0
) =
1
q
(2π)
k
|Σ
0
|
exp
−
1
2
(x−μ
0
)
T
Σ
−1
0
(x−μ
0
)
Bernoulli naive Bayes
Let x
i
= (x
1
i
,x
2
i
,··· ,x
K
i
), whereK is the dimension of the feature vectors. Similar
as in Gaussian naive Bayes, let C
1
be class that the labels of data in it are y = 1,
C
0
be class that the labels of data in it are y = 0. Then for x
i
’s∈ C
1
, we can
calculate p
1k
=
P
N
1
i=1
sgn(x
k
i
)
N
1
, where N
1
is the size of class C
1
and k∈{1,··· ,K}.
For x
j
’s∈ C
0
, p
0k
=
P
N
0
i=1
sgn(x
k
i
)
N
0
, where N
0
is the size of class C
0
. Then for any
feature vector x from the testing data:
p(x|C
1
) =
K
Y
k=1
p
sgn(x
k
)
1k
(1−p
1k
)
1−sgn(x
k
)
p(x|C
0
) =
K
Y
k=1
p
sgn(x
k
)
0k
(1−p
0k
)
1−sgn(x
k
)
*Note: sgn(x) is the sign function: sgn(x) =
1 if x> 0
0 if x = 0
, in our case, the
entries of the feature vectors are all non-negative.
90
A.2 Implementation
When comparing feature-method combinations, for all 9 different host, negative
training data and negative testing data were chosen randonly for 50 times. The
positive training data and positive testing data are the same for the 50 runs of the
pipeline. For each of the 50 datasets, we trained the model purely based on the
training data, and then test the performance of the method on the testing data.
When apply our method on the T4-like / non-T4-like viruses with host Syne-
chococcus, we also repeated our procedure for 50 times with different negative
training data. The confidence intervals were calculated based on the results of
from the 50 runs of the procedure.
Theconstructionofthefeaturevectorswithdifferentwordlengthsanddifferent
sequencebackgroundmodelswereimplementedwithc++. Thesupervisedlearning
methods parts were implemented with python package scikit-learn.
91
A.3 Supplementary Tables
Table A.1: Number of viral genomes founded to have infected the corresponding host bacterium
genus up to the respective year. The NCBI phage genome database currently contains 1,426 completely
sequenced viral genomes with precisely identified hosts. Among all the bacteria at the genus level, we focus on
9 bacterial genera with at least 45 identified infectious viuses and, and together they have been identified as the
hosts of 836 out of 1,426 complete viral genomes.
Years
Bacterial Genuses 2010 2011 2012 2013 2014 2015
Bacillus 18 20 31 53 62 62
Escherichia 86 97 141 145 173 173
Lactococcus 48 48 48 49 55 55
Mycobacterium 34 38 89 172 215 218
Pseudomonas 34 39 57 68 96 96
Salmonella 16 22 32 47 52 54
Staphylococcus 21 24 43 52 62 63
Synechococcus 14 26 30 42 47 47
Vibrio 25 32 39 58 67 68
92
Table A.2: AUC scores of unsupervised learning based on the average Manhattan distance and
d
∗
2
dissimilarity of viruses in the testing data with the viruses in the positive training data. Three
different word lengths : k = 4, k = 6, and k = 8. Four different background models were applied for d
∗
2
dissimilarity;
k = 4 d
∗
2
Manhattan i.i.d. 1
st
−mc 2
nd
−mc 3
rd
−mc
Bacillus 0.826 0.741 0.867 0.778 0.780
Escherichia 0.828 0.772 0.913 0.860 0.602
Lactococcus 0.844 0.778 0.697 0.533 0.794
Mycobacterium 0.977 0.975 0.938 0.962 0.856
Pseudomonas 0.941 0.921 0.941 0.910 0.936
Salmonella 0.818 0.762 0.894 0.868 0.818
Staphylococcus 0.954 0.930 0.861 0.945 0.896
Synechococcus 0.919 0.890 0.940 0.946 0.696
Vibrio 0.852 0.715 0.807 0.754 0.696
93
k = 6 d
∗
2
Manhattan i.i.d. 1
st
−mc 2
nd
−mc 3
rd
−mc
Bacillus 0.829 0.752 0.873 0.851 0.904
Escherichia 0.880 0.833 0.958 0.945 0.939
Lactococcus 0.767 0.775 0.828 0.750 0.836
Mycobacterium 0.976 0.977 0.966 0.984 0.985
Pseudomonas 0.951 0.934 0.974 0.970 0.990
Salmonella 0.837 0.818 0.900 0.900 0.828
Staphylococcus 0.964 0.941 0.947 0.974 0.983
Synechococcus 0.929 0.906 0.994 0.993 0.991
Vibrio 0.841 0.733 0.854 0.817 0.853
94
k = 8 d
∗
2
Manhattan i.i.d. 1
st
−mc 2
nd
−mc 3
rd
−mc
Bacillus 0.812 0.774 0.858 0.848 0.873
Escherichia 0.873 0.838 0.957 0.936 0.945
Lactococcus 0.861 0.761 0.872 0.828 0.814
Mycobacterium 0.986 0.982 0.984 0.983 0.992
Pseudomonas 0.961 0.938 0.981 0.982 0.989
Salmonella 0.770 0.822 0.887 0.904 0.850
Staphylococcus 0.946 0.928 0.963 0.967 0.977
Synechococcus 0.896 0.881 0.972 0.975 0.984
Vibrio 0.738 0.694 0.803 0.792 0.825
95
Table A.3: The average AUC scores for all the feature-method combinations. Three different word
length (4, 6, 8) and four different background models (i.i.d. model, 1
st
, 2
nd
and 3
rd
order Markov chains) were
used. For each feature-method combination trained purely based on the training data, we apply it onto the testing
data and calculate the AUC score. We repeated our pipeline for 50 times with different negative training data
and negative testing data each time, and then we calculated the average AUC score.
Host: i.i.d. model 1
st
order mc
Bacillus
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
LASSO 0.845 0.857 0.788 0.845 0.857 0.788
GBR 0.829 0.810 0.710 0.829 0.810 0.710
SVM 0.856 0.864 0.823 0.856 0.864 0.823
Random forest 0.856 0.864 0.823 0.856 0.864 0.823
Gaussian NB 0.811 0.770 0.731 0.811 0.770 0.731
Bernoulli NB 0.689 0.787 0.782 0.689 0.787 0.782
F
2
LASSO 0.845 0.857 0.782 0.861 0.847 0.729
GBR 0.828 0.807 0.707 0.776 0.787 0.639
SVM 0.777 0.837 0.723 0.798 0.823 0.671
Random forest 0.855 0.862 0.821 0.872 0.881 0.850
Gaussian NB 0.811 0.723 0.500 0.768 0.766 0.498
Bernoulli NB 0.784 0.771 0.707 0.803 0.808 0.663
F
3
LASSO 0.851 0.887 0.794 0.863 0.862 0.837
GBR 0.763 0.777 0.750 0.758 0.788 0.742
SVM 0.547 0.563 0.703 0.563 0.570 0.615
Random forest 0.843 0.861 0.805 0.866 0.874 0.807
Gaussian NB 0.758 0.740 0.713 0.740 0.776 0.640
Bernoulli NB 0.784 0.771 0.707 0.803 0.808 0.663
F
4
LASSO 0.834 0.878 0.798 0.840 0.864 0.866
GBR 0.809 0.822 0.776 0.809 0.755 0.769
SVM 0.823 0.845 0.790 0.786 0.766 0.618
Random forest 0.852 0.863 0.803 0.886 0.879 0.834
Gaussian NB 0.757 0.700 0.627 0.769 0.690 0.613
Bernoulli NB 0.784 0.771 0.707 0.803 0.808 0.663
96
Host: 2
nd
order mc 3
rd
order mc
Bacillus
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
LASSO 0.845 0.857 0.788 0.845 0.857 0.788
GBR 0.829 0.810 0.710 0.829 0.810 0.710
SVM 0.856 0.864 0.823 0.856 0.864 0.823
Random forest 0.856 0.864 0.823 0.856 0.864 0.823
Gaussian NB 0.811 0.770 0.731 0.811 0.770 0.731
Bernoulli NB 0.689 0.787 0.782 0.689 0.787 0.782
F
2
LASSO 0.828 0.773 0.630 0.743 0.780 0.700
GBR 0.754 0.785 0.678 0.706 0.731 0.679
SVM 0.618 0.842 0.623 0.537 0.595 0.521
Random forest 0.901 0.893 0.840 0.864 0.843 0.835
Gaussian NB 0.798 0.803 0.492 0.505 0.698 0.490
Bernoulli NB 0.774 0.768 0.689 0.644 0.698 0.673
F
3
LASSO 0.808 0.839 0.805 0.676 0.800 0.783
GBR 0.698 0.758 0.734 0.785 0.751 0.756
SVM 0.595 0.574 0.560 0.565 0.640 0.547
Random forest 0.878 0.871 0.802 0.903 0.861 0.807
Gaussian NB 0.748 0.755 0.590 0.497 0.665 0.536
Bernoulli NB 0.774 0.768 0.689 0.644 0.698 0.673
F
4
LASSO 0.751 0.818 0.879 0.743 0.813 0.864
GBR 0.706 0.789 0.707 0.840 0.789 0.732
SVM 0.708 0.742 0.557 0.755 0.673 0.511
Random forest 0.872 0.859 0.821 0.912 0.851 0.816
Gaussian NB 0.690 0.668 0.605 0.498 0.660 0.584
Bernoulli NB 0.774 0.768 0.689 0.644 0.695 0.673
97
Host: i.i.d. model 1
st
order mc
Escherichia
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
LASSO 0.798 0.740 0.715 0.798 0.740 0.715
GBR 0.775 0.721 0.704 0.775 0.721 0.704
SVM 0.652 0.602 0.500 0.652 0.602 0.500
Random forest 0.878 0.858 0.807 0.878 0.858 0.807
Gaussian NB 0.780 0.825 0.491 0.780 0.825 0.491
Bernoulli NB 0.602 0.463 0.395 0.602 0.463 0.395
F
2
LASSO 0.798 0.740 0.735 0.578 0.774 0.693
GBR 0.786 0.725 0.682 0.782 0.701 0.652
SVM 0.861 0.911 0.800 0.780 0.913 0.809
Random forest 0.882 0.857 0.807 0.900 0.845 0.776
Gaussian NB 0.780 0.808 0.525 0.755 0.514 0.559
Bernoulli NB 0.705 0.714 0.619 0.844 0.725 0.544
F
3
LASSO 0.717 0.735 0.717 0.560 0.721 0.690
GBR 0.803 0.758 0.660 0.780 0.689 0.642
SVM 0.789 0.755 0.775 0.773 0.769 0.797
Random forest 0.877 0.859 0.809 0.885 0.856 0.8081
Gaussian NB 0.592 0.605 0.752 0.617 0.753 0.639
Bernoulli NB 0.705 0.714 0.619 0.844 0.725 0.544
F
4
LASSO 0.824 0.756 0.717 0.788 0.734 0.736
GBR 0.792 0.777 0.698 0.845 0.753 0.710
SVM 0.910 0.902 0.850 0.858 0.708 0.500
Random forest 0.874 0.863 0.826 0.904 0.860 0.830
Gaussian NB 0.778 0.817 0.777 0.858 0.869 0.703
Bernoulli NB 0.705 0.714 0.619 0.844 0.725 0.544
98
Host: 2
nd
order mc 3
rd
order mc
Escherichia
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
LASSO 0.798 0.740 0.715 0.798 0.740 0.715
GBR 0.775 0.721 0.704 0.775 0.721 0.704
SVM 0.652 0.602 0.500 0.652 0.602 0.500
Random forest 0.878 0.858 0.807 0.878 0.858 0.807
Gaussian NB 0.780 0.825 0.491 0.780 0.825 0.491
Bernoulli NB 0.602 0.463 0.395 0.602 0.463 0.395
F
2
LASSO 0.601 0.769 0.654 0.633 0.773 0.656
GBR 0.759 0.709 0.662 0.779 0.709 0.652
SVM 0.608 0.914 0.775 0.484 0.861 0.728
Random forest 0.848 0.795 0.758 0.839 0.754 0.754
Gaussian NB 0.797 0.544 0.744 0.452 0.467 0.725
Bernoulli NB 0.633 0.584 0.458 0.606 0.592 0.455
F
3
LASSO 0.596 0.769 0.712 0.728 0.735 0.684
GBR 0.731 0.683 0.654 0.798 0.679 0.644
SVM 0.811 0.799 0.795 0.464 0.827 0.753
Random forest 0.840 0.803 0.775 0.864 0.753 0.772
Gaussian NB 0.642 0.703 0.573 0.461 0.717 0.573
Bernoulli NB 0.633 0.584 0.458 0.606 0.592 0.455
F
4
LASSO 0.749 0.764 0.714 0.854 0.738 0.724
GBR 0.759 0.713 0.639 0.912 0.708 0.668
SVM 0.627 0.548 0.500 0.686 0.500 0.500
Random forest 0.837 0.815 0.821 0.923 0.790 0.816
Gaussian NB 0.717 0.808 0.678 0.483 0.795 0.667
Bernoulli NB 0.633 0.584 0.458 0.603 0.588 0.455
99
Host: i.i.d. model 1
st
order mc
Lactococcus
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
LASSO 0.939 0.989 0.583 0.939 0.989 0.583
GBR 0.933 0.822 0.786 0.933 0.822 0.786
SVM 0.692 0.692 0.742 0.692 0.692 0.742
Random forest 0.972 1.00 0.988 0.972 1.00 0.988
Gaussian NB 0.642 0.825 0.692 0.642 0.825 0.692
Bernoulli NB 0.525 0.767 0.633 0.525 0.767 0.633
F
2
LASSO 0.939 0.989 0.581 0.914 0.581 0.792
GBR 0.933 0.811 0.794 0.800 0.744 0.654
SVM 0.883 0.808 0.583 0.583 0.667 0.500
Random forest 0.975 1.000 0.981 1.000 0.928 0.933
Gaussian NB 0.642 0.717 0.500 0.583 0.583 0.500
Bernoulli NB 0.792 0.767 0.658 0.583 0.500 0.583
F
3
LASSO 0.864 0.936 0.519 0.942 0.619 0.633
GBR 0.825 0.594 0.939 0.958 0.711 0.731
SVM 0.500 0.500 0.500 0.500 0.500 0.500
Random forest 1.000 0.938 0.942 0.974 0.900 0.967
Gaussian NB 0.492 0.467 0.500 0.500 0.500 0.500
Bernoulli NB 0.792 0.767 0.658 0.583 0.500 0.583
F
4
LASSO 0.964 0.989 0.492 0.858 0.708 0.678
GBR 0.956 0.775 0.608 0.686 0.689 0.925
SVM 0.717 0.775 0.583 0.575 0.583 0.500
Random forest 0.978 0.969 0.908 0.989 0.972 0.969
Gaussian NB 0.875 0.825 0.500 0.500 0.608 0.500
Bernoulli NB 0.792 0.767 0.658 0.583 0.500 0.583
100
Host: 2
nd
order mc 3
rd
order mc
Lactococcus
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
LASSO 0.939 0.989 0.583 0.939 0.989 0.583
GBR 0.933 0.822 0.786 0.933 0.822 0.786
SVM 0.692 0.692 0.742 0.692 0.692 0.742
Random forest 0.972 1.00 0.988 0.972 1.00 0.988
Gaussian NB 0.642 0.825 0.692 0.642 0.825 0.692
Bernoulli NB 0.525 0.767 0.633 0.525 0.767 0.633
F
2
LASSO 0.925 0.683 0.672 0.714 0.597 0.744
GBR 0.739 0.631 0.772 1.000 0.842 0.847
SVM 0.667 0.583 0.500 0.550 0.558 0.500
Random forest 0.942 0.914 0.936 0.994 0.897 0.894
Gaussian NB 0.583 0.608 0.500 0.775 0.642 0.500
Bernoulli NB 0.583 0.500 0.583 1.000 0.500 0.583
F
3
LASSO 0.911 0.792 0.806 0.661 0.572 0.706
GBR 0.893 0.597 0.944 1.000 0.790 0.792
SVM 0.500 0.500 0.500 0.575 0.583 0.500
Random forest 0.906 0.913 0.967 1.000 0.911 0.967
Gaussian NB 0.467 0.500 0.500 0.750 0.500 0.500
Bernoulli NB 0.583 0.500 0.583 1.000 0.500 0.583
F
4
LASSO 0.869 0.839 0.569 0.892 0.581 0.414
GBR 0.917 0.675 0.797 1.000 0.692 0.833
SVM 0.650 0.525 0.500 0.600 0.500 0.500
Random forest 0.903 0.938 0.986 1.000 0.917 0.979
Gaussian NB 0.583 0.608 0.500 0.667 0.658 0.500
Bernoulli NB 0.583 0.500 0.583 1.000 0.500 0.583
101
Host: i.i.d. model 1
st
order mc
Mycobacterium
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
LASSO 0.937 0.926 0.912 0.937 0.926 0.912
GBR 0.966 0.950 0.948 0.966 0.950 0.948
SVM 0.802 0.757 0.715 0.802 0.757 0.715
Random forest 0.987 0.985 0.984 0.987 0.985 0.984
Gaussian NB 0.937 0.933 0.804 0.937 0.933 0.804
Bernoulli NB 0.613 0.920 0.926 0.613 0.920 0.926
F
2
LASSO 0.936 0.926 0.878 0.795 0.890 0.866
GBR 0.967 0.948 0.943 0.950 0.949 0.943
SVM 0.941 0.980 0.924 0.939 0.980 0.774
Random forest 0.988 0.985 0.983 0.982 0.978 0.984
Gaussian NB 0.937 0.946 0.680 0.935 0.941 0.683
Bernoulli NB 0.920 0.911 0.920 0.920 0.937 0.924
F
3
LASSO 0.960 0.924 0.852 0.822 0.901 0.865
GBR 0.973 0.964 0.945 0.949 0.936 0.936
SVM 0.846 0.870 0.946 0.872 0.894 0.961
Random forest 0.993 0.992 0.989 0.986 0.983 0.981
Gaussian NB 0.952 0.950 0.946 0.935 0.959 0.859
Bernoulli NB 0.920 0.911 0.920 0.920 0.937 0.924
F
4
LASSO 0.936 0.908 0.849 0.924 0.877 0.824
GBR 0.965 0.960 0.946 0.947 0.933 0.955
SVM 0.961 0.989 0.983 0.891 0.904 0.946
Random forest 0.992 0.989 0.986 0.986 0.985 0.982
Gaussian NB 0.944 0.939 0.939 0.944 0.944 0.941
Bernoulli NB 0.920 0.911 0.920 0.920 0.937 0.924
102
Host: 2
nd
order mc 3
rd
order mc
Mycobacterium
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
LASSO 0.937 0.926 0.912 0.937 0.926 0.912
GBR 0.966 0.950 0.948 0.966 0.950 0.948
SVM 0.802 0.757 0.715 0.802 0.757 0.715
Random forest 0.987 0.985 0.984 0.987 0.985 0.984
Gaussian NB 0.937 0.933 0.804 0.937 0.933 0.804
Bernoulli NB 0.613 0.920 0.926 0.613 0.920 0.926
F
2
LASSO 0.793 0.903 0.809 0.927 0.895 0.790
GBR 0.912 0.927 0.936 0.954 0.925 0.906
SVM 0.915 0.907 0.654 0.530 0.944 0.628
Random forest 0.982 0.978 0.984 0.985 0.975 0.984
Gaussian NB 0.944 0.944 0.683 0.715 0.839 0.704
Bernoulli NB 0.950 0.952 0.922 0.874 0.950 0.920
F
3
LASSO 0.820 0.888 0.827 0.969 0.851 0.822
GBR 0.932 0.924 0.943 0.950 0.921 0.905
SVM 0.922 0.928 0.937 0.746 0.967 0.952
Random forest 0.985 0.984 0.983 0.986 0.983 0.981
Gaussian NB 0.957 0.963 0.759 0.689 0.948 0.757
Bernoulli NB 0.950 0.952 0.922 0.874 0.950 0.920
F
4
LASSO 0.927 0.868 0.820 0.976 0.869 0.803
GBR 0.943 0.944 0.935 0.954 0.942 0.970
SVM 0.935 0.928 0.952 0.800 0.924 0.941
Random forest 0.988 0.984 0.983 0.965 0.982 0.982
Gaussian NB 0.948 0.950 0.946 0.694 0.950 0.946
Bernoulli NB 0.950 0.952 0.922 0.874 0.952 0.917
103
Host: i.i.d. model 1
st
order mc
Pseudomonas
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
1
LASSO 0.836 0.905 0.889 0.836 0.905 0.889
GBR 0.946 0.908 0.891 0.946 0.908 0.891
SVM 0.673 0.695 0.738 0.673 0.695 0.738
Random forest 0.978 0.981 0.967 0.978 0.981 0.967
Gaussian NB 0.838 0.854 0.857 0.838 0.854 0.857
Bernoulli NB 0.595 0.789 0.861 0.595 0.789 0.861
F
2
LASSO 0.836 0.905 0.854 0.828 0.864 0.858
GBR 0.946 0.907 0.889 0.930 0.900 0.858
SVM 0.911 0.959 0.946 0.873 0.963 0.846
Random forest 0.977 0.980 0.968 0.986 0.979 0.961
Gaussian NB 0.838 0.857 0.863 0.923 0.938 0.882
Bernoulli NB 0.848 0.845 0.861 0.941 0.934 0.893
F
3
LASSO 0.868 0.890 0.861 0.829 0.887 0.827
GBR 0.939 0.910 0.823 0.899 0.899 0.866
SVM 0.889 0.905 0.921 0.902 0.921 0.925
Random forest 0.974 0.980 0.959 0.983 0.978 0.971
Gaussian NB 0.805 0.852 0.875 0.934 0.946 0.921
Bernoulli NB 0.848 0.845 0.861 0.941 0.934 0.893
F
4
LASSO 0.851 0.896 0.886 0.792 0.874 0.890
GBR 0.949 0.902 0.840 0.918 0.906 0.806
SVM 0.950 0.964 0.941 0.893 0.918 0.938
Random forest 0.985 0.981 0.969 0.988 0.981 0.974
Gaussian NB 0.854 0.850 0.873 0.938 0.896 0.886
Bernoulli NB 0.848 0.845 0.861 0.941 0.934 0.893
104
Host: 2
nd
order mc 3
rd
order mc
Pseudomonas
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
LASSO 0.836 0.905 0.889 0.836 0.905 0.889
GBR 0.946 0.908 0.891 0.946 0.908 0.891
SVM 0.673 0.695 0.738 0.673 0.695 0.738
Random forest 0.978 0.981 0.967 0.978 0.981 0.967
Gaussian NB 0.838 0.854 0.857 0.838 0.854 0.857
Bernoulli NB 0.595 0.789 0.861 0.595 0.789 0.861
F
2
LASSO 0.826 0.808 0.841 0.725 0.775 0.817
GBR 0.903 0.876 0.842 0.909 0.873 0.810
SVM 0.675 0.921 0.768 0.505 0.941 0.750
Random forest 0.970 0.965 0.963 0.950 0.959 0.958
Gaussian NB 0.929 0.886 0.861 0.586 0.877 0.868
Bernoulli NB 0.893 0.896 0.891 0.809 0.929 0.888
F
3
LASSO 0.844 0.804 0.827 0.741 0.854 0.796
GBR 0.872 0.876 0.842 0.898 0.894 0.847
SVM 0.911 0.938 0.904 0.575 0.923 0.914
Random forest 0.964 0.964 0.965 0.937 0.962 0.964
Gaussian NB 0.934 0.955 0.889 0.552 0.975 0.879
Bernoulli NB 0.893 0.896 0.891 0.809 0.929 0.888
F
4
LASSO 0.678 0.845 0.871 0.735 0.908 0.891
GBR 0.868 0.924 0.896 0.889 0.861 0.874
SVM 0.827 0.909 0.934 0.689 0.932 0.929
Random forest 0.971 0.969 0.967 0.936 0.966 0.966
Gaussian NB 0.886 0.893 0.882 0.557 0.895 0.882
Bernoulli NB 0.893 0.896 0.891 0.796 0.929 0.888
105
Host: i.i.d. model 1
st
order mc
Salmonella
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
LASSO 0.803 0.801 0.774 0.803 0.801 0.774
GBR 0.805 0.788 0.718 0.805 0.788 0.718
SVM 0.600 0.584 0.557 0.600 0.584 0.557
Random forest 0.889 0.896 0.891 0.889 0.896 0.891
Gaussian NB 0.757 0.798 0.689 0.757 0.798 0.689
Bernoulli NB 0.568 0.711 0.752 0.568 0.711 0.752
F
2
LASSO 0.803 0.800 0.751 0.816 0.661 0.657
GBR 0.805 0.790 0.709 0.893 0.817 0.735
SVM 0.743 0.784 0.711 0.832 0.864 0.741
Random forest 0.889 0.894 0.891 0.932 0.915 0.912
Gaussian NB 0.757 0.789 0.575 0.818 0.816 0.568
Bernoulli NB 0.784 0.784 0.777 0.850 0.848 0.839
F
3
LASSO 0.774 0.845 0.784 0.780 0.714 0.740
GBR 0.846 0.821 0.764 0.891 0.814 0.752
SVM 0.727 0.773 0.805 0.766 0.782 0.780
Random forest 0.890 0.895 0.880 0.937 0.918 0.897
Gaussian NB 0.755 0.773 0.800 0.800 0.832 0.764
Bernoulli NB 0.784 0.784 0.777 0.850 0.848 0.839
F
4
LASSO 0.831 0.813 0.806 0.800 0.741 0.805
GBR 0.867 0.785 0.774 0.883 0.836 0.769
SVM 0.814 0.818 0.821 0.859 0.850 0.602
Random forest 0.908 0.895 0.897 0.933 0.921 0.904
Gaussian NB 0.807 0.807 0.839 0.811 0.827 0.821
Bernoulli NB 0.784 0.784 0.777 0.850 0.848 0.839
106
Host: 2
nd
order mc 3
rd
order mc
Salmonella
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
LASSO 0.803 0.801 0.774 0.803 0.801 0.774
GBR 0.805 0.788 0.718 0.805 0.788 0.718
SVM 0.600 0.584 0.557 0.600 0.584 0.557
Random forest 0.889 0.896 0.891 0.889 0.896 0.891
Gaussian NB 0.757 0.798 0.689 0.757 0.798 0.689
Bernoulli NB 0.568 0.711 0.752 0.568 0.711 0.752
F
2
LASSO 0.802 0.685 0.636 0.582 0.651 0.664
GBR 0.877 0.785 0.804 0.758 0.638 0.749
SVM 0.632 0.809 0.757 0.525 0.766 0.639
Random forest 0.899 0.900 0.904 0.864 0.888 0.900
Gaussian NB 0.827 0.780 0.568 0.584 0.777 0.573
Bernoulli NB 0.791 0.809 0.777 0.652 0.736 0.768
F
3
LASSO 0.796 0.656 0.696 0.572 0.640 0.773
GBR 0.832 0.731 0.743 0.744 0.627 0.759
SVM 0.782 0.768 0.746 0.511 0.777 0.716
Random forest 0.897 0.905 0.889 0.876 0.880 0.893
Gaussian NB 0.807 0.800 0.741 0.552 0.709 0.698
Bernoulli NB 0.791 0.809 0.777 0.652 0.736 0.768
F
4
LASSO 0.803 0.733 0.786 0.505 0.739 0.789
GBR 0.802 0.810 0.799 0.850 0.692 0.804
SVM 0.843 0.775 0.582 0.643 0.634 0.571
Random forest 0.897 0.897 0.890 0.903 0.894 0.886
Gaussian NB 0.807 0.809 0.768 0.575 0.809 0.741
Bernoulli NB 0.791 0.809 0.777 0.664 0.736 0.768
107
Host: i.i.d. model 1
st
order mc
Staphylococcus
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
LASSO 0.872 0.939 0.937 0.872 0.939 0.937
GBR 0.956 0.973 0.923 0.956 0.973 0.923
SVM 0.740 0.788 0.748 0.740 0.788 0.748
Random forest 0.993 0.987 0.983 0.993 0.987 0.983
Gaussian NB 0.885 0.900 0.780 0.885 0.900 0.780
Bernoulli NB 0.668 0.880 0.923 0.668 0.880 0.923
F
2
LASSO 0.872 0.938 0.938 0.832 0.859 0.822
GBR 0.955 0.970 0.922 0.968 0.954 0.897
SVM 0.913 0.968 0.885 0.890 0.963 0.705
Random forest 0.993 0.987 0.983 0.985 0.982 0.975
Gaussian NB 0.885 0.883 0.525 0.953 0.958 0.528
Bernoulli NB 0.833 0.825 0.850 0.950 0.930 0.900
F
3
LASSO 0.876 0.923 0.938 0.858 0.922 0.933
GBR 0.960 0.926 0.914 0.962 0.952 0.955
SVM 0.850 0.875 0.868 0.875 0.880 0.905
Random forest 0.990 0.984 0.981 0.984 0.980 0.981
Gaussian NB 0.925 0.925 0.948 0.928 0.953 0.550
Bernoulli NB 0.833 0.825 0.850 0.950 0.930 0.900
F
4
LASSO 0.812 0.936 0.936 0.937 0.931 0.931
GBR 0.976 0.960 0.893 0.949 0.950 0.944
SVM 0.878 0.893 0.935 0.808 0.913 0.913
Random forest 0.991 0.987 0.985 0.990 0.9832 0.9894
Gaussian NB 0.8975 0.9100 0.8925 0.9325 0.9400 0.8500
Bernoulli NB 0.8325 0.8250 0.8500 0.9500 0.9300 0.9000
108
Host: 2
nd
order mc 3
rd
order mc
Staphylococcus
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
LASSO 0.8718 0.9385 0.9370 0.8718 0.9385 0.9370
GBR 0.9548 0.9715 0.9262 0.9528 0.9715 0.9228
SVM 0.7400 0.7875 0.7475 0.7400 0.7875 0.7475
Random forest 0.9925 0.9870 0.9823 0.9927 0.9870 0.9833
Gaussian NB 0.8850 0.9000 0.7800 0.8850 0.9000 0.7800
Bernoulli NB 0.6675 0.8800 0.9225 0.6675 0.8800 0.9225
F
2
LASSO 0.8945 0.7987 0.7758 0.6990 0.8465 0.7740
GBR 0.9638 0.9200 0.8573 0.9955 0.8735 0.8949
SVM 0.7675 0.9300 0.5425 0.5200 0.8400 0.5250
Random forest 0.9835 0.9783 0.9773 0.9945 0.9733 0.9738
Gaussian NB 0.9525 0.9225 0.5275 0.7625 0.8175 0.5250
Bernoulli NB 0.9575 0.9200 0.9000 0.9600 0.9175 0.9050
F
3
LASSO 0.8773 0.8807 0.9112 0.7475 0.8863 0.9010
GBR 0.9234 0.9243 0.9102 0.9877 0.8794 0.9244
SVM 0.9075 0.8875 0.8450 0.5925 0.9250 0.8450
Random forest 0.9808 0.9743 0.9807 0.9925 0.9700 0.9774
Gaussian NB 0.9500 0.9275 0.5250 0.7525 0.8625 0.5375
Bernoulli NB 0.9575 0.9300 0.9000 0.9600 0.9175 0.9050
F
4
LASSO 0.8905 0.9285 0.9210 0.8153 0.9065 0.9143
GBR 0.9661 0.9563 0.9352 0.9642 0.9266 0.9209
SVM 0.8675 0.9075 0.8875 0.8575 0.9375 0.8275
Random forest 0.9860 0.9821 0.9859 0.9890 0.9794 0.9791
Gaussian NB 0.9575 0.9275 0.7650 0.7400 0.9525 0.8125
Bernoulli NB 0.9575 0.9200 0.9000 0.9600 0.9175 0.9050
109
Host: i.i.d. model 1
st
order mc
Synechococcus
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
LASSO 0.9581 0.9457 0.9104 0.9581 0.9457 0.9104
GBR 0.9071 0.9035 0.8339 0.9064 0.9069 0.8336
SVM 0.7059 0.7353 0.5000 0.7059 0.7353 0.5000
Random forest 0.9652 0.9777 0.9552 0.9654 0.9798 0.9535
Gaussian NB 0.8794 0.8794 0.6324 0.8794 0.8794 0.6324
Bernoulli NB 0.6588 0.6735 0.6941 0.6588 0.6735 0.6941
F
2
LASSO 0.9581 0.9453 0.8651 0.8304 0.8522 0.7706
GBR 0.9064 0.9076 0.8346 0.8401 0.7526 0.8066
SVM 0.9206 0.9294 0.7471 0.7941 0.8000 0.6647
Random forest 0.9645 0.9777 0.9571 0.9564 0.9666 0.9251
Gaussian NB 0.8794 0.8853 0.7588 0.8794 0.8353 0.6794
Bernoulli NB 0.7471 0.7765 0.8118 0.8912 0.8941 0.8000
F
3
LASSO 0.9633 0.9512 0.8519 0.8924 0.8716 0.8343
GBR 0.9208 0.8810 0.8689 0.8400 0.8304 0.7995
SVM 0.5294 0.6471 0.7176 0.5824 0.7147 0.7529
Random forest 0.9353 0.9599 0.8965 0.9166 0.9225 0.8644
Gaussian NB 0.7529 0.7500 0.7500 0.7412 0.7559 0.7618
Bernoulli NB 0.7471 0.7765 0.8118 0.8912 0.8941 0.8000
F
4
LASSO 0.9581 0.9336 0.8543 0.8903 0.88334 0.8093
GBR 0.8948 0.9170 0.8631 0.9007 0.8945 0.8631
SVM 0.8882 0.9265 0.7676 0.8882 0.9029 0.7412
Random forest 0.9599 0.9806 0.9469 0.9782 0.9806 0.9740
Gaussian NB 0.8382 0.8706 0.7676 0.8588 0.8324 0.7882
Bernoulli NB 0.7471 0.7765 0.8118 0.8912 0.8941 0.8000
110
Host: 2
nd
order mc 3
rd
order mc
Synechococcus
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
LASSO 0.95813 0.9457 0.9104 0.95813 0.9457 0.9104
GBR 0.9071 0.9038 0.8260 0.9085 0.9080 0.8356
SVM 0.7059 0.7353 0.5000 0.7059 0.7353 0.5000
Random forest 0.9656 0.9777 0.9595 0.9656 0.9820 0.9529
Gaussian NB 0.8794 0.8794 0.6324 0.8794 0.8794 0.6324
Bernoulli NB 0.6588 0.6735 0.6941 0.6588 0.6735 0.6941
F
2
LASSO 0.7446 0.7820 0.7035 0.4990 0.8682 0.7156
GBR 0.8208 0.8490 0.8236 0.9170 0.8062 0.7433
SVM 0.7294 0.8029 0.6618 0.5618 0.8530 0.5853
Random forest 0.9151 0.9660 0.9157 0.9405 0.9689 0.9007
Gaussian NB 0.8324 0.7882 0.6794 0.7088 0.7294 0.6529
Bernoulli NB 0.8441 0.7676 0.7647 0.9147 0.8235 0.7882
F
3
LASSO 0.7765 0.8270 0.7121 0.5578 0.8522 0.7384
GBR 0.7844 0.8242 0.8024 0.9410 0.8021 0.7419
SVM 0.6882 0.7206 0.7412 0.5912 0.7353 0.6971
Random forest 0.8483 0.9099 0.9142 0.9585 0.9668 0.9145
Gaussian NB 0.7382 0.7647 0.7588 0.7147 0.8000 0.8294
Bernoulli NB 0.8441 0.7676 0.7647 0.9147 0.8235 0.7882
F
4
LASSO 0.7464 0.78789 0.7820 0.5875 0.7785 0.7616
GBR 0.8623 0.8979 0.9024 0.9436 0.8779 0.8865
SVM 0.9265 0.8324 0.7294 0.8265 0.7265 0.6765
Random forest 0.9265 0.8324 0.7294 0.8265 0.7265 0.6765
Gaussian NB 0.7794 0.7647 0.7971 0.7353 0.7794 0.8353
Bernoulli NB 0.8441 0.7676 0.7647 0.9147 0.8235 0.7882
111
Host: i.i.d. model 1
st
order mc
Vibrio
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
LASSO 0.8834 0.8648 0.7567 0.8834 0.8648 0.7567
GBR 0.8172 0.8253 0.7034 0.8155 0.8250 0.7007
SVM 0.6362 0.6690 0.6034 0.6362 0.6690 0.6034
Random forest 0.9361 0.9399 0.8920 0.9350 0.9379 0.8863
Gaussian NB 0.7897 0.7828 0.7121 0.7897 0.7828 0.7121
Bernoulli NB 0.5431 0.6224 0.6224 0.5431 0.6224 0.6224
F
2
LASSO 0.8834 0.8646 0.7401 0.8321 0.7979 0.7786
GBR 0.8175 0.8277 0.7008 0.7367 0.6893 0.6647
SVM 0.7845 0.8690 0.6224 0.7793 0.8397 0.5741
Random forest 0.9350 0.9376 0.8907 0.9377 0.9295 0.8796
Gaussian NB 0.7897 0.7862 0.5862 0.8776 0.8483 0.5862
Bernoulli NB 0.7259 0.7086 0.5879 0.8000 0.8379 0.6017
F
3
LASSO 0.8279 0.8791 0.7092 0.8202 0.7964 0.7753
GBR 0.8206 0.7985 0.6600 0.8121 0.7466 0.7555
SVM 0.6034 0.5983 0.6034 0.6000 0.6034 0.5983
Random forest 0.9232 0.9229 0.8768 0.9161 0.9446 0.8973
Gaussian NB 0.7586 0.7552 0.8052 0.8224 0.8121 0.5862
Bernoulli NB 0.7259 0.7086 0.5879 0.8000 0.8379 0.6017
F
4
LASSO 0.9101 0.8608 0.7306 0.8020 0.8405 0.8042
GBR 0.8419 0.8103 0.7088 0.7766 0.81249 0.8040
SVM 0.8431 0.8552 0.7138 0.8052 0.8121 0.5431
Random forest 0.9349 0.9337 0.9034 0.9379 0.9490 0.8820
Gaussian NB 0.7034 0.7397 0.8155 0.8276 0.8345 0.7034
Bernoulli NB 0.7259 0.7086 0.5879 0.8000 0.8379 0.6017
112
Host: 2
nd
order mc 3
rd
order mc
Vibrio
k = 4 k = 6 k = 8 k = 4 k = 6 k = 8
F
1
LASSO 0.8834 0.8648 0.7567 0.8834 0.8648 0.7567
GBR 0.8194 0.8256 0.7008 0.8183 0.8273 0.7011
SVM 0.6362 0.6690 0.6034 0.6362 0.6690 0.6034
Random forest 0.9344 0.9410 0.8857 0.9343 0.9388 0.8897
Gaussian NB 0.7897 0.7828 0.7121 0.7897 0.7828 0.7121
Bernoulli NB 0.5431 0.6224 0.6224 0.5431 0.6224 0.6224
F
2
LASSO 0.7553 0.7830 0.7420 0.5227 0.6895 0.6592
GBR 0.7143 0.6860 0.6658 0.7056 0.7279 0.6865
SVM 0.5793 0.8052 0.5638 0.5207 0.7190 0.5517
Random forest 0.8973 0.9068 0.8647 0.8082 0.8786 0.8487
Gaussian NB 0.8052 0.7310 0.5759 0.5121 0.7500 0.5862
Bernoulli NB 0.7672 0.7776 0.5759 0.7431 0.6310 0.5448
F
3
LASSO 0.7868 0.7881 0.7568 0.56433 0.6755 0.6301
GBR 0.7671 0.7600 0.7111 0.7421 0.7354 0.6674
SVM 0.5914 0.6034 0.5862 0.5483 0.6034 0.5862
Random forest 0.8926 0.9404 0.8534 0.8128 0.9226 0.8853
Gaussian NB 0.7241 0.6966 0.5862 0.4966 0.6293 0.5862
Bernoulli NB 0.7672 0.7776 0.5759 0.7431 0.6310 0.5448
F
4
LASSO 0.7566 0.8161 0.8121 0.5625 0.7597 0.7600
GBR 0.7338 0.7942 0.7165 0.7947 0.7207 0.6975
SVM 0.7414 0.8052 0.5138 0.6862 0.6086 0.5000
Random forest 0.8855 0.9337 0.8821 0.8624 0.9315 0.9031
Gaussian NB 0.8034 0.8310 0.6914 0.4897 0.7983 0.6603
Bernoulli NB 0.7672 0.7776 0.5759 0.7431 0.6310 0.5448
113
Appendix B
Supplmentary Materials for
Chapter 3
B.1 Supplementary Methods
B.1.1 Two-proportion z-test
A two-proportion z-test is the hypothesis test for comparing two proportions to
see if they are the same.
For a fixed dataset:
• p
1
: True proportion of male sample in the CRC group
• ˆ p
1
: Observed proportion of male sample in the CRC group
• n
1
: Total number of samples in the CRC group
• p
2
: True proportion of male sample in the healthy control group
• ˆ p
2
: Observed proportion of male sample in the healthy control group
• n
2
: Total number of samples in the healthy control group
• ˆ p: Observed proportion of male sample in the whole dataset
Given the variable definitions above, the null hypothesis and alternative
hypothesis are H
0
: p
1
=p
2
and H
A
: p
1
6=p
2
.
114
Then the z−score is calculated based on the formula:
Z =
(ˆ p
1
− ˆ p
2
)− 0
q
ˆ p(1− ˆ p)(1/n
1
+ 1n
2
)
The corresponding p-value could be infered from z−score table.
B.1.2 Mann-Whitney U-test
Mann-Whitney U-test is the non-parametric (equivalent of the two-sample t-test)
hypothesis test for comparing two sample means that originated from the same
population. For either age or BMI, the null hypothesis is H
0
: CRC patients and
healthy controls come from the sample population (the age (or BMI) has the same
distribution for CRC patients and healthy controls).
The U−statistic is then given by
U =R
1
−
n
1
(n
1
+ 1)
2
Where the definitions of n
1
and n
2
are the same as in two-proportion z-test, and
R
1
is the sum of ranks in the sample of the CRC group.
B.1.3 Kullback–Leibler Divergence
Kullback–LeiblerDivergence[72] isalsocalled therelativeentropy, whichmeasures
how a probability distribution (the targeting probability distribution) differs from
the reference probability distribution. In our scenario, let θ
T
(k) = (p
T
1
,··· ,p
T
4
k
)
115
and θ
R
(k) = (p
R
1
,··· ,p
R
4
k
) be the normalized targeting and reference k-mer distri-
bution, respectively. TheKullback–LeiblerDivergenceD
KL
(θ
T
(k),θ
R
(k)isdefined
as:
D
KL
(θ
T
(k),θ
R
(k) =
X
i∈{1,···,4
k
}
p
T
i
ln(
p
T
i
p
R
i
)
Where ln is the natural log with base e.
B.2 Implementation
For all of the samples in the datasets, each individual has multiple sequencing
data. We randomly select one sequencing data for each of the individuals in the
dataset for 20 times. When do the supervised learning within each single dataset,
the samples are shuffled before splitting the training data and testing data. The
performance of any of the supervised learning methods is averaged based on the
AUC scores of all 20 times repeats.
The construction of features of the sequencing data are implemented by C++.
The supervised learnings are implemented by Python with package scikit-learn
[73] for Logistic regression, LASSO, SVM and Random forest. The Multilayer
Perceptron is implemented under the keras framework [74].
116
B.3 Supplementary Figures
B.3.1 Clinical indicator distributions
Figure B.1: Age distribution for all of the samples in three datasets.
Figure B.2: BMI distribution for all of the samples in three datasets.
117
B.3.2 AUC plots of CRC prediction with subsampled
metagenomic sequences
Intra-dataset CRC prediction with subsampled metagenomic sequences
Logistic Regression (NatoreCom v.s. NatureCom)
Figure B.3: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Logistic regression
with different sampling sizes within the NatureCom (NC) dataset. The ? on top of each AUC bar
indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
118
Logistic Regression (PlosOne v.s. PlosOne)
Figure B.4: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Logistic regression
with different sampling sizes within the PlosOne (PO) dataset. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training data/testing data splittings.
119
Logistic Regression (MolSysBio v.s. MolSysBio)
Figure B.5: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Logistic regression
with different sampling sizes within the MolSysBio (MSB) dataset. The ? on top of each AUC bar
indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
120
LASSO (NatoreCom v.s. NatureCom)
Figure B.6: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for LASSO with different
sampling sizes within the NatureCom (NC) dataset. The? on top of each AUC bar indicates the highest
AUC score in all of the 20 different random training data/testing data splittings.
121
LASSO (PlosOne v.s. PlosOne)
Figure B.7: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for LASSO with different
sampling sizes within the PlosOne (PO) dataset. The ? on top of each AUC bar indicates the highest
AUC score in all of the 20 different random training data/testing data splittings.
122
LASSO (MolSysBio v.s. MolSysBio)
Figure B.8: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for LASSO with different
sampling sizes within the MolSysBio (MSB) dataset. The? on top of each AUC bar indicates the highest
AUC score in all of the 20 different random training data/testing data splittings.
123
Random Forest (NatoreCom v.s. NatureCom)
Figure B.9: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Random forest with
different sampling sizes within the NatureCom (NC) dataset. The ? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training data/testing data splittings.
124
Random Forest (PlosOne v.s. PlosOne)
Figure B.10: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Random forest with
different sampling sizes within the PlosOne (PO) dataset. The ? on top of each AUC bar indicates the
highest AUC score in all of the 20 different random training data/testing data splittings.
125
Random Forest (MolSysBio v.s. MolSysBio)
Figure B.11: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Random forest with
different sampling sizes within the MolSysBio (MSB) dataset. The ? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training data/testing data splittings.
126
SVM-Linear Kernel (NatoreCom v.s. NatureCom)
Figure B.12: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Linear Kernel)
with different sampling sizes within the NatureCom (NC) dataset. The ? on top of each AUC bar
indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
127
SVM-Linear Kernel (PlosOne v.s. PlosOne)
Figure B.13: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Linear Kernel)
with different sampling sizes within the NatureCom (NC) dataset. The ? on top of each AUC bar
indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
128
SVM-Linear Kernel (MolSysBio v.s. MolSysBio)
Figure B.14: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Linear Kernel)
with different sampling sizes within the NatureCom (NC) dataset. The ? on top of each AUC bar
indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
129
SVM-Polynomial Kernel (NatoreCom v.s. NatureCom)
Figure B.15: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Polynomial
Kernel) with different sampling sizes within the NatureCom (NC) dataset. The? on top of each AUC
bar indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
130
SVM-Polynomial Kernel (PlosOne v.s. PlosOne)
Figure B.16: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Polynomial
Kernel) with different sampling sizes within the NatureCom (NC) dataset. The? on top of each AUC
bar indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
131
SVM-Polynomial Kernel (MolSysBio v.s. MolSysBio)
Figure B.17: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Polynomial
Kernel) with different sampling sizes within the NatureCom (NC) dataset. The? on top of each AUC
bar indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
132
SVM-RBF Kernel (NatoreCom v.s. NatureCom)
Figure B.18: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (RBF Kernel)
with different sampling sizes within the NatureCom (NC) dataset. The ? on top of each AUC bar
indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
133
SVM-RBF Kernel (PlosOne v.s. PlosOne)
Figure B.19: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (RBF Kernel)
with different sampling sizes within the NatureCom (NC) dataset. The ? on top of each AUC bar
indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
134
SVM-RBF Kernel (MolSysBio v.s. MolSysBio)
Figure B.20: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (RBF Kernel)
with different sampling sizes within the NatureCom (NC) dataset. The ? on top of each AUC bar
indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
135
Multilayer Perceptron (NatoreCom v.s. NatureCom)
Figure B.21: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Multilayer perceptron
with different sampling sizes within the NatureCom (NC) dataset. The ? on top of each AUC bar
indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
136
Multilayer Perceptron (PlosOne v.s. PlosOne)
Figure B.22: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Multilayer perceptron
with different sampling sizes within the PlosOne (PO) dataset. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training data/testing data splittings.
137
Multilayer Perceptron (MolSysBio v.s. MolSysBio)
Figure B.23: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Multilayer perceptron
with different sampling sizes within the MolSysBio (MSB) dataset. The ? on top of each AUC bar
indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
138
Cross-datasets CRC prediction with subsampled metagenomic
sequences
Logistic regression (NatureCom v.s. PlosOne)
Figure B.24: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Logistic regression
with different sampling sizes using the NatureCom (NC) dataset as training data and the PlosOne
(PO) dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings.
139
Logistic regression (NatureCom v.s. MolSysBio)
Figure B.25: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.dand1
st
-ordermarkovchainmodelover4differentfeaturedefinitionsforLogisticregressionwith
different sampling sizes using the NatureCom (NC) dataset as training data and the MolSysBio
(MSB) dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score in all of the
20 different random training data/testing data splittings.
140
Logistic regression (PlosOne v.s. NatureCom)
Figure B.26: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Logistic regression
with different sampling sizes using the PlosOne (PO) dataset as training data and the NatureCom
(NC) dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings.
141
Logistic regression (PlosOne v.s. MolSysBio)
Figure B.27: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Logistic regression
with different sampling sizes using the PlosOne (PO) dataset as training data and the MolSysBio
(MSB) dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score in all of the
20 different random training data/testing data splittings.
142
Logistic regression (MolSysBio v.s. NatureCom)
Figure B.28: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.dand1
st
-ordermarkovchainmodelover4differentfeaturedefinitionsforLogisticregressionwith
different sampling sizes using the MolSysBio (MSB) dataset as training data and the NatureCom
(NC) dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings.
143
Logistic regression (MolSysBio v.s. PlosOne)
Figure B.29: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Logistic regression
withdifferentsamplingsizesusingtheMolSysBio(MS-B)datasetastrainingdataandthePlosOne
(PO) dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings.
144
LASSO (NatureCom v.s. PlosOne)
Figure B.30: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for LASSO with different
sampling sizes using the NatureCom (NC) dataset as training data and the PlosOne (PO) dataset
as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20 different random
training data/testing data splittings.
145
LASSO (NatureCom v.s. MolSysBio)
Figure B.31: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for LASSO with different
sampling sizes using the NatureCom (NC) dataset as training data and the MolSysBio (MSB)
dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings.
146
LASSO (PlosOne v.s. NatureCom)
Figure B.32: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for LASSO with different
sampling sizes using the PlosOne (PO) dataset as training data and the NatureCom (NC) dataset
as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20 different random
training data/testing data splittings.
147
LASSO (PlosOne v.s. MolSysBio)
Figure B.33: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for LASSO with different
sampling sizes using the PlosOne (PO) dataset as training data and the MolSysBio (MSB) dataset
as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20 different random
training data/testing data splittings.
148
LASSO (MolSysBio v.s. NatureCom)
Figure B.34: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for LASSO with different
sampling sizes using the MolSysBio (MSB) dataset as training data and the NatureCom (NC)
dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings.
149
LASSO (MolSysBio v.s. PlosOne)
Figure B.35: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for LASSO with different
sampling sizes using the MolSysBio (MSB) dataset as training data and the PlosOne (PO) dataset
as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20 different random
training data/testing data splittings.
150
Random Forest (NatureCom v.s. PlosOne)
Figure B.36: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Random Forest with
different sampling sizes using the NatureCom (NC) dataset as training data and the PlosOne (PO)
dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20 different
random training data/testing data splittings.
151
Random Forest (NatureCom v.s. MolSysBio)
Figure B.37: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Random Forest with
different sampling sizes using the NatureCom (NC) dataset as training data and the MolSysBio
(MSB) dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score in all of the
20 different random training data/testing data splittings.
152
Random Forest (PlosOne v.s. NatureCom)
Figure B.38: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Random Forest with
different sampling sizes using the PlosOne (PO) dataset as training data and the NatureCom (NC)
dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20 different
random training data/testing data splittings.
153
Random Forest (PlosOne v.s. MolSysBio)
Figure B.39: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Random Forest with
different sampling sizes using the PlosOne (PO) dataset as training data and the MolSysBio (MSB)
dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20 different
random training data/testing data splittings.
154
Random Forest (MolSysBio v.s. NatureCom)
Figure B.40: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Random Forest with
different sampling sizes using the MolSysBio (MSB) dataset as training data and the NatureCom
(NC) dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings.
155
Random Forest (MolSysBio v.s. PlosOne)
Figure B.41: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Random Forest with
different sampling sizes using the MolSysBio (MSB) dataset as training data and the PlosOne
(PO) dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings.
156
SVM-Linear Kernel (NatureCom v.s. PlosOne)
Figure B.42: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Linear Kernel)
with different sampling sizes using the NatureCom (NC) dataset as training data and the PlosOne
(PO) dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings.
157
SVM-Linear Kernel (NatureCom v.s. MolSysBio)
Figure B.43: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Linear Kernel)
withdifferentsamplingsizesusingtheNatureCom(NC)datasetastrainingdataandtheMolSysBio
(MSB) dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score in all of the
20 different random training data/testing data splittings.
158
SVM-Linear Kernel (PlosOne v.s. NatureCom)
Figure B.44: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Linear Kernel)
with different sampling sizes using the PlosOne (PO) dataset as training data and the NatureCom
(NC) dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings.
159
SVM-Linear Kernel (PlosOne v.s. MolSysBio)
Figure B.45: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Linear Kernel)
with different sampling sizes using the PlosOne (PO) dataset as training data and the MolSysBio
(MSB) dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score in all of the
20 different random training data/testing data splittings.
160
SVM-Linear Kernel (MolSysBio v.s. NatureCom)
Figure B.46: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Linear Kernel)
with different sampling sizes using the MolSysBio (MSB) dataset as training data and the Nature-
Com (NC) dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score in all of
the 20 different random training data/testing data splittings.
161
SVM-Linear Kernel (MolSysBio v.s. PlosOne)
Figure B.47: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Linear Kernel)
with different sampling sizes using the MolSysBio (MSB) dataset as training data and the PlosOne
(PO) dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings.
162
SVM-Polynomial Kernel (NatureCom v.s. PlosOne)
Figure B.48: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Polynomial
Kernel) with different sampling sizes using the NatureCom (NC) dataset as training data and the
PlosOne (PO) dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score in
all of the 20 different random training data/testing data splittings.
163
SVM-Polynomial Kernel (NatureCom v.s. MolSysBio)
Figure B.49: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Polynomial
Kernel) with different sampling sizes using the NatureCom (NC) dataset as training data and the
MolSysBio (MSB) dataset as testing data. The? on top of each AUC bar indicates the highest AUC score
in all of the 20 different random training data/testing data splittings.
164
SVM-Polynomial Kernel (PlosOne v.s. NatureCom)
Figure B.50: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Polynomial
Kernel) with different sampling sizes using the PlosOne (PO) dataset as training data and the
NatureCom (NC) dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score
in all of the 20 different random training data/testing data splittings.
165
SVM-Polynomial Kernel (PlosOne v.s. MolSysBio)
Figure B.51: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Polynomial
Kernel) with different sampling sizes using the PlosOne (PO) dataset as training data and the
MolSysBio (MSB) dataset as testing data. The? on top of each AUC bar indicates the highest AUC score
in all of the 20 different random training data/testing data splittings.
166
SVM-Polynomial Kernel (MolSysBio v.s. NatureCom)
Figure B.52: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Polynomial
Kernel) with different sampling sizes using the MolSysBio (MSB) dataset as training data and the
NatureCom (NC) dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score
in all of the 20 different random training data/testing data splittings.
167
SVM-Polynomial Kernel (MolSysBio v.s. PlosOne)
Figure B.53: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Polynomial
Kernel) with different sampling sizes using the MolSysBio (MSB) dataset as training data and the
PlosOne (PO) dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score in
all of the 20 different random training data/testing data splittings.
168
SVM-RBF Kernel (NatureCom v.s. PlosOne)
Figure B.54: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (RBF Kernel)
with different sampling sizes using the NatureCom (NC) dataset as training data and the PlosOne
(PO) dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings.
169
SVM-RBF Kernel (NatureCom v.s. MolSysBio)
Figure B.55: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (RBF Kernel)
withdifferentsamplingsizesusingtheNatureCom(NC)datasetastrainingdataandtheMolSysBio
(MSB) dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score in all of the
20 different random training data/testing data splittings.
170
SVM-RBF Kernel (PlosOne v.s. NatureCom)
Figure B.56: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (RBF Kernel)
with different sampling sizes using the PlosOne (PO) dataset as training data and the NatureCom
(NC) dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings.
171
SVM-RBF Kernel (PlosOne v.s. MolSysBio)
Figure B.57: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (RBF Kernel)
with different sampling sizes using the PlosOne (PO) dataset as training data and the MolSysBio
(MSB) dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score in all of the
20 different random training data/testing data splittings.
172
SVM-RBF Kernel (MolSysBio v.s. NatureCom)
Figure B.58: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (RBF Kernel)
with different sampling sizes using the MolSysBio (MSB) dataset as training data and the Nature-
Com (NC) dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score in all of
the 20 different random training data/testing data splittings.
173
SVM-RBF Kernel (MolSysBio v.s. PlosOne)
Figure B.59: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (RBF Kernel)
with different sampling sizes using the MolSysBio (MSB) dataset as training data and the PlosOne
(PO) dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings.
174
Multilayer perceptron (NatureCom v.s. PlosOne)
Figure B.60: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Multilayer perceptron
with different sampling sizes using the NatureCom (NC) dataset as training data and the PlosOne
(PO) dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings.
175
Multilayer perceptron (NatureCom v.s. MolSysBio)
Figure B.61: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Multilayer perceptron
withdifferentsamplingsizesusingtheNatureCom(NC)datasetastrainingdataandtheMolSysBio
(MSB) dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score in all of the
20 different random training data/testing data splittings.
176
Multilayer perceptron (PlosOne v.s. NatureCom)
Figure B.62: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Multilayer perceptron
with different sampling sizes using the PlosOne (PO) dataset as training data and the NatureCom
(NC) dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings.
177
Multilayer perceptron (PlosOne v.s. MolSysBio)
Figure B.63: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Multilayer perceptron
with different sampling sizes using the PlosOne (PO) dataset as training data and the MolSysBio
(MSB) dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score in all of the
20 different random training data/testing data splittings.
178
Multilayer perceptron (MolSysBio v.s. NatureCom)
Figure B.64: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Multilayer percep-
tron with different sampling sizes using the MolSysBio (MSB) dataset as training data and the
NatureCom (NC) dataset as testing data. The ? on top of each AUC bar indicates the highest AUC score
in all of the 20 different random training data/testing data splittings.
179
Multilayer perceptron (MolSysBio v.s. PlosOne)
Figure B.65: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Multilayer perceptron
with different sampling sizes using the MolSysBio (MSB) dataset as training data and the PlosOne
(PO) dataset as testing data. The? on top of each AUC bar indicates the highest AUC score in all of the 20
different random training data/testing data splittings.
180
Combined-datasets CRC prediction with subsampled metagenomic
sequences
Logistic Regression (NatureCom&PlosOne v.s. MolSysBio)
Figure B.66: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Logistic Regression
with different sampling sizes combining NatureCom (NC) and PlosOne (PO) datasets as training
data, then using the MolSysBio (MSB) dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training data/testing data splittings.
181
Logistic Regression (NatureCom&MolSysBio v.s. PlosOne)
Figure B.67: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Logistic Regression
with different sampling sizes combining NatureCom (NC) and MolSysBio (MSB) datasets as train-
ing data, then using the PlosOne (PO) dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training data/testing data splittings.
182
Logistic Regression (PlosOne&MolSysBio v.s. NatureCom)
Figure B.68: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Logistic Regression
with different sampling sizes combining PlosOne (PO) and MolSysBio (MSB) datasets as training
data, then using the NatureCom (NC) dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training data/testing data splittings.
183
LASSO (NatureCom&PlosOne v.s. MolSysBio)
Figure B.69: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for LASSO with different
sampling sizes combining NatureCom (NC) and PlosOne (PO) datasets as training data, then using
the MolSysBio (MSB) dataset as testing data. The ? on top of each AUC bar indicates the highest AUC
score in all of the 20 different random training data/testing data splittings.
184
LASSO (NatureCom&MolSysBio v.s. PlosOne)
Figure B.70: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for LASSO with different
sampling sizes combining NatureCom (NC) and MolSysBio (MSB) datasets as training data, then
using the PlosOne (PO) dataset as testing data. The? on top of each AUC bar indicates the highest AUC
score in all of the 20 different random training data/testing data splittings.
185
LASSO (PlosOne&MolSysBio v.s. NatureCom)
Figure B.71: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for LASSO with different
sampling sizes combining PlosOne (PO) and MolSysBio (MSB) datasets as training data, then
using the NatureCom (NC) dataset as testing data. The ? on top of each AUC bar indicates the highest
AUC score in all of the 20 different random training data/testing data splittings.
186
Random Forest (NatureCom&PlosOne v.s. MolSysBio)
Figure B.72: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Random Forest with
different sampling sizes combining NatureCom (NC) and PlosOne (PO) datasets as training data,
then using the MolSysBio (MSB) dataset as testing data. The ? on top of each AUC bar indicates the
highest AUC score in all of the 20 different random training data/testing data splittings.
187
Random Forest (NatureCom&MolSysBio v.s. PlosOne)
Figure B.73: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Random Forest with
different sampling sizes combining NatureCom (NC) and MolSysBio (MSB) datasets as training
data, then using the PlosOne (PO) dataset as testing data. The? on top of each AUC bar indicates the
highest AUC score in all of the 20 different random training data/testing data splittings.
188
Random Forest (PlosOne&MolSysBio v.s. NatureCom)
Figure B.74: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Random Forest with
different sampling sizes combining PlosOne (PO) and MolSysBio (MSB) datasets as training data,
then using the NatureCom (NC) dataset as testing data. The ? on top of each AUC bar indicates the
highest AUC score in all of the 20 different random training data/testing data splittings.
189
SVM-Linear Kernel(NatureCom&PlosOne v.s. MolSysBio)
Figure B.75: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Linear Kernel)
with different sampling sizes combining NatureCom (NC) and PlosOne (PO) datasets as training
data, then using the MolSysBio (MSB) dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training data/testing data splittings.
190
SVM-Linear Kernel(NatureCom&MolSysBio v.s. PlosOne)
Figure B.76: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Linear Ker-
nel) with different sampling sizes combining NatureCom (NC) and MolSysBio (MSB) datasets as
training data, then using the PlosOne (PO) dataset as testing data. The ? on top of each AUC bar
indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
191
SVM-Linear Kernel(PlosOne&MolSysBio v.s. NatureCom)
Figure B.77: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Linear Kernel)
with different sampling sizes combining PlosOne (PO) and MolSysBio (MSB) datasets as training
data, then using the NatureCom (NC) dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training data/testing data splittings.
192
SVM-Polynomial Kernel(NatureCom&PlosOne v.s. MolSysBio)
Figure B.78: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Polynomial
Kernel) with different sampling sizes combining NatureCom (NC) and PlosOne (PO) datasets as
training data, then using the MolSysBio (MSB) dataset as testing data. The ? on top of each AUC
bar indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
193
SVM-Polynomial Kernel(NatureCom&MolSysBio v.s. PlosOne)
Figure B.79: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Polynomial
Kernel) with different sampling sizes combining NatureCom (NC) and MolSysBio (MSB) datasets
as training data, then using the PlosOne (PO) dataset as testing data. The ? on top of each AUC bar
indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
194
SVM-Polynomial Kernel(PlosOne&MolSysBio v.s. NatureCom)
Figure B.80: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (Polynomial
Kernel) with different sampling sizes combining PlosO- ne (PO) and MolSysBio (MSB) datasets
as training data, then using the NatureCom (NC) dataset as testing data. The? on top of each AUC
bar indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
195
SVM-RBF Kernel(NatureCom&PlosOne v.s. MolSysBio)
Figure B.81: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (RBF Kernel)
with different sampling sizes combining NatureCom (NC) and PlosOne (PO) datasets as training
data, then using the MolSysBio (MSB) dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training data/testing data splittings.
196
SVM-RBF Kernel(NatureCom&MolSysBio v.s. PlosOne)
Figure B.82: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (RBF Kernel)
with different sampling sizes combining NatureCom (NC) and MolSysBio (MSB) datasets as train-
ing data, then using the PlosOne (PO) dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training data/testing data splittings.
197
SVM-RBF Kernel(PlosOne&MolSysBio v.s. NatureCom)
Figure B.83: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for SVM (RBF Kernel)
with different sampling sizes combining PlosOne (PO) and MolSysBio (MSB) datasets as training
data, then using the NatureCom (NC) dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training data/testing data splittings.
198
Multilayer perceptron (NatureCom&PlosOne v.s. MolSysBio)
Figure B.84: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Multilayer perceptron
with different sampling sizes combining NatureCom (NC) and PlosOne (PO) datasets as training
data, then using the MolSysBio (MSB) dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training data/testing data splittings.
199
Multilayer perceptron (NatureCom&MolSysBio v.s. PlosOne)
Figure B.85: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Multilayer percep-
tron with different sampling sizes combining NatureCom (NC) and MolSysBio (MSB) datasets as
training data, then using the PlosOne (PO) dataset as testing data. The ? on top of each AUC bar
indicates the highest AUC score in all of the 20 different random training data/testing data splittings.
200
Multilayer perceptron (PlosOne&MolSysBio v.s. NatureCom)
Figure B.86: Average AUC scores with confidence intervals (CI) of different k-mer lengths under
i.i.d and 1
st
-order markov chain model over 4 different feature definitions for Multilayer perceptron
with different sampling sizes combining PlosOne (PO) and MolSysBio (MSB) datasets as training
data, then using the NatureCom (NC) dataset as testing data. The? on top of each AUC bar indicates
the highest AUC score in all of the 20 different random training data/testing data splittings.
201
Abstract (if available)
Abstract
The goal of this dissertation is to uncover the potential of features associated with virus-host interaction and colorectal cancer in metagenomic studies. This dissertation is comprised of two parts. ❧ The first part is a study of virus-host infectious association by supervised learning methods. Uncovering the virus-host infectious association is important for understanding the functions and dynamics of microbial communities. Both cellular and fractionated viral metagenomic data generate a large number of viral contigs with missing host information. Although relatively simple methods based on the similarity between the word frequency vectors of viruses and bacterial hosts have been developed to study virus-host associations, the problem is significantly understudied. We hypothesize that machine learning methods based on word frequencies can be efficiently used to study virus-host infectious associations. ❧ The second part is an investigation of prediction methods of colorectal cancer (CRC) versus healthy controls, using alignment-free methods to construct the feature vectors. We first evaluated the prediction performance using 3 different datasets and compared them with the state-of-art research results. The performance of alignment-free approaches is comparable with that of alignment-based methods with intra-dataset, cross-datasets, and combined-datasets assessment methods. We then investigated the changes in performance in CRC prediction while using the shallow subsamples instead of the whole metagenomic sequencing data. According to our results, the shallow subsamples of the whole metagenomic sequences are more powerful when predicting CRC based on supervised learning on metagenomic sequences with missing data.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Application of machine learning methods in genomic data analysis
PDF
Enhancing phenotype prediction through integrative analysis of heterogeneous microbiome studies
PDF
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
PDF
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Exploration of human microbiome through metagenomic analysis and computational algorithms
PDF
Predicting virus-host interactions using genomic data and applications in metagenomics
PDF
Clustering 16S rRNA sequences: an accurate and efficient approach
PDF
Applications and improvements of background adjusted alignment-free dissimilarity measures
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Computational algorithms and statistical modelings in human microbiome analyses
PDF
Machine learning of DNA shape and spatial geometry
PDF
Data-driven learning for dynamical systems in biology
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Fast search and clustering for large-scale next generation sequencing data
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Exploring the genetic basis of complex traits
Asset Metadata
Creator
Zhang, Mengge
(author)
Core Title
Feature engineering and supervised learning on metagenomic sequence data
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
11/06/2019
Defense Date
12/18/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bacteria,colorectal cancer,feature engineering, metagenomic sequencing data,machine-learning,OAI-PMH Harvest,virus
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sun, Fengzhu (
committee chair
), Chen, Liang (
committee member
), Fuhrman, Jed (
committee member
)
Creator Email
menggezh@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-229178
Unique identifier
UC11673896
Identifier
etd-ZhangMengg-7896.pdf (filename),usctheses-c89-229178 (legacy record id)
Legacy Identifier
etd-ZhangMengg-7896.pdf
Dmrecord
229178
Document Type
Dissertation
Rights
Zhang, Mengge
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
colorectal cancer
feature engineering, metagenomic sequencing data
machine-learning
virus