Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
(USC Thesis Other)
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Transcript (if available)
Content
DEEP LEARNING IN METAGENOMICS: FROM METAGENOMIC CONTIGS
SORTING TO PHAGE-BACTERIAL ASSOCIATION PREDICTION
by
Tianqi Tang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
December 2023
Copyright 2023 Tianqi Tang
Dedication
I dedicate this dissertation to my beloved parents, Zhuhua Tang and Chuanyi Wang. I also dedicate this dissertation to my grandmother, Yunfang Li.
ii
Acknowledgements
First of all, I would like to thank my doctoral advisor, Dr. Fengzhu Sun. His unwavering guidance,
encouragement, and profound insights have been instrumental in my academic journey. His dedication to research, coupled with his patience and wisdom, have not only shaped this dissertation
but have also molded my growth as a scholar and thinker.
I would also like to thank my Doctoral qualifying exam and Dissertation committee members,
Dr. Jed A. Fuhrman, Dr. Liang Chen, Dr. Jinchi Lv, and Dr. Yan Liu, for their valuable feedback
and constructive criticisms. Their perspectives and expertise significantly enhanced the quality
and breadth of my work.
I would also thank my collaborators, Dr. Shengwei Hou and Dr. Jed A. Fuhrman, for their
great support in joint publications. Their invaluable insights, commitment, and collaborative
spirit significantly contributed to our joint publications, making our partnership both productive
and enriching.
I would like to thank professors in QCB, Dr. Peter Calabrese, Dr. Mark Chaisson, Dr. Liang
Chen, Dr. Doc Edge, Dr. Geoffrey Fudenberg, Dr. Vsevolod Katritch, Dr. Adam Maclean, Dr.
Remo Rohs, Dr. Rory Spence, and Dr. Andrew Smith for teaching me knowledge of different
branches in computational biology.
iii
I would like to thank all my labmates at Sun lab: Dr. Xin Bai, Dr. Zifan Zhu, Dr. Kujin Tang,
Dr. Weili Wang, Yilin Gao, Beibei Wang, Ziye Wang, Yuxuan Du, Dallace Francis, Jiawei Huang,
and Wenxuan Zuo. I would also like to thank other friends at USC: Brendon Cooper, Tsung-Yu Lu,
Jingwen Ren, Bo Sun, George Wang, Xiaojun Wu, Quentin Yang, Qingyang Yin, Yuxiang Zhan
and many others.
Lastly, and most importantly, I owe an immeasurable debt of gratitude to my parents, Zhuhua
Tang and Chuanyi Wang, as well as my cherished grandmother, Yunfang Li. Their ceaseless
support and unwavering belief in me have been the pillars upon which I’ve built my life. Their
teachings, guidance, and constant presence by my side have shaped me into the person I am
today. Without them, I could not have achieved this greatness.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Computational sequence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Sequence classification on metagenomics data . . . . . . . . . . . . . . . . . . . . 7
1.3 Phage-bacterial association prediction in metagenomic dataset . . . . . . . . . . . 11
1.4 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Authors and contributors to the dissertation . . . . . . . . . . . . . . . . . . . . . 14
Chapter 2: Metagenomics contig classification with DeepMicroClass . . . . . . . . . . . . 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Dataset preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Benchmark dataset preparation . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 Model design and training . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.4 Use-case data preparation and analysis . . . . . . . . . . . . . . . . . . . . 23
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 A CNN-based multi-class classifier . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 DeepMicroClass outperforms Tiara and Whokaryote in
eukaryotic host sequence prediction . . . . . . . . . . . . . . . . . . . . . 27
2.3.3 DeepMicroClass outcompetes PlasFlow, PPR-Meta and geNomad in
plasmid sequence prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.4 DeepMicroClass achieves improved results in viral sequence
prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.5 DeepMicroClass outperforms PPR-Meta and geNomad in multi-class
prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
v
2.3.6 DeepMicroClass predicted more eukaryotic and viral contigs than
alignment-based predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.7 Computational cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4.1 Microbial eukaryotes and viruses infecting them are understudied . . . . . 41
2.4.2 The challenge of classifying prokaryotic host and plasmid sequences . . . 43
Chapter 3: Phage-bacterial contig association prediction with ContigNet . . . . . . . . . . 46
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1 Dataset preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.2 Feature representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.3 Deep learning model structure and training . . . . . . . . . . . . . . . . . 50
3.2.4 Investigation of the effects of contig lengths, sequencing errors and
chimeric contigs on the performance of ContigNet . . . . . . . . . . . . . 53
3.2.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.1 Sharing weights among base and codon paths improves model
generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.2 ContigNet increased the prediction performance compared to
existing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.3 The performance of ContigNet increases with both viral and host contig
lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.4 Assessing the effects of different channels . . . . . . . . . . . . . . . . . . 61
3.3.5 Sequencing errors and chimeric contigs decrease the performance of
ContigNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.6 Performance of ContigNet on new datasets . . . . . . . . . . . . . . . . . . 67
3.3.7 Computational cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4.1 Proper software should be selected carefully for particular task . . . . . . 72
Chapter 4: Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1 Advancements in metagenomic sequence slassification and association prediction 77
4.2 DeepMicroClass enables potential future advancement in sequence classification . 79
4.3 Expanding datasets unlock new potential for ContigNet . . . . . . . . . . . . . . . 80
Chapter 5: Supplementary materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.1 Supplementary mateirials for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . 82
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
vi
List of Tables
5.1 The composition of test datasets used in this study for benchmarking
different tools. PROK includes prokaryotic genomes, plasmids and prokaryotic
viruses; EUK includes eukaryotic genomes and viruses. Prok: prokaryotic
genomes, ProkVir: prokaryotic viruses/phages, Plas: plasmids, Euk: eukaryotic
genomes, EukVir: eukaryotic viruses. . . . . . . . . . . . . . . . . . . . . . . . . . 83
vii
List of Figures
2.1 Sequence source composition of 20 test dataset. 20 equal sized benchmark
datasets were constructed. The fractions of PROK (including prokaryotic hosts,
prokaryotic viruses, and plasmids) to EUK (including eukaryotic hosts and
eukaryotic viruses) sequences were determined using the ratios of 9:1, 7:3, 5:5,
3:7, and 1:9. For each fixed PROK:EUK ratio, the PROK fraction was further split
into prokaryotic hosts, prokaryotic viruses and plasmids based on the ratios of
5:1:1, 4:1:1, 3:1:1, and 2:1:1; and the EUK fraction was further split into eukaryotic
hosts and eukaryotic viruses according to the ratio of 5:1, 4:1, 3:1, and 2:1. . . . . 20
2.2 Schematic representation of the multi-class CNN structure used in
this study. The network has two convolutional paths, a base-path encodes
the nucleotide level information and a codon-path encodes the codon level
information. The hyperparameters used for each convolutional layer are marked
on the figure. For each strand, the output dimension of base- and codon-paths
are 256 and 256, respectively. The di-path outputs of forward and reverse
strands are concatenated into a 1024-dimensional vector, which is used as the
input of following linear layers. The final linear layer outputs a 5-dimensional
vector, with each dimension indicating the probability of the input contig being
eukaryotic host, eukaryotic virus, plasmid, prokaryotic host and prokaryotic
virus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 The distribution of viral confidence scores for (a) VirFinder and (b)
PPR-Meta. For both predictors, the same dataset was used and the predictions
were performed with default parameters. VirFinder uses VF-Scores to determine
the likelihood of input sequences to be viral or not, and PPR-Meta uses
phage scores to discern viruses from host chromosomes and plasmids. Both
predictors achieved a high recall for prokaryotic viruses, while the confidence
scores of eukaryotic viruses were more evenly spreaded across all confidence
regions. Besides, both predictors achieved a high performance in distinguishing
prokaryotic host sequences from prokaryotic viruses, but less so for eukaryotic
host sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
viii
2.4 The ROC curves and AUC scores of different length models assessed on test
datasets. Each different panel shows the ROC curves for 5 sequence classes at
different contig lengths (500 bps, 1 kbps, 2 kbps, 3 kbps, 5 kbps, 10 kbps, 50 kbps
and 100 kbps). Euk, eukaryotic sequences; EukVir, eukaryotic viral sequences;
Plasmid, plasmid sequences; Prok, prokaryotic genome sequences; ProkVir,
prokaryotic viral sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Distribution patterns of accuracy (a) and F1 score (b) across 20 test datasets for
DeepMicroClass, Tiara and Whokaryote. The top panel shows the sequence type
composition of 20 test datasets, and the detailed composition ratios can be found
in Table 5.1. The dashed black lines indicate where accuracy or F1 score equals 0.8. 30
2.6 Distribution patterns of accuracy (a) and F1 score (b) across 20 test datasets for
DeepMicroClass, PlasFlow, PPR-Meta and geNomad on plasmid classification.
The dashed black lines indicate where accuracy or F1 score equals 0.8. The same
benchmarking datasets were used as in Figure 2.5. DMC, DeepMicroClass; PPR,
PPR-Meta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 Distribution of (a) accuracy and (b) F1 scores across 20 test datasets for
DeepMicroClass (DMC), DeepVirFinder (DVF), VIBRANT PPR-Meta (PPR) and
geNomad on prokaryotic viral contig classification. DeepMicroClass received
the highest scores in both accuracy and F1 score in all tested scenarios compared
to the other predictors. Increasing the fraction of eukaryotic related sequences
didn’t impaired the performance of DeepMicroClass, but did for the other tools.
The dashed black lines indicate where accuracy or F1 score equals to 0.8. Same
benchmarking datasets were used as in Figure 2.5. . . . . . . . . . . . . . . . . . 34
2.8 Distribution of (a) accuracy and (b) F1 scores across 20 test datasets for
DeepMicroClass, and VirSorter2 on prokaryotic and eukaryotic viral contig
classification. DeepMicroClass received higher scores in both accuracy and F1
score in all tested scenarios compared to VirSorter2. Both DeepMicroClass and
VirSorter2 were able to maintain accuracy and F1 above 0.9 and not affected by
the composition of different sequence types in the dataset. The dashed black
lines indicate where accuracy or F1 score equals to 0.8. Same benchmarking
datasets were used as in Figure 2.5. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.9 Distribution of (a) accuracy and (b) F1 scores across 20 test datasets for
DeepMicroClass, PPR-Meta and geNomad on prokaryotic host, prokaryotic
virus and plasmid contig classification. DeepMicroClass received higher scores
in both accuracy and F1 score in all tested scenarios compared to PPR-Meta and
geNomad in multi-class classification. The dashed black lines indicate where
accuracy or F1 score equals to 0.8. Same benchmarking datasets were used as in
Figure 2.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
ix
2.10 Sequence classification and read abundance of a 1-300 µm size fraction
marine metagenomic dataset sampled off the coast of Southern California.
Metagenomic contigs were classified using DeepMicroClass, Kaiju and MetaEuk
at a length cutoff of 2 kb, and percentages of different sequence types were
calculated (a). Contigs predicted as Prokaryotes by both Kaiju and MetaEuk (b),
and contigs that were not classified by Kaiju (c) or MetaEuk (d) were further
broken down into DeepMicroClass’s classification. Clean reads were aligned
to metagenomic contigs and percentages of mappable reads were calculated
(e). Mapped read percentages were further summarized according to sequence
types of reference contigs as predicted by DeepMicroClass (f), Kaiju (g) and
MetaEuk (h). Prokaryotes included both prokaryotic hosts and plasmids.
UnclassifiedViruses were sequences predicted to be viruses but their taxonomy
couldn’t be further resolved by Kaiju or MetaEuk. . . . . . . . . . . . . . . . . . 40
3.1 Pie chart showing the number of phages having association with a host
species in Virus-Host DB. It highlights the top 10 species, while the remaining
species are grouped as "Others". E. coli represents the largest portion, accounting
for 9.58% of the total species. Since there is no dominant species, the dataset is
not biased towards any particular species. . . . . . . . . . . . . . . . . . . . . . . 48
3.2 The overview of the model structure of ContigNet. The network receives
two inputs: the one-hot matrix for the virus contig and the one-hot matrix for
the host contig. These inputs traverse two separate convolutional paths—the
base path and the codon path—after the base one-hot matrix is transformed
into the codon one-hot matrix using a codon transformer. The outputs of the
convolutional paths are then aggregated, concatenated, and passed through
a fully connected layer. Subsequently, the output is processed by a sigmoid
function, which is interpreted as the probability that the virus and host are
associated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Boxplots comparing AUROC scores of ContigNet models with and
without shared weights between phage and host paths. The x-axis
differentiates between models with shared or distinct convolutional paths, while
the y-axis represents the corresponding AUROC scores. A Wilcoxon signedrank test reveals that the ContigNet model with shared weights significantly
outperforms the model with distinct weights, with a p-value of 1.33 × 10−25
. . . . 57
3.4 Performance comparison among d
∗
2
, WIsH and ContigNet. (a) The ROC
curves of d
∗
2 method on the validation set with contigs of different lengths. (b)
The ROC curves of WIsH on the validation set with contigs of different lengths,
and (c) the ROC curves of ContigNet on validation set with contigs with different
lengths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
x
3.5 Heatmap of AUROCs for different phage and host contig lengths. The
x-axis represents host contig lengths and the y-axis represents phage contig
lengths. The color gradient ranges from red to green, representing values from
60 to 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.6 Heatmap of the AUROC value difference between ContigNet with di-path
model and (a) base-path only model, and (b) codon-path only model. The
x-axis represents host contig lengths and the y-axis represents phage contig
lengths. The color gradient ranges from red to green, representing values from
-10 to 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.7 Line plot describing the AUROC score of ContigNet under varying levels
of artificially introduced sequencing errors in the test set. The x-axis
represents µ from 0 to 0.1, and each solid line corresponds to a different level of δ,
the insertion/deletion rate, with value 0, 0.05 and 0.1. The dashed line represents
the baseline, where no errors were introduced. The y-axis denotes the AUROC
score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.8 Line plot describing the AUROC score of ContigNet under varying
levels of artificially introduced assembly errors in the test set. The x-axis
represents the chimera portion from 0 to 0.2, and solid lines represent only virus
is perturbed, only host is perturbed, or both virus and host are perturbed. The
dashed line represents the baseline, where no chimera were introduced. The
y-axis denotes the AUROC score. . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.9 Heatmap of AUROC for different contig lengths on the MGV dataset,
using a model trained with the Virus-Host DB. TThe x-axis represents host
contig lengths and the y-axis represents phage contig lengths. The color gradient
ranges from red to green, representing values from 60 to 100. . . . . . . . . . . . . 69
3.10 ContigNet can be used to predict plasmid-host associations with high
accuracy. (a) Heatmap of AUROC for different contig lengths on plasmid
dataset (PLSDB) for model trained with virus dataset (Virus-Host DB). The x-axis
represents host contig lengths and the y-axis represents plasmid contig lengths.
(b) The ROC curves of d
∗
2 method on PLSDB for contigs with different lengths.
(c) The ROC curves of WIsH on PLSDB for contigs with different lengths. . . . . . 71
xi
3.11 Grouped bar chart illustrating the accuracy of host prediction for whole
genomes across various taxonomic levels, from genus to phylum. The
x-axis represents the taxonomy levels, arranged from lower to higher ranks,
while the y-axis indicates the prediction accuracy of each corresponding
method at different taxonomy levels. The bars represent different phage-host
association prediction methods (WIsH, PHP, HoPhage, VPF-Class, VHM-net,
vHULK, RaFAH, BLASTN, HostG, ContigNet), and are grouped together at each
taxonomic level. Notably, for the task of assigning a viral contig to a candidate
host, the performance of ContigNet is comparable to that of WIsH, but not as
high as other methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.12 Grouped bar chart illustrating the accuracy of host prediction for
whole genomes of phage-host pairs without alignment results across
various taxonomic levels, from genus to phylum. The x-axis represents the
taxonomy levels, arranged from lower to higher ranks, while the y-axis indicates
the prediction accuracy of each corresponding method at different taxonomy
levels. The bars represent different phage-host association prediction methods
(WIsH, PHP, HoPhage, VPF-Class, VHM-net, vHULK, RaFAH, BLASTN, HostG,
ContigNet), and are grouped together at each taxonomic level. The overall
relation between accuracies of different methods remains the same. . . . . . . . . 75
5.1 Performance of DeepMicroClass, Tiara and Whokaryote on eukaryotic
sequence classification. Both the accuracy and F1 score were compared based
on 20 designed test datasets. The sequence class composition of the 20 test
datasets can be found in Table 5.1. Values on top of the pairwise comparisons are
Bonferroni adjusted t-test p-values.The significance of the overall ANOVA test
was shown on the bottom left corner. . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 The distribution of misclassified sequence types by Tiara and Whokaryote. The sequence composition of these datasets can be found in Table 5.1. To
make the figure more visible, the range of y-axis is from 0 to 100 for Tiara and
from 0 to 500 for Whokaryote. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Performance of DeepMicroClass, PlasFlow, PPR-Meta and geNomad
on plasmid sequence classification. Both the accuracy and F1 score were
compared based on 20 designed test datasets. The sequence class composition
of the 20 test datasets can be found in Table 5.1. Values on top of the pairwise
comparisons are Bonferroni adjusted t-test p-values. The significance of the
overall ANOVA test was shown on the bottom left corner. . . . . . . . . . . . . . 86
5.4 The distribution of misclassified sequence types by PlasFlow, PPR-Meta
and geNomad. The sequence composition of these datasets can be found in
Table 5.1. The y-axis ranges from 0 to 400 for both panels. . . . . . . . . . . . . . 87
xii
5.5 Performance of DeepMicroClass (DMF), DeepVirFinder (DVF), VIBRANT,
PPR-Meta (PPR) and geNomad on prokaryotic viral sequence classification. Both the accuracy and F1 score were compared based on 20 designed test
datasets. The sequence class composition of the 20 test datasets can be found in
Table 5.1. Values on top of the pairwise comparisons are Bonferroni adjusted
t-test p-values. The significance of the overall ANOVA test was shown on the
bottom left corner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.6 Performance of DeepMicroClass and VirSorter2 on prokaryotic and
eukaryotic viral sequence classification. Both the accuracy and F1 score
were compared based on 20 designed test datasets. The sequence class
composition of the 20 test datasets can be found in Table 5.1. Values on top of the
pairwise comparisons are Bonferroni adjusted t-test p-values. The significance
of the overall ANOVA test was shown on the bottom left corner. . . . . . . . . . 89
5.7 The distribution of misclassified sequence types by PPR-Meta, DeepVirFinder, VIBRANT, geNomad and VirSorter2. For PPR-Meta, DeepVirFinder, VIBRANT and geNomad, only prokaryotic viruses are considered
as positive set, and for VirSorter2 both prokaryotic and eukaryotic viruses are
considered positive. To make the figure more visible, the range of y-axis is from
0 to 500 for PPR-Meta and DeepVirFinder, from 0 to 50 for VIBRANT and 0 to 80
for VirSorter2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.8 Performance of DeepMicroClass PPR-Meta and geNomad on prokaryotic
host, prokaryotic virus and plasmid contig classification. Both the accuracy
and F1 score were compared based on 20 designed test datasets. The sequence
class composition of the 20 test datasets can be found in Table 5.1. Values on
top of the pairwise comparisons are Bonferroni adjusted t-test p-values. The
significance of the overall ANOVA test was shown on the bottom left corner. . . . 91
5.9 The distribution of misclassified sequence types DeepMicroClass The
sequence composition of these datasets can be found in Table 5.1. The total
number of errors in all possible dataset compositions can be at most 50 and we
can observe that the major source of the error is the misclassification between
prokaryotic hosts and plasmids. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.10 Correlation coefficients of Prokaryotic (a), Eukaryotic (b), ProkaryoticViral (c), and EukaryoticViral (d) sequence relative abundances of different
sequence classifiers. Coefficients highlighted in colors are significant ones
(p-value < 0.01). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
xiii
Abstract
Metagenomic datasets exhibit unique characteristics, influenced either by experimental limitations or the intricacies of microbial communities. Addressing metagenomic sequences necessitates solutions to two principal challenges: identifying the sequences and understanding their
interrelationships. To address the identification challenge, we introduce DeepMicroClass, a deep
learning-based method that categorizes metagenomic contigs into five distinct sequence classes:
viruses targeting prokaryotic or eukaryotic hosts, eukaryotic or prokaryotic chromosomes, and
prokaryotic plasmids. To unravel the interactions among sequences in metagenomic datasets,
we introduce ContigNet, another deep learning-driven approach adept at predicting phage-host
associations, even when only shorter contigs are available. Both our pioneering methods significantly outperform existing techniques. By integrating these tools, we have endowed the scientific
community with two high-performance, user-centric tools.
xiv
Chapter 1
Introduction
Microbes are major players of global biogeochemical cycles owing to their high abundance, immense diversity, versatile metabolism, and survivability in any conceivable ecosystem on the
planet [1, 2]. Microbial communities are a collection of diverse biological entities, including
ribosome-encoding cellular organisms (REOs, cells, including eukaryotic, archaeal and bacterial organisms), capsid-encoding organisms (CEOs, e.g., viruses) that can only reproduce within
cells of REOs, and orphan replicons (plasmids, transposons, etc) that parasitize REOs or CEOs for
propagation [3]. Viruses and plasmids are extrachromosomal genetic elements that have important implications for the diversity and function of microbial communities owing to their roles in
transferring genetic materials between or within microbes. Thus, together with transposable elements, they are collectively referred to as mobile genetic elements (MGEs). Depending on where,
when and how metagenomic samples were collected, the microbial diversity within a sample can
range from a consortium of several dominant strains to a conglomerate of thousands of species.
Soon after the discovery of the small subunit rRNA gene (SSU) as a universally conserved phylogenetic marker [4], the biodiversity and structure of environmental microbial communities can
be easily assessed using the SSU-based amplicon surveys [5, 6]. Microbial coding potentials can
1
be further probed using cloning libraries of natural microbial assemblages (e.g., cosmid and fosmid libraries) [6, 7, 8, 9, 10, 11, 12], which have been revolutionized by shotgun sequencing of
metagenomes to infer functional capabilities and ecological roles of uncultured microbes [13, 14,
15].
Shotgun sequencing of metagenomes involves the random fragmentation and sequencing of
all DNA present in an environmental sample, rather than targeting specific genes (as is done in
16S rRNA gene sequencing for bacterial community analyses) [13]. This approach provides a
comprehensive snapshot of the microbial diversity present in the sample, encompassing bacteria,
archaea, viruses, and eukaryotic microorganisms.
Compared to targeted sequencing methods, there are many different advantages of using shotgun sequencing. One significant advantage of shotgun metagenomics over targeted sequencing
methods is its ability to capture functional information about microbial communities. By sequencing all genetic material in a sample, researchers can identify not only which organisms
are present but also what metabolic capabilities they might possess [15]. This has proven invaluable in studying microbial roles in nutrient cycling, disease progression, and environmental
adaptations [16]. Furthermore, shotgun metagenomics allows for the discovery of novel genes,
metabolic pathways, and even entire organisms [17].
As metagenomic datasets have grown, so too have the catalogs of genes with unknown function and unclassified sequences, emphasizing the vast uncharted territory that still exists within
microbial ecosystems [18]. Therefore, the rapid expansion of metagenomic datasets presents
2
both opportunities and challenges for novel discoveries. The very nature of shotgun metagenomics—sequencing a complex mixture of genetic material from myriad organisms—poses challenges in data analysis. Addressing associated biological problems necessitates strategic approaches. To name a few, the sequence assembly, gene annotation, and comparative analysis
of metagenomic datasets, are all questions that require efforts to tackle and new algorithms to
solve [19, 20, 21].
In this dissertation, we endeavor to tackle two primary challenges associated with metagenomic sequences: the classification of metagenomic contigs of unknown sources (who are they?)
and the discernment of viral-host associations based on these contigs (how do they interact?).
1.1 Computational sequence analysis
Before diving deeper into the two major topics of this dissertation: contig classification and
phage-host association prediction, it is crucial to introduce a foundational concept: computational sequence analysis. Metagenomics data primarily consists of sequence data, generated from
the fragmented microbiome genomes. With the development of the next generation sequencing,
the throughput of the sequencing has been dramatically increased. As of 2022, the latest shortread sequencing technology, such as the NovaSeq 6000 by Illumina, produces up to 6 terabases
per run. Even more impressively, the newer long-read sequencing technology, such as the PromethION by Nanopore, can generate up to 14 terabases per run with a read length of 4 megabases
[22]. This massive volume of data generated far surpasses the capacity for manual processing by
humans, making computer-aided analysis of sequences indispensable.
3
Fundamentally, computational sequence analysis involves the use of algorithms and statistical
techniques to derive meaningful information from DNA, RNA, and protein sequences [23]. In
the realm of computational sequence analysis, methodologies predominantly fall into one of two
categories: alignment-based methods , which uses a technique called sequence alignment, and
alignment-free methods.
Sequence alignment, at its core, is the computational method of arranging sequences of DNA,
RNA, or protein to highlight regions of similarity that might arise from functional, structural, or
evolutionary relationships. The origin of sequence alignment traces back to the 1970s with the
pioneering work by Needleman and Wunsch, which introduced the concept of global alignment
of two sequences using dynamic programming [24]. This was subsequently expanded upon by
Smith and Waterman in the 1980s, who developed an algorithm for local sequence alignment,
ideal for identifying domains or motifs within long sequences [25]. Tools such as BLAST [26]
revolutionized the sequence analysis process, enabling rapid sequence comparisons against large
databases, such as RefSeq [27], GenBank [28], Ensembl [29], Uniprot [30], etc., leading to quick
identification and classification of unknown biological sequences based on sequence similarity.
However, even with more recent variations and optimizations of the sequence alignment algorithms, the time complexity of sequence alignment is still O(mn), where m and n are the lengths
of the two sequences [31]. Therefore, conducting sequence alignment can be exceedingly timeconsuming, considering the varying sequence lengths in the metagenomic dataset. The problem
of multiple sequence alignment is even more challenging to solve, as it has been proven to be
NP-hard (there’s no known algorithm that can solve all instances of the problem quickly) [32].
Consequently, for a metagenomic dataset comprising millions of contigs, the runtime becomes
challenging.
4
The nature of the metagenomic dataset also complicates intra-dataset sequence comparison
with alignment-based methods. Although de novo assembly has been developed and many new
methods are emerging [33, 34, 35, 36, 37, 38, 39, 40], there is no gurantee that we can reconstruct
the whole genome level sequences based on the contigs. Even if two contigs originate from the
same organism, they cannot be binned into the same bin if they are from different regions of the
genome based on alignment.
Another challenge encountered in metagenomic studies that rely on sequence alignment is the
inability to align a significant proportion of metagenomic reads to known sequences in databases.
The mapping rate can be influenced by various factors, one of which is the choice of tools employed for sequence alignment [41]. The subject of the study can also significantly sway the mapping rate. For instance, the human gut microbiome, because of its extensive study and relatively
stable composition, typically has a high mapping rate of over 80% [42]. Conversely, metagenomic
samples from open environments, such as air, soil, or ocean water, often exhibit a substantially
lower percentage of reads that align to established reference genomes. As an illustration, in
[43], despite the authors employing a comprehensive, pan-domain, and computationally intensive strategy, 43.74% of the sequences could not be classified into any kingdom. This vast reservoir
of unmapped reads represents a significant knowledge gap, hindering our full understanding of
these environments.
To overcome the various limitations of alignment-based methods, researchers have proposed
sequence analysis methods that do not require alignment. The concept of alignment-free sequence comparison was first proposed by Blaisdell in 1986 [44]. This method mainly focuses on
k-mers, a short sequence of DNA or RNA that has a fixed length of k bases. For example, if k is
4, then a 4-mer is a sequence of 4 nucleotides, such as ATCG or GCAA. When comparing two
5
sequences, the k-mer frequency of the two sequences are compared and used to represent each
sequence, without considering the order or structures of k-mers in the original sequence. This
can be thought of as the "bag of words" approach from the natural language processing (NLP)
area, specifically adapted for biological sequences. In this sense, taking a 4-mer as an example, there are a total of 256 possible 4-mers, and a sequence of any length can be represented as
a 256-dimensional vector, with each dimension representing the frequence of one combination
of the 4 types of nucleotides. With the 256-dimensional vector, different distance/dissimilarity
measures can further be applied to analyse the relationship between two sequences. Traditional
distance measures, such as Eucledian distance, Manhattan distance, or Chebyshev distance, can
be used. There are also many different distance measures specifically designed for sequence comparison, such as d2 [44], d
S
2
and d
∗
2
[45], Jenson-Shannon divergence, etc. Besides using the k-mer
frequency, some model based sequence comparison methods, such as WIsH [46], which build
a hidden Markov model based on the reference sequence and calculate a score for sequence of
interest, can also be categorized as alignment-free method.
In addition to alignment-based or alignment-free methods, the advancement of machine learning and deep learning has opened up a new frontier. The methods based on machine learning
are difficult to categorize as either alignment-based or alignment-free. The training of a machine
learning model requires a reference dataset, so it is not "reference-free", but with adequate training, the model can learn the underlying distribution of nucleotides in a sequence, thus achieving
the same level of generalizability as alignment-free methods. The power of machine learning
transcends the alignment debate. In fact, several alignment methods have already been improved
with deep learning [37, 38, 39, 40]. For alignment-free methods, the k-mer frequencies can be
directly considered as the input vectors of a machine learning model, allowing the model to
6
adapt to the provided training set, thereby eliminating the need to develop a specialized distance
measure for sequence comparison. Machine learning can also take advantage of both alignmentbased and alignment-free methods to enhance the final performance of the task. For example,
VirHostMatcher-Net [47], a tool for assigning viruses to potential hosts in the training set, considers both alignment-based and alignment-free methods. For alignment-based measures, it considers the alignment score of a viral sequence to the whole genome of the host and to the CRISPR
spacer of the host genome. For alignment-free measures, it considers the WIsH score [46] between the viral sequence and the host genome. The two novel software we are going to present
in this dissertation: DeepMicroClass and ContigNet, are also based on machine learning, and
outperforms all known traditional methods.
1.2 Sequence classification on metagenomics data
So far, the predominant method for analyzing metagenomic samples has been shotgun short-read
sequencing. While this technique generates a plethora of short DNA sequences, named as reads,
sequences from all organisms in the environmental samples are intermixed and sequenced together, and therefore much of the original source information of a read becomes obscured during
the process. These reads are assembled into consecutive genomic regions called contigs formed by
overlapping reads. By categorizing metagenomic contigs into distinct groups, the complexity of
metagenomes can be reduced to certain taxonomic levels, from coarse empires or domains (e.g.,
bacteria, eukarya, virus) to consensus species or strains. Metagenomic applications developed
7
to retrieve intended contigs can be briefly framed into two categories: supervised contig classification tools (e.g., viral contig predictors, plasmid contig predictors), and unsupervised contig
clustering tools (e.g., metagenomic binners, refer to [48] for a review of binning strategies).
Viruses are known to be prevalent in aquatic, soil [49, 50] and host-associated systems [51, 52],
and are presumably the most abundant biological entities on Earth [53, 54]. In marine systems,
viral lysis is crucial in redirecting carbon and energy flow to the lower trophic levels (termed
"Viral Shunt"), which have great implications for the global biogeochemical cycles [55, 56, 57].
As a result, metagenomic contig classification has been heavily focused on the prediction of viral sequences. Numerous software packages have been developed exclusively for viral contig
prediction. VirSorter [58], VirSorter2 [59], and VirFinder [60] are two pioneer tools to identify
viral contigs from metagenomic assemblies. VirSorter and VirSorter2 predicts viral contigs based
on viral signals and categorizes them into three tiers with different confidence levels. VirFinder
employs k-mer frequencies and logistic regression to classify contigs to either viral or host sequences, which outperforms VirSorter at different contig lengths, especially for shorter contigs
without detectable viral hallmark genes [60]. The success of k-mer based methods has inspired
the application of deep learning in viral sequence discovery, which led to the development of
DeepVirFinder [61] and PPR-Meta [62], both methods employ one-hot encoding to convert DNA
sequences into presence/absence matrices of nucleotides, and use neural networks to train virushost classifiers at different contig lengths. Besides, PPR-Meta combines both nucleotide path and
codon path in the encoding step, and classifies contigs into viruses, host chromosomes and plasmids [62]. Another noteworthy addition is VIBRANT [63], a recently published tool that uses
neural networks to distinguish prokaryotic dsDNA, ssDNA and RNA viruses based on “v-score”
metrics, which were calculated from significant protein hits to a collection of Hidden Markov
8
Model (HMM) profiles derived from public databases. It’s worth noting that most of the aforementioned tools target bacteriophages. Eukaryotic virus predictors are also emerging in recent
years, and one such tool is HostTaxonPredictor (HTP) [64], which utilizes four machine learning
methods to classify viral sequences to eukaryotic viruses or bacteriophages based on sequence
features including mono-, dinucleotide absolute frequencies and di-trinucleotide relative frequencies.
Apart from virus, plasmids represent another major type of MGEs heavily studied in environmental microbiomes, particularly in host-associated systems and wastewater treatment facilities.
Via transferring among hosts or exchanging genes with their host genomes, plasmids facilitate
the acquisition of new traits by hosts [65]. Thus, by carrying genes related to resource utilization
[66], antibiotic/metal resistance [67, 68], and defense systems [69, 70], plasmids contribute to the
genetic and phenotypic plasticity of hosts, and increase their fitness to the changing environments. There are multiple dedicated tools developed besides PPR-Meta, such as cBar [71], PlasFlow [72], PlaScope [73] and PlasClass [74]. In principle, PlaScope employs a similarity searching
approach based on species-specific databases, while cBar, PlasFlow and PlasClass use differential
k-mer frequencies with different machine learning methods.
Beyond viruses and plasmids, there is a paucity of applications targeting the classification
of eukaryotic contigs from metagenomes, while eukaryotes are indispensable to the ecological
functioning of natural microbial communities. Alignment-based applications such as Kaiju [75]
and MetaEuk [76] search for close matches in reference databases, thus can be used to assign
reads or contigs to taxonomic groups. While the accuracy of these applications depends on the
completeness of reference databases, their performance in classifying eukaryotic contigs is arguable due to the lack of a comprehensive microbial eukaryotic database [77]. EukRep [78] is a
9
reference-independent application that uses k-mer frequency and linear-SVM to classify metagenomic contigs into eukaryotic and prokaryotic sequences. It has been proven that when combined
with the conventional metagenomic and metatranscriptomic analyses, such as reconstructing eukaryotic bins and gene co-abundance analysis, biological and ecological insight can be readily
obtained for uncultured eukaryotes [78, 79]. Eukaryotic sequences could also be identified using
alignment-independent applications. Tiara [80] is a deep learning based method used for eukaryotic sequence identification in metagenomes, and Whokaryote [81] is a random forest classifier
that uses gene-structure based features to distinguish eukaryotic and prokaryotic sequences.
Despite the significant progress made in the past years, there isn’t one tool that can classify eukaryotic/prokaryotic genomes, eukaryotic/prokaryotic viruses, and plasmids in one shot.
In fact, all these binary classifiers suffer from sequence types that are not modeled, such as eukaryotic contigs or plasmids can be misclassified as viruses by viral predictors, and viral contigs
can be misclassified as plasmids by plasmid predictors, etc. Thus, to achieve a more reliable
classification of the target sequences, one has to run several of these tools consecutively, each
step suffers from the sensitivity and specificity of different tools, and the error rates propagate
throughout the workflow, resulting in less accurate and biased classification. During my doctoral study, we developed DeepMicroClass, a versatile multi-class metagenomic contig classifier
based on convolutional neural networks (CNN) that can avoid the aforementioned problems.
The implementation of DeepMicroClass and code for experiments described in this paper can
be accessed at https://github.com/chengsly/DeepMicroClass. In this dissertation, we will show
that DeepMicroClass outperforms all the existing tools by precision and sensitivity across all test
datasets with different sequence-type compositions. More importantly, DeepMicroClass is superior to the other tools by classifying all sequence types simultaneously, which will greatly reduce
10
the time and computation resource usage compared to the conventional workflow of chaining
a set of different predictors. Using a coastal marine metagenomic dataset as a case study, we
also showed that DeepMicroClass captures more eukaryotic contigs than two alignment-based
classifiers, Kaiju and MetaEuk.
1.3 Phage-bacterial association prediction in metagenomic
dataset
In addition to source information, shotgun short-read sequencing, the primary method for sequencing metagenomic samples, often results in a loss of data concerning the association between phages and their bacterial hosts [82]. As highlighted in Section 1.2, the intricate dance
of interaction between phages and their bacterial hosts is essential to understand [83], not only
from an ecological perspective but also for potential applications in medicine, agriculture, and
biotechnology [84, 85].
Experimentally, the association between a phage and its bacterial host is typically identified
by culturing known bacteria and observing telltale signs, such as bacterial lysis attributable to
phage infection or the direct interaction between the phage and the bacterium [86, 87]. However, this approach is incompatible with shotgun-based metagenomic studies. In such studies,
both phages and bacteria within a metagenomic sample are fragmented into small pieces. Consequently, not only are there no intact phages to test against potential bacterial hosts, but any associations between phages and unidentified bacteria are also obliterated in the process [16]. Given
these limitations, computational methods have become indispensable for identifying phage-host
associations within metagenomic samples [88].
11
Several computational methods have been developed to assign viruses to their putative hosts
[89, 46, 47, 90, 91, 92, 93, 94, 95] , most of which assign phages to their hosts of known bacterial
genomes or at certain taxonomic levels [47, 90, 91, 93, 94, 95] except VirHostMatcher (VHM) [89]
and WIsH [46].
On the other hand, genomes of novel uncultured microbes are usually incomplete due to
limitations of shotgun sequencing. Contigs of novel genomes recovered from metagenomes,
and novel genomic islands of known microbes, are absent or incomplete in current reference
databases, preventing host predictions using alignment-based methods. Similarly, most viral
genomes recovered from viromes are even more fragmented and unmappable, making the whole
viral genome based host prediction impractical.
The most widely used methods for phage-host association prediction are based on k-mer frequency or oligonucleotide frequency (ONF), alignment-based scores, matching of CRISPR spacers, or Markov models. For example, VirHostMatcher [89] uses the dissimilarity between oligonucleotide frequencies of the phage and bacterial sequences to predict their associations. The authors investigated the prediction accuracy based on Euclidean distance, Manhattan distance, and
d
∗
2 dissimilarity [45, 96, 97] , and showed that d
∗
2 yielded the highest prediction accuracy [89].
WIsH [46] is another tool that trained a Markov model for each candidate bacterial genome and
calculated the likelihood of a phage sequence under the trained Markov model. This method was
also reported to have better performance than d
∗
2 when fragments were used in the prediction
[46]. VirHostMatcher-Net [47] uses logistic regression to take advantage of multiple dissimilarity measures, including the aforementioned d
∗
2
and WIsH scores, alignment-based scores, CRISPR
spacer matching, and virus similarity to predict phage-host associations. RaFAH [91] is another
recently developed method for phage-host association prediction by constructing Hidden Markov
12
Models (HMMs) according to the protein clusters identified from the pVOG [98] database and a
random forests model was used to predict phage-host associations.
Deep-learning based methods were also developed in recent years for predicting phage-host
associations. vHULK [90] applied deep neural networks to the phage-host association prediction
problem using features extracted from phage sequences as input. The latest proposed method
HostG [94] used a graph convolutional networks to take both phage-host and phage-phage relationships into consideration, and the alignment scores were used to aid the construction of the
relationships. However, these predefined features extracted from sequences alone are not suitable
to train complicated models to predict contig-contig relationship.
To address the aforementioned problem, during my doctoral study, we developed ContigNet,
a convolutional neural network (CNN)-based model that can predict the association status between phage and bacterial contigs, the first method investigating the relationship between two
contigs. The results showed that ContigNet was able to significantly improve the phage-host contig association prediction performance compared to currently available methods. Based on the
structure of ContigNet, we also tested it on a totally different plasmid-host dataset, and the result
revealed its capability as a promising approach for predicting general mobile genetic elements
associated with their hosts.
1.4 Dissertation outline
The dissertation can be split into 4 chapters. Chapter 1, which you are currently reading, delves
into the foundational concepts of metagenomics. This chapter delineates the background of
metagenomics studies and clearly articulates the two primary challenges this dissertation seeks
13
to address. Additionally, it provides an overview of the current methodologies and approaches
undertaken to address these challenges.
In Chapter 2, the focus shifts to the novel deep learning-based software package named DeepMicroClass. This innovative tool is meticulously designed to classify metagenomics contigs, offering a novel and easy to use approach to sequence classification.
After classifying contigs, the next question is whether the contigs have interaction between
each other. The third chapter introduces ContigNet, a pioneering software package crafted to
computationally predict associations between viral and bacterial contigs. Notably, its efficacy is
also demonstrated in predicting associations between plasmids and bacterial contigs, underlining
the model’s impressive adaptability and generalizability.
Finally, Chapter 4 encapsulates the significant contributions of this dissertation. Beyond reflecting on the accomplishments, this chapter looks forward, speculating on future avenues of research in computational sequence analysis. Such potential trajectories are contextualized within
the realm of recent advancements in experimental methodologies.
1.5 Authors and contributors to the dissertation
I was fortunate to be jointly supervised by Drs. Fengzhu Sun and Jed Fuhrman. Dr. Shengwei Hou
helped with the biological knowledge on different types of microbes and phage-host interactions,
as well as applications of the tools to metagenomic samples. I developed the computational tools
in this dissertation and applied them to analyze metagenomic samples.
14
Chapter 2
Metagenomics contig classification with DeepMicroClass
2.1 Introduction
In this chapter, the contributions we made to address a fundamental question arising from the
characteristics of shotgun metagenomic datasets are described: given its DNA sequence, what is
the source of a contig within the dataset? We introduce DeepMicroClass, a deep learning-based
software package we developed, to tackle this question. It is adept at classifying short contigs, irrespective of their length, into one of five distinct sequence classes: viruses that infect prokaryotic
hosts, viruses targeting eukaryotic hosts, eukaryotic chromosomes, prokaryotic chromosomes,
and prokaryotic plasmids. By developing DeepMicroClass, we are trying to address the challenges posed by the shotgun sequencing on the metagenomics dataset. By accounting for the
most prevalent sources of contigs simultaneously, we sidestepped the error accumulation that
typically arises when applying multiple binary metagenomic classifiers in succession. Moreover,
by leveraging the most recent publicly available datasets and fine-tuning our deep learning algorithm, we ensured that the model was both generalizable and superior in performance—even
15
when competed against state-of-the-art software released in 2023. Our confidence in DeepMicroClass stems from a rigorously devised performance evaluation protocol, meticulously designed
to minimize data leakage between training and testing datasets. In our assessments, DeepMicroClass achieved area under the receiver operating characteristic curve (AUC) scores exceeding
0.98 for most sequence classes, with the exception of distinguishing prokaryotic chromosomes
from plasmids. When benchmarked against 20 test datasets of varied sequence class compositions, DeepMicroClass exhibited average accuracy scores of ~0.99, ~0.97, and ~0.99 for the classification of eukaryotic, plasmid, and viral contigs, respectively. These figures are demonstrably
superior to other leading individual predictors. To further demonstrate its utility, we employed
a 1-300 µm daily time-series metagenomic dataset procured from the coastal regions of Southern
California. Our findings indicated that, with DeepMicroClass’s classification, the proportion of
metagenomic reads attributed to eukaryotic contigs could potentially double when compared to
results from other alignment-based classifiers. With its inclusive modeling and unprecedented
performance, we expect DeepMicroClass will be an invaluable addition to the toolbox of microbial ecologists, and will promote metagenomic studies of under-appreciated sequence types.
From a usability standpoint, DeepMicroClass has been structured to be intuitive and straightforward. Researchers, irrespective of their expertise, can conveniently install the tool using a simple
command: pip install deepmicroclass. To classify their available contigs, another signle
command DeepMicroClass predict -i input_fasta -o output_dir will suffice.
16
2.2 Materials and Methods
2.2.1 Dataset preparation
We collected 5 classes of sequences: prokaryotic host, eukaryotic host, plasmid, prokaryotic viral and eukaryotic viral sequences. For prokaryotic chromosome sequences, we downloaded all
the prokaryotic genomes, including all the bacteria and archaea sequences from NCBI RefSeq on
Aug 22, 2022. The prokaryotic genomes were cleaned up by removing all the sequences annotated as "Plasmid" according to the assembly reports, and sequences not annotated as plasmid but
have identical sequence IDs in the plasmid dataset were also removed from prokaryotic genomes.
The resulting sample set contains 40,208 sequences. The eukaryotic host sequence database includes eukaryotic sequences from the eukaryotic taxa used by Kaiju [75] and the PR2 database
[99]. Specifically, we selected microbial eukaryotic genomes under taxa names: "Amoebozoa",
"Apusozoa”, "Cryptophyceae”, "Euglenozoa”, "Stramenopiles”, "Alveolata”, "Rhizaria”, "Haptista”,
"Heterolobosea”, "Metamonada”, "Rhodophyta”, "Chlorophyta”, and "Glaucocystophyceae” using
genome_updater (available at https://github.com-pirovc/genome_updater) on Aug 22, 2022. A
total of 612 eukaryotic sequences were downloaded. In addition to these eukaryotic genomes,
we also included 32,073,625 eukaryotic host sequences from the 678 marine eukaryotic transcriptomic re-assemblies [100] of cultured samples generated by the MMETSP project [77], which
included 306 pelagic and endosymbiotic marine eukaryotic species representing more than 40
phyla.
Plasmid sequences and corresponding metadata were retrieved from PLSDB [101] released
on Jun 23, 2021. The dataset contains 34,513 plasmid records. Viral sequences and associated
metadata were retrieved from Virus-Host DB [102] released on Jun 1, 2022, which contains 17,357
17
nucleic acid records, including 5,209 prokaryotic viruses and 12,148 eukaryotic viruses. In all
downloaded sequences, we further cross compared sequence IDs in each class, and any sequence
with an identical ID occurring in more than one class was removed so that we could reduce
potential erroneous annotation from the source database.
2.2.2 Benchmark dataset preparation
Sequences were split into two parts according to the dates submitted to NCBI, with Jan 1, 2020 as a
cutoff date. That is, sequences submitted before Jan 1, 2020 were used for training and validation,
with 80% as training and 20% as validation using stratified split, and the sequences submitted
after this date were used for testing.
To control the sequence similarity across training, validation, and test sets, the Mash distance,
as outlined by Ondov et al. [103], was adopted. Sequences within the test set exhibiting a Mash
distance less than 0.1 in relation to any sequence in the training or validation sets were excluded
from the test set. On the other hand, Virus-Host DB derived viral sequences [102] and MMETSP
derived eukaryotic sequences were not dated. Therefore, these sequences were randomly split
into training, validation and test sets with the proportions of 60%, 20% and 20%, respectively.
Similarly, sequences were removed from the test set when the Mash distance less than 0.1 to any
sequence in the training or validation sets.
Given the inherent obscurity surrounding the composition of a metagenomic sample, imbalances among sequence classes may inadvertently influence classifier performance. Moreover,
most existing methods focus on the classification of singular sequence classes, such as eukaryotic hosts, prokaryotic viruses, or plasmids. Some tools are more versatile, capable of classifying
18
two or more sequence classes. For instance, PPR-Meta [62], is able to predict prokaryotic hosts,
phages and plasmids. In order to compare with tools developed for a specific sequence class
and for multiple sequence classes, we formulated 20 equal-sized (1000 contigs, each 10 kbs long)
benchmark datasets with a variable composition of the 5 sequence classes.
The structure of these datasets was meticulously planned. Briefly, the fractions of PROK
(including prokaryotic hosts, prokaryotic viruses, and plasmids) to EUK (including eukaryotic
hosts and eukaryotic viruses) sequences were firstly selected from the ratios of 9:1, 7:3, 5:5, 3:7,
and 1:9. Then for each fixed PROK:EUK ratio, the PROK fraction was further split into prokaryotic
hosts, prokaryotic viruses and plasmids based on the ratios of 5:1:1, 4:1:1, 3:1:1, and 2:1:1; and the
EUK fraction was further split into eukaryotic hosts and eukaryotic viruses according to the ratio
of 5:1, 4:1, 3:1, and 2:1. Finally, the corresponding number of sequences were drawn from the test
sequence pool for each class using the ratios specified above, the finalized actual sequence source
composition of the 20 test datasets is tabulated in Table 5.1, and a visualization of the composition
can be found in Figure 2.1.
19
DS_1
DS_2
DS_3
DS_4
DS_5
DS_6
DS_7
DS_8
DS_9
DS_10
DS_11
DS_12
DS_13
DS_14
DS_15
DS_16
DS_17
DS_18
DS_19
DS_20
0
20
40
60
80
100
Percentage (%)
Prok
ProkVirus
Plasmid
Euk
EukVirus
Figure 2.1: Sequence source composition of 20 test dataset. 20 equal sized benchmark
datasets were constructed. The fractions of PROK (including prokaryotic hosts, prokaryotic
viruses, and plasmids) to EUK (including eukaryotic hosts and eukaryotic viruses) sequences
were determined using the ratios of 9:1, 7:3, 5:5, 3:7, and 1:9. For each fixed PROK:EUK ratio, the
PROK fraction was further split into prokaryotic hosts, prokaryotic viruses and plasmids based
on the ratios of 5:1:1, 4:1:1, 3:1:1, and 2:1:1; and the EUK fraction was further split into eukaryotic
hosts and eukaryotic viruses according to the ratio of 5:1, 4:1, 3:1, and 2:1.
20
2.2.3 Model design and training
DeepMicroClass is constructed with a unique di-path convolutional neural network (CNN), diligently designed to classify input sequences into one of the five aforementioned categories. To
make use of information buried in the DNA sequences at both nucleotide base level and codon
level, the architecture of the network consists of a base-path and a codon-path, aiming at parsing
information specifically for each level.
For the base-path, the first step involves encoding the nucleotide sequence as a one-hot matrix. Here, the nucleotides A, C, G, and T are correspondingly translated into vectors: [1, 0, 0, 0],
[0, 1, 0, 0], [0, 0, 1, 0], and [0, 0, 0, 1], respectively. Notably, any non-standard (non-ACGT) nucleotide is represented as [0, 0, 0, 0]. Additionally, with such encoding method, obtaining the
one-hot encoding for the reverse complimentary strand is straightforward, achieved by simply
inverting the forward one-hot matrix both row-wise and column-wise.
In parallel, the codon-path requires an additional layer of processing. The forward or reverse
base-path matrix undergoes a transformation into three distinct 64-dimensional one-hot matrices.
These matrices are created based on three distinct reading frames. Subsequently, these three
matrices are concatenated to form a single matrix. Thus, for each strand of a input contig, the
di-path efficiently incorporates information at both the base and codon levels. This encoded
information is then passed through the subsequent convolutional layers. An illustrative overview
of DeepMicroClass’s intricate network architecture is provided in Figure 2.2.
21
Forward
L x 4
Convolution
Kernel: (6, 4)
# Filter: 64
Condon
Transformer
PRelu
AvgPool(3)
BatchNorm
Convolution
Kernel: 3
# Filter: 128
PRelu
AvgPool(3)
BatchNorm
Convolution
Kernel: 3
# Filter: 256
PRelu
AvgPool(3)
BatchNorm
256 x 1
PRelu
GlobalAvgPool
BatchNorm
Convolution
Kernel: (2, 64)
# Filter: 64
Convolution
Kernel: 3
# Filter: 128
Convolution
Kernel: 3
# Filter: 256
PRelu
AvgPool(3)
BatchNorm
PRelu
AvgPool(3)
BatchNorm
256 x 1
PRelu
GlobalAvgPool
BatchNorm
Concat
Linear
Layers
Output
5 x 1
Convolutional Layers
Backward
L x 4
Rev
Comp
Figure 2.2: Schematic representation of the multi-class CNN structure used in this study.
The network has two convolutional paths, a base-path encodes the nucleotide level information
and a codon-path encodes the codon level information. The hyperparameters used for each convolutional layer are marked on the figure. For each strand, the output dimension of base- and
codon-paths are 256 and 256, respectively. The di-path outputs of forward and reverse strands
are concatenated into a 1024-dimensional vector, which is used as the input of following linear
layers. The final linear layer outputs a 5-dimensional vector, with each dimension indicating the
probability of the input contig being eukaryotic host, eukaryotic virus, plasmid, prokaryotic host
and prokaryotic virus.
The di-path CNN model was trained by minimizing the cross-entropy loss, which quantifies
the difference between predicted class probabilities and the true class labels of input sequences.
22
Our training was run for 3000 epochs with a learning rate of 0.001 and batch size of 256. During
each epoch, sequences from the whole training dataset were firstly subsampled using weighted
random sampling, with no replacement within an epoch. The weight for samples in each class i
is defined as
wi =
number of samples
5 × number of samples in classi
After the sequences were sampled, a contig length was randomly chosen from 500 bps, 1 kbps,
2 kbps, 3 kbps and 5 kbps, and a contig with the given length was sampled from the original
sequence to construct the batch. During the evaluation or testing phase, sequences with lengths
shorter than 5 kbps were fed directly to the model for classification. For sequences with lengths
exceeding 5 kbps, each input sequence was firstly split into multiple non-overlapping 5 kbps
chunks, then scores given by the model for each chunk were collected, and the mean score of all
chunks was used as the final output of the input sequence.
2.2.4 Use-case data preparation and analysis
The daily time-series metagenomic samples were taken off the coast of Southern California using
an Environmental Sample Processor (ESP), and the 1 µm A/E filters (Pall Gelman) collected during the day were used for DNA extraction as described previously [104]. Metagenomic libraries
were prepared using the Ovation® Ultralow V2 DNA-Seq library preparation kit (NuGEN, Tecan
Genomics) under the manufacturer’s instruction using 10 ng of starting DNA and amplified for
13 PCR cycles. Metagenomic libraries were sequenced on an Illumina NovaSeq 6000 platform
(2 × 150 bp chemistries) at Berry Genomics Co. (Beijing, China).
23
After demultiplexing, the raw reads were first checked with FastQC v0.11.2, then adapter and
low quality regions were trimmed using fastp v0.21.0 [105] with the following parameters: -q
20 -u 20 -l 30 –cut_tail -W 4 -M 20 -c. PhiX174 and sequencing artifacts were removed using
bbduk.sh and human genome sequences were removed using bbmap.sh with default parameters,
both scripts can be found in the BBTools package v37.24 (https://jgi.doe.gov/data-and-tools/
bbtools). Metagenomic samples were assembled independently using metaSPAdes v3.13.0 [106]
with a custom kmer set (-k 21,33,55,77,99,127). The assembled contigs were further coassembled
as previously described [107]. Briefly, all the contigs were pooled and sorted into short (<2kb)
or long (≥2kb) contig sets, the short contig set was first coassembled using Newbler v2.9 [108],
the resulting ≥2kb contigs were further coassembled with the long contig set [109]. A minimum
overlap thresholds of 80 nt and 200 nt were set for Newbler and minimus2, respectively. For both
coassembly steps, a minimum identity cutoff of 0.98 was applied. After co-assembly, contigs were
further dereplicated at 0.98 identity using cd-hit v4.6.8 [110], the resulting contigs were used as
reference contigs for sequence classification and read recruitment analysis.
Reference contigs were classified using Kaiju v1.7.3 [75] and MetaEuk v1 [76], as well as
DeepMicroClass v0.1.0 (in hybrid mode), read counts assigned to each sequence class were summarized using custom Python scripts. Reads were mapped to reference contigs using bwa mem
v0.7.17 with default parameters, and the number of reads aligned >30 nt to reference contigs
were counted using bamcov v0.1 (available at https://github.com/fbreitwieser/bamcov) with default parameters.
24
2.3 Results
2.3.1 A CNN-based multi-class classifier
Identifying contigs of microbial eukaryotes and the viruses infecting them from metagenomic
assemblies is crucial for gaining a better understanding of their ecological roles. However, current state-of-the-art tools often do not fully appreciate most of the eukaryotic viruses and their
hosts. Here two commonly used viral contig predictors, VirFinder [60] and PPR-Meta [62], were
evaluated based on their predicted viral scores. As expected, both predictors gave high scores to
prokaryotic viral sequences and low scores to prokaryotic host sequences. However, the scores
for eukaryotic host and eukaryotic viral sequences were more evenly distributed (Figure 2.3),
revealing an insufficient accuracy in classifying these sequence classes. Out of 500 randomly
subsampled genomic sequences for each sequence type of prokaryotes, prokaryotic viruses, microbial eukaryotes, and eukaryotic viruses downloaded from NCBI, 454 prokaryotic viruses and
85 prokaryotic hosts had VirFinder-scores (VF-scores) above 0.5, while 238 eukaryotic viruses
and 157 eukaryotic hosts had VF-scores above this value (Figure 2.3a). A similar trend can be observed for PPR-Meta (Figure 2.3b), confirming these tools are not adequately equipped to handle
eukaryotic viral and host sequences. This emphasizes the need for novel predictors that consider
more sequence types during the model training process.
25
Figure 2.3: The distribution of viral confidence scores for (a) VirFinder and (b) PPRMeta. For both predictors, the same dataset was used and the predictions were performed with
default parameters. VirFinder uses VF-Scores to determine the likelihood of input sequences to
be viral or not, and PPR-Meta uses phage scores to discern viruses from host chromosomes and
plasmids. Both predictors achieved a high recall for prokaryotic viruses, while the confidence
scores of eukaryotic viruses were more evenly spreaded across all confidence regions. Besides,
both predictors achieved a high performance in distinguishing prokaryotic host sequences from
prokaryotic viruses, but less so for eukaryotic host sequences.
Here the performance of DeepMicroClass on sequences with different lengths (500 bps, 1
kbps, 2 kbps, 3 kbps, 5 kbps, 10 kbps, 50 kbps, and 100 kbps) was evaluated on test data. The
model performance for each sequence type was visualized via the Receiver Operating Characteristics (ROC) curve using a one-versus-rest strategy (Figure 2.4). Overall, we showed that as the
sequence length increased, the model’s performance improved across most sequence types, as
26
indicated by the Area Under the Receiver Operating Characteristic (AUC) measurements (Figure
2.4). DeepMicroClass performed well on all sequence types when the input sequence length was
≥ 1 kbps, with the minimum AUC score being 0.963 on classifying prokaryotic sequences. At
the sequence length of 500 bps, DeepMicroClass achieved fairly high AUC scores for eukaryotic
(0.944) or prokaryotic (0.96) viruses, whilst the scores for both viral sequence types were always
≥ 0.99 at longer sequence lengths (≥ 2 kbps) (Figure 2.4). For non-viral sequences, the AUC
scores were highest for eukaryotic sequences, followed by plasmid and prokaryotic genome sequences. However, a slight drop in the True Positive Rate (TPR) could be observed for eukaryotic
sequences when the False Positive Rate (FPR) was near 0 (Figure 2.4). With further investigation, the rough curve could be caused by the sharp drop in the number of available eukaryotic
sequences in the training dataset, which dropped from 16,002 to 255 when the contig length
changed from 10 kbps to 50 kbps.
2.3.2 DeepMicroClass outperforms Tiara and Whokaryote in
eukaryotic host sequence prediction
In the following three sections, we investigate the performance of DeepMicroClass for particular
classes of sequences. We used accuracy and F1 score as the metrics to assess the model performance. And the sequence type composition of different benchmark datasets was described in the
section 2.2.2.
First, we compared the performance of DeepMicroClass with Tiara [80] and Whokaryote
[81] on the classification of microbial eukaryotes. Tiara and Whokaryote are commonly used to
identify eukaryotic contigs from metagenomic assemblies without prior knowledge of microbial
27
0.0
0.2
0.4
0.6
0.8
1.0
Contig length: 500 bps
Euk, AUC=0.971
EukVir, AUC=0.944
Plasmid, AUC=0.933
Prok, AUC=0.930
ProkVir, AUC=0.960
Contig length: 1000 bps
Euk, AUC=0.989
EukVir, AUC=0.978
Plasmid, AUC=0.966
Prok, AUC=0.963
ProkVir, AUC=0.982
0.0
0.2
0.4
0.6
0.8
1.0
Contig length: 2000 bps
Euk, AUC=0.995
EukVir, AUC=0.992
Plasmid, AUC=0.980
Prok, AUC=0.978
ProkVir, AUC=0.991
Contig length: 3000 bps
Euk, AUC=0.998
EukVir, AUC=0.996
Plasmid, AUC=0.983
Prok, AUC=0.981
ProkVir, AUC=0.993
0.0
0.2
0.4
0.6
0.8
1.0
Contig length: 5000 bps
Euk, AUC=0.999
EukVir, AUC=0.998
Plasmid, AUC=0.986
Prok, AUC=0.984
ProkVir, AUC=0.996
Contig length: 10000 bps
Euk, AUC=0.992
EukVir, AUC=0.998
Plasmid, AUC=0.977
Prok, AUC=0.975
ProkVir, AUC=0.997
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Contig length: 50000 bps
Euk, AUC=0.984
EukVir, AUC=1.000
Plasmid, AUC=0.986
Prok, AUC=0.982
ProkVir, AUC=1.000
0.0 0.2 0.4 0.6 0.8 1.0
Contig length: 100000 bps
Euk, AUC=0.986
EukVir, AUC=1.000
Plasmid, AUC=0.986
Prok, AUC=0.981
ProkVir, AUC=1.000
False Positive Rate
True Positive Rate
Figure 2.4: The ROC curves and AUC scores of different length models assessed on test datasets.
Each different panel shows the ROC curves for 5 sequence classes at different contig lengths (500
bps, 1 kbps, 2 kbps, 3 kbps, 5 kbps, 10 kbps, 50 kbps and 100 kbps). Euk, eukaryotic sequences;
EukVir, eukaryotic viral sequences; Plasmid, plasmid sequences; Prok, prokaryotic genome sequences; ProkVir, prokaryotic viral sequences. 28
phylogenetic affiliation. With the compiled test datasets, we showed that DeepMicroClass persistently outcompeted both tools in all scenarios in terms of accuracy and F1 score (Figure 2.5,
5.1), and DeepMicroClass was robust to the different compositions of test datasets (Figure 2.5).
The average accuracy and F1 score across all test datasets for DeepMicroClass were both 0.99,
which were significantly higher than these metrics of Tiara and Whokaryote (pairwise Wilcoxon
test p-values ≤ 9.5e-05 for both accuracy and F1 score). The accuracy of Whokaryote dropped
from ∼ to ∼0.75 as the proportion of eukaryotic sequences increased, and the F1 scores were
substantially lower than 0.8 in all test datasets. In contrast, Tiara maintained high accuracy and
F1 score across different eukaryotic proportions, though a slight decrease in accuracy could be
observed when the eukaryotic proportion was high. DeepMicroClass achieved accuracy and F1
score above 0.98 for all tested scenarios and was robust to variable sequence composition.
A further look into those misclassified sequences revealed that both Tiara and Whokaryote
suffered from lower sensitivity in distinguishing eukaryotic sequences from other types of sequences. Especially for Whokaryote, a substantial amount of eukaryotic viruses were mistakenly
classified as eukaryotes (Figure 5.2).
29
a b
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DS_0
DS_1
DS_2
DS_3
DS_4
DS_5
DS_6
DS_7
DS_8
DS_9
DS_10
DS_11
DS_12
DS_13
DS_14
DS_15
DS_16
DS_17
DS_18
DS_19
Dataset Number
Accuracy
Predictor DMC Tiara Whokaryote
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DS_0
DS_1
DS_2
DS_3
DS_4
DS_5
DS_6
DS_7
DS_8
DS_9
DS_10
DS_11
DS_12
DS_13
DS_14
DS_15
DS_16
DS_17
DS_18
DS_19
Dataset Number
F1 Score
Predictor DMC Tiara Whokaryote
Figure 2.5: Distribution patterns of accuracy (a) and F1 score (b) across 20 test datasets for
DeepMicroClass, Tiara and Whokaryote. The top panel shows the sequence type composition of
20 test datasets, and the detailed composition ratios can be found in Table 5.1. The dashed black
lines indicate where accuracy or F1 score equals 0.8.
2.3.3 DeepMicroClass outcompetes PlasFlow, PPR-Meta and geNomad
in plasmid sequence prediction
Plasmids are mobile genetic elements of diverse prokaryotes and are one of the major agents of
horizontal gene transfer (HGT) among hosts. Here we compared the performance of DeepMicroClass to PlasFlow [72], PPR-Meta [62] and geNomad [111] in classifying plasmid sequences
30
using the same benchmark datasets described above. DeepMicroClass showed significantly improved results than PlasFlow, PPR-Meta and geNomad in all tested cases in plasmid classification
(pairwise Wilcoxon test adj.p-value ≤ 1.1e-07; Figure 2.6 & 5.3). Although PlasFlow, PPR-Meta
and geNomad were able to achieve a maximum F1 score of 0.68, 0.74 and 0.86, respectively, their
performance was severely impaired with increasing proportions of eukaryotic sequences in the
dataset (Figure 2.6). In contrast, the F1 score of DeepMicroClass was constantly higher than 0.8,
though a slight decrease could also be observed with increasing eukaryotic proportions.
We further examined the misclassified sequences and found PlasFlow had high sensitivity but
low specificity, and the dominance of misclassified sequence types was in line with the composition of benchmark datasets (Figure 5.4). PPR-Meta might benefit from its modeling of prokaryotic
chromosomes and phages, while it still had a low specificity mainly due to the misclassification
of prokaryotic and eukaryotic host sequences into plasmids (Figure 5.4). geNomad, on the other
hand, suffers from misclassifying prokaryotic chromosomes into plasmids (Figure 5.4). It’s noteworthy that DeepMicroClass might further benefit from its modeling of eukaryotic hosts and
viruses since eukaryotic host sequences were rarely classified as plasmids, though the misclassification rates between plasmids and prokaryotic hosts were still the highest among all misclassifications (Figure 5.9). Probable reasons for such observation are the high affinity and frequent genetic exchange between plasmids and prokaryotic hosts, further improvements on the
neural network structures or using additional features extracted from gene- or operon-centric
approaches might yield a better classifier.
31
a b
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DS_0
DS_1
DS_2
DS_3
DS_4
DS_5
DS_6
DS_7
DS_8
DS_9
DS_10
DS_11
DS_12
DS_13
DS_14
DS_15
DS_16
DS_17
DS_18
DS_19
Dataset Number
Accuracy
Predictor DMC PlasFlow PPR geNomad
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DS_0
DS_1
DS_2
DS_3
DS_4
DS_5
DS_6
DS_7
DS_8
DS_9
DS_10
DS_11
DS_12
DS_13
DS_14
DS_15
DS_16
DS_17
DS_18
DS_19
Dataset Number
F1 Score
Predictor DMC PlasFlow PPR geNomad
Figure 2.6: Distribution patterns of accuracy (a) and F1 score (b) across 20 test datasets for
DeepMicroClass, PlasFlow, PPR-Meta and geNomad on plasmid classification. The dashed black
lines indicate where accuracy or F1 score equals 0.8. The same benchmarking datasets were used
as in Figure 2.5. DMC, DeepMicroClass; PPR, PPR-Meta
2.3.4 DeepMicroClass achieves improved results in viral sequence
prediction
Viruses are ubiquitously found in every natural system where cellular organisms colonize. Significant advances have been made in recent years in developing tools to identify viral contigs from
metagenomic assemblies, using essentially gene-centric (e.g. VirSorter [58], VirSorter2 [59], VIBRANT [63]), or oligo-nucleotide-centric (e.g. VirFinder [60], DeepVirFinder [61], PPR-Meta [62])
32
approaches, or combining both approaches (e.g. geNomad [111]). Here we compared the performance of DeepMicroClass to VirSorter2 [59], geNomad [111], VIBRANT [63], DeepVirFinder [61]
and PPR-Meta [62] on viral contig prediction using the aforementioned benchmark datasets.
Among these methods, DeepVirFinder, VIBRANT, PPR-Meta and geNomad were designed
for prokaryotic virus identification [61, 62, 63, 111]. For a fair comparison, in the comparison
with them, we considered only prokaryotic viruses as positive samples, and all other classes
were considered negative. On the other hand, because VirSorter2 supports both eukaryotic and
prokaryotic viruses prediction [59], we considered both eukaryotic and prokaryotic viruses as
positive and conducted the comparison. And for both scenarios, DeepMicroClass achieved better
performance in terms of accuracy and F1 score than all the other tools in all tested datasets
(pairwise Wilcoxon test p-values ≤ 9.5e-05; Figure 2.7 & 5.5; Figure 2.8 & 5.6).
In the comparison with DeepVirFinder [61], VIBRANT [63] and PPR-Meta [62], the performance can be split into two tiers, the performance of DeepMicroClass and VIBRANT [63] were
closer to each other and were both higher than the performance of DeepVirFinder [61] and PPRMeta [62]. Both DeepMicroClass and VIBRANT [63] were able to keep prediction accuracy near
1 for all 20 benchmark datasets. However, the F1 score of VIBRANT [63] dropped from around
0.94 to 0.80 when the proportion of eukaryotic host and virus sequences increased. DeepMicroClass, on the other hand, was able to maintain a 0.90 minimum F1 score for predicting prokaryotic
viruses in different benchmark datasets (Figure 2.7 & 5.5).
33
0
0.2
0.4
0.6
0.8
1
Class Plasmid Prokaryote Prokaryote Virus Eukaryote Eukaryote Virus
0
0.2
0.4
0.6
0.8
1
Class Plasmid Prokaryote Prokaryote Virus Eukaryote Eukaryote Virus
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DS_0
DS_1
DS_2
DS_3
DS_4
DS_5
DS_6
DS_7
DS_8
DS_9
DS_10
DS_11
DS_12
DS_13
DS_14
DS_15
DS_16
DS_17
DS_18
DS_19
Dataset Number
Accuracy
Predictor DMC DVF VIBRANT PPR geNomad
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DS_0
DS_1
DS_2
DS_3
DS_4
DS_5
DS_6
DS_7
DS_8
DS_9
DS_10
DS_11
DS_12
DS_13
DS_14
DS_15
DS_16
DS_17
DS_18
DS_19
Dataset Number
F1 Score
Predictor DMC DVF VIBRANT PPR geNomad
Figure 2.7: Distribution of (a) accuracy and (b) F1 scores across 20 test datasets for DeepMicroClass (DMC), DeepVirFinder (DVF), VIBRANT PPR-Meta (PPR) and geNomad on prokaryotic
viral contig classification. DeepMicroClass received the highest scores in both accuracy and F1
score in all tested scenarios compared to the other predictors. Increasing the fraction of eukaryotic related sequences didn’t impaired the performance of DeepMicroClass, but did for the other
tools. The dashed black lines indicate where accuracy or F1 score equals to 0.8. Same benchmarking datasets were used as in Figure 2.5.
When considering both prokaryotic and eukaryotic viral sequences as positive samples, DeepMicroClass and VirSorter2 were both able to achieve accuracy above 0.90 and F1 score above 0.80,
without being significantly affected by the different sequence compositions of the dataset. And
on all datasets, DeepMicroClass constantly outperformed VirSorter2 in both metrics across the
benchmark datasets (Figure 2.8 & 5.6).
34
0
0.2
0.4
0.6
0.8
1
Class Plasmid Prokaryote Prokaryote Virus Eukaryote Eukaryote Virus
0
0.2
0.4
0.6
0.8
1
a b Class Plasmid Prokaryote Prokaryote Virus Eukaryote Eukaryote Virus
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DS_0
DS_1
DS_2
DS_3
DS_4
DS_5
DS_6
DS_7
DS_8
DS_9
DS_10
DS_11
DS_12
DS_13
DS_14
DS_15
DS_16
DS_17
DS_18
DS_19
Dataset Number
Accuracy
Predictor DeepMicroClass VirSorter2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DS_0
DS_1
DS_2
DS_3
DS_4
DS_5
DS_6
DS_7
DS_8
DS_9
DS_10
DS_11
DS_12
DS_13
DS_14
DS_15
DS_16
DS_17
DS_18
DS_19
Dataset Number
F1 Score
Predictor DeepMicroClass VirSorter2
Figure 2.8: Distribution of (a) accuracy and (b) F1 scores across 20 test datasets for DeepMicroClass, and VirSorter2 on prokaryotic and eukaryotic viral contig classification. DeepMicroClass
received higher scores in both accuracy and F1 score in all tested scenarios compared to VirSorter2. Both DeepMicroClass and VirSorter2 were able to maintain accuracy and F1 above 0.9
and not affected by the composition of different sequence types in the dataset. The dashed black
lines indicate where accuracy or F1 score equals to 0.8. Same benchmarking datasets were used
as in Figure 2.5.
The number of misclassified sequences by PPR-Meta, DeepVirFinder, VIBRANT, geNomad
and VirSorter2 is shown in Figure 5.7. The distribution of misclassified sequences by PPR-Meta,
DeepVirFinder and geNomad showed a similar pattern, that eukaryotic chromosomal and viral sequences were prone to be misidentified as prokaryotic viruses. This indicates tools or
35
models trained without knowledge of eukaryotic sequences are likely to behave similarly when
eukaryotes are not rare in the metagenomic community. Although VIBRANT and VirSorter2
had fewer misclassified sequences compared to PPR-Meta, DeepVirFinder and geNomad, both
suffered from misclassifying prokaryotic chromosomal or plasmid sequences into prokaryotic
viruses (Figure 5.7). Since both VIBRANT and VirSorter2 use a gene-centric approach, it’s possible that some of the viral signature genes or fragments could also be widely detected in prokaryotic genomes or plasmids as a result of frequent gene transfer among them. This contrasts with
the oligonucleotide-centric tools since cross-kingdom viral infection or plasmid conjugation and
gene transfer are less common.
2.3.5 DeepMicroClass outperforms PPR-Meta and geNomad in multiclass prediction
DeepMicroClass, PPR-Meta and geNomad are multiclass classifiers, their performance based on
accuracy and F1 score metrics on multiclass sequence classification using the same benchmark
datasets (Figure 2.9 & 5.8). Here we only considered prokaryotic chromosomal, prokaryotic viral
and plasmid sequences for comparison with PPR-Meta and geNomad as they were not trained
for eukaryotic sequence classification. On the other hand, all five sequence types were considered for the evaluation of DeepMicroClass. Although DeepMicroClass were required to classify
more sequence classes than PPR-Meta, it still outperformed PPR-Meta in all tested cases in both
accuracy and F1 score metrics (pairwise Wilcoxon test p-values ≤ 1.9e-06; Figure 2.9 & 5.8).
36
a b
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DS_0
DS_1
DS_2
DS_3
DS_4
DS_5
DS_6
DS_7
DS_8
DS_9
DS_10
DS_11
DS_12
DS_13
DS_14
DS_15
DS_16
DS_17
DS_18
DS_19
Dataset Number
Accuracy
Predictor DMC PPR-Meta geNomad
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DS_0
DS_1
DS_2
DS_3
DS_4
DS_5
DS_6
DS_7
DS_8
DS_9
DS_10
DS_11
DS_12
DS_13
DS_14
DS_15
DS_16
DS_17
DS_18
DS_19
Dataset Number
F1 Score
Predictor DMC PPR-Meta geNomad
Figure 2.9: Distribution of (a) accuracy and (b) F1 scores across 20 test datasets for DeepMicroClass, PPR-Meta and geNomad on prokaryotic host, prokaryotic virus and plasmid contig classification. DeepMicroClass received higher scores in both accuracy and F1 score in all tested
scenarios compared to PPR-Meta and geNomad in multi-class classification. The dashed black
lines indicate where accuracy or F1 score equals to 0.8. Same benchmarking datasets were used
as in Figure 2.5.
2.3.6 DeepMicroClass predicted more eukaryotic and viral contigs than
alignment-based predictors
Alignment-based classifiers can suffer from incomplete genomic databases, particularly for complex natural environments such as marine or soil systems. To test the performance of DeepMicroClass in real metagenomic context, here we examined its performance with the other two
37
sequence classifiers, Kaiju [75] and MetaEuk [76], using a 1-300 µm size fraction marine metagenomic dataset sampled off the coast of Southern California [104]. Using the co-assembled contigs
as the reference, we show DeepMicroClass classified less prokaryotic but more eukaryotic, eukaryotic viral and prokaryotic viral contigs than Kaiju and MetaEuk (Figure 2.10a). Among all
the prokaryotic contigs classified by both Kaiju and MetaEuk, 73.6% of them were predicted to
be prokaryotic by DeepMicroClass, and 11.88%, 10.39%, and 4.14% of them were predicted to be
eukaryotic, prokaryotic viral and eukaryotic viral sequences, respectively (Figure 2.10b). Contigs that couldn’t be taxonomically determined by Kaiju (16.41%) or MetaEuk (10.01%) are mainly
dominated by eukaryotic sequences (57.13% / 38.3%) as predicted by DeepMicroClass (Figure 2.10c
& 2.10d). Although MetaEuk classified more eukaryotic contigs than Kaiju (21.88% vs 15.26%,
Figure 2.10a), the latter classified more prokaryotic viral contigs (4.38% vs 1.51%, Figure 2.10a).
This is consistent with the higher percentage of prokaryotic viral sequences in the unclassified
contigs of MetaEuk than Kaiju (28.86% vs 14.87%, Figure 2.10c & 2.10d). By mapping reads to
reference contigs, we calculated the read percentages recruited by different sequence types. The
average eukaryotic read percentage recruited by DeepMicroClass (6.15%) is considerably higher
than by MetaEuk (4.78%) or Kaiju (3.50%), at the expense of lower prokaryotic read percentages
(13.12%, 20.60% and 20.51%, respectively, Figure 2.10f-h). Similarly, the average read percentages
of prokaryotic viral and eukaryotic viral sequences recruited by DeepMicroClass (6.07%/1.24%)
are also higher than MetaEuk (0.49%/0.19%) and Kaiju (1.67%/0.37%) (Figure 2.10f-h). Notably,
though DeepMicroClass assigned less prokaryotic and more eukaryotic reads than other classifiers, the relative abundance profiles across the whole time series are highly correlated (Figure 5.10a & b), and to a less extent for the prokaryotic viral read percentage profiles (Figure 5.10).
This is not the case for eukaryotic viral read abundance profiles, where Kaiju and MetaEuk are
38
highly correlated, but not to DeepMicroClass (Figure 5.10). To sum up, DeepMicroClass is more
correlated with MetaEuk in eukaryotic read profiles, and more correlated with Kaiju in prokaryotic and prokaryotic viral read profiles.
39
Figure 2.10: Sequence classification and read abundance of a 1-300 µm size fraction marine
metagenomic dataset sampled off the coast of Southern California. Metagenomic contigs were
classified using DeepMicroClass, Kaiju and MetaEuk at a length cutoff of 2 kb, and percentages of
different sequence types were calculated (a). Contigs predicted as Prokaryotes by both Kaiju and
MetaEuk (b), and contigs that were not classified by Kaiju (c) or MetaEuk (d) were further broken
down into DeepMicroClass’s classification. Clean reads were aligned to metagenomic contigs
and percentages of mappable reads were calculated (e). Mapped read percentages were further
summarized according to sequence types of reference contigs as predicted by DeepMicroClass
(f), Kaiju (g) and MetaEuk (h). Prokaryotes included both prokaryotic hosts and plasmids. UnclassifiedViruses were sequences predicted to be viruses but their taxonomy couldn’t be further
resolved by Kaiju or MetaEuk.
40
2.3.7 Computational cost
Depending on whether preloading the whole prepared one-hot dataset into the memory, the
memory requirement varies. When preloading is enabled, 512GB memory is recommended. In
our case, we utilized a workstation equipped with an AMD EPYC 7702P CPU and an NVIDIA
Tesla T4 GPU, with 1TB memory, so preloading dataset is enabled in the model training. This
configuration enabled us to complete the training in less than 5 hours, and testing on the 20
datasets takes less than 5 minutes.
2.4 Discussion
2.4.1 Microbial eukaryotes and viruses infecting them are understudied
Microbial eukaryotes are prevalent in diverse ecosystems such as host-associated habitats [112],
deep-sea benthos [113], and geothermal springs [114], etc. Due to challenges in the cultivation
and whole genome-sequencing of microbial eukaryotes, biodiversity surveys of microbial eukaryotes were commonly performed using marker genes, such as the 18S rDNA hypervariable V4 or
V9 regions [115, 116]. The amplicon-based analysis provides valuable information on the taxonomy of microbial eukaryotes. In order to probe their metabolic potentials or ecological functions,
genomic and transcriptomic information are essential. Despite several achievements in collecting microbial eukaryotic genes [117, 79], transcripts [77] or single-cell amplified genomes (SAGs)
[118] towards a comprehensive microbial eukaryotic database, our knowledge are still limited
by the availability of diverse microbial eukaryotic genomes [119]. With the rapid accumulation
41
of metagenomic datasets and the availability of binning software, it’s appealing to recover eukaryotic genomes from natural microbial communities. EukRep was developed in such a context
to identify eukaryotic contigs for metagenomic binning [78]. This approach has enabled the
genome-resolved analysis of fungi, protists, and rotifers from human microbiome studies [78,
120]. Similar approaches have been applied to marine microbiome studies [121, 122], which recovered hundreds of eukaryotic metagenome-assembled genomes (MAGs) and provided insight
into the functional diversity and evolutionary histories of microbial eukaryotes beyond the taxonomic information.
Beyond microbial eukaryotes, current viromic studies are biased towards viruses infecting
prokaryotes. This could be introduced by the skewed distribution of viral genomes in the RefSeq database, which is dominated by phages and pathogenic viruses. By Sept 1, 2023, among
18,729 viral reference sequences, there were only 104 records belonging to algae-infecting Phycodnaviridae and 30 belonging to protists-infecting Mimiviridae. Both of the two viral families
are subgroups of the Nucleocytoplasmic Large DNA Viruses (NCLDV) [123]. Since most of the
commonly used viral predictors are trained on the RefSeq viral database, it’s expected that these
tools suffered from identifying eukaryotic viruses from the test datasets (Figure 2.7, 2.8, 5.5, &
5.6). Given the high diversity of protists [124, 125], high throughput metagenomes and singlecell genomes are expected to offer a culture-independent solution to rapidly expand the coverage
of viral database. For instance, two recent studies reconstructed 2,074 and 501 NCLDV MAGs
from global environmental metagenomes [126, 127], dramatically increased the phylogenetic and
functional diversity of NCLDVs. Single-cell metagenomics was also employed to identify viruses
infecting marine microbial eukaryotes [128, 129], these studies provided insightful findings of the
42
viral encoded proteins and metabolic pathways.
These studies demonstrated that metagenomics and single-cell genomics can be promising
in studying microbial eukaryotes and viruses infecting them. While most commonly used tools
are not optimized in classifying eukaryotes (Figure 2.5 & 5.1) or eukaryotic viruses (Figure 2.7 &
5.5). Given the high performance of DeepMicroClass and the evidence of abundant eukaryotic
contigs in marine ecosystems (Figure 2.10), we expect it will be a valuable addition to the toolbox
of marine ecologists.
2.4.2 The challenge of classifying prokaryotic host and plasmid sequences
DeepMicroClass has a relatively lower accuracy in classifying plasmids when compared to the
classification of eukaryotic or viral contigs (Figure 2.5, 2.6, 2.7, 2.8). The majority of the sequences
that were misclassified as plasmids were from prokaryotic host genomes (Figure 5.9), confirming classifying prokaryotic chromosomal and plasmid sequences is a caveat of DeepMicroClass
(Figure 2.4). In comparison, the other tested plasmid classifiers suffered from both prokaryotic
and eukaryotic sequences as we have benchmarked (Figure 2.6 & 5.4). It’s noteworthy that this
marginal advantage can be crucial in natural environments, such as marine environments as we
mentioned here (Figure 2.10), where eukaryotic sequences can have a substantial impact on the
classification of plasmid sequences. This also indicates that it is achievable to separate plasmid
sequences from eukaryotic sequences solely based on patterns of oligonucleotides, and current
43
plasmid predictors can benefit from using a more comprehensive training dataset including eukaryotic sequences.
It is understandable that more eukaryote related contigs were misclassified (Figure 5.9), given
the higher genome complexity of eukaryotes than prokaryotes [130], such as the coding density, prevalence of introns and repetitive sequences, etc. In contrast, it’s challenging to classify plasmids and prokaryotic chromosomal sequences for all the tested plasmid predictors (Figure 2.6). The reasons can be manifold, but plasmid transmission among microbial hosts and
plasmid-chromosome gene shuffling can be two fundamental ones. The host range of plasmids is
variable, it can be within closely related species for narrow host range plasmids or across distant
phylogenetic groups for broad host range plasmids [131]. Broad host range plasmids can be important drivers of the gene flux among host microbes in natural environments [132, 133, 134]. For
instance, in natural soil microbial communities, the IncP- and IncPromA-type broad host range
plasmids could transfer from proteobacteria to diverse bacteria belonging to 11 bacterial phyla
[135]. When plasmid carriage could increase the hosts’ fitness, such as improving host survival
with antibiotic resistance, it can be rapidly adopted and persistently maintained in natural microbial communities [136, 137]. On the other hand, when the maintenance of plasmids imposed
a high fitness cost on the hosts, plasmids or plasmid-borne genes could be lost in the process
of purifying selection [65]. Interestingly, studies also suggested that sometimes this fitness cost
could be ameliorated by compensatory evolution [138, 139, 140], which was hypothesized to be
the major factor of plasmid survival and persistence [141]. Plasmid carriage also increases the
chance of plasmid-chromosome genetic exchange mediated by SOS-induced mutagenesis [142]
or mobile genetic elements such as transposons and integrons, etc [143, 142]. For instance, genes
44
carried by transposons or in the variable regions were also frequently found on plasmids [144,
145]. Thus, the permissive transfer of plasmids across diverse hosts and the plasmid-chromosome
gene flow pose a challenge for current plasmid classifiers. The oligonucleotide-based approaches
might be complemented by gene-centric approaches using plasmid signature genes or enriched
gene functions, such as genes involved in mobilization or conjugation. In addition, a comprehensive plasmid database is also crucial for model training, and plasmid-enriched metagenomics
(plasmidome) can be a promising way to screen plasmids from environmental samples [146].
45
Chapter 3
Phage-bacterial contig association prediction with
ContigNet
3.1 Introduction
In this chapter, we delve into another pivotal challenge presented by shotgun metagenome datasets:
the computational determination of whether a given DNA sequence from a virus can potentially
infect a corresponding bacterial sequence. To address this challenge, we present ContigNet, our
deep learning-based software package designed to predict the probability of an association between a viral contig and its potential host contig.
Traditional taxonomy-based methods of predicting virus-host associations are hampered by
their inability to fully harness the information available in metagenomic datasets Consequently,
these methods often disregard bacteria not present in existing datasets when identifying potential hosts for viruses. ContigNet circumvents this limitation by conducting a direct comparison
between viral and bacterial sequences.
46
While there are two existing methods, VirHostMatcher (VHM) [89] and WIsH [46], that can
perform such direct comparisons, their efficacy diminishes with shorter contigs. As we will
demonstrate in Section 3.3.2, these methods are less adept at handling short contigs, thereby establishing ContigNet as the superior computational tool for predicting the relationship between
a viral contig and its host. The functionality of ContigNet is straightforward: it accepts two DNA
sequences as input and returns a probability score between 0 and 1, representing the likelihood
of their association.
3.2 Materials and Methods
3.2.1 Dataset preparation
The first dataset we used was retrieved from Virus-Host DB [147]. Virus-Host DB gathered virus
and host information from multiple sources, including RefSeq, GenBank, UniProt, ViralZone,
and manual curation. The association between virus and host was represented as a pair of virus
genome sequence and host taxonomic ID in the database. The Virus-Host DB version used in this
study was released in March 2021, and this release contained 16,048 virus-host pairs with both
prokaryotic and eukaryotic hosts. We filtered out non-bacterial host entries from the database.
The hosts reported in the database varied in taxonomic ranks, and only phage-host pairs with host
taxonomy ID at species rank were used. We then removed phage-host pairs with host taxonomy
ID not presented in GenBank [28]. Finally we got a dataset with 2,589 phage-host pairs, including
2,548 phages and 301 host species. The species level composition of the hosts of the phages is
shown in Figure 3.1 that shows the top 10 most abundant species and the others. From the pie
chart we can see the hosts of the phages are not overly biased towards particular species.
47
268
155
129
118
106
90
81
78
72
68
1633
E. coli L. lactis P. aeruginosa S. aureus
K. pneumoniae G. terrae C. acnes E. faecalis
S. enterica M. smegmatis Others
Figure 3.1: Pie chart showing the number of phages having association with a host species
in Virus-Host DB. It highlights the top 10 species, while the remaining species are grouped as
"Others". E. coli represents the largest portion, accounting for 9.58% of the total species. Since
there is no dominant species, the dataset is not biased towards any particular species.
48
The dataset was separated into training and validation sets. To prevent extremely similar
phage genomes appearing in both the training and validation sets, we used CD-Hit v4.8.1 [148] to
cluster phage genomes, which required >95% sequence identity and >50% alignment coverage for
the shorter sequence. Finally, we obtained 2,325 phage clusters and in each cluster we randomly
selected a phage chromosome to represent the sequence of other phages in the same cluster. After
redundant phage genomes were removed, the dataset was randomly separated into 80% training
set and 20% validation set. Any phage-host pair in the training set with the phage also appearing
in the validation set was moved to the validation set. A similar approach was also applied to
hosts. Given a particular taxonomy level, phage-host pairs with overlapping hosts between the
training and validation sets were moved to the validation set.
Virus-Host DB included genome sequences for all viruses in the database, and their host
genome sequences were retrieved from GenBank according to their species taxonomy IDs. Some
host species, such as E.coli, have over 20,000 assemblies in GenBank, and it is not feasible to use
all of them for training and testing. Thus, for hosts with more than 10 assemblies, we randomly
downloaded 10 assemblies as representatives of that species. For host species with fewer than 10
assemblies, we downloaded all available assemblies. All collected genome sequences were used
in the subsequent training procedures.
To assess the generalization ability of our developed model, we also included two different
datasets as test sets in our study. The first one is the Metagenomic Gut Virus (MGV) [149], and the
other is the PLSDB [150] database. Both datasets were preprocessed so that they did not overlap
with the training set. The details of the datasets, steps of preprocessing and rationale of choosing
these two datasets as test sets will be discussed in Section 3.3.6.
49
3.2.2 Feature representation
For a phage contig with length Lp, we used a one-hot matrix to represent the contig. Particularly,
A, C, G and T were represented by [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0] and [0, 0, 0, 1], respectively. We
used [0, 0, 0, 0] to represent bases other than A, C, G or T that might appear in the sequence. The
constructed matrix was denoted as Mpb, with size Lp × 4, where the subscript p represents the
phage and b represents nucleotide base. Similarly, for the host, we constructed the one-hot matrix
Mhb with size Lh × 4, where Lh represents the host contig length.
We also encoded the contigs based on codons using a 64 dimensional one-hot matrix, given
that there are 64 possible codons, and a vector with all components being 0 was used in case the
number of remaining nucleotides were not enough to form a codon or a non-regular character
occurs in the sequence. Contigs were translated into peptides using 3-frame encoding, using
the forward strand. Considering a fragment ATGCGTCAT, the possible reading frames can be
ATG/CGT/CAT, TGC/GTC/AT-, or GCG/TCA/T–. We first constructed a one-hot matrix for each
reading frame and got three individual matrices. The three matrices were concatenated together
and used as the input of ContigNet. We used Mpc and Mhc to denote the resulting matrices for
phage and host, respectively, where the subscript c indicates codon information for the phage or
the host.
3.2.3 Deep learning model structure and training
The ContigNet is a four-path convolutional neural network for the phage-host prediction task.
The overview of the model is depicted in Figure 3.2. It takes matrices Mpb and Mhb as input
and outputs a single numeric value indicating the probability if the query phage and host are
50
associated with each other. Each of the two input matrices is branched into two paths, a basepath to encode base information and a codon-path to encode the codon information contained in
the contig. In the codon-path, the codon one-hot matrices Mpc and Mhc can be converted from
base matrices with a series of convolutional operations, which is named as codon transformer in
the figure.
The codon transformer first constructs a filter F with size 64 × 3 × 4, i.e. the filter consists
of 64 kernels, where each kernel is a 3 × 4 matrix. Here we set each kernel to be a base one-hot
matrix of a possible codon. Consider a base one-hot matrix M with size 3 × 4, representing a
3-mer. The following function
f(M) = ReLU(F ∗ M − 2 × 1)
where the output of ∗ is defined as [
P
i
P
j F0,i,j × Mi,j , . . . ,P
i
P
j F64,i,j × Mi,j ], 1 denotes
the vector with all dimensions as 1 and the ReLU denotes the Rectified Linear Unit. The output
of f is a one-hot vector, where the dimension corresponding to the codon is 1 and all other
dimensions are 0. By applying the convolutional filter to the base one-hot matrix, we can obtain
the corresponding codon one-hot matrix.
The structure of the base and codon channels consists of convolutional and max pooling layers
with ReLU activation function. Usually, convolutional neural networks require a fixed size input.
However, in metagenomics studies, the contig length can range from only a few hundred bps to
over thousands of bps. To solve the incompatibility between variable contig length and fixed CNN
input size, in ContigNet, we took advantage of the characteristic of the pooling layer to enable
its ability to accept a theoretically unlimited length of contigs. The last layer of each convolution
51
Figure 3.2: The overview of the model structure of ContigNet. The network receives two
inputs: the one-hot matrix for the virus contig and the one-hot matrix for the host contig. These
inputs traverse two separate convolutional paths—the base path and the codon path—after the
base one-hot matrix is transformed into the codon one-hot matrix using a codon transformer.
The outputs of the convolutional paths are then aggregated, concatenated, and passed through
a fully connected layer. Subsequently, the output is processed by a sigmoid function, which is
interpreted as the probability that the virus and host are associated.
channel in ContigNet is a global pooling layer, which produces a fixed length output that can be
fed into the downstream fully connected layers, regardless of the length of input. We forced the
base paths and codon paths between the phage and host to have the same weight, the rationale
will be discussed in Section 3.3.1.
The outputs of four separated paths are collected, concatenated, and fed into the fully connected layer with sigmoid function as the final output of the model, which has a range of (0, 1),
indicating the probability of the two input contigs of being associated with each other.
52
The model was optimized by minimizing the binary cross entropy loss between the output
and the target. The hyper-parameters for the model training were chosen from grid search using
ten-fold cross validation on the training set. The learning rate was chosen from 0.01, 0.001, 0.0001
and 0.00001, and the batch size was chosen from 16, 32, 64, and 128. Eventually, the model was
trained for 5000 epochs with early stopping criteria 0.00001 for 50 epochs. The learning rate was
finally set to 0.001 and the batch size was set to 32.
At each epoch, for a positive phage-host pair, we need to construct a negative phage-host pair.
In this study, we randomly selected the host that was not reported interacting with the phage in
Virus-Host DB to construct the negative phage-host pairs. This way of selecting the negative
pairs can possibly include some phage-host pairs that have associations but not reported by previous studies. However, such potential positive phage-host pairs not reported in the database are
expected to be extremely rare given that the number of truly associated phage-host pairs is much
smaller than that of negatively associated pairs. Thus, we expect that such incorrectly chosen
pairs will have minimal impacts on our model. For each positive phage-host pair, we constructed
one negative phage-host pair resulting in the positive/negative ratio of 1:1.
3.2.4 Investigation of the effects of contig lengths, sequencing errors
and chimeric contigs on the performance of ContigNet
To investigate the effects of contig lengths on prediction accuracy, for each input batch during
training, validation, and testing, the phage contig length Lp was selected from 200 bps, 500 bps,
1 kbps, 5 kbps, 10 kbps and 50 kbps, and the host contig length Lh was selected randomly from
200 bps, 500 bps, 1 kbps, 5 kbps, 10 kbps, 50 kbps and 100 kbps. Contigs were then sampled from
53
the corresponding phage and host genomes. The contig lengths and contig sequences were all
re-sampled at the start of each training epoch to increase the diversity of the training dataset.
Firstly, we evaluated the performance of the trained model on phage-host contig pairs directly
sampled as error-free substrings of phage and host genomes. The related results were reported
in Subsections 3.3.2 and 3.3.3.
In metagenomic samples, however, it is possible that the contigs have multiple sources of
errors including sequencing and assembly errors. Therefore, we also assessed the performance
of ContigNet with different types of errors. For sequencing, possible errors include insertions,
deletions and substitutions. The error profiles can vary depending on sequencing (NGS) technologies, and most common sequencing technologies have error rate < 0.1% [151]. To simulate
the general scenario, we set two parameters, µ as substitution rate and δ as insertion/deletion
rate. For each nucleotide, given an error occurred, a substitution error occurred with probability
µ
µ+δ
and the nucleotide was changed to the other nucleotides with equal probability 1
3
. A deletion error occurred with probability δ
2(µ+δ)
. An insertion error occurred with probability δ
2(µ+δ)
and one of A, C, G, or T was inserted with equal probability 1
4
. We used this simple model to
show the impact of substitution and insertion/deletion errors on the performance of ContigNet.
Other sequencing error mechanisms can easily be simulated. In this experiment, we first trained
ContigNet with error-free training set, and the trained model was evaluated on the validation set
with artificially simulated errors with different combinations of δ and µ. The parameter δ was
chosen from 0, 0.05 and 0.1, and µ was chosen from 0 to 0.1 by step of 0.02. The parameters for
the simulations could be over 100 times larger than typical error rates, and therefore the results
showed the lower bound of the performance of ContigNet on real metagenomic datasets. The
54
contig lengths for both phage and host were fixed at 5 kbps, and the AUROC of ContigNet on the
test sets was reported as the average of 10 repeated experiments.
There are potential assembly errors of the contigs in metagenomic studies. Reads from different genomes could be assembled into the same contigs referred as chimeric contigs. We assessed
the effect of the presence of chimeric contigs in the test set on the performance of error-free
trained ContigNet. To simulate the occurrence of chimeric contigs, for each phage/host contig, we exchanged 0%-20% of its sequence with a random contig in the validation set. We only
swapped the subsequences between phages and between hosts, because previous studies showed
that it is very rare to mix viral and host fragments in assembly [152, 153]. The AUROC scores
of ContigNet on the validation set with sequencing errors and chimeric contigs were reported in
Section 3.3.5.
3.2.5 Performance Evaluation
The performance of our deep learning method in the phage-host association prediction was assessed using the area under the receiver operating characteristic curve (AUROC) score. The receiver operating characteristic (ROC) curve visualizes the predictive performance of a binary
classifier by plotting the true positive rate against the false positive rate at various thresholds.
The AUROC can be used to quantify the performance of this binary classifier.
55
3.3 Results
3.3.1 Sharing weights among base and codon paths improves model
generalization
The convolutional part of ContigNet is essentially a feature extractor, where each path extracts
256 features from the base and codon paths of the input sequence. Comparing with traditional
alignment-free based methods, we can consider the output of the convolution layer as the k-mer
frequencies, and the fully-connected layers as the distance measure. When we calculate k-mer
frequencies, we treat phage and host sequences the same and identical features are extracted from
them. Hence, it is also reasonable to let the convolutional paths between both sequence types to
have the same weights.
To legitimize the use of shared weights, we need to compare two models, one having shared
weight between phage and host paths and the other having independent weights between phage
and host paths. For a fair comparison, for both models, we conducted a grid search using ten-fold
cross validation on the training set, with the learning rate chosen from 0.01, 0.001, 0.0001 and
0.00001, and the batch size chosen from 16, 32, 64 and 128. The final selected hyper parameters
were the same for both models, with the learning rate being 0.0001 and batch size being 32.
The trained models were then tested on the validation set. The AUROC scores of each model
on contigs with different lengths, ranging from 200 bps to 5 kbps, are shown in Figure 3.3. The
figure shows that ContigNet with shared weights for phages and hosts outperforms the model
with independent weights. The Wilcoxon signed-rank test also supports this conclusion with
56
shared distinct
0.70
0.72
0.74
0.76
0.78
0.80
0.82
0.84
0.86
AUROC
Figure 3.3: Boxplots comparing AUROC scores of ContigNet models with and without
shared weights between phage and host paths. The x-axis differentiates between models
with shared or distinct convolutional paths, while the y-axis represents the corresponding AUROC scores. A Wilcoxon signed-rank test reveals that the ContigNet model with shared weights
significantly outperforms the model with distinct weights, with a p-value of 1.33 × 10−25
.
p-value 1.33 × 10−25. Therefore in the rest of the paper, we use the model with shared weights
between the phages and the hosts.
3.3.2 ContigNet increased the prediction performance compared to
existing methods
K-mer based alignment-free sequence comparison methods have been widely used for predicting
phage-host associations. Among various methods of the same class, the currently known best
57
performing alignment-free method is d
∗
2 dissimilarity [89]. WIsH [46] is another popular method
for the association status prediction between sequences. It first trains a Hidden Markov Model
(HMM) using reference sequences, then a score will be assigned to each new sequence according
to the trained HMM. Since the construction of HMM is not dependent on contig lengths, we can
use it on any pair of contigs. So we first compared the performance difference among d
∗
2
, WIsH
and ContigNet.
We used Afann [154] to calculate the d
∗
2 dissimilarity between phage and host pairs. The
selection of negative phage-host pairs was the same as the one we used to train our model. To
assess the performance of d
∗
2 on contigs, we used 200 bps, 1 kbps, 5 kbps and 50 kbps as the
contig lengths for both phage and host to sample from genomes. For experiments with contigs,
for a given contig length l, we randomly sampled a contig from phage and host genomes from the
validation set, respectively, and the sampling was repeated k = 10 times. With the sampled contig
pairs, we then evaluated the performance of d
∗
2 using AUROC. The same approach was applied
to WIsH and the AUROC score for each contig length was recorded. Similarly, we evaluated the
performance of our trained model on contigs with different lengths using the same approach. To
test our model, the contig pairs were sampled from the phage-host pairs in the validation set. The
ROC curves for the three different methods with different contig lengths are shown in Figure 3.4.
58
0.0 0.2 0.4 0.6 0.8 1.0
FPR
0.0
0.2
0.4
0.6
0.8
1.0
TPR
50kbps, AUC=0.684
5kbps, AUC=0.631
1kbps, AUC=0.589
200bps, AUC=0.549
(a) d
∗
2
0.0 0.2 0.4 0.6 0.8 1.0
FPR
0.0
0.2
0.4
0.6
0.8
1.0
TPR
50kbps, AUC=0.676
5kbps, AUC=0.592
1kbps, AUC=0.515
200bps, AUC=0.485
(b) WIsH
0.0 0.2 0.4 0.6 0.8 1.0
FPR
0.0
0.2
0.4
0.6
0.8
1.0
TPR
50kbps, AUC=0.830
5kbps, AUC=0.836
1kbps, AUC=0.808
200bps, AUC=0.717
(c) ContigNet
Figure 3.4: Performance comparison among d
∗
2
, WIsH and ContigNet. (a) The ROC curves
of d
∗
2 method on the validation set with contigs of different lengths. (b) The ROC curves of WIsH
on the validation set with contigs of different lengths, and (c) the ROC curves of ContigNet on
validation set with contigs with different lengths.
Figure 3.4(a) shows that for d
∗
2
the AUROC is markedly lower than that of ContigNet for virushost contig association predictions. The AUROC score for d
∗
2
is only 0.684 for contig pairs of 50
kbps and further drops to 0.589 for contig pairs of 1 kbps. Figure 3.4(b) depicts the performance
59
of WIsH under different conditions, and we can see similar characteristic of the change of performance, for shorter contigs the performance drastically degraded. In comparison, as shown in
Figure 3.4(c), ContigNet achieves a much higher AUROC of over 0.808 for contig pairs of length
above 1 kbps. Even for contig pairs of length 200 bps, the AUROC can be as high as 0.717.
There are several other tools for phage-host association prediction, however, they are not
suitable for the question described in this study, as discussed in more detail in the Discussion
section.
3.3.3 The performance of ContigNet increases with both viral and host
contig lengths
Figure 3.4 shows that the performance of ContigNet increases with the contig length when the
viral and host contigs are of the same length. ContigNet can be applied to predict phage-host
associations even if the contigs are of different lengths and we next investigate how the performance of ContigNet changes with the viral or host contig lengths.
To assess the effect of different lengths of phage and host contigs on the performance of our
trained model, we extended the contig length and evaluated the AUROC score of our model with
contig length selected from 200 bps, 500 bps, 1 kbps, 5 kbps, 10 kbps and 50 kbps for phages
and 200 bps, 500 bps, 1 kbps, 5 kbps, 10 kbps, 50 kbps and 100 kbps for hosts. The results are
presented in Figure 3.5. The x-axis represents the host contig length, and y-axis marks the phage
contig length. Figure 3.5 shows an apparent increasing trend of AUROC with respect to both
phage and host contig lengths. When the phage contig length is about 1kbps, the AUROC is
stablized at around 0.82 when the host contig length is above 5 kbps. When the phage contig
60
200 500 1k 5k 10k 50k 100k
Host Contig Length
200 500 1k 5k 10k 50k
Phage Contig Length
72.3 75.7 76.5 76.3 75.6 78.8 78.2
75.1 77.6 77.0 82.2 80.9 82.1 81.9
74.0 78.6 78.0 81.9 82.5 82.8 82.5
76.2 81.2 81.2 82.8 84.7 83.5 82.9
79.4 80.0 83.5 81.7 83.2 83.1 84.9
76.3 80.3 81.9 82.3 82.7 84.4 83.2
60
65
70
75
80
85
90
95
100
AUROC(%)
Figure 3.5: Heatmap of AUROCs for different phage and host contig lengths. The x-axis
represents host contig lengths and the y-axis represents phage contig lengths. The color gradient
ranges from red to green, representing values from 60 to 100.
length is above 5 kbps, the AUROC is stablized at around 0.83 when the host contig length is
above 10 kbps.
This observation can be intuitively explained by the fact that longer contigs contain more
information that can be extracted by our network, resulting in a higher AUROC score.
3.3.4 Assessing the effects of different channels
For a DNA sequence, the information it contains can be classified into two granularities, base
level and codon level. To incorporate these two granularities, a di-path model was utilized, i.e.
two separate convolutional paths for each contig, one path for parsing base information and the
other for codon information.
61
Here we explore if the di-path model can improve the prediction results and the contributions
of each path to the final result. To assess the contributions, we retrained our model by keeping
only the base or codon path, assessed the performance of the single-path models, and compared
their performance to our original di-path model. Figure 3.6 is the heatmap showing the difference
of AUROC on corresponding contig lengths combinations between the di-path model and each
single-path model. In the heatmap deeper green means the di-path model performs better than
the single path model, whilst red cell means the di-path model underperforms.
Figure 3.6 shows that the base-path model does not perform as well as the di-path and codonpath models, while the di-path model and the codon-path model perform similarly. We utilized
the Wilcoxon signed-rank test to quantify the difference between models. For the base-path
model and the di-path model, we got p = 2.41 × 10−8 with the alternative hypothesis that the
AUROC of base-path model was smaller than that of the di-path model. This indicates that the
di-path model performs significantly better than the base-path model. However, when comparing
the codon-path model and di-path model, we got p = 0.472 with two-sided alternative hypothesis, and thus we cannot reject the null hypothesis that the two models perform the same.
From the observation, we can conclude that the di-path model was not able to provide any
significant advantage over using codon-path alone. Despite the aforementioned result, there was
no apparent drawback of keeping both paths in the model compared to using only the codon
path, and with multiple potential benefits of overparameterization in model convergence and
generalization [155], we kept the di-path structure in our model.
62
200 500 1k 5k 10k 50k 100k
Host Contig Length
200 500 1k 5k 10k 50k
Phage Contig Length
1.1 2.0 2.2 1.3 -0.4 4.0 2.3
0.2 2.9 -0.4 2.7 3.2 3.5 1.9
-0.2 2.4 0.6 3.3 2.4 2.7 1.4
1.2 3.3 1.0 0.8 3.1 2.6 -0.7
6.4 1.5 3.9 2.4 0.5 1.8 3.0
4.2 1.7 2.9 1.5 2.6 3.7 1.5
10.0
7.5
5.0
2.5
0.0
2.5
5.0
7.5
10.0
AUROC(%)
(a) Base
200 500 1k 5k 10k 50k 100k
Host Contig Length
200 500 1k 5k 10k 50k
Phage Contig Length
1.2 2.0 -0.5 -1.1 -1.0 2.2 1.7
-0.0 -0.9 -2.6 1.7 1.7 0.5 1.9
-1.7 -1.6 -2.0 -0.3 -0.8 0.1 -0.1
-2.0 -0.2 1.2 -0.1 0.7 -0.4 -0.2
2.5 0.1 1.7 -1.6 0.1 -0.8 0.7
0.9 0.3 -0.3 0.7 -0.0 1.4 1.5
10.0
7.5
5.0
2.5
0.0
2.5
5.0
7.5
10.0
AUROC(%)
(b) Codon
Figure 3.6: Heatmap of the AUROC value difference between ContigNet with di-path
model and (a) base-path only model, and (b) codon-path only model. The x-axis represents
host contig lengths and the y-axis represents phage contig lengths. The color gradient ranges
from red to green, representing values from -10 to 10.
63
3.3.5 Sequencing errors and chimeric contigs decrease the performance
of ContigNet
Figures 3.7 and 3.8 show the performance of ContigNet trained with error-free training set and
tested with validation set with artificially introduced errors. Figure 3.7 shows the change of performance with different levels of simulated sequencing error rates. The performance of ContigNet
decreases with both substitution and insertion/deletion rates as expected. However, even with
very high values of insertion/deletion rate δ = 0.1 and substitution rate µ = 0.1, ContigNet still
maintains a high AUROC score of above 0.805, a slight decrease from 0.840 when no errors are
introduced to the validation set.
64
0.00 0.02 0.04 0.06 0.08 0.10
0.805
0.810
0.815
0.820
0.825
0.830
0.835
0.840
AUROC
=0
=0.05
=0.1
No error
Figure 3.7: Line plot describing the AUROC score of ContigNet under varying levels of
artificially introduced sequencing errors in the test set. The x-axis represents µ from 0 to
0.1, and each solid line corresponds to a different level of δ, the insertion/deletion rate, with value
0, 0.05 and 0.1. The dashed line represents the baseline, where no errors were introduced. The
y-axis denotes the AUROC score.
Figure 3.8 shows the AUROC scores of ContigNet with different levels of artificial chimeras
for phage contigs only, host contigs only and both phage and host contig pairs. The dash line
shows the AUROC score of ContigNet for no chimeras as the baseline. Both the phage and host
contig lengths were set at 5 kbps. The figure shows that ContigNet is robust to phage chimeras,
with AUROC staying around that for ContigNet with no chimeras. The impact of host chimeras
65
on ContigNet is higher than that for phage contigs. ContigNet still maintains AUROC above
0.805 when the fraction of sequence from others is below 0.20. For the different impacts between
host chimeras and phage chimeras on ContigNet, it can possibly be attributed to the inherent
diversity differences between phages and hosts. The phages are more diverse and are more likely
to exchange genetic materials compared to hosts.
0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200
Chimera portion
0.805
0.810
0.815
0.820
0.825
0.830
0.835
0.840
AUROC
Virus Chimera
Host Chimera
Both Chimera
No Chimera
Figure 3.8: Line plot describing the AUROC score of ContigNet under varying levels of
artificially introduced assembly errors in the test set. The x-axis represents the chimera
portion from 0 to 0.2, and solid lines represent only virus is perturbed, only host is perturbed, or
both virus and host are perturbed. The dashed line represents the baseline, where no chimera
were introduced. The y-axis denotes the AUROC score.
66
3.3.6 Performance of ContigNet on new datasets
To further validate that ContigNet works well for novel data, we tested ContigNet on two separate
databases. The first database is the Metagenomic Gut Virus (MGV) catalogue [149]. The investigators used multiple viral-informative features, including the presence/absence of viral protein
families, the presence of viral nucleotide signatures and multiple adjacent genes on the same
strand [149] to identify viral contigs based on 11,810 distinct human gut metagenomic samples.
The viral genomes with completeness greater than 50% were then selected from all identified viral
genomes, and their hosts were identified using CRISPR-spacer matches and whole genome alignment. Both methods require near exact matches and the predicted hosts for the viral genomes
have a high specificity. We collected the phage-host pairs from the MGV database with hosts of
the phages predicted at the genus level. A total of 77,348 phage-host pairs were identified. To
remove phages with high similarity to phages in Virus-Host DB, we clustered the phage genomes
from Virus-Host DB and MGV. The total number of phages were around 80,000. CD-Hit was not
able to give a result in a reasonable time, due to not fully utilizing multi-threading. We applied
MMseqs2 [156] with parameters local sequence identity with sequence identity threshold 95%
and 50% alignment coverage for the shorter sequence, which were the same as we used when
clustering the Virus-Host DB alone. Phage-host pairs from MGV with phage appeared in the
same clusters as phages from Virus-Host DB were removed. Phage-host pairs with hosts under
the same genus as hosts from VirHost DB were also removed. The resulting test set contains
39,916 phage-host pairs with 39,916 phages and 134 genera.
The ContigNet model was trained on the entire Virus-Host DB, and tested on the MGV dataset.
The AUROC results for different combinations of phage and host contig lengths are shown in
67
Figure 3.9. As a baseline, we tested d2
∗
and WIsH on the MGV dataset with 50 kbps contigs.
The selection of positive and negative pairs was the same as the aforementioned method. The
AUROC score was 0.516 and 0.518, respectively. Shorter contigs were not tested because the
AUROC score for longer contigs was already low. In comparison, the AUROC for ContigNet
with 200 bps contigs pairs is 0.601. Compared to the result of d2
∗
and WIsH it is a significant
improvement. With contig pairs of 50 kbps, the AUROC score is 0.698, a further improvement.
The score converges as we further increase the phage and host contig lengths, with 0.698 for
phage contig of 50 kbps and host contig of 100 kbps. The heatmap for different contig length
combinations is shown in Figure 3.9. The AUROC scores are generally lower than that in Figure
3.5, probably due to sequencing errors and chimeric contigs.
68
200 500 1k 5k 10k 50k 100k
Host Contig Length
200 500 1k 5k 10k 50k
Phage Contig Length
60.1 60.9 62.0 62.7 63.1 63.4 63.4
61.4 63.1 63.7 65.1 65.7 66.1 66.1
62.2 63.8 64.9 66.4 66.7 67.2 67.5
63.1 65.5 66.7 68.3 68.4 69.3 69.9
63.3 65.3 66.6 68.2 69.0 69.4 69.7
62.2 64.9 66.1 68.4 68.8 69.8 69.8
60
65
70
75
80
85
90
95
100
AUROC(%)
Figure 3.9: Heatmap of AUROC for different contig lengths on the MGV dataset, using a
model trained with the Virus-Host DB. TThe x-axis represents host contig lengths and the
y-axis represents phage contig lengths. The color gradient ranges from red to green, representing
values from 60 to 100.
The second database we used was the PLSDB dataset [150]. We wondered whether the ContigNet model we developed can be used to predict plasmid-host associations. For each plasmid
entry in the PLSDB dataset, a corresponding host species was provided. To evaluate the performance of our model on PLSDB, we first trained our model on the entire Virus-Host DB. For each
positive plasmid-host pair in PLSDB, a different host species was randomly selected to construct
the negative plasmid-host pair. The AUROC was then calculated for different plasmid and host
contig lengths ranging from 200 bps to 5 kbps, and the AUROC was reported for different contig
69
length combinations. Because plasmid and phage are two different classes of MGEs, there were
no overlaps between Virus-Host DB as training set and the PLSDB as testing set. Therefore, no
redundancy removal steps were required. The results are shown in Figure 3.10(a). The figure
shows that the resulting AUROCs range from 0.734 to 0.85. For comparison, we calculated the
AUROC scores when we used d
∗
2
and WIsH to predict the association status when the contig
lengths were 200 bps, 1 kbps 5 kbps and 50 kbps, respectively. The results are shown in Figure
3.10(b) and (c), where the AUROC ranged from 0.489 to 0.771. By comparing the results, we can
see that ContigNet has provided significant improvement even if the model was trained on a
totally different dataset.
70
200 500 1k 5k 10k 50k 100k
Host Contig Length
200 500 1k 5k 10k 50k
Plasmid Contig Length
73.4 76.9 77.6 79.3 80.3 78.7 80.1
76.2 79.6 82.0 82.4 84.2 84.2 84.6
79.1 81.6 83.4 85.3 86.4 85.1 86.2
79.7 82.9 84.6 86.3 86.6 88.3 88.4
78.8 84.2 84.8 87.5 87.6 88.5 88.5
76.7 81.9 82.4 83.3 85.3 84.3 85.0
60
65
70
75
80
85
90
95
100
AUROC(%)
(a) ContigNet
0.0 0.2 0.4 0.6 0.8 1.0
FPR
0.0
0.2
0.4
0.6
0.8
1.0
TPR
50kbps, AUC=0.771
5kbps, AUC=0.708
1kbps, AUC=0.648
200bps, AUC=0.572
(b) d
∗
2
0.0 0.2 0.4 0.6 0.8 1.0
FPR
0.0
0.2
0.4
0.6
0.8
1.0
TPR
50kbps, AUC=0.770
5kbps, AUC=0.645
1kbps, AUC=0.531
200bps, AUC=0.488
(c) WIsH
Figure 3.10: ContigNet can be used to predict plasmid-host associations with high accuracy. (a) Heatmap of AUROC for different contig lengths on plasmid dataset (PLSDB) for model
trained with virus dataset (Virus-Host DB). The x-axis represents host contig lengths and the yaxis represents plasmid contig lengths. (b) The ROC curves of d
∗
2 method on PLSDB for contigs
with different lengths. (c) The ROC curves of WIsH on PLSDB for contigs with different lengths.
The performance of ContigNet on PLSDB seemed unexpected, considering the distinct lifestyle
between plasmids and phages. However, from evolutional perspective, both MGEs depend on organelles of hosts to replicate and have to adapt to the internal environment of a particular host,
including the bias of nucleotides, codons, di-codons etc. The experiment above proves that our
71
model is able to capture this relationship between phages and hosts, and apply it to make plasmidhost association prediction.
3.3.7 Computational cost
The model training process is designed to preload the entire Virus-Host DB and corresponding
hosts into memory to minimize I/O overhead. Therefore, it is recommended to use a machine
with more than 128GB of memory. In our case, we utilized a workstation equipped with an AMD
EPYC 7742 CPU and an NVIDIA RTX 2080Ti GPU. This configuration enabled us to complete the
training in less than 5 hours and the testing in under 1 hour.
3.4 Discussion
In this chapter, we present ContigNet, a deep learning-based method for predicting phage-host
contig interactions, the first deep learning method for this task. We showed that ContigNet outperforms other k-mer based method such as d
∗
2 or Markov chain based method, WIsH, for predicting phage-host contig associations.
3.4.1 Proper software should be selected carefully for particular task
Several state-of-the-art tools are available for phage-host association prediction, including PHP
[92], HoPhage (consisting of HoPhage-G and HoPhage-S) [95], VPF-Class [93], VHM-Net [47],
vHULK [90], RaFAH [91] and HostG [94]. Each of these tools has its own limitations making it
unsuitable for the question of predicting phage-host contig associations in natural metagenomic
72
settings. For instance, HoPhage-G, VPF-Class, VHM-Net, vHULK and RaFAH are multiclass classifiers with fixed candidate host set and thus cannot be used for phage host contig association
predictions. PHP is another k-mer based method and was shown to underperform WIsH when
the viral contigs were shorter than 10 kbps [92]. HoPhage-S uses coding sequences to build hidden Markov model but short contigs may not contain coding sequences. HostG uses BLASTN
matches to construct phage-host connections when building the graph and integrate new hosts
into the graph. However, such matches can be scarce between contigs. Therefore, all aforementioned softwares are not suitable to predict the reference-independent contig-level phage-host
associations.
On the other hand, ContigNet can also take the whole phage and host genomes as input because it supports contigs with any length. We compared its performance with other methods on
the whole genomes according to the experimental steps described in [94] by testing the methods on the whole dataset and the dataset with only phage-host pairs without alignment results.
And the results are shown in Figure 3.11 and Figure 3.12. From the figures we can observe that
ContigNet still outperforms WIsH. However, ContigNet underperforms other state-of-the-art approaches for phage-host genome associations that were optimized for such a purpose. Therefore
users are advised to use existing methods specialized for whole-genome association if the host
whole genome is available.
73
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Genus Family Order Class Phylum
WIsH PHP HoPhage VPF-Class VHM-net vHULK RaFAH BLASTN HostG ContigNet
Figure 3.11: Grouped bar chart illustrating the accuracy of host prediction for whole
genomes across various taxonomic levels, from genus to phylum. The x-axis represents
the taxonomy levels, arranged from lower to higher ranks, while the y-axis indicates the prediction accuracy of each corresponding method at different taxonomy levels. The bars represent
different phage-host association prediction methods (WIsH, PHP, HoPhage, VPF-Class, VHMnet, vHULK, RaFAH, BLASTN, HostG, ContigNet), and are grouped together at each taxonomic
level. Notably, for the task of assigning a viral contig to a candidate host, the performance of
ContigNet is comparable to that of WIsH, but not as high as other methods.
74
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Genus Family Order Class Phylum
WIsH PHP VHM-net HoPhage VPF-Class vHULK RaFAH HostG ContigNet
Figure 3.12: Grouped bar chart illustrating the accuracy of host prediction for whole
genomes of phage-host pairs without alignment results across various taxonomic levels, from genus to phylum. The x-axis represents the taxonomy levels, arranged from lower to
higher ranks, while the y-axis indicates the prediction accuracy of each corresponding method at
different taxonomy levels. The bars represent different phage-host association prediction methods (WIsH, PHP, HoPhage, VPF-Class, VHM-net, vHULK, RaFAH, BLASTN, HostG, ContigNet),
and are grouped together at each taxonomic level. The overall relation between accuracies of
different methods remains the same.
With the incorporation of shared weights of convolutional paths between phage and host, we
observed increased performance and generalizability. If we consider ContigNet as consisting of
two components, the convolutional layers and the linear layers, the convolutional layers can be
considered as a feature extractor and the linear layers can be considered as a classifier accepting
the 512 features extracted from phage contig and host contigs, respectively. The applications of
the extracted features are not restricted to the phage-host association status prediction itself, and
75
the derived model can also be used for plasmid-host contig association predictions, a completely
different problem. Our results show the great potential that our model can be adapted for much
broader applications, either directly use the model for sequence feature extraction, or use our
model as a foundation and use transfer learning to fine-tune the parameters for new problems.
Therefore, in our software, we provide a feature extraction mode as well as the trained model so
that other investigators can use features to train new models for different questions.
However, there are still more valuable topics to explore for this method in the future. Poor
explainability has long been a shortcoming for deep learning related methods in different applications, and this applies to our model too. In traditional methods, the features extracted from a
sequence have clear meanings, for example, k-mer frequencies are the frequencies of oligonucleotides in the given sequence. Among the 512 features extracted from a sequence, 256 of them
are from base information and 256 of them are from codon information, but there is no clear
biological meaning for each individual feature. Therefore, an explainable feature extractor is a
topic for future studies.
In conclusion, ContigNet showed a competitive performance on identifying the relationship
between phage and host contigs. It can be a useful tool for biological researchers when studying
novel metagenomic samples from diverse natural environments, particularly those ones poorly
represented in genomic databases.
76
Chapter 4
Conclusions and future work
4.1 Advancements in metagenomic sequence slassification
and association prediction
In this dissertation, we have presented our efforts towards computationally addressing the challenges associated with shotgun metagenomics.
DeepMicroClass as a versatile multi-class classifier enables the accurate classification of five
different metagenomic sequence types in one shot, meanwhile, it avoids the time-consuming and
error-prone preprocessing steps that could potentially propagate errors to the final classification.
The inclusive modeling of all common sequence types in metagenomes also makes DeepMicroClass attain better performance than the other state-of-the-art individual predictors due to reduced cross misclassifications. We also detected high relative abundances of marine eukaryotes
in a daily time-series dataset, which were underestimated by alignment-based classifiers due to
the limitation of public reference databases. Our case study indicates that both host and viral
sequences are essential components in the cellular metagenomes, and robust ecological patterns
77
can be obtained with DeepMicroClass even for coarse sequence types. We argue that by using
DeepMicroClass as a preliminary classification step on metagenomic/viromic assemblies, one can
further focus on the interested sequence types for the following analysis, such as metagenomic
binning of prokaryotic or eukaryotic contigs, comparative genomic analysis of viral or plasmid
sequences, etc. We conclude DeepMicroClass achieves higher performance than the other benchmarked predictors, and its application can facilitate studies of under-appreciated sequence types,
such as microbial eukaryotic or viral sequences.
ContigNet, on the other hand, emerges as a robust tool, capable of accurately predicting
the likelihood of association between viral and bacterial contigs. Unlike traditional taxonomy
assignment-based prediction methods, ContigNet directly harnesses the wealth of information
embedded within the metagenomics dataset, preventing the exclusion of bacteria not present in
pre-existing databases. In terms of direct comparison capabilities, it stands out, outperforming
existing tools like VirHostMatcher (VHM) and WIsH, especially when it comes to shorter contigs. Moreover, through its intuitive interface, ContigNet requires only two DNA sequences as
input and delivers an association likelihood score ranging between 0 and 1. We advocate for the
integration of ContigNet into metagenomic research pipelines, asserting that it not only outpaces
other predictors in performance but also enriches the research arena, emphasizing sequences that
have traditionally been overlooked, like specific viral or host contigs.
78
4.2 DeepMicroClass enables potential future advancement
in sequence classification
The journey of DeepMicroClass, since its inception, has been driven by the evolving nature of
metagenomics and the unique computational challenges metagenomics data presents. While
DeepMicroClass has made significant strides in multi-class classification of metagenomic sequences, the dynamic and ever-expanding realm of genomics demands a perpetual evolution of
tools and methodologies.
During our iteration of developing the methods, we observed that the training set is essential
to the performance of the model, in spite of the fact that we have made sure there is no data
leakage. Therefore, with the exponential growth of sequence data, DeepMicroClass could employ
a continuous learning approach, where the model regularly updates itself based on new data,
ensuring its predictions remain state-of-the-art.
Beyond mere classification, understanding the functional roles of these microbial entities is
also crucial. The next iteration of DeepMicroClass could incorporate elements of functional prediction, offering insights into not just "what a sequence is", but also "what it does."
For many researchers, the ability to visualize data in an intuitive manner is invaluable. Enhancing the suite of visualization tools that accompany DeepMicroClass can aid in better data
interpretation and insights. In the same sense, simplifying the user experience, through a graphical user interface (GUI) or more streamlined command-line operations, can also make DeepMicroClass accessible to a broader audience, including researchers with minimal computational
backgrounds.
79
In conclusion, the potential of DeepMicroClass is boundless. As genomics and metagenomics
continue to evolve, so too must our computational tools. With focused development, DeepMicroClass has the potential to remain at the forefront of metagenomic classification, offering invaluable insights to researchers across the globe.
4.3 Expanding datasets unlock new potential for ContigNet
The trajectory of computational methodologies has always been intertwined with advancements
in sequencing technologies. As the horizon of sequencing continues to expand, particularly with
the advent of long-read sequencing, there’s a compelling need to re-engineer our computational
tools to ensure they harness the full potential of the data they process.
For instance, while short read sequencers like Illumina have dominated the genomics landscape for years, the recent advancements in long-read sequencing technologies, such as the Oxford Nanopore R10.4, have shown promising results, often rivaling the accuracy and error rate
of their short-read counterparts [157, 158, 159]. Yet, they provide a much broader canvas by offering substantially longer reads. This paradigm shift, albeit advantageous, also presents unique
computational challenges.
It’s evident from our findings with ContigNet that our current algorithms are optimized for
shorter contigs, showing a performance saturation beyond 5 kbps. However, with the average
read length of Oxford Nanopore now reaching up to 20 kbps and pushing boundaries with reads as
long as 4 Mbps, it is imperative to re-evaluate and re-structure our models. The vast information
carried by these longer sequences should not be left unexplored.
80
Therefore, as the nature of sequencing data transforms, so should our approach to analysis. To
truly harness the power of long reads, there’s a pressing need to evolve our algorithms, making
them more sensitive to the intricacies of longer sequences. This not only means adapting our
current algorithms but also innovating novel methods tailored explicitly for longer reads. By
synchronizing computational advancements with experimental methods, we can pave the way
for more profound insights into genomics and a more holistic understanding of the intricate web
of life.
Besides long-read sequencing, as we have shown in Chapter 3.3.6, our model, when trained
solely on virus-host pairs, already shows a commendable capability in accurately predicting
plasmid-host associations. Given this potential, there is an expectation that with data from more
relationships among kingdoms, such as plasmid-host, prokaryote-prokaryote, and prokaryoteeukaryote associations, we might be able to enhance the model into a universal predictor capable
of discerning a wide variety of organismal associations.
Another potential improvement we can make with the ContigNet model is that, as the microbiome community is ever changing and evolving, historical infection from a bacteriophage
does not guarantee a future infection. The CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) system in bacteria is a natural adaptive immune system that provides defense
against invading viruses who invaded the bacteria before. It is a great tool to show any previous
infection, however, in order to look at what it is currently happening, the metagranscriptomic
data can also be used to study not only the history, but also the present, and untimately guide us
to predict the future.
81
Chapter 5
Supplementary materials
5.1 Supplementary mateirials for Chapter 2
82
Dataset Prok ProkVir Plas Euk EukVir PROK EUK PROK
:EUK
Prok:ProkVir:Plas
| Euk:EukVir
DS_1 643 129 129 83 17 901 100 9:1 5:1:1 | 5:1
DS_2 600 150 150 80 20 900 100 9:1 4:1:1 | 4:1
DS_3 540 180 180 75 25 900 100 9:1 3:1:1 | 3:1
DS_4 450 225 225 67 33 900 100 9:1 2:1:1 | 2:1
DS_5 500 100 100 250 50 700 300 7:3 5:1:1 | 5:1
DS_6 467 117 117 240 60 701 300 7:3 4:1:1 | 4:1
DS_7 420 140 140 225 75 700 300 7:3 3:1:1 | 3:1
DS_8 350 175 175 200 100 700 300 7:3 2:1:1 | 2:1
DS_9 357 71 71 417 83 499 500 5:5 5:1:1 | 5:1
DS_10 333 83 83 400 100 499 500 5:5 4:1:1 | 4:1
DS_11 300 100 100 375 125 500 500 5:5 3:1:1 | 3:1
DS_12 250 125 125 333 167 500 500 5:5 2:1:1 | 2:1
DS_13 214 43 43 583 117 300 700 3:7 5:1:1 | 5:1
DS_14 200 50 50 560 140 300 700 3:7 4:1:1 | 4:1
DS_15 180 60 60 525 175 300 700 3:7 3:1:1 | 3:1
DS_16 150 75 75 467 233 300 700 3:7 2:1:1 | 2:1
DS_17 71 14 14 750 150 99 900 1:9 5:1:1 | 5:1
DS_18 67 17 17 720 180 101 900 1:9 4:1:1 | 4:1
DS_19 60 20 20 675 225 100 900 1:9 3:1:1 | 3:1
DS_20 50 25 25 600 300 100 900 1:9 2:1:1 | 2:1
Table 5.1: The composition of test datasets used in this study for benchmarking different
tools. PROK includes prokaryotic genomes, plasmids and prokaryotic viruses; EUK includes eukaryotic genomes and viruses. Prok: prokaryotic genomes, ProkVir: prokaryotic viruses/phages,
Plas: plasmids, Euk: eukaryotic genomes, EukVir: eukaryotic viruses.
83
9.5e-05
1.9e-06
1.9e-06
Anova, p = 5.7e-16 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DMC_Acc Tiara_Acc Whokaryote_Acc
Predictor
Accuracy
Predictor DMC_Acc Tiara_Acc Whokaryote_Acc
1.9e-06
1.9e-06
1.9e-06
Anova, p < 2.2e-16 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DMC_F1 Tiara_F1 Whokaryote_F1
Predictor
F1 Score
Predictor DMC_F1 Tiara_F1 Whokaryote_F1
Figure 5.1: Performance of DeepMicroClass, Tiara and Whokaryote on eukaryotic sequence classification. Both the accuracy and F1 score were compared based on 20 designed test
datasets. The sequence class composition of the 20 test datasets can be found in Table 5.1. Values
on top of the pairwise comparisons are Bonferroni adjusted t-test p-values.The significance of the
overall ANOVA test was shown on the bottom left corner.
84
DS_1DS_2DS_3DS_4DS_5DS_6DS_7DS_8DS_9
DS_10 DS_11 DS_12 DS_13 DS_14 DS_15 DS_16 DS_17 DS_18 DS_19 DS_20
0
20
40
60
80
100
Number of misclassified samples
Tiara
Prok->Euk
ProkVir->Euk
Euk->Non-Euk
EukVir->Euk
Plas->Euk
DS_1DS_2DS_3DS_4DS_5DS_6DS_7DS_8DS_9
DS_10 DS_11 DS_12 DS_13 DS_14 DS_15 DS_16 DS_17 DS_18 DS_19 DS_20
0
100
200
300
400
500 Whokaryote
Prok->Euk
ProkVir->Euk
Euk->Non-Euk
EukVir->Euk
Plas->Euk
Dataset
Figure 5.2: The distribution of misclassified sequence types by Tiara and Whokaryote.
The sequence composition of these datasets can be found in Table 5.1. To make the figure more
visible, the range of y-axis is from 0 to 100 for Tiara and from 0 to 500 for Whokaryote.
85
6.8e−08
6.8e−08
0.062
Anova, p < 2.2e−16 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DMF_Acc PlasFlow_Acc PPR_Acc
Predictor
Accuracy
Predictor DMF_Acc PlasFlow_Acc PPR_Acc
1.5e−11
1.5e−11
0.7
Anova, p = 1e−13 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DMF_F1 PlasFlow_F1 PPR_F1
Predictor
F1 Score
Predictor DMF_F1 PlasFlow_F1 PPR_F1
Figure 5.3: Performance of DeepMicroClass, PlasFlow, PPR-Meta and geNomad on plasmid sequence classification. Both the accuracy and F1 score were compared based on 20 designed test datasets. The sequence class composition of the 20 test datasets can be found in Table
5.1. Values on top of the pairwise comparisons are Bonferroni adjusted t-test p-values. The significance of the overall ANOVA test was shown on the bottom left corner.
86
DS_1DS_2DS_3DS_4DS_5DS_6DS_7DS_8DS_9DS_10 DS_11 DS_12 DS_13 DS_14 DS_15 DS_16 DS_17 DS_18 DS_19 DS_20
0
50
100
150
200
250
300
350
400
Number of misclassified samples
PlasFlow
Prok->Plas
ProkVir->Plas
Euk->Plas
EukVir->Plas
Plas->Non-Plas
DS_1DS_2DS_3DS_4DS_5DS_6DS_7DS_8DS_9DS_10 DS_11 DS_12 DS_13 DS_14 DS_15 DS_16 DS_17 DS_18 DS_19 DS_20
0
50
100
150
200
250
300
350
400
PPR-Meta
Prok->Plas
ProkVir->Plas
Euk->Plas
EukVir->Plas
Plas->NonPlas
DS_1
DS_2
DS_3
DS_4
DS_5
DS_6
DS_7
DS_8
DS_9
DS_10
DS_11
DS_12
DS_13
DS_14
DS_15
DS_16
DS_17
DS_18
DS_19
DS_20
Dataset
0
50
100
150
200
250
300
350
400
Number of misclassified samples
geNomad
Prok->Plas
ProkVir->Plas
Euk->Plas
EukVir->Plas
Plas->NonPlas
Figure 5.4: The distribution of misclassified sequence types by PlasFlow, PPR-Meta and
geNomad. The sequence composition of these datasets can be found in Table 5.1. The y-axis
ranges from 0 to 400 for both panels.
87
1.9e-06
1.9e-06
9.5e-05
9.5e-05
Anova, p < 2.2e-16 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DMC_Acc DVF_AccVIBRANT_AccPPR_AccgeNomad_Acc
Predictor
Accuracy
Predictor DMC_Acc DVF_Acc VIBRANT_Acc PPR_Acc geNomad_Acc
1.9e-06
1.9e-06
1.9e-06
1.9e-06
Anova, p < 2.2e-16 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DMC_F1 DVF_F1 VIBRANT_F1 PPR_F1 geNomad_F1
Predictor
F1 Score
Predictor DMC_F1 DVF_F1 VIBRANT_F1 PPR_F1 geNomad_F1
Figure 5.5: Performance of DeepMicroClass (DMF), DeepVirFinder (DVF), VIBRANT,
PPR-Meta (PPR) and geNomad on prokaryotic viral sequence classification. Both the
accuracy and F1 score were compared based on 20 designed test datasets. The sequence class
composition of the 20 test datasets can be found in Table 5.1. Values on top of the pairwise
comparisons are Bonferroni adjusted t-test p-values. The significance of the overall ANOVA test
was shown on the bottom left corner.
88
9.5e-05
Anova, p < 2.2e-16 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DMC_Acc VirSorter_Acc
Predictor
Accuracy
Predictor DMC_Acc VirSorter_Acc
1.9e-06
Anova, p < 2.2e-16 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DMC_F1 VirSorter_F1
Predictor
F1 Score
Predictor DMC_F1 VirSorter_F1
Figure 5.6: Performance of DeepMicroClass and VirSorter2 on prokaryotic and eukaryotic viral sequence classification. Both the accuracy and F1 score were compared based on
20 designed test datasets. The sequence class composition of the 20 test datasets can be found in
Table 5.1. Values on top of the pairwise comparisons are Bonferroni adjusted t-test p-values. The
significance of the overall ANOVA test was shown on the bottom left corner.
89
DS_1DS_2DS_3DS_4DS_5DS_6DS_7DS_8DS_9DS_10 DS_11 DS_12 DS_13 DS_14 DS_15 DS_16 DS_17 DS_18 DS_19 DS_20
0
100
200
300
400
500
Number of misclassified samples
PPR-Meta
Prok->ProkVir
Euk->ProkVir
EukVir->ProkVir
Plas->ProkVir
ProkVir->NonProkVir
DS_1DS_2DS_3DS_4DS_5DS_6DS_7DS_8DS_9DS_10 DS_11 DS_12 DS_13 DS_14 DS_15 DS_16 DS_17 DS_18 DS_19 DS_20
0
100
200
300
400
500 DeepVirFinder
Prok->ProkVir
Euk->ProkVir
EukVir->ProkVir
Plas->ProkVir
ProkVir->NonProkVir
DS_1DS_2DS_3DS_4DS_5DS_6DS_7DS_8DS_9DS_10 DS_11 DS_12 DS_13 DS_14 DS_15 DS_16 DS_17 DS_18 DS_19 DS_20
0
10
20
30
40
50
Number of misclassified samples
VIBRANT
Prok->ProkVir
Euk->ProkVir
EukVir->ProkVir
Plas->ProkVir
ProkVir->NonProkVir
DS_1DS_2DS_3DS_4DS_5DS_6DS_7DS_8DS_9DS_10 DS_11 DS_12 DS_13 DS_14 DS_15 DS_16 DS_17 DS_18 DS_19 DS_20
0
50
100
150
200
250
300
350
400
geNomad
Prok->ProkVir
Euk->ProkVir
EukVir->ProkVir
Plas->ProkVir
ProkVir->NonProkVir
DS_1
DS_2
DS_3
DS_4
DS_5
DS_6
DS_7
DS_8
DS_9
DS_10
DS_11
DS_12
DS_13
DS_14
DS_15
DS_16
DS_17
DS_18
DS_19
DS_20
Dataset
0
10
20
30
40
50
60
70
80
Number of misclassified samples
VirSorter2
Prok->Vir
Euk->Vir
Plas->Vir
ProkVir->NonVir
EukVir->NonVir
Figure 5.7: The distribution of misclassified sequence types by PPR-Meta, DeepVirFinder, VIBRANT, geNomad and VirSorter2. For PPR-Meta, DeepVirFinder, VIBRANT
and geNomad, only prokaryotic viruses are considered as positive set, and for VirSorter2 both
prokaryotic and eukaryotic viruses are considered positive. To make the figure more visible, the
range of y-axis is from 0 to 500 for PPR-Meta and DeepVirFinder, from 0 to 50 for VIBRANT and
0 to 80 for VirSorter2.
90
1.9e−06
1.9e−06
Anova, p < 2.2e−16 0.00
0.25
0.50
0.75
1.00
DMC_Acc PPR_Acc geNomad_Acc
Predictor
Accuracy
Predictor DMC_Acc PPR_Acc geNomad_Acc
1.9e−06
1.9e−06
Anova, p < 2.2e−16 0.00
0.25
0.50
0.75
1.00
DMC_F1 PPR_F1 geNomad_F1
Predictor
F1 Score
Predictor DMC_F1 PPR_F1 geNomad_F1
Figure 5.8: Performance of DeepMicroClass PPR-Meta and geNomad on prokaryotic host,
prokaryotic virus and plasmid contig classification. Both the accuracy and F1 score were
compared based on 20 designed test datasets. The sequence class composition of the 20 test
datasets can be found in Table 5.1. Values on top of the pairwise comparisons are Bonferroni
adjusted t-test p-values. The significance of the overall ANOVA test was shown on the bottom
left corner.
91
DS_1
DS_2
DS_3
DS_4
DS_5
DS_6
DS_7
DS_8
DS_9
DS_10
DS_11
DS_12
DS_13
DS_14
DS_15
DS_16
DS_17
DS_18
DS_19
DS_20
Dataset
0
10
20
30
40
50
Number of misclassified samples
DeepMicroClass
Euk->EukVir
Euk->Plasmid
Euk->Prok
Euk->ProkVir
EukVir->Euk
EukVir->Plasmid
EukVir->Prok
EukVir->ProkVir
Plasmid->Euk
Plasmid->EukVir
Plasmid->Prok
Plasmid->ProkVir
Prok->Euk
Prok->EukVir
Prok->Plasmid
Prok->ProkVir
ProkVir->Euk
ProkVir->EukVir
ProkVir->Plasmid
ProkVir->Prok
Figure 5.9: The distribution of misclassified sequence types DeepMicroClass The sequence
composition of these datasets can be found in Table 5.1. The total number of errors in all possible
dataset compositions can be at most 50 and we can observe that the major source of the error is
the misclassification between prokaryotic hosts and plasmids.
92
a b
c d
Figure 5.10: Correlation coefficients of Prokaryotic (a), Eukaryotic (b), ProkaryoticViral
(c), and EukaryoticViral (d) sequence relative abundances of different sequence classifiers. Coefficients highlighted in colors are significant ones (p-value < 0.01).
93
Bibliography
[1] Paul G. Falkowski, Tom Fenchel, and Edward F. Delong. “The Microbial Engines That
Drive Earth’s Biogeochemical Cycles”. In: Science 320.5879 (May 2008), pp. 1034–1039.
issn: 0036-8075, 1095-9203. doi: 10.1126/science.1153213.
[2] Farooq Azam and Alexandra Z. Worden. “Oceanography. Microbes, molecules, and
marine ecosystems”. In: Science (New York, N.Y.) 303.5664 (Mar. 2004), pp. 1622–1624.
issn: 1095-9203. doi: 10.1126/science.1093892.
[3] Didier Raoult and Patrick Forterre. “Redefining viruses: lessons from Mimivirus”. In:
Nature Reviews Microbiology 6.4 (Apr. 2008), pp. 315–319. issn: 1740-1534. doi:
10.1038/nrmicro1858.
[4] C. R. Woese and G. E. Fox. “Phylogenetic structure of the prokaryotic domain: the
primary kingdoms”. In: Proceedings of the National Academy of Sciences of the United
States of America 74.11 (Nov. 1977), pp. 5088–5090. issn: 0027-8424. doi:
10.1073/pnas.74.11.5088.
[5] Norman R. Pace, David A. Stahl, David J. Lane, and Gary J. Olsen. “The Analysis of
Natural Microbial Populations by Ribosomal RNA Sequences”. In: Advances in Microbial
Ecology. Ed. by K. C. Marshall. Advances in Microbial Ecology. Springer US, 1986,
pp. 1–55. isbn: 978-1-4757-0611-6. doi: 10.1007/978-1-4757-0611-6_1.
[6] G. J. Olsen, D. J. Lane, S. J. Giovannoni, N. R. Pace, and D. A. Stahl. “Microbial ecology
and evolution: a ribosomal RNA approach”. In: Annual Review of Microbiology 40 (1986),
pp. 337–365. issn: 0066-4227. doi: 10.1146/annurev.mi.40.100186.002005.
[7] T. M. Schmidt, E. F. DeLong, and N. R. Pace. “Analysis of a marine picoplankton
community by 16S rRNA gene cloning and sequencing”. In: Journal of Bacteriology
173.14 (July 1991), pp. 4371–4378. issn: 0021-9193. doi:
10.1128/jb.173.14.4371-4378.1991.
94
[8] J L Stein, T L Marsh, K Y Wu, H Shizuya, and E F DeLong. “Characterization of
uncultivated prokaryotes: isolation and analysis of a 40-kilobase-pair genome fragment
from a planktonic marine archaeon.” In: Journal of Bacteriology 178.3 (Feb. 1996),
pp. 591–599. issn: 0021-9193.
[9] Kevin L. Vergin, Ena Urbach, Jeffery L. Stein, Edward F. DeLong, Brian D. Lanoil, and
Stephen J. Giovannoni. “Screening of a Fosmid Library of Marine Environmental
Genomic DNA Fragments Reveals Four Clones Related to Members of the Order
Planctomycetales”. In: Applied and Environmental Microbiology 64.8 (Aug. 1998),
pp. 3075–3078. issn: 0099-2240.
[10] M. R. Rondon, P. R. August, A. D. Bettermann, S. F. Brady, T. H. Grossman, M. R. Liles,
K. A. Loiacono, B. A. Lynch, I. A. MacNeil, C. Minor, and et al. “Cloning the soil
metagenome: a strategy for accessing the genetic and functional diversity of uncultured
microorganisms”. In: Applied and Environmental Microbiology 66.6 (June 2000),
pp. 2541–2547. issn: 0099-2240. doi: 10.1128/AEM.66.6.2541-2547.2000.
[11] O. Béjà, M. T. Suzuki, E. V. Koonin, L. Aravind, A. Hadd, L. P. Nguyen, R. Villacorta,
M. Amjadi, C. Garrigues, S. B. Jovanovich, and et al. “Construction and analysis of
bacterial artificial chromosome libraries from a marine microbial assemblage”. In:
Environmental Microbiology 2.5 (Oct. 2000), pp. 516–529. issn: 1462-2912. doi:
10.1046/j.1462-2920.2000.00133.x.
[12] Boris A. Legault, Arantxa Lopez-Lopez, Jose Carlos Alba-Casado, W. Ford Doolittle,
Henk Bolhuis, Francisco Rodriguez-Valera, and R. Thane Papke. “Environmental
genomics of “Haloquadratum walsbyi” in a saltern crystallizer indicates a large pool of
accessory genes in an otherwise coherent species”. In: BMC Genomics 7.1 (July 2006),
p. 171. issn: 1471-2164. doi: 10.1186/1471-2164-7-171.
[13] Jo Handelsman, Michelle R Rondon, Sean F Brady, Jon Clardy, and Robert M Goodman.
“Molecular biological access to the chemistry of unknown soil microbes: a new frontier
for natural products”. In: Chemistry & biology 5.10 (1998), R245–R249.
[14] J. Craig Venter, Karin Remington, John F. Heidelberg, Aaron L. Halpern, Doug Rusch,
Jonathan A. Eisen, Dongying Wu, Ian Paulsen, Karen E. Nelson, William Nelson, and
et al. “Environmental genome shotgun sequencing of the Sargasso Sea”. In: Science (New
York, N.Y.) 304.5667 (Apr. 2004), pp. 66–74. issn: 1095-9203. doi: 10.1126/science.1093857.
[15] Jo Handelsman. “Metagenomics: application of genomics to uncultured
microorganisms”. In: Microbiology and molecular biology reviews: MMBR 68.4 (Dec. 2004),
pp. 669–685. issn: 1092-2172. doi: 10.1128/MMBR.68.4.669-685.2004.
[16] Thomas J Sharpton. “An introduction to the analysis of shotgun metagenomic data”. In:
Frontiers in plant science 5 (2014), p. 209.
95
[17] Lu Zhang, FengXin Chen, Zhan Zeng, Mengjiao Xu, Fangfang Sun, Liu Yang,
Xiaoyue Bi, Yanjie Lin, YuanJiao Gao, HongXiao Hao, et al. “Advances in metagenomics
and its application in environmental microorganisms”. In: Frontiers in microbiology 12
(2021), p. 766364.
[18] Christian S Riesenfeld, Patrick D Schloss, and Jo Handelsman. “Metagenomics: genomic
analysis of microbial communities”. In: Annu. Rev. Genet. 38 (2004), pp. 525–552.
[19] Torsten Thomas, Jack Gilbert, and Folker Meyer. “Metagenomics-a guide from sampling
to data analysis”. In: Microbial informatics and experimentation 2 (2012), pp. 1–12.
[20] Alla L Lapidus and Anton I Korobeynikov. “Metagenomic data assembly–the way of
decoding unknown microorganisms”. In: Frontiers in Microbiology 12 (2021), p. 613791.
[21] Gauri S Navgire, Neha Goel, Gifty Sawhney, Mohit Sharma, Prashant Kaushik,
Yugal Kishore Mohanta, Tapan Kumar Mohanta, and Ahmed Al-Harrasi. “Analysis and
Interpretation of metagenomics data: An approach”. In: Biological Procedures Online 24.1
(2022), pp. 1–22.
[22] Muhammad Tariq Pervez, Syed Hassan Abbas, Mahmoud F Moustafa, Naeem Aslam,
Syed Shah Muhammad Shah, et al. “A comprehensive review of performance of
next-generation sequencing platforms”. In: BioMed Research International 2022 (2022).
[23] Richard Durbin, Sean R Eddy, Anders Krogh, and Graeme Mitchison. Biological sequence
analysis: probabilistic models of proteins and nucleic acids. Cambridge university press,
1998.
[24] Saul B Needleman and Christian D Wunsch. “A general method applicable to the search
for similarities in the amino acid sequence of two proteins”. In: Journal of molecular
biology 48.3 (1970), pp. 443–453.
[25] Temple F Smith, Michael S Waterman, et al. “Identification of common molecular
subsequences”. In: Journal of molecular biology 147.1 (1981), pp. 195–197.
[26] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman.
“Basic local alignment search tool”. In: Journal of molecular biology 215.3 (1990),
pp. 403–410.
[27] Nuala A O’Leary, Mathew W Wright, J Rodney Brister, Stacy Ciufo, Diana Haddad,
Rich McVeigh, Bhanu Rajput, Barbara Robbertse, Brian Smith-White, Danso Ako-Adjei,
et al. “Reference sequence (RefSeq) database at NCBI: current status, taxonomic
expansion, and functional annotation”. In: Nucleic acids research 44.D1 (2016),
pp. D733–D745.
96
[28] Dennis A Benson, Mark Cavanaugh, Karen Clark, Ilene Karsch-Mizrachi,
David J Lipman, James Ostell, and Eric W Sayers. “GenBank”. In: Nucleic Acids Research
41.D1 (2012), pp. D36–D42.
[29] Fiona Cunningham, James E Allen, Jamie Allen, Jorge Alvarez-Jarreta,
M Ridwan Amode, Irina M Armean, Olanrewaju Austine-Orimoloye, Andrey G Azov,
If Barnes, Ruth Bennett, et al. “Ensembl 2022”. In: Nucleic acids research 50.D1 (2022),
pp. D988–D995.
[30] UniProt Consortium. “UniProt: a worldwide hub of protein knowledge”. In: Nucleic acids
research 47.D1 (2019), pp. D506–D515.
[31] Maria Chatzou, Cedrik Magis, Jia-Ming Chang, Carsten Kemena, Giovanni Bussotti,
Ionas Erb, and Cedric Notredame. “Multiple sequence alignment modeling: methods and
applications”. In: Briefings in bioinformatics 17.6 (2016), pp. 1009–1023.
[32] Lusheng Wang and Tao Jiang. “On the complexity of multiple sequence alignment”. In:
Journal of computational biology 1.4 (1994), pp. 337–348.
[33] Daniel R Zerbino and Ewan Birney. “Velvet: algorithms for de novo short read assembly
using de Bruijn graphs”. In: Genome research 18.5 (2008), pp. 821–829.
[34] Ruiqiang Li, Hongmei Zhu, Jue Ruan, Wubin Qian, Xiaodong Fang, Zhongbin Shi,
Yingrui Li, Shengting Li, Gao Shan, Karsten Kristiansen, et al. “De novo assembly of
human genomes with massively parallel short read sequencing”. In: Genome research
20.2 (2010), pp. 265–272.
[35] Toshiaki Namiki, Tsuyoshi Hachiya, Hideaki Tanaka, and Yasubumi Sakakibara.
“MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from
short sequence reads”. In: Nucleic Acids Research 40.20 (Nov. 2012), e155. issn: 0305-1048.
doi: 10.1093/nar/gks678.
[36] Ruibang Luo, Binghang Liu, Yinlong Xie, Zhenyu Li, Weihua Huang, Jianying Yuan,
Guangzhu He, Yanxiang Chen, Qi Pan, Yunjie Liu, et al. “SOAPdenovo2: an empirically
improved memory-efficient short-read de novo assembler”. In: Gigascience 1.1 (2012),
pp. 2047–217X.
[37] Afiahayati, Kengo Sato, and Yasubumi Sakakibara. “MetaVelvet-SL: an extension of the
Velvet assembler to a de novo metagenomic assembler utilizing supervised learning”. In:
DNA research 22.1 (2015), pp. 69–77.
[38] Ramchalam Kinattinkara Ramakrishnan, Jaspal Singh, and Mathieu Blanchette. “Rlalign:
a reinforcement learning approach for multiple sequence alignment”. In: 2018 IEEE 18th
International Conference on Bioinformatics and Bioengineering (BIBE). IEEE. 2018,
pp. 61–66.
97
[39] Ming-Feng Hsieh, Chin Lung Lu, and Chuan Yi Tang. “Clover: a clustering-oriented de
novo assembler for Illumina sequences”. In: BMC bioinformatics 21.1 (2020), pp. 1–13.
[40] Kuo-ching Liang and Yasubumi Sakakibara. “MetaVelvet-DL: a MetaVelvet deep learning
extension for de novo metagenome assembly”. In: BMC bioinformatics 22.6 (2021),
pp. 1–21.
[41] Stephanie Schaarschmidt, Axel Fischer, Ellen Zuther, and Dirk K Hincha. “Evaluation of
seven different RNA-seq alignment tools based on experimental data from the model
plant Arabidopsis thaliana”. In: International journal of molecular sciences 21.5 (2020),
p. 1720.
[42] Sigal Leviatan, Saar Shoer, Daphna Rothschild, Maria Gorodetski, and Eran Segal. “An
expanded reference map of the human gut microbiome reveals hundreds of previously
unknown species”. In: Nature communications 13.1 (2022), p. 3863.
[43] Chao Jiang, Xin Wang, Xiyan Li, Jingga Inlora, Ting Wang, Qing Liu, and
Michael Snyder. “Dynamic human environmental exposome revealed by longitudinal
personal monitoring”. In: Cell 175.1 (2018), pp. 277–291.
[44] B Edwin Blaisdell. “A measure of the similarity of sets of sequences not requiring
sequence alignment.” In: Proceedings of the National Academy of Sciences 83.14 (1986),
pp. 5155–5159.
[45] Gesine Reinert, David Chew, Fengzhu Sun, and Michael S Waterman. “Alignment-free
sequence comparison (I): statistics and power”. In: Journal of Computational Biology
16.12 (2009), pp. 1615–1634.
[46] Clovis Galiez, Matthias Siebert, François Enault, Jonathan Vincent, and Johannes Söding.
“WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage
contigs”. In: Bioinformatics 33.19 (2017), pp. 3113–3114.
[47] Weili Wang, Jie Ren, Kujin Tang, Emily Dart, Julio Cesar Ignacio-Espinoza,
Jed A Fuhrman, Jonathan Braun, Fengzhu Sun, and Nathan A Ahlgren. “A
network-based integrated framework for predicting virus–prokaryote interactions”. In:
NAR Genomics and Bioinformatics 2.2 (2020), lqaa044.
[48] Karel Sedlar, Kristyna Kupkova, and Ivo Provaznik. “Bioinformatics strategies for
taxonomy independent binning and visualization of sequences in shotgun
metagenomics”. In: Computational and Structural Biotechnology Journal 15 (2017),
pp. 48–55. issn: 2001-0370. doi: 10.1016/j.csbj.2016.11.005.
[49] Kurt E Williamson, Mark Radosevich, and K Eric Wommack. “Abundance and diversity
of viruses in six Delaware soils”. In: Applied and environmental microbiology 71.6 (2005),
pp. 3119–3125.
98
[50] Noah Fierer. “Embracing the unknown: disentangling the complexities of the soil
microbiome”. In: Nature Reviews Microbiology 15.10 (2017), pp. 579–590.
[51] Breck A Duerkop and Lora V Hooper. “Resident viruses and their interactions with the
immune system”. In: Nature immunology 14.7 (2013), pp. 654–659.
[52] Herbert W Virgin. “The virome in mammalian physiology and disease”. In: Cell 157.1
(2014), pp. 142–150.
[53] Curtis A. Suttle. “Viruses in the sea”. In: Nature 437.7057 (2005), pp. 356–361. doi:
10.1038/NATURE04160.
[54] Curtis A. Suttle. “Marine viruses — major players in the global ecosystem”. In: Nature
Reviews Microbiology 5.1010 (Oct. 2007), pp. 801–812. issn: 1740-1534. doi:
10.1038/nrmicro1750.
[55] Jed A. Fuhrman. “Marine viruses and their biogeochemical and ecological effects”. In:
Nature 399.67366736 (June 1999), pp. 541–548. issn: 1476-4687. doi: 10.1038/21119.
[56] Steven W. Wilhelm and Curtis A. Suttle. “Viruses and Nutrient Cycles in the SeaViruses
play critical roles in the structure and function of aquatic food webs”. In: BioScience
49.10 (Oct. 1999), pp. 781–788. issn: 0006-3568. doi: 10.2307/1313569.
[57] Forest Rohwer and Rebecca Vega Thurber. “Viruses manipulate the marine
environment”. In: Nature 459.7244 (2009), pp. 207–212.
[58] Simon Roux, Francois Enault, Bonnie L. Hurwitz, and Matthew B. Sullivan. “VirSorter:
mining viral signal from microbial genomic data”. In: PeerJ 3 (May 2015), e985. issn:
2167-8359. doi: 10.7717/peerj.985.
[59] Jiarong Guo, Ben Bolduc, Ahmed A. Zayed, Arvind Varsani,
Guillermo Dominguez-Huerta, Tom O. Delmont, Akbar Adjie Pratama,
M. Consuelo Gazitúa, Dean Vik, Matthew B. Sullivan, and Simon Roux. “VirSorter2: a
multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses”. In:
Microbiome 9.1 (Feb. 2021), p. 37. issn: 2049-2618. doi: 10.1186/s40168-020-00990-y.
[60] Jie Ren, Nathan A. Ahlgren, Yang Young Lu, Jed A. Fuhrman, and Fengzhu Sun.
“VirFinder: a novel k-mer based tool for identifying viral sequences from assembled
metagenomic data”. In: Microbiome 5 (July 2017), p. 69. issn: 2049-2618. doi:
10.1186/s40168-017-0283-5.
[61] Jie Ren, Kai Song, Chao Deng, Nathan A. Ahlgren, Jed A. Fuhrman, Yi Li, Xiaohui Xie,
Ryan Poplin, and Fengzhu Sun. “Identifying viruses from metagenomic data using deep
learning”. In: Quantitative Biology (Jan. 2020). issn: 2095-4697. doi:
10.1007/s40484-019-0187-4.
99
[62] Zhencheng Fang, Jie Tan, Shufang Wu, Mo Li, Congmin Xu, Zhongjie Xie, and
Huaiqiu Zhu. “PPR-Meta: a tool for identifying phages and plasmids from metagenomic
fragments using deep learning”. In: GigaScience 8.6 (June 2019). issn: 2047-217X. doi:
10.1093/gigascience/giz066.
[63] Kristopher Kieft, Zhichao Zhou, and Karthik Anantharaman. “VIBRANT: automated
recovery, annotation and curation of microbial viruses, and evaluation of viral
community function from genomic sequences”. In: Microbiome 8.1 (June 2020), p. 90.
issn: 2049-2618. doi: 10.1186/s40168-020-00867-0.
[64] Wojciech Gałan, Maciej Bąk, and Małgorzata Jakubowska. “Host Taxon Predictor - A
Tool for Predicting Taxon of the Host of a Newly Discovered Virus”. In: Scientific Reports
9.1 (Mar. 2019), p. 3436. issn: 2045-2322. doi: 10.1038/s41598-019-39847-2.
[65] James P. J. Hall, A. Jamie Wood, Ellie Harrison, and Michael A. Brockhurst. “Source-sink
plasmid transfer dynamics maintain gene mobility in soil bacterial communities”. In:
Proceedings of the National Academy of Sciences of the United States of America 113.29
(July 2016), pp. 8260–8265. issn: 1091-6490. doi: 10.1073/pnas.1600974113.
[66] Anastasia Kottara, James PJ Hall, Ellie Harrison, and Michael A Brockhurst. “Variable
plasmid fitness effects and mobile genetic element dynamics across Pseudomonas
species”. In: FEMS microbiology ecology 94.1 (2018), fix172.
[67] Julian Davies and Dorothy Davies. “Origins and evolution of antibiotic resistance”. In:
Microbiology and molecular biology reviews 74.3 (2010), pp. 417–433.
[68] Alessandra Carattoli. “Plasmids and the spread of resistance”. In: International journal of
medical microbiology 303.6-7 (2013), pp. 298–304.
[69] Laurence Van Melderen and Manuel Saavedra De Bast. “Bacterial toxin–antitoxin
systems: more than selfish entities?” In: PLoS genetics 5.3 (2009), e1000437.
[70] Jean Cury, Marie Touchon, and Eduardo PC Rocha. “Integrative and conjugative
elements and their hosts: composition, distribution and organization”. In: Nucleic acids
research 45.15 (2017), pp. 8943–8956.
[71] Fengfeng Zhou and Ying Xu. “cBar: a computer program to distinguish plasmid-derived
from chromosome-derived sequence fragments in metagenomics data”. In:
Bioinformatics 26.16 (Aug. 2010), pp. 2051–2052. issn: 1367-4803. doi:
10.1093/bioinformatics/btq299.
[72] Pawel S. Krawczyk, Leszek Lipinski, and Andrzej Dziembowski. “PlasFlow: predicting
plasmid sequences in metagenomic data using genome signatures”. In: Nucleic Acids
Research 46.6 (Apr. 2018), e35–e35. issn: 0305-1048. doi: 10.1093/nar/gkx1321.
100
[73] G. Royer, J. W. Decousser, C. Branger, M. Dubois, C. Médigue, E. Denamur, and
D. Vallenet. “PlaScope: a targeted approach to assess the plasmidome from genome
assemblies at the species level”. In: Microbial Genomics 4.9 (Sept. 2018). issn: 2057-5858.
doi: 10.1099/mgen.0.000211.
[74] David Pellow, Itzik Mizrahi, and Ron Shamir. “PlasClass improves plasmid sequence
classification”. In: PLoS computational biology 16.4 (Apr. 2020), e1007781. issn:
1553-7358. doi: 10.1371/journal.pcbi.1007781.
[75] Peter Menzel, Kim Lee Ng, and Anders Krogh. “Fast and sensitive taxonomic
classification for metagenomics with Kaiju”. In: Nature Communications 7 (Apr. 2016),
p. 11257. issn: 2041-1723. doi: 10.1038/ncomms11257.
[76] Eli Levy Karin, Milot Mirdita, and Johannes Söding. “MetaEuk-sensitive,
high-throughput gene discovery, and annotation for large-scale eukaryotic
metagenomics”. In: Microbiome 8.1 (2020), p. 48. issn: 2049-2618. doi:
10.1186/s40168-020-00808-x.
[77] Patrick J. Keeling, Fabien Burki, Heather M. Wilcox, Bassem Allam, Eric E. Allen,
Linda A. Amaral-Zettler, E. Virginia Armbrust, John M. Archibald, Arvind K. Bharti,
Callum J. Bell, and et al. “The Marine Microbial Eukaryote Transcriptome Sequencing
Project (MMETSP): Illuminating the Functional Diversity of Eukaryotic Life in the
Oceans through Transcriptome Sequencing”. In: PLOS Biology 12.6 (June 2014),
e1001889. issn: 1545-7885. doi: 10.1371/journal.pbio.1001889.
[78] Patrick T. West, Alexander J. Probst, Igor V. Grigoriev, Brian C. Thomas, and
Jillian F. Banfield. “Genome-reconstruction for eukaryotes from complex natural
microbial communities”. In: Genome Research 28.4 (2018), pp. 569–580. issn: 1549-5469.
doi: 10.1101/gr.228429.117.
[79] Alexey Vorobev, Marion Dupouy, Quentin Carradec, Tom O. Delmont, Anita Annamalé,
Patrick Wincker, and Eric Pelletier. “Transcriptome reconstruction and functional
analysis of eukaryotic marine plankton communities via high-throughput
metagenomics and metatranscriptomics”. In: Genome Research 30.4 (Apr. 2020),
pp. 647–659. issn: 1088-9051, 1549-5469. doi: 10.1101/gr.253070.119.
[80] Michał Karlicki, Stanisław Antonowicz, and Anna Karnkowska. “Tiara: deep
learning-based classification system for eukaryotic sequences”. en. In: Bioinformatics
38.2 (Jan. 2022). Ed. by Inanc Birol, pp. 344–350. issn: 1367-4803, 1460-2059. doi:
10.1093/bioinformatics/btab672.
[81] Lotte J.U. Pronk and Marnix H. Medema. “Whokaryote: distinguishing eukaryotic and
prokaryotic contigs in metagenomes based on gene structure”. In: Microbial Genomics
8.5 (2022), p. 000823. issn: 2057-5858. doi: 10.1099/mgen.0.000823.
101
[82] John C Wooley, Adam Godzik, and Iddo Friedberg. “A primer on metagenomics”. In:
PLoS computational biology 6.2 (2010), e1000667.
[83] Martha RJ Clokie, Andrew D Millard, Andrey V Letarov, and Shaun Heaphy. “Phages in
nature”. In: Bacteriophage 1.1 (2011), pp. 31–45.
[84] Stephen T Abedon, Sarah J Kuhl, Bob G Blasdel, and Elizabeth Martin Kutter. “Phage
treatment of human infections”. In: Bacteriophage 1.2 (2011), pp. 66–85.
[85] Jeffrey B Jones, Gary E Vallad, Fanny B Iriarte, Aleksa Obradović, Mine H Wernsing,
Lee E Jackson, Botond Balogh, Jason C Hong, and M Timur Momol. “Considerations for
using bacteriophages for plant disease control”. In: Bacteriophage 2.4 (2012), e23857.
[86] MM Doolittle, JJ Cooney, and DE Caldwell. “Tracing the interaction of bacteriophage
with bacterial biofilms using fluorescent and chromogenic probes”. In: Journal of
industrial microbiology and biotechnology 16.6 (1996), pp. 331–341.
[87] Joshua S Weitz, Timothée Poisot, Justin R Meyer, Cesar O Flores, Sergi Valverde,
Matthew B Sullivan, and Michael E Hochberg. “Phage–bacteria infection networks”. In:
Trends in microbiology 21.2 (2013), pp. 82–91.
[88] Rob Knight, Janet Jansson, Dawn Field, Noah Fierer, Narayan Desai, Jed A Fuhrman,
Phil Hugenholtz, Daniel Van Der Lelie, Folker Meyer, Rick Stevens, et al. “Unlocking the
potential of metagenomics through replicated experimental design”. In: Nature
biotechnology 30.6 (2012), pp. 513–520.
[89] Nathan A Ahlgren, Jie Ren, Yang Young Lu, Jed A Fuhrman, and Fengzhu Sun.
“Alignment-free oligonucleotide frequency dissimilarity measure improves prediction of
hosts from metagenomically-derived viral sequences”. In: Nucleic Acids Research 45.1
(2017), pp. 39–53.
[90] Deyvid Amgarten, Bruno Koshin Vázquez Iha, Carlos Morais Piroupo,
Aline Maria da Silva, and João Carlos Setubal. “vHULK, a new tool for bacteriophage
host prediction based on annotated genomic features and deep neural networks”. In:
bioRxiv (2020).
[91] Felipe Hernandes Coutinho, Asier Zaragoza-Solas, Mario López-Pérez, Jakub Barylski,
Andrzej Zielezinski, Bas E Dutilh, Robert Edwards, and Francisco Rodriguez-Valera.
“RaFAH: Host prediction for viruses of Bacteria and Archaea based on protein content”.
In: Patterns 2 (2021), p. 100274.
[92] Congyu Lu, Zheng Zhang, Zena Cai, Zhaozhong Zhu, Ye Qiu, Aiping Wu, Taijiao Jiang,
Heping Zheng, and Yousong Peng. “Prokaryotic virus host predictor: a Gaussian model
for host prediction of prokaryotic viruses in metagenomics”. In: BMC Biology 19 (2021),
p. 5.
102
[93] Joan Carles Pons, David Paez-Espino, Gabriel Riera, Natalia Ivanova, Nikos C Kyrpides,
and Mercè Llabrés. “VPF-Class: taxonomic assignment and host prediction of
uncultivated viruses based on viral protein families”. In: Bioinformatics 37.13 (Jan. 2021),
pp. 1805–1813. issn: 1367-4803. doi: 10.1093/bioinformatics/btab026. eprint: https:
//academic.oup.com/bioinformatics/article-pdf/37/13/1805/39353167/btab026.pdf.
[94] Jiayu Shang and Yanni Sun. “Predicting the hosts of prokaryotic viruses using
GCN-based semi-supervised learning”. In: BMC Biology 19 (2021), p. 250.
[95] Jie Tan, Zhencheng Fang, Shufang Wu, Qian Guo, Xiaoqing Jiang, and Huaiqiu Zhu.
“Hophage: an ab initio tool for identifying hosts of phage fragments from metaviromes”.
In: Bioinformatics 38.2 (2022), pp. 543–545.
[96] Lin Wan, Gesine Reinert, Fengzhu Sun, and Michael S Waterman. “Alignment-free
sequence comparison (II): theoretical power of comparison statistics”. In: Journal of
Computational Biology 17.11 (2010), pp. 1467–1490.
[97] Kai Song, Jie Ren, Zhiyuan Zhai, Xuemei Liu, Minghua Deng, and Fengzhu Sun.
“Alignment-free sequence comparison based on next-generation sequencing reads”. In:
Journal of Computational Biology 20.2 (2013), pp. 64–79.
[98] Ana Laura Grazziotin, Eugene V Koonin, and David M Kristensen. “Prokaryotic Virus
Orthologous Groups (pVOGs): a resource for comparative genomics and protein family
annotation”. In: Nucleic Acids Research 45.Database issue (2017), p. D491.
[99] Laure Guillou, Dipankar Bachar, Stéphane Audic, David Bass, Cédric Berney,
Lucie Bittner, Christophe Boutte, Gaétan Burgaud, Colomban de Vargas, Johan Decelle,
and et al. “The Protist Ribosomal Reference database (PR2): a catalog of unicellular
eukaryote Small Sub-Unit rRNA sequences with curated taxonomy”. In: Nucleic Acids
Research 41.D1 (Jan. 2013), pp. D597–D604. issn: 0305-1048. doi: 10.1093/nar/gks1160.
[100] Lisa K Johnson, Harriet Alexander, and C Titus Brown. “Re-assembly, quality evaluation,
and annotation of 678 microbial eukaryotic reference transcriptomes”. In: GigaScience
8.4 (Apr. 2019). issn: 2047-217X. doi: 10.1093/gigascience/giy158.
[101] Valentina Galata, Tobias Fehlmann, Christina Backes, and Andreas Keller. “PLSDB: a
resource of complete bacterial plasmids”. eng. In: Nucleic Acids Research 47.D1 (Jan.
2019), pp. D195–D202. issn: 1362-4962. doi: 10.1093/nar/gky1050.
[102] Tomoko Mihara, Yosuke Nishimura, Yugo Shimizu, Hiroki Nishiyama, Genki Yoshikawa,
Hideya Uehara, Pascal Hingamp, Susumu Goto, and Hiroyuki Ogata. “Linking Virus
Genomes with Host Taxonomy”. eng. In: Viruses 8.3 (Mar. 2016), p. 66. issn: 1999-4915.
doi: 10.3390/v8030066.
103
[103] Brian D. Ondov, Todd J. Treangen, Páll Melsted, Adam B. Mallonee,
Nicholas H. Bergman, Sergey Koren, and Adam M. Phillippy. “Mash: fast genome and
metagenome distance estimation using MinHash”. In: Genome Biology 17.1 (June 2016),
p. 132. issn: 1474-760X. doi: 10.1186/s13059-016-0997-x.
[104] David M. Needham, Erin B. Fichot, Ellice Wang, Lyria Berdjeb, Jacob A. Cram,
Cédric G. Fichot, and Jed A. Fuhrman. “Dynamics and interactions of highly resolved
marine plankton via automated high-frequency sampling”. In: The ISME Journal (June
2018), p. 1. issn: 1751-7370. doi: 10.1038/s41396-018-0169-y.
[105] Shifu Chen, Yanqing Zhou, Yaru Chen, and Jia Gu. “fastp: an ultra-fast all-in-one FASTQ
preprocessor”. In: Bioinformatics (Oxford, England) 34.17 (Sept. 2018), pp. i884–i890. issn:
1367-4811. doi: 10.1093/bioinformatics/bty560.
[106] Sergey Nurk, Dmitry Meleshko, Anton Korobeynikov, and Pavel A. Pevzner.
“metaSPAdes: a new versatile metagenomic assembler”. In: Genome Research 27.5 (May
2017), pp. 824–834. issn: 1088-9051, 1549-5469. doi: 10.1101/gr.213959.116.
[107] Andrew M. Long, Shengwei Hou, J. Cesar Ignacio-Espinoza, and Jed A. Fuhrman.
“Benchmarking microbial growth rate predictions from metagenomes”. In: The ISME
Journal 15.11 (Jan. 2021), pp. 183–195. issn: 1751-7370. doi: 10.1038/s41396-020-00773-1.
[108] Marcel Margulies, Michael Egholm, William E Altman, Said Attiya, Joel S Bader,
Lisa A Bemben, Jan Berka, Michael S Braverman, Yi-Ju Chen, Zhoutao Chen, and et al.
“Genome sequencing in microfabricated high-density picolitre reactors”. In: Nature
437.7057 (Sept. 2005), pp. 376–380.
[109] Todd J. Treangen, Dan D. Sommer, Florent E. Angly, Sergey Koren, and Mihai Pop.
“Next generation sequence assembly with AMOS”. In: Current Protocols in Bioinformatics
Chapter 11 (Mar. 2011), Unit 11.8. issn: 1934-340X. doi: 10.1002/0471250953.bi1108s33.
[110] Weizhong Li and Adam Godzik. “Cd-hit: a fast program for clustering and comparing
large sets of protein or nucleotide sequences”. In: Bioinformatics 22.13 (July 2006),
pp. 1658–1659. issn: 1367-4803. doi: 10.1093/bioinformatics/btl158.
[111] Antonio Pedro Camargo, Simon Roux, Frederik Schulz, Michal Babinski, Yan Xu,
Bin Hu, Patrick SG Chain, Stephen Nayfach, and Nikos C Kyrpides. “You can move, but
you can’t hide: identification of mobile genetic elements with geNomad”. In: bioRxiv
(2023), pp. 2023–03.
[112] Laura Wegener Parfrey, William A. Walters, and Rob Knight. “Microbial eukaryotes in
the human microbiome: ecology, evolution, and future directions”. In: Frontiers in
Microbiology 2 (2011), p. 153. issn: 1664-302X. doi: 10.3389/fmicb.2011.00153.
104
[113] Holly M. Bik, Way Sung, Paul De Ley, James G. Baldwin, Jyotsna Sharma,
Axayácatl Rocha-Olivares, and W. Kelley Thomas. “Metagenetic community analysis of
microbial eukaryotes illuminates biogeographic patterns in deep-sea and shallow water
sediments”. In: Molecular Ecology 21.5 (2012), pp. 1048–1059. issn: 1365-294X. doi:
10.1111/j.1365-294X.2011.05297.x.
[114] Angela M. Oliverio, Jean F. Power, Alex Washburne, S. Craig Cary, Matthew B. Stott,
and Noah Fierer. “The ecology and diversity of microbial eukaryotes in geothermal
springs”. In: The ISME Journal 12.88 (Aug. 2018), pp. 1918–1928. issn: 1751-7370. doi:
10.1038/s41396-018-0104-2.
[115] Jan Pawlowski, Stéphane Audic, Sina Adl, David Bass, Lassaâd Belbahri, Cédric Berney,
Samuel S. Bowser, Ivan Cepicka, Johan Decelle, Micah Dunthorn, and et al. “CBOL
Protist Working Group: Barcoding Eukaryotic Richness beyond the Animal, Plant, and
Fungal Kingdoms”. In: PLOS Biology 10.11 (Nov. 2012), e1001419. issn: 1545-7885. doi:
10.1371/journal.pbio.1001419.
[116] Linda A. Amaral-Zettler, Elizabeth A. McCliment, Hugh W. Ducklow, and
Susan M. Huse. “A Method for Studying Protistan Diversity Using Massively Parallel
Sequencing of V9 Hypervariable Regions of Small-Subunit Ribosomal RNA Genes”. In:
PLOS ONE 4.7 (July 2009), e6372. issn: 1932-6203. doi: 10.1371/journal.pone.0006372.
[117] Quentin Carradec, Eric Pelletier, Corinne Da Silva, Adriana Alberti, Yoann Seeleuthner,
Romain Blanc-Mathieu, Gipsi Lima-Mendez, Fabio Rocha, Leila Tirichine,
Karine Labadie, and et al. “A global ocean atlas of eukaryotic genes”. In: Nature
Communications 9.11 (Jan. 2018), p. 373. issn: 2041-1723. doi:
10.1038/s41467-017-02342-1.
[118] M. E. Sieracki, N. J. Poulton, O. Jaillon, P. Wincker, C. de Vargas, L. Rubinat-Ripoll,
R. Stepanauskas, R. Logares, and R. Massana. “Single cell genomics yields a wide
diversity of small planktonic protists across major ocean ecosystems”. In: Scientific
Reports 9.1 (Apr. 2019), pp. 1–11. issn: 2045-2322. doi: 10.1038/s41598-019-42487-1.
[119] Fabien Burki, Andrew J. Roger, Matthew W. Brown, and Alastair G. B. Simpson. “The
New Tree of Eukaryotes”. In: Trends in Ecology & Evolution 35.1 (Jan. 2020), pp. 43–55.
issn: 0169-5347. doi: 10.1016/j.tree.2019.08.008.
[120] Matthew R. Olm, Patrick T. West, Brandon Brooks, Brian A. Firek, Robyn Baker,
Michael J. Morowitz, and Jillian F. Banfield. “Genome-resolved metagenomics of
eukaryotic populations during early colonization of premature infants and in hospital
rooms”. In: Microbiome 7.1 (Feb. 2019), p. 26. issn: 2049-2618. doi:
10.1186/s40168-019-0638-1.
105
[121] A. Duncan, K. Barry, C. Daum, E. Eloe-Fadrosh, S. Roux, S. G. Tringe, K. Schmidt,
K. U. Valentin, N. Varghese, I. V. Grigoriev, and et al. “Metagenome-assembled genomes
of phytoplankton communities across the Arctic Circle”. In: bioRxiv (June 2020),
p. 2020.06.16.154583. doi: 10.1101/2020.06.16.154583.
[122] Tom O. Delmont, Morgan Gaia, Damien D. Hinsinger, Paul Fremont,
Antonio Fernandez Guerra, A. Murat Eren, Chiara Vanni, Artem Kourlaiev, Leo d’Agata,
Quentin Clayssen, and et al. “Functional repertoire convergence of distantly related
eukaryotic plankton lineages revealed by genome-resolved metagenomics”. In: bioRxiv
(Oct. 2020), p. 2020.10.15.341214. doi: 10.1101/2020.10.15.341214.
[123] Lakshminarayan M. Iyer, L. Aravind, and Eugene V. Koonin. “Common Origin of Four
Diverse Families of Large Eukaryotic DNA Viruses”. In: Journal of Virology 75.23 (Dec.
2001), pp. 11720–11734. issn: 0022-538X. doi: 10.1128/JVI.75.23.11720-11734.2001.
[124] W. Foissner. “Protist diversity: estimates of the near-imponderable”. In: Protist 150.4
(Dec. 1999), pp. 363–368. issn: 1434-4610. doi: 10.1016/S1434-4610(99)70037-4.
[125] Jan Slapeta, David Moreira, and Purificación López-García. “The extent of protist
diversity: insights from molecular ecology of freshwater eukaryotes”. In: Proceedings.
Biological Sciences 272.1576 (Oct. 2005), pp. 2073–2081. issn: 0962-8452. doi:
10.1098/rspb.2005.3195.
[126] Frederik Schulz, Simon Roux, David Paez-Espino, Sean Jungbluth, David Walsh,
Vincent J. Denef, Katherine D. McMahon, Konstantinos T. Konstantinidis,
Emiley A. Eloe-Fadrosh, Nikos Kyrpides, and et al. “Giant virus diversity and host
interactions through global metagenomics”. In: Nature (Jan. 2020), pp. 1–7. issn:
1476-4687. doi: 10.1038/s41586-020-1957-x.
[127] Mohammad Moniruzzaman, Carolina A. Martinez-Gutierrez, Alaina R. Weinheimer, and
Frank O. Aylward. “Dynamic genome evolution and complex virocell metabolism of
globally-distributed giant viruses”. In: Nature Communications 11.11 (Apr. 2020),
pp. 1–11. issn: 2041-1723. doi: 10.1038/s41467-020-15507-2.
[128] David M. Needham, Camille Poirier, Elisabeth Hehenberger, Valeria Jiménez,
Jarred E. Swalwell, Alyson E. Santoro, and Alexandra Z. Worden. “Targeted
metagenomic recovery of four divergent viruses reveals shared and distinctive
characteristics of giant viruses of marine eukaryotes”. In: Philosophical Transactions of
the Royal Society B: Biological Sciences 374.1786 (Nov. 2019), p. 20190086. doi:
10.1098/rstb.2019.0086.
106
[129] David M. Needham, Susumu Yoshizawa, Toshiaki Hosaka, Camille Poirier,
Chang Jae Choi, Elisabeth Hehenberger, Nicholas A. T. Irwin, Susanne Wilken,
Cheuk-Man Yung, Charles Bachy, and et al. “A distinct lineage of giant viruses brings a
rhodopsin photosystem to unicellular marine predators”. In: Proceedings of the National
Academy of Sciences (Sept. 2019), p. 201907517. issn: 0027-8424, 1091-6490. doi:
10.1073/pnas.1907517116.
[130] Michael Lynch and John S. Conery. “The Origins of Genome Complexity”. In: Science
302.5649 (Nov. 2003), pp. 1401–1404. issn: 0036-8075, 1095-9203. doi:
10.1126/science.1089370.
[131] Aayushi Jain and Preeti Srivastava. “Broad host range plasmids”. In: FEMS Microbiology
Letters 348.2 (Nov. 2013), pp. 87–96. issn: 0378-1097. doi: 10.1111/1574-6968.12241.
[132] Holger Heuer and Kornelia Smalla. “Horizontal gene transfer between bacteria”. In:
Environmental Biosafety Research 6.1–21–2 (Jan. 2007), pp. 3–13. issn: 1635-7922,
1635-7930. doi: 10.1051/ebr:2007034.
[133] Krystyna I. Wolska. “Horizontal DNA transfer between bacteria in the environment”. In:
Acta Microbiologica Polonica 52.3 (2003), pp. 233–243. issn: 0137-1320.
[134] John Davison. “Genetic Exchange between Bacteria in the Environment”. In: Plasmid
42.2 (Sept. 1999), pp. 73–91. issn: 0147-619X. doi: 10.1006/plas.1999.1421.
[135] Uli Klümper, Leise Riber, Arnaud Dechesne, Analia Sannazzarro, Lars H. Hansen,
Søren J. Sørensen, and Barth F. Smets. “Broad host range plasmids can invade an
unexpectedly diverse fraction of a soil bacterial community”. In: The ISME Journal 9.4
(Apr. 2015), pp. 934–945. issn: 1751-7370. doi: 10.1038/ismej.2014.191.
[136] Liguan Li, Arnaud Dechesne, Jonas Stenløkke Madsen, Joseph Nesme, Søren J. Sørensen,
and Barth F. Smets. “Plasmids persist in a microbial community by providing fitness
benefit to multiple phylotypes”. In: The ISME Journal 14.5 (May 2020), pp. 1170–1181.
issn: 1751-7370. doi: 10.1038/s41396-020-0596-4.
[137] Xavier Bellanger, Hélène Guilloteau, Bérengère Breuil, and Christophe Merlin. “Natural
microbial communities supporting the transfer of the IncP-1 plasmid pB10 exhibit a
higher initial content of plasmids from the same incompatibility group”. In: Frontiers in
Microbiology 0 (2014). issn: 1664-302X. doi: 10.3389/fmicb.2014.00637.
[138] A. San Millan, R. Peña-Miller, M. Toll-Riera, Z. V. Halbert, A. R. McLean, B. S. Cooper,
and R. C. MacLean. “Positive selection and compensatory adaptation interact to stabilize
non-transmissible plasmids”. In: Nature Communications 5.1 (Oct. 2014), p. 5208. issn:
2041-1723. doi: 10.1038/ncomms6208.
107
[139] Ellie Harrison, David Guymer, Andrew J. Spiers, Steve Paterson, and
Michael A. Brockhurst. “Parallel compensatory evolution stabilizes plasmids across the
parasitism-mutualism continuum”. In: Current biology: CB 25.15 (Aug. 2015),
pp. 2034–2039. issn: 1879-0445. doi: 10.1016/j.cub.2015.06.024.
[140] Wesley Loftie-Eaton, Kelsie Bashford, Hannah Quinn, Kieran Dong, Jack Millstein,
Samuel Hunter, Maureen K. Thomason, Houra Merrikh, Jose M. Ponciano, and
Eva M. Top. “Compensatory mutations improve general permissiveness to antibiotic
resistance plasmids”. In: Nature Ecology & Evolution 1.9 (Sept. 2017), pp. 1354–1363. issn:
2397-334X. doi: 10.1038/s41559-017-0243-2.
[141] James P. J. Hall, Michael A. Brockhurst, Calvin Dytham, and Ellie Harrison. “The
evolution of plasmid stability: Are infectious transmission and compensatory evolution
competing evolutionary trajectories?” In: Plasmid. SI: ISPB Plasmid 2016 91 (May 2017),
pp. 90–95. issn: 0147-619X. doi: 10.1016/j.plasmid.2017.04.003.
[142] Jerónimo Rodríguez-Beltrán, Javier DelaFuente, Ricardo León-Sampedro,
R. Craig MacLean, and Álvaro San Millán. “Beyond horizontal gene transfer: the role of
plasmids in bacterial evolution”. In: Nature Reviews Microbiology 19.6 (June 2021),
pp. 347–359. issn: 1740-1534. doi: 10.1038/s41579-020-00497-1.
[143] Laura S. Frost, Raphael Leplae, Anne O. Summers, and Ariane Toussaint. “Mobile genetic
elements: the agents of open source evolution”. In: Nature Reviews Microbiology 3.9
(Sept. 2005), pp. 722–732. issn: 1740-1534. doi: 10.1038/nrmicro1235.
[144] William G. Eberhard. “Evolution in Bacterial Plasmids and Levels of Selection”. In: The
Quarterly Review of Biology 65.1 (Mar. 1990), pp. 3–22. issn: 0033-5770. doi:
10.1086/416582.
[145] Jinshui Zheng, Ziyu Guan, Shiyun Cao, Donghai Peng, Lifang Ruan, Daohong Jiang, and
Ming Sun. “Plasmids are vectors for redundant chromosomal genes in the Bacillus
cereus group”. In: BMC Genomics 16.1 (Jan. 2015), p. 6. issn: 1471-2164. doi:
10.1186/s12864-014-1206-5.
[146] Yanhong Shi, Hong Zhang, Zhe Tian, Min Yang, and Yu Zhang. “Characteristics of
ARG-carrying plasmidome in the cultivable microbial community from wastewater
treatment system under high oxytetracycline concentration”. In: Applied Microbiology
and Biotechnology 102.4 (Feb. 2018), pp. 1847–1858. issn: 1432-0614. doi:
10.1007/s00253-018-8738-6.
[147] Tomoko Mihara, Yosuke Nishimura, Yugo Shimizu, Hiroki Nishiyama, Genki Yoshikawa,
Hideya Uehara, Pascal Hingamp, Susumu Goto, and Hiroyuki Ogata. “Linking virus
genomes with host taxonomy”. In: Viruses 8.3 (2016), p. 66.
108
[148] Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. “CD-HIT:
accelerated for clustering the next-generation sequencing data”. In: Bioinformatics 28.23
(2012), pp. 3150–3152.
[149] Stephen Nayfach, David Páez-Espino, Lee Call, Soo Jen Low, Hila Sberro,
Natalia N Ivanova, Amy D Proal, Michael A Fischbach, Ami S Bhatt, Philip Hugenholtz,
et al. “Metagenomic compendium of 189,680 DNA viruses from the human gut
microbiome”. In: Nature Microbiology 6.7 (2021), pp. 960–970.
[150] Valentina Galata, Tobias Fehlmann, Christina Backes, and Andreas Keller. “PLSDB: a
resource of complete bacterial plasmids”. In: Nucleic Acids Research 47.D1 (2019),
pp. D195–D202.
[151] Xiaotu Ma, Ying Shao, Liqing Tian, Diane A Flasch, Heather L Mulder,
Michael N Edmonson, Yu Liu, Xiang Chen, Scott Newman, Joy Nakitandwe, et al.
“Analysis of error profiles in deep next-generation sequencing data”. In: Genome Biology
20 (2019), p. 50.
[152] Miguel Pignatelli and Andrés Moya. “Evaluating the fidelity of de novo short read
metagenomic assembly using simulated data”. In: PLOS One 6.5 (2011), e19984.
[153] Jonathan D Magasin and Dietlind L Gerloff. “Pooled assembly of marine metagenomic
datasets: enriching annotation through chimerism”. In: Bioinformatics 31.3 (2015),
pp. 311–317.
[154] Kujin Tang, Jie Ren, and Fengzhu Sun. “Afann: bias adjustment for alignment-free
sequence comparison based on sequencing data using neural network regression”. In:
Genome Biology 20 (2019), p. 266.
[155] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. “Learning and generalization in
overparameterized neural networks, going beyond two layers”. In: Advances in Neural
Information Processing Systems (2019).
[156] Martin Steinegger and Johannes Söding. “MMseqs2 enables sensitive protein sequence
searching for the analysis of massive data sets”. In: Nature Biotechnology 35.11 (2017),
pp. 1026–1028.
[157] Mantas Sereika, Rasmus Hansen Kirkegaard, Søren Michael Karst,
Thomas Yssing Michaelsen, Emil Aarre Sørensen, Rasmus Dam Wollenberg, and
Mads Albertsen. “Oxford Nanopore R10. 4 long-read sequencing enables the generation
of near-finished bacterial genomes from pure cultures and metagenomes without
short-read or reference polishing”. In: Nature methods 19.7 (2022), pp. 823–826.
109
[158] Ying Ni, Xudong Liu, Zemenu Mengistie Simeneh, Mengsu Yang, and Runsheng Li.
“Benchmarking of Nanopore R10. 4 and R9. 4.1 flow cells in single-cell whole-genome
amplification and whole-genome shotgun sequencing”. In: Computational and Structural
Biotechnology Journal 21 (2023), pp. 2352–2364.
[159] Jörg Linde, Hanka Brangsch, Martin Hölzer, Christine Thomas, Mandy C Elschner,
Falk Melzer, and Herbert Tomaso. “Comparison of Illumina and Oxford Nanopore
Technology for genome analysis of Francisella tularensis, Bacillus anthracis, and
Brucella suis”. In: BMC genomics 24.1 (2023), pp. 1–15.
110
Asset Metadata
Creator
Tang, Tianqi (author)
Core Title
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
Contributor
Electronically uploaded by the author
(provenance)
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Degree Conferral Date
2023-12
Publication Date
10/20/2023
Defense Date
10/17/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,machine learning,metagenomics,OAI-PMH Harvest,phage-bacterial association prediction,sequence classification,virus-host interaction prediction
Format
theses
(aat)
Language
English
Advisor
Sun, Fengzhu (
committee chair
), Chen, Liang (
committee member
), Fuhrman, Jed A. (
committee member
)
Creator Email
tianqit@usc.edu,tianqitang88@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113759510
Unique identifier
UC113759510
Identifier
etd-TangTianqi-12430.pdf (filename)
Legacy Identifier
etd-TangTianqi-12430
Document Type
Dissertation
Format
theses (aat)
Rights
Tang, Tianqi
Internet Media Type
application/pdf
Type
texts
Source
20231020-usctheses-batch-1102
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
uscdl@usc.edu
Abstract (if available)
Abstract
Metagenomic datasets exhibit unique characteristics, influenced either by experimental limitations or the intricacies of microbial communities. Addressing metagenomic sequences necessitates solutions to two principal challenges: identifying the sequences and understanding their interrelationships. To address the identification challenge, we introduce DeepMicroClass, a deep learning-based method that categorizes metagenomic contigs into five distinct sequence classes: viruses targeting prokaryotic or eukaryotic hosts, eukaryotic or prokaryotic chromosomes, and prokaryotic plasmids. To unravel the interactions among sequences in metagenomic datasets, we introduce ContigNet, another deep learning-driven approach adept at predicting phage-host associations, even when only shorter contigs are available. Both our pioneering methods significantly outperform existing techniques. By integrating these tools, we have endowed the scientific community with two high-performance, user-centric tools.
Tags
deep learning
machine learning
metagenomics
phage-bacterial association prediction
sequence classification
virus-host interaction prediction
Linked assets
University of Southern California Dissertations and Theses