Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Predicting virus-host interactions using genomic data and applications in metagenomics
(USC Thesis Other)
Predicting virus-host interactions using genomic data and applications in metagenomics
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PREDICTING VIRUS-HOST INTERACTIONS USING GENOMIC DATA
AND APPLICATIONS IN METAGENOMICS
by
Weili Wang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND INFORMATICS)
May 2020
Copyright 2020 Weili Wang
Acknowledgments
First and foremost, I would like to express my deepest gratitude to my advisor,
Prof. Fengzhu Sun, for his unreserved support and patient guidance in my Ph.D.
studies. Over the past years, his insight, hard work and dedication to details have
inspired me a lot. I am very fortunate to be one of his students.
I would also like to express my gratitude and appreciation to Prof. Michael
Waterman, Prof. Yingying Fan, Prof. Jed Fuhrman, and Prof. Mark Chaisson for
serving on my qualification and dissertation committee and providing insightful
suggestions to guide my research.
I would like to thank my collaborators, especially Dr. Jie Ren and Dr. Nathan
Alhgren, fortheirgreatsupportandeffort. Iwouldalsoliketothankmylabmates,
Dr. Kujin Tang, Dr. Yang Lu and Dr. Mengge Zhang for all the discussions we
had. I could not have achieved the work without them.
I would like to thank Prof. Minping Qian for introducing me to the field of
bioinformatics when I was an undergraduate. I am always impressed by her long-
last curiosity and great passion in research.
I would like to thank Kathleen Boeck, Douglas Burleson and Luigi Manna for
their assistance and service over the years at the QCB program. I would like to
thank all the members of Sun lab present and past including Wangshu Zhang,
Han Li, Kaida Ning, Zifan Zhu, Xin Bai, Tianqi Tang, Siliangyu Chen, Yilin Gao
ii
and Wenxuan Zuo. I would also like to thank all of my friends at USC including
Jianghan Qu, Junsong Zhao, Maoqi Xu, Wenzheng Li, Meng Zhou, Tsu-pei Chiu,
Ben Decato, Beibei Xin, Chao Deng, Xiaojing Ji, Jinsen Li, Nan Hua, Rishvanth
Prabakar, HaiyangZhang, YingjunLyu, ZhuLiu, BoSun, YingfeiWangandmany
others. It was such a journey and I cherished every moment we spent together.
Finally, I would like to thank my parents and my beloved girlfriend, Mengqian.
It was their tremendous love and support that helped me through all the ups and
downs. This dissertation is dedicated to them.
iii
Contents
Acknowledgments ii
List of Tables vi
List of Figures vii
1 Introduction 1
2 Materials and Methods 6
2.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Outline of the model . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 A MRF approach for virus-host interactions . . . . . . . . . . . . . 8
2.4 The similarity between two VHPs and the generalized probability
model for a VHP to interact . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Sharing of CRISPR spacers between the virus and the host . . . . . 15
2.6 The fraction of virus genome aligned to the host genome . . . . . . 16
2.7 Incorporation of WIsH score for predicting hosts of virus contigs . . 18
2.8 Model training and evaluation . . . . . . . . . . . . . . . . . . . . . 21
2.9 Clustering of viral contigs . . . . . . . . . . . . . . . . . . . . . . . 22
2.10 Consideration of virus-host co-abundance in host prediction . . . . 23
iv
2.11 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Results 25
3.1 A novel network-based integrated framework for predicting virus-
host interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Feature scores are significantly different between positive and nega-
tive virus-host pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Integrated approach markedly increases host prediction accuracy . . 31
3.4 Integratedapproachimproveshostpredictionaccuracyofshortviral
sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Thresholding on the prediction score further improves accuracy . . 38
3.6 Prediction accuracy varies for different viral families . . . . . . . . . 40
3.7 Prediction of the host of crAss-like phage ΦcrAss001 . . . . . . . . 44
3.8 Host prediction for marine environmental viral genomes . . . . . . . 45
3.9 Host prediction for metagenomic viral contigs from various habitats 47
3.10 Computational cost . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 Discussion 54
Bibliography 62
A Supplementary Figures 75
B Supplementary Tables 80
v
List of Tables
3.1 The estimated coefficients and corresponding p-values for host pre-
diction features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Proportions of congruent predictions for viral contigs between our
method and those in Paez-Espino et al. . . . . . . . . . . . . . . . . 48
B.1 The predictions for ΦcrAss001 in the host species tested in Shko-
porov et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
B.2 The comparision of prediction accuracies when excluding a certain
group viruses while training. . . . . . . . . . . . . . . . . . . . . . . 81
B.3 The prediction accuracies for different models when evaluated on
the 1,462 validation viruses. . . . . . . . . . . . . . . . . . . . . . . 81
B.4 The predicted hosts of the selected group of 160 marine viral contigs
from Paez-Espino et al. . . . . . . . . . . . . . . . . . . . . . . . . . 82
B.5 The predicted hosts of the selected group of 173 human associated
viral contigs from Paez-Espino et al. . . . . . . . . . . . . . . . . . 85
vi
List of Figures
2.1 Overview of the network prediction framework. . . . . . . . . . . . 20
3.1 Distributions of the different feature values among 826 interacting
and non-interacting virus-host pairs . . . . . . . . . . . . . . . . . . 30
3.2 Prediction accuracies of the different approaches for 1,462 viruses. . 33
3.3 Prediction accuracies of the different approaches for viral contigs of
length 5kb. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Prediction accuracies of the different approaches for simulated viral
contigs of length 5kb. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Improvement in host prediction by thresholding on the prediction
score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6 Differences in prediction accuracy across viral families . . . . . . . . 43
3.7 Relatednessofnewlydiscovered24viralcontigsinhumanassociated
metagenomes based on shared gene content. . . . . . . . . . . . . . 50
3.8 Shared gene relatedness of 102 newly discovered viral contigs and
31 previously isolated viruses that infect Cellulophaga. . . . . . . . 52
A.1 Comparisonofpredictionaccuraciesusingvirusandhostco-abundance. 76
vii
A.2 ROCcurvesforpredictingvirus-hostinteractionsusings
∗
2
andWIsH
on 352 positive and negative virus-host pairs. . . . . . . . . . . . . 77
A.3 The effect of prediction accuracies when the hosts in the true genus
level are excluded. . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
A.4 Improvement in host prediction by thresholding on the prediction
score for viral contigs of different lengths across different families of
Caudoviruses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
viii
Chapter 1
Introduction
Viruses are the most abundant and highly diverse biological entities on earth
[7, 8]. Viruses infect all domains of life including archaea, bacteria, and eukary-
otes. For prokaryotic viruses, especially those that infect bacteria, there have been
extensive studies about their diversity [22, 35], functions [71, 9, 43], and impact on
microbial communities through virus-host interactions [44, 32, 28, 58]. In particu-
lar, prokaryotic viruses can significantly impact human health [47, 52, 42] and the
functioningofmanyecosystems[65,55,53]suchasmarineandsoilhabitats. There-
fore, characterizingvirus-hostinteractionsisacriticalcomponenttounderstanding
how biological systems work. Viruses are traditionally studied using culture-based
isolation techniques that provide direct identification of virus-host pairs. Isolation
approaches are, however, low throughput and limited to hosts that are cultivable.
Compared to the predicted number of extant viruses, a relatively small number
of viruses have been discovered via isolation based approaches with current esti-
mates indicating that 75-85% of viruses remain uncharacterized [58, 12]. With the
advent of metagenomic sequencing technologies, genetic material from microbes
including viruses, regardless of cultivability, can be sequenced. Metagenomic shot-
gun sequencing, especially the metagenomic sequencing of virus-like particles, has
tremendously accelerated the discovery of previously unknown viruses. An exam-
ple is crAss-like phages, a highly abundant family of ubiquitous human gut viruses,
1
originally discovered from the cross-assembly of fecal viral metagenomic samples
[20].
Identifying the hosts of viruses is important for understanding the impact of
viruses on the host dynamics and thus host community diversity and function.
Computational methods have been developed to infer the hosts of new viruses.
Many bacteria and archaea possess CRISPR virus defence systems whereby the
host incorporates some virus DNA fragments into its own genome forming inter-
spaced short palindromic repeats (CRISPR) spacers. Therefore, shared CRISPR
regions are direct evidence supporting virus-host interactions [55, 20] and have
been used for host prediction for viruses in previous studies [21, 73]. Genome
alignment matches between virus and host genomes due to integrated prophages
or horizontal gene transfer are another piece of strong evidence used in predicting
the host of a virus [71, 55]. However, the above methods are limited by their low
accessibility: it is estimated that CRISPRs are only present in approximately 10%
of sequenced bacterial genomes [11, 30]; many viruses infect hosts under lytic mode
without integration to the host genome; and many viruses do not extensively share
host genes. Thus, CRISPRs and alignment-based approaches are not applicable
for predicting many viral hosts.
Several investigators have utilized the fact that viruses are often more similar
to their hosts, compared to non-host species, in terms of genome-wide signature,
i.e. k-mer usage because viruses and their hosts live in the same environment and
viruses use the hosts’ replication mechanism for replication [58, 21, 2, 27]. This
information has been used to predict the host of a virus as the one closest to the
viral genome based on some similarity measures using k-mers. These methods
in general have decent prediction accuracy, though the mechanism behind this
2
phenomenon is not fully understood. One plausible explanation is that viruses
tend to adopt the codon usage of their hosts in order to utilize the hosts’ trans-
lational machinery [13, 29]. The recently developed dissimilarity measure d
∗
2
that
subtracts expectedk-mer frequency from the observed frequency achieves the high-
est reported host prediction accuracy among all current genomic signature-based
measures, including commonly used Euclidean and Manhattan distances [2]. Sim-
ilarly, Galiez et al. [27] predicted the host of a virus to be the one for which
results from a Markov chain model analysis had the highest likelihood score. The
method has good prediction accuracy for short viral fragments. These genomic
signature-based measures are often referred to as alignment-free sequence compar-
ison measures. The high correlation between virus and host abundance profiles
across different samples also serves as evidence for virus-host interaction [20], but
its accuracy is not as high as the above methods [21]. Edwards et al. [21] recently
provided a comprehensive evaluation of several different computational approaches
for virus-host predictions.
In addition to the methods using features defined between a pair of virus and
host genomes, some researchers have used virus-virus similarity networks to infer
the host of a query virus [70, 77]. The high similarity between viruses may indicate
a common host or very close host relatedness. Network-based prediction models,
whereby unknown entities are predicted based on the features of their neighbors
in a network, have been successfully applied to many biological problems, includ-
ing predicting protein functions using protein-protein interaction networks [19, 36],
inferring disease genes based on gene-gene networks [37, 26, 79], and predicting the
target of new drugs using drug-drug, drug-target and target-target similarity net-
works [14]. A few attempts have been made to exploit the possibility of predicting
3
viral hosts based on virus-virus network information. Different principles, such as
gene homology [60, 41], protein family [48] and genome similarity [70, 76, 78] were
used to define the virus-virus relationships in networks. Villarroel et al. proposed
HostPhinder [70], a method to predict the host of a virus by searching for the virus
that shares the most k-mers from a database of viruses with known hosts. Zhang
et al. [77] identified the important k-mer features of viruses infecting the same
host genera, and built a classifier to predict whether or not a new virus belongs to
the same group of viruses. One drawback of the network-based approach is that
the performance can diminish if the query virus is highly divergent from the known
viruses in the current network.
Though various methods have been proposed for predicting virus-host inter-
actions, the highest accuracy is only 43% at the genus level using a single type
of information. With the increasing number of viruses being discovered, there
is a demand for a tool that is able to accurately and rapidly predict the hosts
of viruses, incorporating all types of virus-host and virus-virus features. In this
study, we have developed a network-based integrated framework for predicting
virus-prokaryote interactions based on multiple types of information: virus-virus
similarity, virus-host alignment-free similarity, virus-host shared CRISPR spacers
and virus-host alignment-based matches. To the best of our knowledge, this is the
first time that multiple types of features are effectively integrated into a network to
complement each other and enhance the prediction accuracy of virus-prokaryote
interactions. This integrated method markedly improved the accuracies in pre-
dicting virus-prokaryote interactions for complete viral genomes from 43% to 59%
at the genus level, and yielded 86% accuracy at the phylum level, the highest
4
among all the existing methods. The prediction framework also had decent accu-
racy for shorter viral contigs even as short as 5 kb. We have used our method
to infer the host of the first isolated strain of the crAssphage, 1,811 marine viral
genomes, and>100,000 viral contigs from various environments. We have provided
a user-friendly program, VirHostMatcher-Net, that uses this framework to predict
virus-prokaryote interactions. Finally, VirHostMatcher-Net provides a flexible and
expandable network-based framework for on-going refinement of virus-prokaryote
prediction methods.
5
Chapter 2
Materials and Methods
2.1 Data sets
All data analyzed in this study are available from previously published studies
[46, 48, 62] or included in this dissertation and its appendix. We collected 2,288
RefSeq viral genomes with known hosts at the genus level from NCBI as of Nov.
11, 2019. Among them, 826 viruses have specific hosts (at strain level) and those
were used for training. The training set includes 817 viruses that infect bacteria
and 9 that infect archaea. The other 1,462 viruses were used for validation. The
hosts of the viruses from which the viruses were originally isolated were collected
based on the key words ‘isolate_host=’ or ‘host=’ within each Genbank file. Fur-
thermore, for a subset of 826 viral genomes, their hosts were reported at either
strain, subspecies, or serovar, and only a single host genome was reported in the
NCBI genome database for that particular strain, subspecies or serovar. We used
the 826 viruses with known specific host genomes as the training set. The other
viruses either have more than one specific host strains or have host taxonomic
information only down to the genus or species level.
We appliedourmethodtoasetof1,811marinevirusgenomesthatwerestudied
in Nishimura et al. [46]. The data set is available at ftp://ftp.genome.jp/
pub/db/community/EVG2017. In addition, we predicted the hosts of 111,167 viral
contigs that were assembled previously from various environmental metagenomic
6
samples [48]. Accession numbers of those viral contigs are available in Table S19
of Paez-Espino et al. [48].
2.2 Outline of the model
We formulate the virus-host interactions using a Markov random field (MRF)
model [19, 40, 62]. Given a set of viruses {v
1
,v
2
,...,v
n
} and a set of hosts
{b
1
,b
2
,...,b
m
}, we define the set of virus-host pairs (VHP) and their interaction
statuses,
K ={κ
ij
=I(v
i
,b
j
),i = 1, 2,..,n;j = 1, 2,..,m},
whereI(v,b) = 1 ifv infectsb andI(v,b) = 0 otherwise. We construct a VHP net-
work where nodes are VHPs and edge weights are the pairwise similarity between
two VHPs.
The interaction statuses of all VHPs depend on two essential components: 1)
the likelihood of the interaction status of each individual VHP and 2) the linkage
between each VHP and all others. In the following sections, we first show how a
MRF model can take the first component into consideration. Next we introduce
a similarity measure that describes the linkage between a pair of VHPs. Then
we define all other features that can be used to estimate the second component.
Finally, we derive two models for host prediction given virus genomes and contigs,
respectively.
7
2.3 A MRF approach for virus-host interactions
We model the likelihood of virus-host interaction statuses by considering two
components: the fraction of interacting VHPs among all the VHPs and the similar-
ity network among the VHPs. For the first component, we use a Bernoulli model
that assumes the interaction statuses of VHPs are independent. For the second
component, we use a network model based on the similarity network among the
VHPs. The two components are integrated by multiplying the probabilities from
both components. More specifically, the likelihood of an assignmentK of the infec-
tion statuses for all the VHPs in the network is proportional to the likelihood of the
assignments of the VHP nodes and the likelihood of the pairwise labels of VHPs
given the network. Let π be the probability for a VHP to interact. Then for each
pair (v
i
,b
j
), the likelihood of the interaction status,P (K
ij
=κ
ij
), can be expressed
asπ
κ
ij
(1−π)
1−κ
ij
according to the Bernoulli model. By considering all VHPs and
assuming their assignments are independent, the likelihood of an assignment ofK
equals to the product of the likelihood for all the virus-host pairs, that is,
Y
i,j
π
κ
ij
(1−π)
1−κ
ij
=
π
1−π
F
1
(1−π)
F
=λ exp (βF
1
), (2.1)
where F
1
=
P
κ
ij
, F =||K|| is the size ofK, β = log(
π
1−π
), and λ = (1−π)
F
.
Next consider the relationship between two VHPs in the network. The proba-
bility of two similar VHPs having the same 0-1 status is higher than the probability
of having different 0-1 assignments. LetS
ij,i
0
j
0 be the similarity between two VHPs
(v
i
,b
j
) and (v
i
0,b
j
0). Conditional on the similarity between two VHPs, we model
the probability for them to be labelled as (1,1), (1,0) and (0,0) by a
S
ij,i
0
j
0
,b
S
ij,i
0
j
0
8
andc
S
ij,i
0
j
0
, respectively, wherea,b, andc are parameters. Mathematically, we can
write the probability of (v
i
,b
j
) labeled as κ
ij
and (v
i
0,b
j
0) labeled as κ
i
0
j
0 by
P (K
ij
=κ
ij
,K
i
0
j
0 =κ
i
0
j
0)
= a
κ
ij
κ
i
0
j
0S
ij,i
0
j
0
b
(1−κ
ij
)κ
i
0
j
0S
ij,i
0
j
0+(1−κ
i
0
j
0)κ
ij
S
ij,i
0
j
0
c
(1−κ
ij
)(1−κ
i
0
j
0)S
ij,i
0
j
0
= exp(γ
2
κ
ij
κ
i
0
j
0S
ij,i
0
j
0 +γ
1
((1−κ
ij
)κ
i
0
j
0S
ij,i
0
j
0
+(1−κ
i
0
j
0)κ
ij
S
ij,i
0
j
0) +γ
0
(1−κ
ij
)(1−κ
i
0
j
0)S
ij,i
0
j
0))
where γ
2
= log(a),γ
1
= log(b), and γ
0
= log(c). We assume that the labeling of
the VHP pairs are independent. Then we can multiply the above equation over all
the VHP pairs to obtain
exp(γ
2
F
11
+γ
1
F
01
+γ
0
F
00
), (2.2)
where F
cc
0 is defined as the sum of similarities among VHP pairs labeled as
(c,c
0
),c,c
0
= 0, 1, namely
F
11
=
X
(i,j)6=(i
0
,j
0
)∈K
κ
ij
κ
i
0
j
0S
ij,i
0
j
0,
F
01
=
X
(i,j)6=(i
0
,j
0
)∈K
(1−κ
ij
)κ
i
0
j
0S
ij,i
0
j
0 + (1−κ
i
0
j
0)κ
ij
S
ij,i
0
j
0,
F
00
=
X
(i,j)6=(i
0
,j
0
)∈K
(1−κ
ij
)(1−κ
i
0
j
0)S
ij,i
0
j
0.
9
By multiplying equations 2.1 and 2.2 and then normalizing to a probability dis-
tribution, we model the probability of the assignment conditional on the similarity
network as
Pr(K|θ) =
1
Z(θ)
exp(U(K))
=
1
Z(θ)
exp (βF
1
+γ
2
F
11
+γ
1
F
01
+γ
0
F
00
)
where θ = (β,γ
2
,γ
1
,γ
0
) are the parameters, U(K) =βF
1
+γ
2
F
11
+γ
1
F
01
+γ
0
F
00
,
and Z(θ) is the normalizing factor.
With this distribution function, for any κ
ij
∈K , we can calculate
Pr(κ
ij
= 1|K
[−ij]
)
Pr(κ
ij
= 0|K
[−ij]
)
= exp
β + (γ
2
−γ
1
)m
ij
1
+ (γ
1
−γ
0
)m
ij
0
where
K
[−ij]
=K\κ
ij
, m
ij
1
=
X
κ
i
0
j
0∈K
[−ij]
,κ
i
0
j
0=1
S
ij,i
0
j
0,
m
ij
0
=
X
κ
i
0
j
0∈K
[−ij]
,κ
i
0
j
0=0
S
ij,i
0
j
0.
Then the log-odds of the probability Pr(κ
ij
= 1|K
[−ij]
,θ) is
logit
Pr(κ
ij
= 1|K
[−ij]
,θ)
=β + (γ
2
−γ
1
)m
ij
1
+ (γ
1
−γ
0
)m
ij
0
.
Denote γ
+
=γ
2
−γ
1
and γ
−
=γ
1
−γ
0
. We have
logit
Pr(κ
ij
= 1|K
[−ij]
,θ)
=β +γ
+
m
ij
1
+γ
−
m
ij
0
.
10
2.4 The similarity between two VHPs and the
generalized probability model for a VHP to
interact
The MRF network model is constructed based on the similarity between pairs
of VHPsS
ij,i
0
j
0. Various similarity measures between VHPs can be defined. In this
study, we define the similarity between two VHPs as the similarity between the
two viruses plus the similarity between the two hosts. To measure the similarity
between two genomic sequences, we previously developed dissimilarity measuresd
∗
2
andd
S
2
for alignment-free sequence comparison usingk-mers as genomic signatures
[64, 63, 72, 49], and showed that the dissimilarity measure d
∗
2
and d
S
2
have high
correlation with alignment-based distance measures [51]. Since viruses are highly
diverse and alignments of highly divergent sequences are challenging, alignment-
free measures are more suitable for sequence comparison than the alignment-based
methods. Furthermore, Ahlgren et al. [2] showed that d
∗
2
outperformed d
S
2
for the
comparison of virus and bacterial sequences for the purpose of virus-host inter-
action prediction. Therefore, here we choose to use d
∗
2
and transform it to s
∗
2
to
measure the similarity between two sequences.
For each sequence, we represent it by the normalized k-mer frequency vector
(
˜
f
w
,w∈A
k
), whereA is the set of alphabet{A,C,G,T},k is the length ofk-mer,
and
˜
f
w
= (N
w
−E
w
)/
q
E
w
,
withN
w
andE
w
being the observed and expected numbers of occurrences of word
w in the sequence. The expected count is calculated under a Markov chain model
11
for the sequence as described below. Since it was shown in [2] that k = 6 and
second order Markov chain performed well in virus-host interaction prediction,
we choose k = 6 and second order Markov chain in this study. The similarity
between two sequences, s
∗
2
, is defined as the un-centered correlation between their
corresponding normalized frequency vectors. That is,
s
∗
2
(v,b) = 1− 2d
∗
2
(v,b) =
X
w∈A
k
¯
f
(v)
w
¯
f
(b)
w
where d
∗
2
(v,b) is the dissimilarity measure used in the previous studies, and
¯
f
w
=
˜
f
w
/||f|| with||f|| being the Euclid norm of the feature vector f =
˜
f
w
,w∈A
k
and the superscript indicates the virus v or bacterial b sequence. Thus, we define
the similarity
S
ij,i
0
j
0 =s
∗
2
(v
i
,v
i
0)I(b
j
=b
j
0) +s
∗
2
(b
j
,b
j
0)I(v
i
=v
i
0).
Plugging S
ij,i
0
j
0 into the logit function, we have
logit
Pr(κ
ij
= 1|K
[−ij]
,θ)
=β +γ
+
SV
ij
+
+δ
+
SB
ij
+
+γ
−
SV
ij
−
+δ
−
SB
ij
−
,
(2.3)
SV
ij
+
=
X
I(v
0
,b
j
)=1,v
0
6=v
i
s
∗
2
(v
0
,v
i
),
SB
ij
+
=
X
I(v
i
,b
0
)=1,b
0
6=b
j
s
∗
2
(b
0
,b
j
),
SV
ij
−
=
X
I(v
0
,b
j
)=0,v
0
6=v
i
s
∗
2
(v
0
,v
i
),
SB
ij
−
=
X
I(v
i
,b
0
)=0,b
0
6=b
j
s
∗
2
(b
0
,b
j
).
12
The above formulation takes into account both the similarity network between
viruses, and the similarity network between hosts. In our data set, however, each
virus has only one reported host. So when we train the model using the current
data set, both SB
ij
+
and SB
ij
−
are set to zero. Then the model reduces to,
logit
Pr(κ
ij
= 1|K
[−ij]
,θ)
=β +γ
+
SV
ij
+
+γ
−
SV
ij
−
.
Though the terms SB
ij
+
and SB
ij
−
cannot be used given the current data set, as
more virus-host pairs are collected in the training data, the host-host similarity
network will contribute to the prediction model and the two-layer MRF network
will be fully utilized based on Eq. (2.3).
Incorporating similarity between virus and host for interaction predic-
tion.
The assumption that any VHP has the same probabilityπ for interaction is not
realistic. Different pairs of virus and host have different features that affect the
probability of interaction. For example, the probability can be associated with the
similarity between the virus and the host [2]. Thus, the probability π is modelled
specifically to each individual pair (v
i
,b
j
),
log
π
ij
1−π
ij
!
=α +βs
∗
2
(v
i
,b
j
). (2.4)
Then the logit model with the generalized probability can be written as,
logit
Pr(κ
ij
= 1|κ
[−ij]
,θ)
=α +βs
∗
2
(v
i
,b
j
) +γ
+
SV
ij
+
+γ
−
SV
ij
−
.
13
Therefore, the network-based MRF for predicting virus-host interaction is
finally written as a logistic regression model where the predictors are the features
of virus-virus similarity and virus-host similarity,
logit(Pr(I(v,b) = 1)) =α +βs
∗
2
(v,b) +γ
+
SV
+
(v,b)
+γ
−
SV
−
(v,b),
(2.5)
whereα is a constant, (β,γ
+
,γ
−
) measure the contributions of the featuress
∗
2
(v,b),
SV
+
(v,b), andSV
−
(v,b), respectively. We expect thatβ andγ
+
to be positive and
γ
−
to be negative. However, we do not make these assumptions and let the data
inform us the values of these parameters. To learn the parameters, we trained
the model in a smaller training data set, and predicted virus-host interactions in
the network of all viruses and hosts. Since the scales of SV
+
(v,b) and SV
−
(v,b)
are proportional to the size of the data set, in practice we used the normalized
variables, that is,
SV
+
(v,b) =
1
||H
b
||
X
v
0
∈H
b
s
∗
2
(v,v
0
),
SV
−
(v,b) =
1
||H
c
b
||
X
v
0
∈H
c
b
s
∗
2
(v,v
0
),
where H
b
={v
0
|I(v
0
,b) = 1,v
0
6=v}, H
c
b
={v
0
|I(v
0
,b) = 0,v
0
6=v}, and||·|| is the
size of the set. WhenH
b
orH
c
b
is an empty set, the value ofSV
+
(v,b) orSV
−
(v,b)
is set to zero.
To achieve the best performance, in addition to the similarity score s
∗
2
, we
integrate other types of features, including the CRISPR score and the alignment
score between the virus v and host b into the framework.
14
2.5 Sharing of CRISPR spacers between the
virus and the host
The CRISPR systems play an important role as an adaptive and heritable
immune system for prokaryotes. They help the host fight against the invasion of
specific viruses by inserting small fragments of viral genomes (typically 21-72bp)
as spacers into a CRISPR locus. The spacers are transcribed and are used as a
guide by a Cas complex to target the degradation of the corresponding viral DNA
[34].
Given a host genome, the CRISPR locus can be computationally located and
thus the spacers can be extracted. In our study, we used the CRISPR Recognition
Tool (CRT) [6] to find spacers. The spacers in a host genome (if available) were
aligned to a viral genome by blastn [3] and alignment with E-value less than 1
were recorded. This threshold was chosen the same as the one used in a previous
study [21]. Since lower E-value between a spacer and a virus genome indicates
high similarity between them, we use− log(E-value) to measure the strength of
association between the spacer and the virus genome. It is possible that a host
genome may contain multiple spacers and the strongest association between these
spacers and the virus genome indicates the strength of association between the
host and the virus. Therefore, for each pair of virus and host, we define the
score S
CRISPR
(v,b) as the largest value of− log(E-value). If there is no match
between a virus and host, a score of zero is assigned. We used CRT1.2-CLI[6]
to find CRISPRs in all bacterial genomes, with parameters -minNR 3 -minRL
20 -maxRL 50 -minSL 20 -maxSL 60 -searchWL 7. All identified CRISPRs
were merged to one file to construct a BLAST database using makeblastdb
15
(BLAST 2.6.0). We then searched all viral genomes against the database using
blastn with parameters -evalue 1 -gapopen 10 -penalty -1 -gapextend 2
-word_size 7 perc_identity 90 -dust no -task blastn-short.
With the CRISPR information, we modify the model of π
ij
in Eq. (2.4) to
log
π
ij
1−π
ij
!
=α +βs
∗
2
(v
i
,b
j
) +ηS
CRISPR
(v
i
,G
b
j
),
and our logistic regression model in Eq. (2.5) to
logit(Pr(I(v,b) = 1)) =α +βs
∗
2
(v,b) +γ
+
SV
+
(v,b)
+γ
−
SV
−
(v,b) +ηS
CRISPR
(v,G
b
),
(2.6)
where G
b
is the set of hosts that belong to the same genus as host b, and
S
CRISPR
(v,G
b
) = max
b
0
∈G
b
S
CRISPR
(v,b
0
).
Due to the limited availability of CRISPR information in the training data, as
shown in Fig. 3.1, we group hosts by genus for the CRISPR feature.
2.6 The fraction of virus genome aligned to the
host genome
Viruses and their hosts frequently exchange genetic materials and viruses play
important roles in horizontal gene transfer. Therefore, similar regions in virus and
host genomes can provide a strong evidence for linking a virus into its potential
host. On the one hand, phages, especially those temperate phages, are able to
16
integrate their own genomes to the hosts. On the other hand, phages can obtain
genetic material from their hosts. If a genetic element brings an evolutionary
advantage to the virus, the borrowed genetic segment will be preserved in the
viral genome [21]. One example is cyanophages, phages that infect cyanobacteria.
Many cyanophages acquire and express host photosystem genes that are thought
to bolster photosynthetic energy during infection. [61].
Similar to the method in [21], we used blastn to find similarities between
each pair of virus and host genomes. For each virus-host pair, their similarity,
S
blastn
(v,b), is defined as the fraction of the virus genome that can be mapped
to the host genome. Only matches with percent identity higher than 90% are
used for prediction. Note that different parts of the virus genome can be matched
to different positions on the host genome and all contribute to the coverage per-
centage. We used the same parameter setting as in [21] for our analysis. To
generate blastn results a BLAST database was created for all bacterial genomes
by makeblastdb (BLAST 2.6.0). We then searched all viral genomes against the
database by blastn with parameters -word_size 11 -evalue 0.01 -reward 1
-penalty -2 -gapopen 0 -gapextend 0 perc_identity 90.
Finally, with the CRISPR feature and the alignment-based similarity, we have
the following model:
logit(Pr(I(v,b) = 1)) =α +βs
∗
2
(v,b) +γ
+
SV
+
(v,b)
+γ
−
SV
−
(v,b) +ηS
CRISPR
(v,G
b
)
+δS
blastn
(v,b).
(2.7)
17
2.7 Incorporation of WIsH score for predicting
hosts of virus contigs
In many metagenomic studies, the whole genome of a virus may not be avail-
able. Instead, only parts of the virus genome referred as contigs that were assem-
bled from shotgun reads are known. Several algorithms such as VirFinder and
VirSorter etc. [50, 56, 57, 38, 10, 69] can be used to decide if the contigs come
from virus genomes. Our objective is to predict the hosts for full virus genomes as
well as viral contigs.
Galiez et al. [27] recently developed a program, WIsH, to predict the hosts of
viral contigs and showed that WIsH outperformsd
∗
2
for predicting the hosts of viral
contigs as short as 5 kb. WIsH trains a homogeneous Markov chain model for each
host genome, and calculates the likelihood of a viral contig based on each Markov
chain model. Instead of using s
∗
2
(v,b) as a feature, we hereby replace it with the
log-likelihood of viral contig v fitting to the Markov chain model of bacteria b,
S
WIsH
(v,b). WIsH [27] scores were computed using WIsH 1.0 with the default
parameters. Then the model for predicting the host b of viral contig v becomes,
logit(Pr(I(v,b) = 1)) =α +βS
WIsH
(v,b) +γ
+
SV
+
(v,b)
+γ
−
SV
−
(v,b) +ηS
CRISPR
(v,G
b
),
(2.8)
corresponding to Eq. (2.6) and
logit(Pr(I(v,b) = 1)) =α +βS
WIsH
(v,b) +γ
+
SV
+
(v,b)
+γ
−
SV
−
(v,b) +ηS
CRISPR
(v,G
b
)
+δS
blastn
(v,b)
(2.9)
18
corresponding to Eq. (2.7).
Note that both SV
+
(v,b) and SV
−
(v,b) are still computed by s
∗
2
, since WIsH
is not able to depict the similarities between viral contigs.
19
Host-host network
Virus-host
network
Virus-virus network
Integrated network-based
machine learning framework
No
Yes
Predict virus-host interaction
Virus
Host
Virus-virus similarity
Virus-host similarity
Host-host similarity
0.05 0.05 0.05 0.05 0.05
TT ... AG AC AA
TT ... AG AC AA
0.05 0.05 0.05 0.05 0.05
Host CRISPR
spacers
Virus-host alignment
(blastn)
WIsH
s
2
*
Alignment-free
similarity
Figure 2.1: A novel two-layer network is constructed for representing virus-virus,
host-host, and virus-host similarities. Viruses (red circles) are connected based on
sequence similarity (red edges). Similarly, hosts (blue squares) are connected based
on sequence similarity (blue edges). The thickness of the edges indicate the degree
of similarity. The interaction between a virus and host pair (green edges) can
be predicted using multiple types of features: 1) the similarity between the virus
and other viruses infecting the host; 2) the similarity between the host and other
hosts infected by the virus; 3) the alignment-free sequence similarity between the
virus and the host based onk-mer frequencies; 4) the existence of shared CRISPR
spacers between the virus and the host; 5) alignment-based sequence matches
between the virus and the host. Finally, a network-based machine learning model
is used to integrate all different types of features and to predict the likelihood of
the interaction of a virus-host pair.
20
2.8 Model training and evaluation
Among the 2,288 viruses, we used the set of 826 viruses whose exact host
genome sequences were known and the set of their corresponding 185 hosts as
the positive training set. We randomly select 826 pairs of virus-host within the
826 viruses and 185 hosts as negative training data. To alleviate potential false
negative interactions, we require that the selected host for each virus is not in
the same phylum level as the true host. We then learned the model based on the
training data for the various models. We repeated the selection of negative training
sets for 100 times. For real applications and the software, we set the coefficients
by averaging over 100 times of the training procedure to reduce randomness.
It is possible that the selected 826 non-interacting pairs may contain some
positive-yet-unknown interaction pairs, which may influence the training and test
results. We recognized this possibility while assuming the fraction of such pairs is
relatively low since the virus-host interaction is specific so that the overall fraction
of virus-host interacting pairs among all the pairs is very small. The additional
requirement that the host in a negative virus-host pair comes from a different
phylum level further mitigates this potential problem.
The trained models are then used to predict the hosts of the remaining 1,462
viruses against 62,493 candidate prokaryotic hosts. For each virus, we estimate
its probability of infecting any hosts, and the one with the highest probability
was predicted as its host. For a taxonomic groupS at an upper taxonomic level
containing a set of hosts, we define the prediction score between v andS as the
maximum probability between v and all hosts inS, that is
P (I(v,S) = 1) = max
b∈S
P (I(v,b) = 1).
21
We predict the host group of the virus v by the one having the highest prediction
score P (I(v,S) = 1). In case of ties, we first checked the number of hosts having
the highest probability in each group and chose the one with the largest number of
hosts having the highest probability. Further, if there were more than one taxon
with the same number of bacteria having the highest probability, all taxa were
reported.
We then compared the predicted host taxonomic groups with the true taxo-
nomic group of every virus at several taxonomic levels: genus, family, order, class,
and phylum. At a particular taxonomic levelL, letT
v
be the set of predicted
groups and C
L
(v) =I(h
v
,T
v
)/||T
v
||, where I(h
v
,T
v
) = 1 if the true host of v, h
v
,
belongs to the set of the predicted host groupsT
v
, and I(h
v
,T
v
) = 0, otherwise.
The prediction accuracy for a certain taxonomic level is defined as
Acc
L
=
1
||V||
X
v∈V
C
L
(v),
whereV is the set of viruses for prediction.
2.9 Clustering of viral contigs
To examine the relatedness of viral contigs for novel host predictions, proteins
encoded on viral contigs were predicted by Prodigal 2.6.3 (with default param-
eters). BLASTp 2.6.0 was then used to search for similar proteins shared between
viral contigs. The percentage of genes shared between two contigs were defined as
the number of pairs of homologous proteins between the two contigs divided by
the average number of proteins of the two contigs.
22
2.10 Consideration of virus-host co-abundance
in host prediction
In order to investigate whether co-abundance can help the prediction of virus-
host interactions, we incorporated this feature to the model in a smaller data set to
evaluate its contribution. The data set included a subset of 2,695 prokaryotic refer-
ence genomes and 1,403 viruses (see below). A total of 148 stool metagenomic sam-
ples from the Human Microbiome Project (HMP) [17] and 103 metagenomes from
the Tara Ocean (filter size 0.22 to 3μm) [68] were collected. We used centrifuge
[38] (centrifuge-1.0.3-beta) to compute the abundance of virus and bacteria
genomes in each of the metagenomes, resulting in an abundance profile of 251-
dimensional vector for every virus and host genome. The co-abundance feature
S
co−abundance
(v,b) was defined by the Pearson correlation between the abundance
profiles for the pair of virus and bacterium. We then modified the integrated model
to
logit(P{I(v,b) = 1}) =α +βs
∗
2
(v,b) +γ
+
SV
+
(v,b)
+γ
−
SV
−
(v,b) +δS
co−abundance
(v,b).
(2.10)
We compared the performance of this model with that of the model in Eq. (2.5).
Both models were trained based on a subset of 308 viruses and 50 hosts, including
308 pairs of true interacting pairs and 308 randomly chosen negative pairs. After
both models were trained, we predicted the hosts of 1,095 viruses. The results are
shown in Fig. A.1. The co-abundance feature itself had weak prediction ability
and adding it to the model did not help prediction. Therefore, we did not consider
it as a feature in the final model presented in the main text.
23
2.11 Software
We developed a computational tool, VirHostMatcher-Net, implementing our
network-based integrated method for virus-host predictions. The software is pub-
licly available at https://github.com/WeiliWw/VirHostMatcher-Net. The tool
supports parallel computing and has the option of choosing the type of query
viruses (complete genomes or contigs). It also provides the option of specifying a
customized subset of candidate hosts for prediction. The tool provides informative
outputs including all the feature scores of the query viruses against all candidate
hosts, and a summarized table listing top predictions for each virus with their
feature scores, score percentiles, and accuracy. The score percentile of a virus-host
pair is defined as the percentile of this score among all scores between that virus
and all the candidate hosts. A large percentile suggests high relevance of the fea-
ture score. The percentile of SV
−
, the only feature with a negative coefficient,
is reversed to be consistent with other feature score percentiles. The percentile
information helps to better understand how relevant each feature score is for a
particular prediction. We also provide “accuracy" that gives the fraction of cor-
rect predictions when virus-host pairs with prediction scores above the particular
threshold are declared as interacting.
24
Chapter 3
Results
3.1 Anovelnetwork-basedintegratedframework
for predicting virus-host interaction
We collected from NCBI the genomes of a set of known virus-host interaction
pairs, S
+
, and generated a set of random virus-host pairs that most likely do not
interact, S
−
, as the training data for this study. Our objective was to develop
a machine learning approach to predict the probability of interaction between a
query virus-host pair (v,b), denoted as P (I(v,b) = 1), where I(v,b) denotes the
interaction status of a virus v and a host b with value 1 indicating interaction
and 0 indicating no interaction. In order to achieve the best performance, we
comprehensively considered various factors that contribute to the interaction of
a virus-host pair (v,b). First, if a virus is genetically close to viruses infecting
a particular host, this virus is highly likely to infect the same host [77, 70]. On
the other hand, if a virus infects a host, the virus should be genetically distant
from the viruses that do not infect the host. Secondly, the similarity among hosts
indicates the possibility of infection by the same virus [24, 25]. If a potential
host belongs to the same taxon as the known host of the virus, then that host is
likely to be infected by the virus. Third, the similarity between virus-host pairs
in terms of genomic signatures reflects the likelihood of interaction [2]. If a virus
genome is similar to a host genome in terms of the alignment-free k-mer usage
25
pattern, the pair is predicted to have a high probability of interacting. Finally, the
existence of virus-host shared CRISPR spacers and the alignment-based matches
(i.e. BLASTn) are strong evidence of interaction.
All together, virus-virus similarity, host-host similarity, and virus-host similar-
ity can be integrated to form a two-layer network connecting viruses and hosts.
Thus, we constructed a virus-host pairs (VHP) network where nodes are VHPs
and edge weights are the pairwise similarities between VHPs. We developed an
integrated network-based Markov random field (MRF) approach that systemati-
cally and comprehensively integrates various types of features to predict interact-
ing virus-host pairs. The probability of a given VHP to be interactive is based
on the characteristics of this VHP itself, and the connectivity between this VHP
and its neighbor VHPs in the network. Intuitively, the characteristics of a VHP
itself include alignment-free score, the fraction of alignment-based matches, and
the existence of shared CRISPR spacers. The connectivity between this VHP and
other VHPs is defined based on the genome similarity between the virus and other
viruses infecting the same host. The outline of the framework is demonstrated in
Fig. 2.1. The details of the models for this framework can be found in the Material
and Methods Chapter.
3.2 Feature scores are significantly different
between positive and negative virus-host
pairs
We incorporated multiple types of features that contribute to the prediction
of virus-host interactions. To assess the discriminatory power of each feature, we
26
compared the distributions of the feature values between the virus-host interacting
pairs and the non-interacting pairs. A set of 826 known virus-host interacting pairs
was used as the positive set, and a set of the same number of randomly selected
virus-host pairs was used as the negative set. See the Methods section for details
of the data collection and the simulation of negative pairs. We used a one sided
t-statistic to test if the feature values in the positive set are significantly higher or
lower than the ones in the negative set.
First, the alignment-free similarity score s
∗
2
(v,b) was used to measure the sim-
ilarity between virus and host pairs, where s
∗
2
= 1− 2d
∗
2
and the k-mer based
dissimilarity score d
∗
2
is defined in our previous work [2]. The measure s
∗
2
has an
advantage over other classical similarity measures because of its precise correction
of background noise, and has shown superior accuracy for predicting virus-host
interactions [2]. See the Methods section for the definition of s
∗
2
(v,b). The s
∗
2
score had significantly higher values (p-value<2.2e-16, one sided t-test) for posi-
tive virus-host pairs than the negative pairs (Fig. 3.1a). The mean s
∗
2
similarity
score between positive pairs was 0.52 while the mean s
∗
2
similarity between nega-
tive pairs was 0.24.
The WIsH score, proposed by Galiez et al. [27], is another alignment-free simi-
larity measure for a virus-host pair. It uses a log-likelihood score of a Markov chain
model to measure similarity between viruses and hosts. We computed the WIsH
scores for both positive and negative virus-host pairs, and found that the WIsH
scores for positive virus-host interacting pairs is significantly higher than that for
the negative virus-host pairs (p-value = 1e-10) (Fig. 3.1b). In fact, we observed
that the WIsH ands
∗
2
scores were highly correlated (Pearson correlation coefficient
ρ = 0.85,p-value<2.2e-16). We predicted a virus-host pair as interacting if one of
27
the similarity measures, s
∗
2
or WIsH, was above a threshold and, by changing the
threshold, the corresponding receiver operating characteristic curve (ROC) was
plotted. The area under the receiver operating characteristic curve (AUROC),
which measures the discriminative ability between positive and negative pairs, was
0.91 for s
∗
2
and 0.86 for WIsh (Fig. A.2). Though the distinguishing power using
WIsH was lower than that of s
∗
2
using complete genomes, WIsH was previously
shown to be more effective than s
∗
2
when predicting hosts of partial viral genomes
[27]. Therefore, we decided to use s
∗
2
to measure virus-host alignment-free simi-
larity when the length of viral sequence is close to the size of a complete genome,
and to use WIsH to measure the virus-host similarity for short contigs.
Second, for a given virus-host pair (v,b), we defined the similarity between a
virus v and other viruses infecting the host b, denoted as SV
+
(v,b), and likewise,
the similarity between virusv and other viruses not infecting the hostb, denoted as
SV
−
(v,b). See Methods for the details of their definitions. We hypothesized that,
for a true interacting virus-host pair (v,b), other viruses that infect the same host
b should exhibit high similarity to the virus v, resulting in a high SV
+
(v,b). At
the same time, other viruses not infecting the host b should have low similarity to
the virusv, resulting in a lowSV
−
(v,b). For a non-interacting virus-host pair, the
above trend of SV
+
(v,b) and SV
−
(v,b) should be opposite. Consistent with our
hypothesis, SV
+
(v,b) scores were significantly higher for positive virus-host pairs
than negative pairs, and vice versa for SV
−
(v,b) scores (both p-values <2.2e-16,
Fig. 3.1c-d).
Third, we included information from CRISPR matches and alignment-based
genome similarity between viruses and hosts. The CRISPR score was defined as
the highest alignment score between the predicted CRISPR spacers in a host and a
28
viral genome, and the alignment-based matching score was defined as the fraction
of virus genome that significantly matches the host genome using blastn (> 90%
identity, seeMethods). Thus, forsimplicity, wereferthealignment-basedmatching
score to as the BLAST score. Both CRISPR and BLAST scores were significantly
higher for the true interacting virus-host pairs than the non-interacting pairs with
p-values of 0.0001 and <2.2e-16 for one sided t-tests, respectively. Fig. 3.1e-f
shows the limited frequency of CRISPR and BLAST matches between viruses and
hosts.
29
Positive Negative
0.0
0.2
0.4
0.6
0.8
Similarity
a)
s2*
Positive Negative
1.50
1.45
1.40
1.35
1.30
1.25
Similarity
b)
WIsH
Positive Negative
0.0
0.2
0.4
0.6
0.8
1.0
Score
c)
SV+
Positive Negative
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Score
d)
SV-
Positive Negative
0.0
0.2
0.4
0.6
0.8
1.0
Percent of virus genome aligned
e)
BLAST
Positive Negative
0
5
10
15
20
25
-log[evalue]
f)
CRISPR
Figure 3.1: Distributions of the different feature values among 826 inter-
acting and non-interacting virus-host pairs. See details at the next page.
30
Figure 3.1: The positive set consists of 826 known infecting virus-host pairs (posi-
tive set) and the same number of randomly selected virus and host pairs were used
as the non-interacting, negative set. a) Boxplots of similarity defined by s
∗
2
(v,
b). b) Boxplots of the log-likelihood scores given by WIsH; c) Boxplots ofSV
+
(v,
b) scores; d) Boxplots of the SV
−
(v, b) scores; e) Boxplots of BLAST scores; f)
Boxplots of the CRISPR scores. For all figures, the horizontal bar displays the
median; boxes display the first and third quartiles; whiskers depict minimum and
maximum values; and points depict outliers beyond the whiskers.
3.3 Integrated approach markedly increases host
prediction accuracy
We integrated the multiple types of features proposed previously to predict
virus-host interactions using a general framework of MRF, where the nodes were
virus-host pairs (VHP) and edges were the similarity between the VHPs. We
investigated the prediction accuracies of the newly developed integrated models in
Equations (2.6) and (2.7) (see Methods), and compared the accuracies with those
using the individual features. The model in Eq. (2.6) incorporates the network
features including virus-virus similaritySV
+
andSV
−
, the virus-host similaritys
∗
2
,
and the CRISPR score. The model in Eq. (2.7) combines features in Eq. (2.6) plus
the BLAST scores. For each of the integrated models, we learned the parameters
using the 826 positive and the same number of negative virus-host pairs, and then
tested the trained model on the remaining 1,462 viruses for which their true hosts
are known against 62,493 candidate hosts.
We assessed the prediction accuracies of the trained models using an inde-
pendent set of 1,462 viruses at different taxonomic levels, including genus, fam-
ily, order, class, and phylum. For each virus, we computed the prediction scores
between this virus and all candidate hosts (n = 62, 493) using the trained models,
31
and predicted the host as the one having the highest prediction score. The pre-
diction accuracy was calculated as the percentage of viruses whose predicted hosts
had the same taxonomy as their respective known hosts. Host prediction accura-
cies were markedly higher for the integrated approach using network features and
CRISPR scores than using s
∗
2
or CRISPR scores alone (Fig. 3.2). For example, at
the genus level, prediction accuracy was 31% and 43% when usings
∗
2
and CRISPR,
respectively. Combining network similarity features together with CRISPR score
(Eq.(2.6)) increased prediction accuracy to 59%, or a 1.4-fold increase.
Alignment-based BLAST scores alone had a prediction accuracy of 41%, com-
parable to that based on CRISPR scores. However, incorporating BLAST into the
network model in Eq. (2.5) or Eq. (2.6) does not yield a better performance than
the model in Eq.(2.6). (Fig. 3.2) Therefore, the model in Eq.(2.6) that incorpo-
rates the network features, virus-host similarity s
∗
2
and CRISPR had the highest
accuracy and was used in the subsequent host prediction applications. For the
higher levels of taxonomy like family, order, class and phylum, the network-based
integrated framework also achieved large improvements over the prediction accu-
racy of individual features, yielding 70%, 78%, 83% and 86% prediction accuracy,
respectively. At the species level, the prediction accuracy is 43%. The estimated
coefficients and the corresponding p-values of the features are shown in Table 3.1.
All the coefficients had the expected signs that were consistent with the observa-
tions in Fig. 3.1, and the statistical significance p-values for the coefficients were
all <0.05.
32
Species Genus Family Order Class Phylum
Taxonomy
Accuracy
s2star
BLAST
CRISPR
Network
Network+BLAST
Network+CRISPR+BLAST
Network+CRISPR
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Figure 3.2: Prediction accuracies of the different approaches for 1,462
viruses. Prediction accuracies for 1,462 viral genomes whose true hosts are known
against 62,493 candidate hosts, binned by taxonomic level. The first three bars
show results using individual features of s
∗
2
(v; b), CRISPR score, or alignment-
based similarity score (blastn), respectively. The remaining bars show results with
integrated network models, trained using 826 positive and the same number of
negative virus-host pairs as in Fig. 3.1. (Continued at the next page.)
33
Figure 3.2: (continued) In order these are the model in Eq. (5) which incorpo-
rates the network-based features SV
+
(v, b) and SV
−
(v, b), alignment-free virus-
host similaritys
∗
2
(v; b),in addition to the blastn scores (“Network+BLAST"); the
model in Eq. (7) (“Network+CRISPR+BLAST"); and the model in Eq. (6) (“Net-
work+CRISPR"). Error bars for the network-based results depict 95% confidence
intervals using 100 replicates of negative training sets (random virus-host pairs).
Model s
∗
2
S
WIsH
SV
+
SV
−
S
CRISPR
Complete genomes Coeff. 16.41 - 4.44 -27.38 0.13
using Eq. (2.7)
a
p-value <2e-16 - <2e-16 <2e-16 0.0002
Short contigs Coeff. - 25.96 6.46 -15.29 0.19
using Eq. (2.8)
b
p-value - <2e-16 <2e-16 <2e-16 0.0069
Table3.1: Theestimatedcoefficientsandcorrespondingp-valuesforhostprediction
features.
a Results for complete viral genomes using the network-based integrated model in
Eq. (2.6))
b Results for short viral contigs using the model in Eq. (2.8).
“Coeff."=coefficient. Sincedifferentnegativetrainingsetsyieldedslightlydifferent
estimated coefficients of the features, we show one example here.
3.4 Integrated approach improves host predic-
tion accuracy of short viral sequences
Viral contigs assembled from metagenomic data often represent partial viral
genomes. We tested an integrated model in Eq. (2.8) that uses WIsH scores
instead ofs
∗
2
for measuring the alignment-free similarity between viruses and hosts.
We evaluated the accuracy of the model for predicting the hosts of viral contigs
at various lengths, and investigated the effect of viral sequence length on the
34
prediction accuracy. To evaluate the performance of host prediction for short viral
contigs, we randomly sub-sampled fragments of different lengths (1 kb, 2 kb, 5
kb, 10 kb, and 20 kb) from each of the 1,462 viral genomes. For a given viral
genome and a fixed contig length, we randomly chose a segment of fixed length
uniformly from the genome. If the fixed length was longer than the size of the
complete genome, we took the entire genome. This procedure was repeated 10
times for each contig length. We then computed all the features of the contigs
using the same procedure as for the complete viral genome analyses, with the only
difference being that s
∗
2
similarity was replaced with the WIsH score [27]. The
model was trained with the same set of 826 virus-host positive pairs and the same
number of negative pairs using the same scheme as before by replacing s
∗
2
with
the WIsH likelihood score. With the trained model, we predicted the hosts for
all sub-sampled contigs. The results for different models on viral contigs of length
5kb are shown in Fig. 3.3. With WIsH score alone, the prediction accuracy at
the genus level was 35%. Adding the network features SV
+
and SV
−
improved
the accuracy to 48%. Similar to the results for predicting complete viral genomes,
the model in Eq. (2.8) performed best (Fig. 3.3). For viral contigs of length 5kb,
the model has 53% prediction accuracy at the genus level and 85% at the phylum
level.
The average prediction accuracies for each contig length are shown in Fig. 5.
Our model (solid lines) achieved a large improvement compared to the results of
WIsH alone (dashed lines). For example, when the contig length was 20 kb, the
prediction accuracy using our model was about 19-26% higher than that of WIsH
at the genus, family and order levels. As expected, the prediction accuracy of
our model (solid lines) increases with contig lengths. For instance, at the genus
35
level, the accuracy increased from 42% for contigs length of 1kb, 48% for 2kb,
53% for 5kb, to 55% for 10kb, to 57% for 20kb (Fig. 3.4). Given the results,
we provide our framework with two models for host prediction: one for complete
or nearly complete viral genomes using the model in Eq. (2.6), and one for short
viral contigs using the model in Eq. (2.8).
36
Genus Family Order Class Phylum
Taxonomy
Accuracy
WIsH
Network
Network+CRISPR+BLAST
Network+CRISPR
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Figure 3.3: Prediction accuracies of the different approaches for viral
contigs of length 5kb. Prediction accuracies for viral contigs of length 5kb,
binned by taxonomic level. The first bar shows results using WIsH method alone,
asin[27]. Theremainingbarsshowresultswithintegratednetworkmodels, similar
to Fig. 3.2. All bars are calculated based on the average accuracies for 10 different
sets of viral contigs.
37
● ● ● ● ● ● ● ● ● ● 0.3
0.4
0.5
0.6
0.7
0.8
0.9
Genus Family Order Class Phylum
Taxonomy
accuracy
length
● 20k
10k
5k
2k
1k
model
network
WIsH
Figure3.4: Predictionaccuraciesofthedifferentapproachesforsimulated
viral contigs of length 5kb. Prediction accuracies for contigs subsampled at
various lengths from the 1,462 virus genomes. Mean accuracies are shown at
different taxonomic levels using WIsH scores only (dashed lines) or the integrated
model in Eq. (2.8) (solid line) that uses WIsH scores in place of of s
∗
2
scores.
3.5 Thresholding on the prediction score further
improves accuracy
In many situations, investigators are interested in making sure the predicted
hostsareasaccurateaspossible, i.e. thepredictionshavehighprecisionorlowfalse
discovery rate. Therefore, we investigated how the accuracy changes by thresh-
olding on the predicted probability of interaction P (I(v,b) = 1). In the above
analysis, we predicted the host of every virus as the one with the highest score.
38
However, sometimes the highest score was relatively low. For example, as shown
in Fig. 6, the highest prediction score among the 62,493 hosts for some viruses
in the complete genome test set was as low as 0.31. Low scores may occur, for
example, when the true host is not in the database of potential hosts. In order to
improve the prediction accuracy, we can set a threshold such that host predictions
are only made if the score is above that threshold. For instance, when a threshold
was set at 0.95, there was an improvement of prediction accuracy at all taxonomic
levels. Specifically at the genus level, accuracy was improved by 13%, from 59%
to 72% ; at the phylum level, accuracy was improved by 4% from 86% to 90%.
39
0.4
0.6
0.8
1.0
0.4 0.6 0.8 1.0
Prediction score threshold
Accuracy and recall
Genus
Family
Order
Class
Phylum
recall
Figure 3.5: Improvement in host prediction by thresholding on the pre-
diction score. By applying a given threshold, predictions were made only when
the prediction score is above the threshold. Predictions were made using the whole
genomes of 1,462 viruses whose true hosts are known among 62,493 hosts as in Fig.
3.2. The proportion of viruses that can be predicted (recall rate) decreases as the
prediction accuracy at all levels increases.
3.6 Prediction accuracy varies for different viral
families
Viruses from three major families, Siphoviridae, Myoviridae, and Podoviridae,
arehighlyrepresentedinourevaluationdataset(42%, 24%and18%, respectively).
40
Previous host predictions withs
∗
2
showed notable differences in prediction accuracy
among these families [2]. Therefore, we examined prediction accuracies using our
model (Fig. 7). We found that the Siphoviridae family of viruses in our data set
had generally higher prediction accuracy than other families of viruses, achieving
72% accuracy compared with the average accuracy of 59% for all types of viruses,
consistent with previous results using thes
∗
2
scores alone [2]. The prediction accu-
racies for the different virus families with various thresholds on the prediction
score are shown in Fig. A.4. We also noticed that the top prediction scores for the
Siphoviridae family of viruses are significantly higher than that for the other two
families (Kolmogorov-Smirnov test, p-value<1e-15). The above observations may
be explained by the fact that 1) Siphoviridae is the most abundant viral family
in the training data (75%, n = 618); 2) siphoviruses typically have relatively nar-
row host ranges and podoviruses and myoviruses often have broader host ranges
[67, 75, 15], though recent studies suggest that current isolation techniques may
result in the under-representation of broad host range viruses and that the true
host range of viruses is hard to define [24, 54].
Toinvestigateifthehighhostpredictionaccuracyforsiphovirusesisduetotheir
high abundance in the training set, we trained a new model only on podoviruses
(n = 76) and myoviruses (n = 113), and tested the model on siphoviruses in the
validation set (n = 607). Comparing the performance of this model with the model
trained with the full training set, we found the difference in prediction accuracy is
less than 1% for each taxonomic level, from the species to the phylum (Table. B.2).
To further investigate the sensitivity of the model to the training data, we similarly
trained a model excluding a certain group of viruses from the training set and
evaluated the host prediction accuracy for that group of viruses in the validation
41
set. The same procedure was conducted for several groups, including the other two
major virus families (Myoviridae and Podoviridae) and groups of viruses infecting
the common host taxonomic groups (E. coli, Proteobacteria, Actinobacteria and
Firmicutes). The overall decrease in host prediction accuracy for the excluded
groups of viruses is on average 2.6%. The detail results are in Table. B.2.
42
Genus Family Order Class Phylum
Performance on different viral groups
Taxonomy
Accuracy
0.0 0.2 0.4 0.6 0.8 1.0
all
Siphoviridae
Myoviridae
Podoviridae
others
Figure 3.6: Differences in prediction accuracy across viral families Predic-
tion accuracies for different virus families within the order Caudovirales: Siphoviri-
dae, Myoviridae, and Podoviridae. For comparison, accuracies are shown for all
viruses (“all”) and for viruses outside of the Caudovirales or for which their virus
families were not listed in the Genbank files (“other”). Predictions were made
using whole viral genomes with no thresholding.
43
3.7 Prediction of the host of crAss-like phage
ΦcrAss001
CrAssphage was first discovered through the cross-assembly of human fecal
metagenomes and was originally published as an individual genome that is referred
as prototypical crAssphage (p-crAssphage) [20]. Though crAssphage is ubiquitous
in human gut samples and comprises up to 90% of the sequencing reads in some
fecal viral metagenomes [20], little is known about the biological significance and
the hosts of crAssphage, due to the difficulty of culturing crAssphage and the high
divergence between crAssphage and known viruses. Different methods have been
used to predict the hosts of crAssphage. Dutilh et al. [20] predicted its host as the
phylum Bacteroidetes using the co-occurrence profile between crAssphage and 404
potential human gut bacteria hosts across 151 human gut metagenomes from the
Human Microbiome Project (HMP). Ahlgren et al. [2] compared the alignment-
free similarity between crAssphage and the potential hosts, and the genera, Bac-
teroides, Coprobacillus and Fusobacterium, were found to have significantly high
similarity to crAssphage.
Recently, Shkoporov et al. [62] isolated a particular strain of crAssphage,
ΦcrAss001, by enriching viral fraction gut samples on a collection of 54 bacteria
strains from the human gut. They subsequently showed that ΦcrAss001 specifi-
cally infects only one of 14 strains of Bacteroides tested, Bacteroides intestinalis
919/174. We first predicted the host of ΦcrAss001 using 22 species of the bacteria
used to enrich ΦcrAss001 (and whose genomes are available) that span four phyla
and 14 genera (Table. B.1). A Bacteroides intestinalis strain had the highest pre-
diction score of 0.962, congruent with the experimental results of Shkoporov et al.
44
[62]. Alignment-based scores such as CRISPR and BLAST were all 0 and did not
contribute to the prediction. The main contribution comes from the alignment-free
similarity score s
∗
2
of 0.5 and the CRISPR signal. We then applied the integrated
approach to predict the host of ΦcrAss001 using the large database of 62,493 host
genomes and found that all of the top 25 predictions belong to the Bacteroidetes
phylum, including 23 belonging to the genus Prevotella. ΦcrAss001 was classified
as a genus VI crAssphage [62]. Guerin et al. [31] previously hypothesized that
genus VI crAss-like phages infect Prevotella based on the observation that these
two genera of virus and host were both enriched in malnourished and healthy
Malawian infants. Our host prediction of ΦcrAss001 is therefore consistent with
this hypothesis.
3.8 Host prediction for marine environmental
viral genomes
Metagenomic sequencing has provided access to a broad range of viral genomes
and has played an important role in studying uncharacterized marine viral genetic
materials. Nishimura et al. [46] compiled a set of 1,811 marine environmental
viral genomes (EVGs) including those newly assembled from the Tara Ocean [9]
and Osaka Bay viromes and previously reported EVGs [39, 5, 45]. They pre-
dicted putative hosts of the EVGs based on the gene-based similarity between
the EVGs and the cultured viral genomes with known hosts. In particular, they
compiled another set of cultured viral genomes as a reference (RVGs) and created
a proteomic tree for all EVGs and RVGs by the all-against-all distance matrix
calculated from tBLASTx. They first assigned hosts by directly comparing the
45
proteomic similarity between the EVGs and RVGs resulting in host assignment
for 29 EVGs. They then constructed genus-level genomic operational taxonomic
units (gOTUs) according to the proteomic tree. Based on the identification and
phylogenetic analysis of various functional genes in EVGs and their closeness to
related RVGs in the proteomic tree, they predicted the hosts of gOTUs at different
host taxonomic levels (phylum to genus). In total they predicted the hosts for 564
EVGs.
We used our integrated model in Eq. (2.6) to predict the hosts for the 1,811
EVGs using a set of 4,034 marine bacteria as host candidates. We set a cutoff
of 0.95 on the prediction score to ensure 90% prediction accuracy at the phylum
level (Fig. 3.5). With this cutoff, our model was able to make host predictions
for 676 EVGs, among which 233 EVGs also had phylum-level host predictions by
Nishimura et al. Compared with the predictions of Nishimura et al., our method
had consistent predictions for 172 (74%) out of the EVGs at the phylum level and
156 (77%) out of the 203 EVGs at the class level (only 203 EVGs have predictions
by our method and Nishimura et al.). In particular, our predictions were consistent
with the previous predictions for the entire group of 16 cyanobacteria viruses.
For a group of viruses that Nishimura et al. predicted to infect Proteobacteria,
our predictions agree with theirs in 24 out of 39 cases at the phylum level. For
another group of 158 viruses that were previously predicted as Flavobacteriaceae
(within the phylum Bacteroidetes) phages, our predictions were consistent with
theirs for 127 viruses at the family level. Note that the inconsistency between
our predictions and Nishimura et al. may due to the different choices of features
used for prediction. Predictions in Nishimura et al. are based on the similarity
between virus genomes, while our method uses not only the similarity between
46
viruses, but also the CRISPR scores between virus and host genomes, which are
direct evidence for interactions. In addition, our method was able to predict more
hosts at lower taxonomic levels compared with the previous method. We had all
233 EVGs predicted at the order or lower host taxonomic level, a 9% increase in
the number of EVGs that the previous method was able to predict.
For the 443 viruses whose hosts were not predicted previously and only pre-
dicted by our method, their predicted hosts include 4 phyla and 22 genera. In
particular, we discovered 11 viruses infecting 8 novel host genera that are absent
from the data set of 2,288 isolate virus genomes.
3.9 Host prediction for metagenomic viral con-
tigs from various habitats
Paez-Espino et al. [48] analyzed over three thousand geographically diverse
metagenomic samples and identified 125,842 putative metagenomic viral contigs
of median length 11 kb, revealing the extended viral genetic diversity in various
environments [48]. In the original prediction, the metagenomic viral contigs and
other 2,536 isolated contigs were first clustered into viral groups or singletons.
They predicted the hosts of the viral contigs using a series of analyses including
projecting the isolate viral-host information onto viral groups, matching viral con-
tigs to a database of 3.5 million CRISPR spacers found in prokaryotic genomes,
and identifying tRNA sequences in corresponding hosts. The analysis predicted
hosts for 9,992 (7.7%) viral contigs. To evaluate our integrated approach for host
prediction, we first used our method in Eq. (2.8) to predict the hosts of those
putative metagenomic viral contigs. We then compared our predictions with those
47
of Paez-Espino et al. by concentrating on 5,105 metagenomic contigs whose pre-
viously predicted host families were present in our host database and having a
prediction score above 0.95. Our predictions were consistent with the vast major-
ity of the original predictions, having 96% consistency at the phylum level (Table
3.2). Our predictions matched the previous predictions at an even higher rate
(97% at the phylum level) for62.7% of viruses whose hosts were previously inferred
based on direct evidence of CRISPR spacer matches or tRNA matches to the
hosts. For viruses whose hosts were inferred indirectly based on the hosts of other
viruses in the same viral groups, our predicted hosts had 93% consistency with
those based on the previous method at the phylum level. Thus, the inconsistent
predictions mostly occurred for the viruses whose hosts were previously inferred
based on viral group membership. For those viruses with inconsistent predictions,
88% of our predictions had significant network scores (>95% percentile), 86% had
significant WIsH scores, and 43% had significant CRISPR scores.
Genus Family Order Class Phylum
Overall
a
82% 86% 90% 90% 96%
Extensive predictions only
b
75% 78% 82% 82% 93%
Excluding extensive predictions
c
86% 91% 95% 95% 97%
Table 3.2: Proportions of congruent predictions for viral contigs between our
method and those in Paez-Espino et al. [48]
a - Calculated based on all 5,105 metagenomic viral contigs.
b - Calculated based on 3,203 metagenomic viral contigs whose predictions were
previously inferred indirectly from group membership instead of direct evidence.
c -Calculatedbasedon1,902metagenomicviralcontigswhosepreviouspredictions
were inferred directly by CRISPR spacer matches or tRNA matches.
48
We then predicted the hosts for the remaining available contigs that were not
predicted in Paez-Espino et al. (n=101,343, note not all of the contigs from Paez-
Espino et al. are accessible at IMG/VR). Viruses were parsed by the type of
sample from which they were obtained (human-associated, marine, and all other
environments/sample types) and predictions were made against collections of host
genomes corresponding to the sample type (human-related genomes, n=9,097;
marine genomes, n=4,034; or all 62,493 host genomes, respectively). This resulted
in 7,653, 12,014, 8,013 viral contigs with prediction scores above 0.95 or 27,680
viral contigs in sum. In combination with viral contigs with overlapping predic-
tions by Paez-Espino et al., we were able to make confident host predictions for
27% of all the remaining viral contigs, representing 2.7-fold more host predictions
than previously by Paez-Espino et al..
We analyzed more specifically the predicted hosts for contigs with length≥10
kb and for which≥90% of their genes belong to known viral protein families (a
criterionusedintheoriginalpaper). Therewere545contigsfromhumanassociated
samples that met the above criterion, and we restricted our host predictions to
9,097 human-associated bacterial genomes. In total, 173 human-associated viral
contigs were successfully predicted by our method with a score above 0.95 (Table
B.5). The predicted hosts of these 173 viral contigs belonged to 12 host genera.
In particular, we discovered 24 viral contigs predicted to infect 4 host genera that
have no known infecting viruses. To study the virus diversity within those hosts,
we clustered the 24 viral contigs based on their percentage of shared genes using
the UPGMA hierarchical clustering method (Fig. 3.7). Some viruses infecting
the same host genus were found in the same habitat. For example, all 3 viruses
predicted to infect Prevotella were found in human tongue dorsum; the 2 viruses
49
predicted to infect Neisseria were found in human supragingival plaque. On the
other hand, the 18 viruses predicted to infect Veillonella were found in human
tongue dorsum, throat and saliva, probably indicating a higher viral diversity in
this host genus. Meanwhile, the large cluster of 10 viruses of host genus Veillonella
were from different samples in multiple studies (as assessed by contig IDs; WUGC,
Baylor, LANLrepresentingdifferentstudies), indicatingthosevirus-hostpairswere
common across individuals.
Human tongue dorsum
Human tongue dorsum
Human tongue dorsum
Human tongue dorsum
Human tongue dorsum
Human tongue dorsum
Human tongue dorsum
Human tongue dorsum
Human supragingival plaque
Human supragingival plaque
Human stool
Human tongue dorsum
Human tongue dorsum
Human tongue dorsum
Human throat
Human saliva
Human throat
Human tongue dorsum
Human tongue dorsum
Human tongue dorsum
Human tongue dorsum
Human tongue dorsum
Human tongue dorsum
Human tongue dorsum
SRS020856_Baylor_scaffold_5869 | Veillonella
SRS057205_LANL_scaffold_16276 | Veillonella
SRS017120_Baylor_scaffold_80625 | Prevotella
SRS054687_LANL_scaffold_13892 | Prevotella
SRS014573_WUGC_scaffold_57471 | Veillonella
SRS018591_WUGC_scaffold_31262 | Veillonella
SRS015941_WUGC_scaffold_21622 | Veillonella
SRS014573_WUGC_scaffold_57741 | Veillonella
SRS022602_Baylor_scaffold_118713 | Neisseria
SRS065099_LANL_scaffold_101779 | Neisseria
SRS054956_LANL_scaffold_45718 | Roseburia
SRS050244_LANL_scaffold_73744 | Prevotella
SRS018145_Baylor_scaffold_48460 | Veillonella
SRS019607_WUGC_scaffold_13340 | Veillonella
SRS014689_WUGC_scaffold_23508 | Veillonella
SRS014692_WUGC_scaffold_5365 | Veillonella
SRS014689_WUGC_scaffold_11775 | Veillonella
SRS016086_WUGC_scaffold_26851 | Veillonella
SRS023352_LANL_scaffold_84250 | Veillonella
SRS018591_WUGC_scaffold_4777 | Veillonella
SRS045715_LANL_scaffold_59715 | Veillonella
SRS058336_LANL_scaffold_6266 | Veillonella
SRS018300_Baylor_scaffold_3053 | Veillonella
SRS051791_LANL_scaffold_5624 | Veillonella
0 0.2 0.4 0.6 0.8 1
Percentage of genes shared
Host genus
Neisseria
Prevotella
Roseburia
Veillonella
Figure 3.7: Relatedness of newly discovered 24 viral contigs in human associated
metagenomes based on shared gene content.
50
Similarly, we applied our method to a set of 558 marine viral contigs that were
not predicted by Paez-Espino et al. using the same criteria as above. Prediction
was restricted to the set of 4,034 marine hosts defined previously by Ahlgren et al..
Our model predicted hosts for 160 viral contigs using a score threshold of 0.95. The
predicted hosts belonged to 4 host genera (Table B.4). In particular, the newly
identifiedvirus-hostpairsexpandedtheuniverseofknownCellulophaga viraldiver-
sity, a nascent marine heterotrophic model system. Previously, Holmfeldt et al.
[33], by sequencing 31 viral isolates, demonstrated the existence of several viral
genera associated with this marine group. Here we found an additional 102 viral
contigs that putatively infect Cellulophaga. Using the same gene-based method for
hierarchical clustering as in [33], the newly discovered 102 viruses clustered into
multiple groups, including one having 31 contigs (group A) and one having 17 con-
tigs (group B), which are separate from the group containing the 31 known isolates
(Fig. 3.8). Overall, we identified at least 3 novel genera with each having more
than 10 viral contigs, representing a sizable increase from the previously known
diversity. Genera were defined, for consistency as in Holmfeldt et al., as pairs of
genomessharingmorethan40%oftheirgenecontent. Thosenewvirusgroupswere
found in multiple locations such as Delaware Coast, Pacific Ocean and North Sea,
indicating their ubiquity and potential impacts on communities of Cellulophaga,
an important degrader of complex organic matter. In addition, our method pre-
dicted49virusesascyanobacterialphages(cyanophages)infecting Prochlorococcus,
a group of globally abundant marine cyanobacteria [23]. We independently con-
firmed that 33 of these are actually cyanophages based on significant nucleotide
or protein similarity to cyanophage isolate genomes (≥70% nucleotide identity for
≥10% of the contig or≥50% of proteins on the contig shared≥40% identity to
51
cyanophage proteins). The remaining 16 contigs thus represent potentially novel
lineages that have no significant nucleotide similarity to known cyanophage iso-
lates. This showcases both the diversity of virus-host interactions and the power
of our method to capture groups with relatively few known representatives.
FALSE
0 0.2 0.4 0.6 0.8 1
Percentage of genes shared
Delaware Coast
Isolate
Kolumbo Volcano mats
Monterey Bay
Nags Head
North Sea
Pacific Ocean
Saanich Inlet
Santorini caldera mats
Tahi Moana
Group A
Figure 3.8: Shared gene relatedness of 102 newly discovered viral contigs and 31
previously isolated viruses that infect Cellulophaga.
52
3.10 Computational cost
For a set of 1,500 complete viral genomes, the prediction requires no more than
16GB of memory for host predictions. However, due to the implementation of
WIsH score, it requires up to 100GB for the same size of query viral contigs. In
practice, we recommend analyzing the viral contigs in smaller groups at a time if
the memory is a major constrain. Using an 8-core E5-2640v3 CPU, the analysis
takes less than 1 hour for 1,500 complete genomes and less than 4 hours for the
same size of viral contigs.
53
Chapter 4
Discussion
The interactions between virus and prokaryotic hosts play important roles
in human health and ecosystems. Millions of new viruses have been identified
using high-throughput metagenomic sequencing technologies, but little is known
about their biological functions and the prokaryotic hosts with which they inter-
act. We developed a network-based integrated framework for predicting the hosts
of prokaryotic viruses. The new method provides a sizable improvement on predic-
tion accuracy compared with previous methods by integrating multiple measures
for informing host prediction. Based on the evaluation of the methods using a
large benchmark data set containing 1,462 viruses and 62,493 hosts, the method
achieves 59% and 86% prediction accuracy at the genus and the phylum levels,
respectively, yielding 16% and 6% improvements at the genus and the phylum
levels compared to the highest accuracy achieved by previous single methods.
The novel two-layer network of virus-virus, host-host, and virus-host genomic
similarity lays the foundation for this method. The employment of a two-layer
network is inspired by underlying biological phenomena. First, it is observed that
genetically similar viruses tend to infect closely related hosts [24, 25]. So the host
of a new virus can be partly inferred based on the similarity to related viruses
with known hosts. Similarly, the host of new viruses could potentially be inferred
throughsimilarityofhosts. Second, becausevirusesdependonthecellularmachin-
ery of their host to replicate, viruses often share highly similar patterns in codon
54
usage or short nucleotide words with their hosts. The host of a new virus can be
predicted using nucleotide word similarity between the virus and candidate hosts
[58, 21, 2]. Thus, the two-layer network model is a natural formulation of the
biological relationships described above. Despite the fact that the viruses in our
current database only have one reported host for each virus such that host-host
network connections cannot be incorporated into the prediction model, the novel
two-layer network can be fully implemented in the future as multiple hosts of
viruses are revealed.
Multiple types of features, including shared sequences between host CRISPR
spacers and viral genomes and virus-host BLAST matches, combined with the
network-based features, were tested in the integrated framework for host predic-
tion. The CRISPR and BLAST features are based on the biological process that
some viruses and their hosts share a portion of their genomes due to CRISPR
defence system, horizontal gene transfer, or prophage integration. Although these
features have been investigated individually in previous studies [21, 2, 27, 60], this
is the first time that multiple types of features have been integrated into a uni-
fied framework for virus-host prediction. We interestingly found that addition of
the BLAST feature did not significantly improve over the model that included
CRISPR and k-mer frequency similarity, possibly because BLAST information is
incorporated in informative CRISPR matching feature results. In the future, more
sophisticated and sensitive approaches, beyond simple BLAST searches, could be
developedforidentifyinggenessharedbetweenhostsandtheirphageviahorizontal
gene transfer. Our results show that the integrated method combining multi-
ple features achieves a higher prediction accuracy than use of individual types of
information.
55
Our model also markedly improved the host prediction accuracy on shorter
viral fragments at all taxonomic levels when compared to WIsH [27], a recently
developed probabilistic method for predicting hosts of viral contigs. Our method
was able to obtain 57%, 55% and 53% prediction accuracies at the genus level for
20 kb, 10 kb and 5 kb sequence lengths, respectively. The prediction accuracies
for 20 kb, 10 kb, and 5 kb contigs were all above 84% at the phylum level.
Setting a minimum threshold for making predictions led to a notable improve-
ment in accuracy . We also investigated the host prediction accuracy for differ-
ent groups of viruses. Specifically, our observations indicate that viruses in the
Siphoviridae family have higher prediction accuracy than the other Caudovirales
families, consistent with the fact that siphoviruses tend to have a narrower range
of target hosts [75, 15]. Likewise, restricting the possible hosts from all available
prokaryotic genomes to a focused set of relevant microbes can help improve pre-
diction accuracy, as was the case of predicting hosts of human associated viruses
using the 9,097 human-related host genomes and predicting marine viruses using
4,034 marine host genomes.
Our model was trained on a selected set of known virus-host pairs, mostly
represented by well-studied virus-host systems (e.g. E. coli viruses). Therefore it
was important to assess the sensitivity of our approach to the sets of viruses used
for model training. The model was tested by excluding several groups of viruses,
either by virus family (Myoviridae, Podoviridate, Siphoviridae) or by the taxo-
nomic group of hosts they infect and then assessing accuracies for predicting the
hosts of those groups of viruses (Table. B.2). Prediction accuracies were largely
similar when using models trained with all available or the restricted sets of viruses
and hosts, strongly supporting that our integrated approach can be extended to
56
make predictions on novel groups of viruses. Indeed, in the applications above,
we make confident new predictions for viruses for which their predicted host taxa
are not represent in the training dataset. We conjecture that the applicability of
this approach to novel viruses reflects that the features used and their underly-
ing molecular processes are common across viral groups. In particular, CRISPR
defense systems have been found across many prokaryotic phyla and more impor-
tantly, the mechanisms and thus the molecular signals underlying the CRISPR
defense systems are conserved.
We utilized our model to predict the host of a new strain of crAss-like phages,
ΦcrAss001. Until the recent isolation of ΦcrAss001, the host of crAss-like phages
was unknown, but surmised to be Bacteroidetes based on bioinformatic analyses
[20]. It was recently isolated and was found to infect Bacteroides intestinalis
among a set of 54 strains belonging to 22 bacterial species [62]. Our computational
prediction for the host of ΦcrAss001 against the 22 species for which genomes
were available is consistent with the culture-based results. When we predicted
its host against the 62,493 candidate genomes, the genus Prevotella within the
Bacteroidetes phylum was the top predicted host. Although this genus is different
from the experimentally determined host, the prediction of Prevotella is consistent
with the hypothesis of Guerin et al. [31] that genus VI crAss-like phages, to which
crAss001 belongs, infect Prevotella.
We also applied our method to predict hosts for viruses in two large-scale
metagenomic data sets, one focusing on marine viral genomes such as those dis-
coveredinTaraOceans, andtheotherincludingviralcontigsinoverthreethousand
geographically diverse metagenomic samples including marine and HMP samples.
Our predictions had high consistency with previous predictions made using simple
57
methods such as CRISPR or tRNA matches or gene-based similarity to known
reference viruses. More importantly, our method greatly increased the number of
viruses for which predictions could be made, nearly three-fold more viruses than by
Paez-Espino et al. These predictions were made using a minimum score threshold
of 0.95, with a false discovery rate of<10% for nearly complete genomes and con-
tigs of length > 10 kb at the phylum level. The newly predicted virus-host pairs
revealed viruses for hosts without known infecting viruses, and also expanded the
diversity of viruses for hosts with known isolate viruses, showcasing the usefulness
of our method in expanding knowledge of hosts in both ways.
A major advantage of our network-based integrated framework is that it can
be easily extended to incorporate more meaningful features that can better inform
virus-host interactions in the future. Virus-host co-abundance profiles have been
shown to provide some evidence of virus-host interactions [66, 18], but Edwards
et al. [21] suggested that its performance on host prediction was relatively poor
compared to other measures such as CRISPR and sequence homology. Coenen et
al. [16] also showed that virus-host correlations are poor predictors of virus-host
interactions. Our preliminary analysis of incorporating such co-abundance data as
a feature likewise showed the model did not benefit from adding the co-abundance
feature (see Fig. A.1). In general, co-abundance can be a misleading feature
because virus-host interactions may not always yield positive or negative correla-
tions depending on the complexity of virus lifestyles (e.g. lytic vs. lysogenic) [74].
In fact, we noticed that the feature coefficient for co-abundance when incorporated
into the model was not statistically significant, indicating that the co-abundance
can not consistently be a useful predictor. Moreover, virus-host interactions are
dynamic with delays and fluctuate over time, while metagenomic sampling only
58
captures the community at a single time point. Also the interactions can be nonlin-
ear because of the complicated many-to-many virus-host networks [16]. Likewise
non-specific hosts and viruses can exhibit spurious correlations due to the compu-
tational bias in terms of the compositional data where the abundance vector is con-
strained to a constant sum. Similarly, hosts may be incorrectly predicted to infect
certain viruses because their hosts coincidentally share similar niches and dynam-
ics. Significant co-abundance between a virus and a host nonetheless is consistent
with and can support in some cases the discovery of a true virus-host interaction,
but co-abundance evidence alone should be taken with caution. Although we do
not exclude the possibility that co-abundance could be useful under certain envi-
ronments or for certain types of viruses, it is not likely that a simple co-abundance
measure based on non-time series samples can well describe the virus-host dynam-
ics in general. More sophisticated model-based approaches that utilize virus and
host abundances for host prediction could in theory be incorporated in our model
in the future.
If other promising predictive virus-host features are discovered in the future,
these can easily be incorporated into our framework. As noted above, inclusion
of the BLAST feature did not significantly improve the prediction model. Sim-
ple nucleotide BLAST results, however, may not be best suited for detection of
genes shared between cross-infecting viruses and their hosts. The discovery of
auxiliary metabolic genes (AMGs) in viruses has emerged as a valuable means
to connect viruses to their hosts [59, 55, 1, 4]. Protein-based homology searches
or phylogenetic-based detection of AMGs may be more informative means for
host prediction, and further development and incorporation of an improved AMG-
matching feature in our model framework could further improve host prediction.
59
Sequence-based and alignment-based measures such as CRISPR and BLAST
scores generally have limited availability, but can provide solid evidence for virus-
host interactions when such signals are present. On the other hand, alignment-
free s
∗
2
similarity can be computed for any virus-host pairs, but may not always
perform as well as CRISPR and BLAST. We compared the prediction accuracies
for s
∗
2
score and BLAST score when the hosts belonging to the true host genus
of the viruses are removed from the candidates. The result showed that when
the specific hosts were removed, the prediction accuracy for BLAST at the family
level decreased markedly to 0.20, while the accuracy for s
∗
2
was 0.32 (Fig. A.3).
Therefore, alignment-based methods depend heavily on the existence of the true
host in the database, and they can perform much worse than the alignment-free
based methods for predicting hosts of new viruses when the true host genus is
not in the host candidate set. These results highlight again how the integrated
framework combining both alignment and alignment-free based features helps to
complement the two types of methods and improve the overall prediction accuracy.
Although the new model makes sizable improvements over existing methods
for both complete viral genomes and viral contigs at different taxonomic levels,
the prediction accuracy at the genus level is still 59% for complete genomes and
55% for 10kb contigs. It is expected that with an increased data set of hosts
and virus-host interactions for training our models, the prediction accuracy of our
method will further increase. Our host data set will be gradually updated to
include more newly discovered virus-host pairs for training and testing. However,
we note that prediction accuracy at the phylum level is already very high (∼90%).
Since there are many prokaryotic phyla (>75%) for which their viruses have yet
60
to be identified, our tool is promising to greatly expand characterization of novel
groups of viruses.
In summary, our novel network-based integrated approach demonstrates how
integration of multiple features informative of virus-host interactions significantly
improves host prediction than any single feature. Application of our method to a
few datasets of metagenomically assembled contigs demonstrate on the strong pre-
diction ability of the model–yielding predictions largely congruent with previous
methods but more importantly generating many more host predictions and iden-
tifying novel virus-host interactions than previous approaches. This approach will
be valuable for identifying the putative hosts of newly discovered viral genomes
particularly for the flood of new viral metagenomic data currently being gener-
ated. The flexible nature of our prediction framework also has the potential to be
updated as new computational theories and biological understanding in virus-host
interactions become available.
61
Bibliography
[1] Nathan A Ahlgren, Clara A Fuchsman, Gabrielle Rocap, and Jed A Fuhrman.
Discovery of several novel, widespread, and ecologically distinct marine thau-
marchaeota viruses that encode amoc nitrification genes. The ISME journal,
13(3):618–631, 2019.
[2] Nathan A Ahlgren, Jie Ren, Yang Young Lu, Jed A Fuhrman, and Fengzhu
Sun. Alignment-free d
∗
2
oligonucleotide frequency dissimilarity measure
improves prediction of hosts from metagenomically-derived viral sequences.
Nucleic Acids Research, 45(1):39–53, 2017.
[3] StephenFAltschul, WarrenGish, WebbMiller, EugeneWMyers, andDavidJ
Lipman. Basic local alignment search tool. Journal of Molecular Biology,
215(3):403–410, 1990.
[4] Karthik Anantharaman, Melissa B Duhaime, John A Breier, Kathleen A
Wendt, Brandy M Toner, and Gregory J Dick. Sulfur oxidation genes in
diverse deep-sea viruses. Science, 344(6185):757–760, 2014.
[5] Christopher Mark Bellas, Alexandre Magno Anesio, and Gary Barker. Analy-
sis of virus genomes from glacial environments reveals novel virus groups with
unusual host interactions. Frontiers in Microbiology, 6:656, 2015.
62
[6] Charles Bland, Teresa L Ramsey, Fareedah Sabree, Micheal Lowe, Kyndall
Brown, Nikos C Kyrpides, and Philip Hugenholtz. Crispr recognition tool
(crt): a tool for automatic detection of clustered regularly interspaced palin-
dromic repeats. BMC Bioinformatics, 8(1):209, 2007.
[7] Mya Breitbart and Forest Rohwer. Here a virus, there a virus, everywhere
the same virus? Trends in Microbiology, 13(6):278–284, 2005.
[8] Mya Breitbart, Peter Salamon, Bjarne Andresen, Joseph M Mahaffy, Anca M
Segall, David Mead, Farooq Azam, and Forest Rohwer. Genomic analysis of
uncultured marine viral communities. Proceedings of the National Academy
of Sciences, 99(22):14250–14255, 2002.
[9] Jennifer R Brum, J Cesar Ignacio-Espinoza, Simon Roux, Guilhem Doulcier,
Silvia G Acinas, Adriana Alberti, Samuel Chaffron, Corinne Cruaud, Colom-
ban De Vargas, Josep M Gasol, et al. Patterns and ecological drivers of ocean
viral communities. Science, 348(6237):1261498, 2015.
[10] BenjaminBuchfink, ChaoXie, andDanielHHuson. Fastandsensitiveprotein
alignment using diamond. Nature Methods, 12(1):59–60, 2015.
[11] David Burstein, Christine L Sun, Christopher T Brown, Itai Sharon, Karthik
Anantharaman, Alexander J Probst, Brian C Thomas, and Jillian F Ban-
field. Major bacterial lineages are essentially devoid of crispr-cas viral defence
systems. Nature Communications, 7:10613, 2016.
[12] Alan James Cann, Sarah Elizabeth Fandrich, and Shaun Heaphy. Analysis of
the virus population present in equine faeces indicates the presence of hun-
dreds of uncharacterized virus genomes. Virus Genes, 30(2):151–156, 2005.
63
[13] Alessandra Carbone. Codon bias is a major factor explaining phage evolution
intranslationallybiasedhosts. Journal of Molecular Evolution, 66(3):210–223,
2008.
[14] Feixiong Cheng, Chuang Liu, Jing Jiang, Weiqiang Lu, Weihua Li, Guixia
Liu, Weixing Zhou, Jin Huang, and Yun Tang. Prediction of drug-target
interactions and drug repositioning via network-based inference. PLoS Com-
putational Biology, 8(5):e1002503, 2012.
[15] Sandra Chibani-Chennoufi, Anne Bruttin, Marie-Lise Dillmann, and Harald
Brüssow. Phage-host interaction: an ecological perspective. Journal of Bac-
teriology, 186(12):3677–3686, 2004.
[16] Ashley R Coenen and Joshua S Weitz. Limitations of correlation-based infer-
ence in complex virus-microbe communities. mSystems, 3(4):e00084–18, 2018.
[17] Human Microbiome Project Consortium et al. Structure, function and diver-
sity of the healthy human microbiome. Nature, 486(7402):207–214, 2012.
[18] Felipe H Coutinho, Cynthia B Silveira, Gustavo B Gregoracci, Cristiane C
Thompson, Robert A Edwards, Corina PD Brussaard, Bas E Dutilh, and
Fabiano L Thompson. Marine viruses discovered via metagenomics shed light
on viral strategies throughout the oceans. Nature Communications, 8:15955,
2017.
[19] Minghua Deng, Kui Zhang, Shipra Mehta, Ting Chen, and Fengzhu Sun.
Prediction of protein function using protein–protein interaction data. Journal
of Computational Biology, 10(6):947–960, 2003.
64
[20] Bas E Dutilh, Noriko Cassman, Katelyn McNair, Savannah E Sanchez, Geni-
valdo GZ Silva, Lance Boling, Jeremy J Barr, Daan R Speth, Victor Seguri-
tan, Ramy K Aziz, et al. A highly abundant bacteriophage discovered in the
unknown sequences of human faecal metagenomes. Nature Communications,
5:4498, 2014.
[21] Robert A Edwards, Katelyn McNair, Karoline Faust, Jeroen Raes, and Bas E
Dutilh. Computational approaches to predict bacteriophage–host relation-
ships. FEMS Microbiology Reviews, 40(2):258–272, 2016.
[22] Noah Fierer, Mya Breitbart, James Nulton, Peter Salamon, Catherine
Lozupone, RyanJones, MichaelRobeson, RobertAEdwards, BenFelts, Steve
Rayhawk, et al. Metagenomic and small-subunit rrna analyses reveal the
genetic diversity of bacteria, archaea, fungi, and viruses in soil. Applied and
Environmental Microbiology, 73(21):7059–7066, 2007.
[23] Pedro Flombaum, José L Gallegos, Rodolfo A Gordillo, José Rincón, Lina L
Zabala, Nianzhi Jiao, David M Karl, William KW Li, Michael W Lomas,
Daniele Veneziano, et al. Present and future global distributions of the marine
cyanobacteriaprochlorococcusandsynechococcus. Proceedings of the National
Academy of Sciences, 110(24):9824–9829, 2013.
[24] Cesar O Flores, Justin R Meyer, Sergi Valverde, Lauren Farr, and Joshua S
Weitz. Statistical structure of host–phage interactions. Proceedings of the
National Academy of Sciences, 108(28):E288–E297, 2011.
[25] Cesar O Flores, Sergi Valverde, and Joshua S Weitz. Multi-scale structure
and geographic drivers of cross-infection within marine bacteria and phages.
The ISME Journal, 7(3):520, 2013.
65
[26] Jan Freudenberg and P Propping. A similarity-based method for
genome-wide prediction of disease-relevant human genes. Bioinformatics,
18(suppl_2):S110–S115, 2002.
[27] Clovis Galiez, Matthias Siebert, François Enault, Jonathan Vincent, and
Johannes Söding. Wish: who is the host? predicting prokaryotic hosts from
metagenomic phage contigs. Bioinformatics, 33(19):3113–3114, 2017.
[28] Pedro Gómez and Angus Buckling. Bacteria-phage antagonistic coevolution
in soil. Science, 332(6025):106–109, 2011.
[29] Manolo Gouy and Christian Gautier. Codon usage in bacteria: correlation
with gene expressivity. Nucleic Acids Research, 10(22):7055–7074, 1982.
[30] Ibtissem Grissa, Gilles Vergnaud, and Christine Pourcel. The crisprdb
database and tools to display crisprs and to generate dictionaries of spacers
and repeats. BMC Bioinformatics, 8(1):172, 2007.
[31] Emma Guerin, Andrey Shkoporov, Stephen R Stockdale, Adam G Clooney,
Feargal J Ryan, Thomas DS Sutton, Lorraine A Draper, Enrique Gonzalez-
Tortuero, R Paul Ross, and Colin Hill. Biology and taxonomy of crass-like
bacteriophages, the most abundant virus in the human gut. Cell Host &
Microbe, 24(5):653–664, 2018.
[32] Geoffrey D Hannigan, Melissa B Duhaime, Danai Koutra, and Patrick D
Schloss. Biogeography and environmental conditions shape bacteriophage-
bacteria networks across the human microbiome. PLoS Computational Biol-
ogy, 14(4):e1006099, 2018.
66
[33] Karin Holmfeldt, Natalie Solonenko, Manesh Shah, Kristen Corrier, Lasse
Riemann, Nathan C VerBerkmoes, and Matthew B Sullivan. Twelve previ-
ously unknown phage genera are ubiquitous in global oceans. Proceedings of
the National Academy of Sciences, 110(31):12798–12803, 2013.
[34] Philippe Horvath and Rodolphe Barrangou. Crispr/cas, the immune system
of bacteria and archaea. Science, 327(5962):167–170, 2010.
[35] Bonnie L Hurwitz and Matthew B Sullivan. The pacific ocean virome (pov):
a marine viral metagenomic dataset and associated protein clusters for quan-
titative viral ecology. PloS One, 8(2):e57355, 2013.
[36] Ronald Jansen, Haiyuan Yu, Dov Greenbaum, Yuval Kluger, Nevan J Kro-
gan, Sambath Chung, Andrew Emili, Michael Snyder, Jack F Greenblatt, and
Mark Gerstein. A bayesian networks approach for predicting protein-protein
interactions from genomic data. Science, 302(5644):449–453, 2003.
[37] Rui Jiang, Mingxin Gan, and Peng He. Constructing a gene semantic similar-
ity network for the inference of disease genes. BMC Systems Biology, 5(2):S2,
2011.
[38] Daehwan Kim, Li Song, Florian P Breitwieser, and Steven L Salzberg. Cen-
trifuge: rapid and sensitive classification of metagenomic sequences. Genome
Research, 26(12):1721–1729, 2016.
[39] Jessica M Labonté, Brandon K Swan, Bonnie Poulos, Haiwei Luo, Sergey
Koren, Steven J Hallam, Matthew B Sullivan, Tanja Woyke, K Eric Wom-
mack, and Ramunas Stepanauskas. Single-cell genomics-based analysis of
67
virus–host interactions in marine surface bacterioplankton. The ISME Jour-
nal, 9(11):2386, 2015.
[40] Stan Z Li. Markov random field models in computer vision. In European
Conference on Computer Vision, pages 361–370. Springer, 1994.
[41] GipsiLima-Mendez, JacquesVanHelden, ArianeToussaint, andRaphaëlLep-
lae. Reticulate representation of evolutionary and functional relationships
between phage genomes. Molecular Biology and Evolution, 25(4):762–777,
2008.
[42] Susan Mills, Fergus Shanahan, Catherine Stanton, Colin Hill, Aidan Coffey,
and R Paul Ross. Movers and shakers: influence of bacteriophages in shaping
the mammalian gut microbiota. Gut Microbes, 4(1):4–16, 2013.
[43] SamuelMinot,RohiniSinha,JunChen,HongzheLi,SueAKeilbaugh,GaryD
Wu, James D Lewis, and Frederic D Bushman. The human gut virome:
inter-individual variation and dynamic response to diet. Genome Research,
21(10):1616–1625, 2011.
[44] Mohammadali Khan Mirzaei and Corinne F Maurice. Menage a trois in the
human gut: interactions between host, bacteria and phages. Nature Reviews
Microbiology, 15(7):397–408, 2017.
[45] Carolina Megumi Mizuno, Francisco Rodriguez-Valera, Nikole E Kimes, and
Rohit Ghai. Expanding the marine virosphere using metagenomics. PLoS
Genetics, 9(12):e1003987, 2013.
[46] Yosuke Nishimura, Hiroyasu Watai, Takashi Honda, Tomoko Mihara, Kim-
iho Omae, Simon Roux, Romain Blanc-Mathieu, Keigo Yamamoto, Pascal
68
Hingamp, Yoshihiko Sako, et al. Environmental viral genomes shed new light
on virus-host interactions in the ocean. Msphere, 2(2):e00359–16, 2017.
[47] Jason M Norman, Scott A Handley, Megan T Baldridge, Lindsay Droit,
Catherine Y Liu, Brian C Keller, Amal Kambal, Cynthia L Monaco, Guoyan
Zhao, Phillip Fleshner, et al. Disease-specific alterations in the enteric virome
in inflammatory bowel disease. Cell, 160(3):447–460, 2015.
[48] David Paez-Espino, Emiley A Eloe-Fadrosh, Georgios A Pavlopoulos, Alex D
Thomas, Marcel Huntemann, Natalia Mikhailova, Edward Rubin, Natalia N
Ivanova, and Nikos C Kyrpides. Uncovering earth’s virome. Nature,
536(7617):425–430, 2016.
[49] Gesine Reinert, David Chew, Fengzhu Sun, and Michael S Waterman.
Alignment-free sequence comparison (i): statistics and power. Journal of
Computational Biology, 16(12):1615–1634, 2009.
[50] Jie Ren, Nathan A Ahlgren, Yang Young Lu, Jed A Fuhrman, and Fengzhu
Sun. Virfinder: a novel k-mer based tool for identifying viral sequences from
assembled metagenomic data. Microbiome, 5(1):69, 2017.
[51] Jie Ren, Kai Song, Minghua Deng, Gesine Reinert, Charles H Cannon, and
Fengzhu Sun. Inference of markovian properties of molecular sequences
from ngs data and applications to comparative genomics. Bioinformatics,
32(7):993–1000, 2015.
[52] Alejandro Reyes, Laura V Blanton, Song Cao, Guoyan Zhao, Mark Man-
ary, Indi Trehan, Michelle I Smith, David Wang, Herbert W Virgin, Forest
Rohwer, and others. Gut DNA viromes of Malawian twins discordant for
69
severe acute malnutrition. Proceedings of the National Academy of Sciences,
112(38):11941–11946, 2015.
[53] Forest Rohwer, David Prangishvili, and Debbie Lindell. Roles of viruses in
the environment. Environmental Microbiology, 11(11):2771–2774, 2009.
[54] Alexa Ross, Samantha Ward, and Paul Hyman. More is better: selecting for
broad host range bacteriophages. Frontiers in Microbiology, 7:1352, 2016.
[55] Simon Roux, Jennifer R Brum, Bas E Dutilh, Shinichi Sunagawa, Melissa B
Duhaime, Alexander Loy, Bonnie T Poulos, Natalie Solonenko, Elena Lara,
Julie Poulain, et al. Ecogenomics and potential biogeochemical impacts of
globally abundant ocean viruses. Nature, 537(7622):689–693, 2016.
[56] Simon Roux, Francois Enault, Bonnie L Hurwitz, and Matthew B Sullivan.
Virsorter: mining viral signal from microbial genomic data. PeerJ, 3:e985,
2015.
[57] Simon Roux, Michaël Faubladier, Antoine Mahul, Nils Paulhe, Aurélien
Bernard, Didier Debroas, and François Enault. Metavir: a web server dedi-
cated to virome analysis. Bioinformatics, 27(21):3074–3075, 2011.
[58] Simon Roux, Steven J Hallam, Tanja Woyke, and Matthew B Sullivan.
Viral dark matter and virus–host interactions resolved from publicly avail-
able microbial genomes. Elife, 4:e08490, 2015.
[59] Simon Roux, Alyse K Hawley, Monica Torres Beltran, Melanie Scofield,
Patrick Schwientek, Ramunas Stepanauskas, Tanja Woyke, Steven J Hallam,
70
and Matthew B Sullivan. Ecology and evolution of viruses infecting uncul-
tivated sup05 bacteria as revealed by single-cell-and meta-genomics. Elife,
3:e03125, 2014.
[60] Jason W Shapiro and Catherine Putonti. Gene networks provide a high-
resolution view of bacteriophage ecology. bioRxiv, page 148668, 2017.
[61] Itai Sharon, Natalia Battchikova, Eva-Mari Aro, Carmela Giglione, Thierry
Meinnel, Fabian Glaser, Ron Y Pinter, Mya Breitbart, Forest Rohwer, and
Oded Béjà. Comparative metagenomics of microbial traits within oceanic
viral communities. The ISME Journal, 5(7):1178, 2011.
[62] Andrey N Shkoporov, Ekaterina V Khokhlova, C Brian Fitzgerald, Stephen R
Stockdale, Lorraine A Draper, R Paul Ross, and Colin Hill. φcrass001 repre-
sents the most abundant bacteriophage family in the human gut and infects
bacteroides intestinalis. Nature Communications, 9(1):4781, 2018.
[63] K. Song, J. Ren, G. Reinert, M. Deng, M. S. Waterman, and F. Sun. New
developments of alignment-free sequence comparison: measures, statistics and
next-generation sequencing. Briefings in Bioinformatics, 15(3):343–353, 2014.
[64] K. Song, J. Ren, Z. Zhai, X. Liu, M. Deng, and F. Sun. Alignment-free
sequence comparison based on next-generation sequencing reads. Journal of
Computational Biology, 20(2):64–79, 2013.
[65] Sharath Srinivasiah, Jaysheel Bhavsar, Kanika Thapar, Mark Liles, Tom
Schoenfeld, and K Eric Wommack. Phages across the biosphere: con-
trasts of viruses in soil and aquatic environments. Research in Microbiology,
159(5):349–357, 2008.
71
[66] Adi Stern, Eran Mick, Itay Tirosh, Or Sagy, and Rotem Sorek. Crispr tar-
geting reveals a reservoir of common phages associated with the human gut
microbiome. Genome Research, 22(10):1985–1994, 2012.
[67] Matthew B Sullivan, John B Waterbury, and Sallie W Chisholm.
Cyanophages infecting the oceanic cyanobacterium prochlorococcus. Nature,
424(6952):1047–1051, 2003.
[68] Shinichi Sunagawa, Luis Pedro Coelho, Samuel Chaffron, Jens Roat Kultima,
KarineLabadie, GuillemSalazar, BardyaDjahanschiri, GeorgZeller, DanielR
Mende, Adriana Alberti, et al. Structure and function of the global ocean
microbiome. Science, 348(6237):1261359, 2015.
[69] DuyTinTruong, EricAFranzosa, TimothyLTickle, MatthiasScholz, George
Weingart, Edoardo Pasolli, Adrian Tett, Curtis Huttenhower, and Nicola
Segata. Metaphlan2 for enhanced metagenomic taxonomic profiling. Nature
Methods, 12(10):902–903, 2015.
[70] Julia Villarroel, Kortine Annina Kleinheinz, Vanessa Isabell Jurtz, Henrike
Zschach, Ole Lund, Morten Nielsen, and Mette Voldby Larsen. Hostphinder:
a phage host prediction tool. Viruses, 8(5):116, 2016.
[71] Alison S Waller, Takuji Yamada, David M Kristensen, Jens Roat Kultima,
Shinichi Sunagawa, Eugene V Koonin, and Peer Bork. Classification and
quantification of bacteriophage taxa in human gut metagenomes. The ISME
Journal, 8(7):1391–1402, 2014.
72
[72] Lin Wan, Gesine Reinert, Fengzhu Sun, and Michael S Waterman. Alignment-
freesequencecomparison(ii): theoreticalpowerofcomparisonstatistics. Jour-
nal of Computational Biology, 17(11):1467–1490, 2010.
[73] Jinfeng Wang, Yuan Gao, and Fangqing Zhao. Phage–bacteria interaction
network in human oral microbiome. Environmental microbiology, 18(7):2143–
2158, 2016.
[74] Joshua S Weitz, Stephen J Beckett, Jennifer R Brum, BB Cael, and Jonathan
Dushoff. Lysis, lysogeny and virus–microbe ratios. Nature, 549(7672):E1,
2017.
[75] Antje Wichels, Stefan S Biel, Hans R Gelderblom, Thorsten Brinkhoff, Ger-
ard Muyzer, and Christian Schütt. Bacteriophage diversity in the north sea.
Applied and Environmental Microbiology, 64(11):4128–4133, 1998.
[76] Guohong Albert Wu, Se-Ran Jun, Gregory E Sims, and Sung-Hou Kim.
Whole-proteome phylogeny of large dsdna virus families by an alignment-free
method. Proceedings of the National Academy of Sciences, 106(31):12826–
12831, 2009.
[77] Mengge Zhang, Lianping Yang, Jie Ren, Nathan A Ahlgren, Jed A Fuhrman,
andFengzhuSun. Prediction ofvirus-hostinfectiousassociationbysupervised
learning methods. BMC Bioinformatics, 18(3):60, 2017.
[78] QianZhang, Se-RanJun, MichaelLeuze, DavidUssery, andIntawatNookaew.
Viral phylogenomics using an alignment-free method: A three-step approach
to determine optimal length of k-mer. Scientific Reports, 7:40712, 2017.
73
[79] Wangshu Zhang, Fengzhu Sun, and Rui Jiang. Integrating multiple protein-
protein interaction networks to prioritize disease genes: a bayesian regression
approach. BMC Bioinformatics, 12(1):S11, 2011.
74
Appendix A
Supplementary Figures
75
Genus Family Order Class Phylum
Performance on 1,095 viruses
0.0 0.2 0.4 0.6 0.8
co−abundance
s2star
Network
Networ k+co−abundance
Figure A.1: The prediction accuracies of virus-host interactions for different meth-
ods using the individual feature, co-abundance, s
∗
2
, and the integrated network-
based model in Eq. 2.10, and the integrated network-based model combined with
the co-abundance feature, respectively, from left to right. The results are binned
by taxonomic level. Error bars show the 95% confidence intervals of the accuracies,
based on 100 different (randomly selected) negative training sets.
76
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate
0.0
0.2
0.4
0.6
0.8
1.0
True Positive Rate
s2* ROC curve (area = 0.91)
WIsH ROC curve (area = 0.86)
Figure A.2: ROC curves for predicting virus-host interactions using s
∗
2
and WIsH
on 352 positive and negative virus-host pairs.
77
Family Order Class Phylum
Performance when leaving out hosts in the correct genus
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
CRISPR
blast
d2star
Figure A.3: The prediction accuracies of CRISPR, BLAST and s
∗
2
, respectively,
when the hosts in the true genus level are excluded. Predictions were made by
excluding all the true hosts in the genus level and were evaluated at higher taxo-
nomic levels. Average prediction accuracies for the set of 1,075 viruses are shown.
The performance for alignment-free measure s
∗
2
was least susceptible to this situ-
ation where the true host(s) was missing from the candidates.
78
Myoviridae Podoviridae Siphoviridae
0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0
0.4
0.6
0.8
1.0
Prediction score threshold
Accuracy and recall
recall
Genus
Family
Order
Class
Phylum
Figure A.4: Improvement in host prediction by thresholding on the prediction
score for viral contigs of different lengths across different families of Caudoviruses.
79
Appendix B
Supplementary Tables
Table B.1: The predictions for ΦcrAss001 in the host species tested in Shkoporov
et al. The table lists the highest scores for 20 of the 22 host species tested in
Shkoporov et al. (The other two host species, Peptoclostridium difficile and
Agathobacter rectalis, are not present in our host database.)
NCBI name Host Species Genome name Score
GCF_902364365.1 Bacteroides intestinalis Bacteroides intestinalis 0.962
GCF_000273015.1 Bacteroides cellulosilyticus Bacteroides cellulosilyticus CL02T12C19 0.956
GCF_000273195.1 Bacteroides ovatus Bacteroides ovatus CL02T12C04 0.948
GCF_000699845.1 Bacteroides vulgatus Bacteroides vulgatus str. 3775 SR(B) 19 0.940
GCF_003463165.1 Bacteroides caccae Bacteroides caccae 0.937
GCF_009020455.1 Bacteroides uniformis Bacteroides uniformis 0.929
GCF_001579785.1 Clostridium perfringens Clostridium perfringens 0.883
GCF_009493995.1 Prevotella copri Prevotella copri 0.835
GCF_000235885.1 Prevotella stercorea Prevotella stercorea DSM 18206 0.700
GCF_002190295.1 Escherichia coli Escherichia coli 0.600
GCF_000392875.1 Enterococcus faecalis Enterococcus faecalis ATCC 19433 0.546
GCF_001405455.1 Blautia obeum Blautia obeum 0.497
GCF_003671805.1 Parabacteroides distasonis Parabacteroides distasonis 0.484
GCF_000424345.1 Dorea longicatena Dorea longicatena AGR2136 0.254
GCF_900080095.1 Enterococcus faecium Enterococcus faecium 0.211
GCF_002549935.1 Faecalibacterium prausnitzii Faecalibacterium prausnitzii 0.153
GCF_001404835.1 Anaerostipes hadrus Anaerostipes hadrus 0.127
GCF_901542395.1 Enterococcus casseliflavus Enterococcus casseliflavus 0.080
GCF_000157955.1 Subdoligranulum variabile Subdoligranulum variabile DSM 15176 0.052
GCF_902501405.1 Collinsella aerofaciens Collinsella aerofaciens 0.020
80
TableB.2: Thecomparisionofpredictionaccuraciesbetweenthemodelstrainedby
excludingacertaingroupofvirusesfromthetrainingdataandthemodeltrainedby
the full training set, and evaluated on the same group of viruses on the validation
set. Groups of viruses analyzed include Siphoviridae, Myoviridae, Podoviridae,
and the groups of viruses infecting E. coli, Proteobacteria, Actinobacteria and
Firmicutes, respectively.
Trained ... Predict hosts for Genus Family Order Class Phylum
without E. coli (n=808)
191 E. coli viruses
0.51 0.64 0.72 0.73 0.75
with all data (n=1462) 0.51 0.64 0.72 0.73 0.75
without Proteobacteria (n=696)
809 Proteobacteria phages
0.44 0.63 0.72 0.79 0.81
with all data (n=1462) 0.43 0.6 0.69 0.75 0.79
Trained with Siphoviridae
and Podoviridae (n=694) 358 Myoviridae viruses
0.34 0.43 0.51 0.57 0.61
with all data (n=1462) 0.44 0.56 0.62 0.68 0.71
Trained with Siphoviridae
and Myoviridae (n=731) 265 Podoviridae viruses
0.48 0.62 0.75 0.82 0.89
with all data (n=1462) 0.45 0.6 0.73 0.79 0.87
Trained with Podoviridae
and Myoviridae (n=189) 607 Siphoviridae viruses
0.7 0.81 0.87 0.92 0.93
with all data (n=1462) 0.71 0.82 0.88 0.93 0.93
without Actinobacteria (n=305)
145 Actinobacteria viruses
0.72 0.79 0.92 0.98 0.98
with all data (n=1462) 0.68 0.75 0.9 0.98 0.98
without Firmicutes (n=753)
399 Firmicutes viruses
0.87 0.87 0.93 0.98 0.99
with all data (n=1462) 0.89 0.9 0.94 0.98 0.99
Table B.3: The prediction accuracies for different models when evaluated on the
1,462 validation viruses. Specifically, “Network” refers to the model in Eq. (2.2).
Species Genus Family Order Class Phylum
s
∗
2
0.108 0.318 0.47 0.556 0.676 0.759
BLAST 0.202 0.408 0.487 0.472 0.683 0.763
CRISPR 0.202 0.43 0.531 0.619 0.727 0.805
BLAST+CRISPR 0.302 0.47 0.54 0.595 0.708 0.8
Network 0.374 0.493 0.61 0.706 0.79 0.82
Network+BLAST 0.396 0.518 0.604 0.676 0.76 0.796
Network+CRISPR+BLAST 0.411 0.558 0.642 0.71 0.787 0.824
Network+CRISPR 0.432 0.59 0.701 0.78 0.833 0.862
81
Table B.4: The predicted hosts of the selected group of 160 marine viral contigs
fromPaez-Espinoet al. withscores≥0.95bytheintegratedmodel. Thehostsofall
of these viral contigs were not predicted previously and were predicted exclusively
by our method.
Viral contig Host genus Host species Predicted host
DelMOSpr2010_c10000300 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
DelMOSum2010_c10002168 Prochlorococcus Prochlorococcus marinus GCF_000015665.1
DelMOWin2010_c10001045 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
JGI20151J14362_10003856 Prochlorococcus Prochlorococcus marinus GCF_000015665.1
JGI24005J15628_10001962 Cellulophaga Cellulophaga baltica GCF_000468575.1
KVWGV2_10011187 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
JGI24006J15134_10000686 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24006J15134_10000909 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI25128J35275_1000241 Prochlorococcus Prochlorococcus marinus GCF_000015665.1
DelMOSpr2010_c10002122 Cellulophaga Cellulophaga baltica GCF_000468575.1
DelMOSum2010_c10004178 Cellulophaga Cellulophaga baltica GCF_000468575.1
DelMOWin2010_c10000077 Cellulophaga Cellulophaga baltica GCF_000468575.1
DelMOSum2011_c10002572 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24004J15324_10000131 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
JGI24004J15324_10000235 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI26064J46334_1000053 Prochlorococcus Prochlorococcus marinus GCF_000011465.1
DelMOSum2011_c10002320 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI20157J14317_10003734 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24513J20088_1000158 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI25133J35611_10001926 Prochlorococcus Prochlorococcus marinus GCF_000015665.1
KVRMV2_100026167 Prochlorococcus Prochlorococcus marinus GCF_000015665.1
DelMOSpr2010_c10000530 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
DelMOSpr2010_c10002229 Cellulophaga Cellulophaga baltica GCF_000468575.1
DelMOSpr2010_c10002622 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI20151J14362_10000498 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
JGI24003J15210_10000630 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24004J15324_10001133 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24005J15628_10001129 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24005J15628_10001235 Cellulophaga Cellulophaga baltica GCF_000468575.1
82
JGI24005J15628_10001262 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24006J15134_10001875 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI25134J35505_10000777 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
KVRMV2_100038217 Cellulophaga Cellulophaga baltica GCF_000468575.1
KVWGV2_10007996 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24005J15628_10000740 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24005J15628_10001947 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24006J15134_10000334 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI26253J51717_1000764 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
DelMOWin2010_c10003149 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24004J15324_10000060 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
JGI24006J15134_10002234 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24005J15628_10000988 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI25134J35505_10000759 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
JGI26380J51729_10001197 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
DelMOSpr2010_c10001524 Cellulophaga Cellulophaga baltica GCF_000468575.1
DelMOWin2010_c10000818 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
DelMOWin2010_c10001421 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
JGI20152J14361_10000819 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI20154J14316_10003155 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI20156J14371_10004249 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24005J15628_10001328 Prochlorococcus Prochlorococcus marinus GCF_000012645.1
JGI24006J15134_10001317 Cellulophaga Cellulophaga baltica GCF_000468575.1
DelMOSpr2010_c10001694 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24004J15324_10001093 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24005J15628_10001392 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI26084J50262_1001496 Cellulophaga Cellulophaga baltica GCF_000468575.1
KVWGV2_10013378 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
TahiMoana_1001459 Cellulophaga Cellulophaga baltica GCF_000468575.1
DelMOWin2010_c10001283 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24005J15628_10001461 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24005J15628_10001573 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24005J15628_10001711 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI26086J50260_1002326 Cellulophaga Cellulophaga baltica GCF_000468575.1
83
DelMOSpr2010_c10001249 Cellulophaga Cellulophaga baltica GCF_000468595.1
DelMOSum2011_c10003133 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI25128J35275_1000513 Cellulophaga Cellulophaga baltica GCF_000468575.1
DelMOSum2011_c10002082 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI20154J14316_10001441 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24005J15628_10001808 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24513J20088_1000131 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI26083J51738_10000097 Prochlorococcus Prochlorococcus marinus GCF_000015665.1
DelMOSpr2010_c10002787 Cellulophaga Cellulophaga baltica GCF_000468575.1
DelMOSum2010_c10001543 Cellulophaga Cellulophaga baltica GCF_000468575.1
DelMOSum2010_c10004062 Candidatus Pelagibacter Candidatus Pelagibacter ubique GCF_000012345.1
DelMOWin2010_c10002784 Candidatus Pelagibacter Candidatus Pelagibacter ubique GCF_000012345.1
GOS2229_1052262 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24003J15210_10000591 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
JGI24004J15324_10000171 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24004J15324_10000443 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
JGI24004J15324_10000904 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24006J15134_10001788 Candidatus Pelagibacter Candidatus Pelagibacter ubique GCF_000012345.1
JGI25132J35274_1000514 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
JGI25133J35611_10000742 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
JGI25133J35611_10001051 Cellulophaga Cellulophaga baltica GCF_000468575.1
DelMOSpr2010_c10000243 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
DelMOSum2011_c10001421 Cellulophaga Cellulophaga baltica GCF_000468575.1
DelMOWin2010_c10000964 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
DelMOWin2010_c10001617 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI20154J14316_10002223 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24003J15210_10001253 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24006J15134_10002219 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24513J20088_1000115 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI25127J35165_1000572 Prochlorococcus Prochlorococcus marinus GCF_000012465.1
JGI25134J35505_10000947 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
KVRMV2_100009144 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
DelMOSpr2010_c10001968 Cellulophaga Cellulophaga baltica GCF_000468575.1
DelMOSpr2010_c10002288 Prochlorococcus Prochlorococcus marinus GCF_000015665.1
84
DelMOSpr2010_c10002558 Cellulophaga Cellulophaga baltica GCF_000468575.1
DelMOSum2010_c10003208 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI20158J14315_10000669 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24003J15210_10000144 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
JGI24005J15628_10001291 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI25133J35611_10001888 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
KVRMV2_100037395 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
DelMOWin2010_c10000800 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
JGI24005J15628_10000250 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI26114J46594_1000386 Cellulophaga Cellulophaga baltica GCF_000468575.1
DelMOSum2011_c10002656 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24004J15324_10000247 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24005J15628_10001646 Cellulophaga Cellulophaga baltica GCF_000468575.1
JGI24006J15134_10001077 Prochlorococcus Prochlorococcus marinus GCF_000015685.1
Table B.5: The predicted hosts of the selected group of 173 human associated
viral contigs from Paez-Espino et al. with scores≥0.95 by the integrated model.
The hosts of all of these viral contigs were not predicted previously and predicted
exclusively by our method.
Viral contig Host genus Host species Predicted host
SRS014573_WUGC_scaffold_10762 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS015941_WUGC_scaffold_21622 Veillonella Veillonella dispar GCF_004166985.1
SRS015941_WUGC_scaffold_27153 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS048791_LANL_scaffold_14612 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS057205_LANL_scaffold_13380 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS015762_WUGC_scaffold_5122 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS019607_WUGC_scaffold_56474 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS014271_WUGC_scaffold_25970 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS017533_Baylor_scaffold_41658 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS018300_Baylor_scaffold_16016 Streptococcus Streptococcus pneumoniae GCF_000014365.2
SRS018739_WUGC_scaffold_56127 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS019027_WUGC_scaffold_20392 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS019974_Baylor_scaffold_52037 Staphylococcus Staphylococcus aureus GCF_000709475.1
85
SRS021496_Baylor_scaffold_69240 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS014692_WUGC_scaffold_15540 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS015369_Baylor_scaffold_19085 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS017533_Baylor_scaffold_53265 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS019128_WUGC_scaffold_55262 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS045715_LANL_scaffold_25870 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS048791_LANL_scaffold_21876 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS052697_LANL_scaffold_40832 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS055450_LANL_scaffold_61509 Streptococcus Streptococcus pneumoniae GCF_001329615.1
SRS062761_LANL_scaffold_52092 Streptococcus Streptococcus pneumoniae GCF_001154345.1
SRS013950_WUGC_scaffold_2682 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS014470_WUGC_scaffold_44951 Fusobacterium Fusobacterium periodonticum GCF_000158215.3
SRS015794_Baylor_scaffold_15925 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS019128_WUGC_scaffold_25110 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS024388_LANL_scaffold_16685 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS049268_LANL_scaffold_74546 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS050244_LANL_scaffold_73744 Prevotella Prevotella amnii GCF_000759315.1
SRS051031_LANL_scaffold_75870 NAmissing Lachnospiraceae bacterium GCF_000242315.1
SRS054687_LANL_scaffold_13892 Prevotella Prevotella amnii GCF_000177355.1
SRS057539_LANL_scaffold_44738 Fusobacterium Fusobacterium periodonticum GCF_000297655.1
SRS058053_LANL_scaffold_85905 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS014470_WUGC_scaffold_48890 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS015215_WUGC_scaffold_58669 Streptococcus Streptococcus pneumoniae GCF_001328955.1
SRS019045_WUGC_scaffold_18982 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS019219_WUGC_scaffold_74087 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS020856_Baylor_scaffold_30603 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS022621_Baylor_scaffold_5055 Fusobacterium Fusobacterium periodonticum GCF_000163935.1
SRS023617_Baylor_scaffold_74968 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS045715_LANL_scaffold_111263 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS015038_WUGC_scaffold_25448 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS016086_WUGC_scaffold_26851 Veillonella Veillonella parvula GCF_002847925.1
SRS018791_WUGC_scaffold_34979 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS064774_LANL_scaffold_4415 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS018145_Baylor_scaffold_48460 Veillonella Veillonella dispar GCF_000160015.1
86
SRS023617_Baylor_scaffold_75044 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS042643_WUGC_scaffold_47164 Campylobacter Campylobacter coli GCF_001491235.1
SRS044373_WUGC_scaffold_11731 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS045713_WUGC_scaffold_312 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS055378_LANL_scaffold_95424 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS017191_Baylor_scaffold_18706 Escherichia Escherichia coli GCF_000800845.1
SRS021954_Baylor_scaffold_50357 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS045715_LANL_scaffold_32028 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS045715_LANL_scaffold_8331 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS047113_LANL_scaffold_82624 Fusobacterium Fusobacterium nucleatum GCF_002591585.1
SRS015038_WUGC_scaffold_35208 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS019026_WUGC_scaffold_26253 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS045715_LANL_scaffold_59715 Veillonella Veillonella dispar GCF_000160015.1
SRS057205_LANL_scaffold_24211 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS013818_Baylor_scaffold_33895 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS021954_Baylor_scaffold_20032 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS022621_Baylor_scaffold_48975 Streptococcus Streptococcus pneumoniae GCF_001328955.1
SRS024138_Baylor_scaffold_42548 Fusobacterium Fusobacterium ulcerans GCF_900683735.1
SRS049389_WUGC_scaffold_3272 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS063193_LANL_scaffold_48013 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS014689_WUGC_scaffold_23508 Veillonella Veillonella dispar GCF_000160015.1
SRS015797_WUGC_scaffold_34018 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS015941_WUGC_scaffold_12463 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS019030_Baylor_scaffold_4207 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS021954_Baylor_scaffold_56821 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS023958_Baylor_scaffold_44922 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS053398_LANL_scaffold_27298 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS053603_LANL_scaffold_26069 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS057791_LANL_scaffold_98 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS064423_LANL_scaffold_37946 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS014470_WUGC_scaffold_41846 Fusobacterium Fusobacterium periodonticum GCF_000158215.3
SRS019022_WUGC_scaffold_24071 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS024580_LANL_scaffold_41942 Fusobacterium Fusobacterium nucleatum GCF_000178895.1
SRS044373_WUGC_scaffold_19666 Staphylococcus Staphylococcus aureus GCF_000709475.1
87
SRS049147_LANL_scaffold_64187 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS049900_LANL_scaffold_9614 Parabacteroides Parabacteroides distasonis GCF_009025675.1
SRS064774_LANL_scaffold_3234 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS064774_LANL_scaffold_67187 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS014689_WUGC_scaffold_11775 Veillonella Veillonella parvula GCF_002847925.1
SRS015745_WUGC_scaffold_9922 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS018791_WUGC_scaffold_40500 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS022143_LANL_scaffold_92267 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS051791_LANL_scaffold_5624 Veillonella Veillonella dispar GCF_000160015.1
SRS057205_LANL_scaffold_16276 Veillonella Veillonella parvula GCF_902374055.1
SRS018145_Baylor_scaffold_50013 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS021954_Baylor_scaffold_52093 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS023352_LANL_scaffold_13194 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS017209_Baylor_scaffold_70438 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS018591_WUGC_scaffold_31262 Veillonella Veillonella dispar GCF_000160015.1
SRS022621_Baylor_scaffold_12359 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS049389_WUGC_scaffold_15582 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS052227_LANL_scaffold_22306 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS065099_LANL_scaffold_101779 Neisseria Neisseria elongata GCF_003044605.1
SRS014573_WUGC_scaffold_57741 Veillonella Veillonella parvula GCF_000448705.1
SRS019607_WUGC_scaffold_41229 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS020856_Baylor_scaffold_5869 Veillonella Veillonella parvula GCF_003584215.1
SRS022602_Baylor_scaffold_118713 Neisseria Neisseria subflava GCF_003044355.1
SRS049147_LANL_scaffold_47393 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS058336_LANL_scaffold_6266 Veillonella Veillonella dispar GCF_000160015.1
SRS014684_WUGC_scaffold_22896 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS014692_WUGC_scaffold_5365 Veillonella Veillonella dispar GCF_000160015.1
SRS015057_WUGC_scaffold_76413 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS018791_WUGC_scaffold_24769 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS045127_LANL_scaffold_39500 Streptococcus Streptococcus pneumoniae GCF_000014365.2
SRS049389_WUGC_scaffold_43846 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS062878_LANL_scaffold_84527 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS019607_WUGC_scaffold_13340 Veillonella Veillonella sp. oral taxon 158 GCF_000183505.1
SRS050244_LANL_scaffold_57624 Fusobacterium Fusobacterium varium GCF_000159915.2
88
SRS063193_LANL_scaffold_26038 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS011140_Baylor_scaffold_54981 Fusobacterium Fusobacterium nucleatum GCF_002591585.1
SRS011343_Baylor_scaffold_55508 Fusobacterium Fusobacterium periodonticum GCF_000297655.1
SRS014271_WUGC_scaffold_28305 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS020334_Baylor_scaffold_44363 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS045645_Baylor_scaffold_13177 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS019027_WUGC_scaffold_21036 Fusobacterium Fusobacterium nucleatum GCF_000178895.1
SRS023352_LANL_scaffold_54382 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS023617_Baylor_scaffold_75074 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS024561_LANL_scaffold_42852 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS048791_LANL_scaffold_65898 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS050752_LANL_scaffold_33630 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS051791_LANL_scaffold_55695 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS013800_Baylor_scaffold_15724 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS013800_Baylor_scaffold_15939 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS019125_WUGC_scaffold_11585 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS049147_LANL_scaffold_57666 Clostridium Clostridium butyricum GCF_003459015.1
SRS057692_LANL_scaffold_59029 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS057791_LANL_scaffold_31442 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS016575_Baylor_scaffold_66457 Fusobacterium Fusobacterium nucleatum GCF_000218655.1
SRS018978_WUGC_scaffold_13759 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS019022_WUGC_scaffold_40278 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS019391_WUGC_scaffold_2804 Streptococcus Streptococcus pneumoniae GCF_001329615.1
SRS019587_WUGC_scaffold_9641 Streptococcus Streptococcus pneumoniae GCF_001329615.1
SRS019974_Baylor_scaffold_30775 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS024318_LANL_scaffold_80249 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS050244_LANL_scaffold_28980 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS013164_Baylor_scaffold_26422 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS017076_Baylor_scaffold_29149 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS018300_Baylor_scaffold_53659 Fusobacterium Fusobacterium nucleatum GCF_000178895.1
SRS024441_LANL_scaffold_16796 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS063288_LANL_scaffold_62879 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS065278_LANL_scaffold_54349 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS062878_LANL_scaffold_39249 Streptococcus Streptococcus pneumoniae GCF_000014365.2
89
SRS019128_WUGC_scaffold_56023 Fusobacterium Fusobacterium varium GCF_000159915.2
SRS050244_LANL_scaffold_103368 Streptococcus Streptococcus pyogenes GCF_000007425.1
SRS014573_WUGC_scaffold_57471 Veillonella Veillonella parvula GCF_000215025.1
SRS015038_WUGC_scaffold_4323 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS015797_WUGC_scaffold_13994 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS019045_WUGC_scaffold_41982 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS023987_Baylor_scaffold_2892 Streptococcus Streptococcus pneumoniae GCF_001329615.1
SRS054956_LANL_scaffold_45718 Roseburia Roseburia intestinalis GCF_003475515.1
SRS057791_LANL_scaffold_82339 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS018300_Baylor_scaffold_3053 Veillonella Veillonella sp. oral taxon 158 GCF_000183505.1
SRS023352_LANL_scaffold_29715 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS044373_WUGC_scaffold_18975 Fusobacterium Fusobacterium periodonticum GCF_000297655.1
SRS017120_Baylor_scaffold_80625 Prevotella Prevotella amnii GCF_001553225.1
SRS018739_WUGC_scaffold_28123 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS023352_LANL_scaffold_84250 Veillonella Veillonella dispar GCF_000160015.1
SRS057205_LANL_scaffold_57058 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS064423_LANL_scaffold_58411 Staphylococcus Staphylococcus epidermidis GCF_000390405.1
SRS011271_WUGC_scaffold_40288 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS015057_WUGC_scaffold_74789 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS017821_Baylor_scaffold_25008 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS018591_WUGC_scaffold_4777 Veillonella Veillonella dispar GCF_004166985.1
SRS062540_LANL_scaffold_47785 Streptococcus Streptococcus pyogenes GCF_000007425.1
SRS063288_LANL_scaffold_2091 Staphylococcus Staphylococcus aureus GCF_000709475.1
SRS075404_LANL_scaffold_40900 Staphylococcus Staphylococcus aureus GCF_000709475.1
90
Abstract (if available)
Abstract
Metagenomic sequencing has greatly enhanced the discovery of viral genomic sequences
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
PDF
Applications and improvements of background adjusted alignment-free dissimilarity measures
PDF
Application of machine learning methods in genomic data analysis
PDF
Computational algorithms and statistical modelings in human microbiome analyses
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Enhancing phenotype prediction through integrative analysis of heterogeneous microbiome studies
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
DNA shape at transcription factor binding sites: from purifying selection to a new alphabet
PDF
Whole genome bisulfite sequencing: analytical methods and biological insights
PDF
Bayesian analysis of transcriptomic and genomic data
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Clustering 16S rRNA sequences: an accurate and efficient approach
PDF
Characterizing brain aging with neuroimaging, health, and genetic data
PDF
Integrating high-throughput sequencing data to study gene regulation
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Profiling transcription factor-DNA binding specificity
PDF
Novel computational methods of disease gene and variant discovery, parallelization and applications
Asset Metadata
Creator
Wang, Weili
(author)
Core Title
Predicting virus-host interactions using genomic data and applications in metagenomics
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
04/15/2020
Defense Date
03/20/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
machine learning,metagenomics,OAI-PMH Harvest,virome
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sun, Fengzhu (
committee chair
), Fan, Yingying (
committee member
), Waterman, Michael Spencer (
committee member
)
Creator Email
weiliw@usc.edu,wlwang1116@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-281772
Unique identifier
UC11673398
Identifier
etd-WangWeili-8262.pdf (filename),usctheses-c89-281772 (legacy record id)
Legacy Identifier
etd-WangWeili-8262.pdf
Dmrecord
281772
Document Type
Dissertation
Rights
Wang, Weili
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
machine learning
metagenomics
virome