Close
USC Libraries
University of Southern California
About
FAQ
Home
Collections
Login
USC Login
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Alignment-free sequence comparison methods and applications to comparative genomics
/
Folder
FigureS1
(USC Thesis Other) 

FigureS1

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Copy asset link
Request this asset
Request accessible transcript
Transcript (if available)
Content A
B
Figure S1
10,000 bp 5,000 bp 3,000 bp 1,000 bp 500 bp
0.5
0.6
0.7
0.8
0.9
1.0
AUROC
10% virus 50% virus 90% virus
Fraction of viral contigs
0.5
0.6
0.7
0.8
0.9
1.0
AUPRC
10% virus
Fraction of viral contigs
50% virus 90% virus
500 1,000 3,000 5,000 10,000
Contig size (bp)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
True positive rate (recall)
VirSorter (cat. I & II)
VirFinder at VirSorter FPR
VirFinder at 0.001 FPR
VirFinder at 0.005 FPR
VirFinder at 0.01 FPR
500 1,000 3,000 5,000 10,000
Contig size (bp)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
True positive rate (recall)
VirSorter (cat. I & II)
VirFinder at VirSorter FPR
VirFinder at 0.001 FPR
VirFinder at 0.005 FPR
VirFinder at 0.01 FPR
A
B
Figure S2
0.0000
0.0004
0.002
0.008
0.007
0.000
0.000
0.007
0.009
0.011
0.7
0.8
0.9
1.0
500 1000 5000 10000
Contig length (bp)
3000
AUROC
0.01
0.001
0.0001
0 (no mutation)
Figure S3
500−1,000 bp
AUROC include chimeras: 0.91
AUROC exclude chimeras: 0.90
1,000−3,000 bp
AUROC include chimeras: 0.94
AUROC exclude chimeras: 0.94
>3,000 bp
AUROC include chimeras: 0.98
AUROC exclude chimeras: 0.96
Exclude chimeras
Include chimeras
0.0
0.2
0.4
0.6
0.8
1.0
True positive rate (recall)
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.9
0.95
1
10% virus 50% virus 90% virus 10% virus 50% virus 90% virus
10M reads 20M reads
AUPRC
Number of reads and viral fraction
500-1,000 bp 1,000-3,000 bp
> 3,000 bp
all sequences > 500 bp all sequences >1,000 bp
0.85
A
B
Figure S4
I I&II I&II&III I I&II I&II&III I I&II I&II&III
True positive rate (recall)
0.0
0.2
0.4
0.6
0.8
1.0
* *
*
*
*
*
*
*
*
Virus prediction method used
VirSorter
VirFinder
* * *
* * *
* *
*
I I&II I&II&III I I&II I&II&III I I&II I&II&III
0.0
0.2
0.4
0.6
0.8
1.0
VirSorter
VirFinder
True positive rate (recall)
Virus prediction method used
A      10% viral contigs
B      90% viral contigs
Figure S5
0.0
0.2
0.4
0.6
0.8
VS VF VS VF VS VF
0.0
0.2
0.4
0.6
0.8
10% Virus
50% Virus
90% Virus
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
Cat. I Cat. I&II Cat. I-III
VS VF VS VF VS VF
Cat. I Cat. I&II Cat. I-III
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
True positive rate (recall) True positive rate (recall) True positive rate (recall)
True positive rate (recall) True positive rate (recall) True positive rate (recall)
VS VF VS VF VS VF
Cat. I Cat. I&II Cat. I-III
VS VF VS VF VS VF
Cat. I Cat. I&II Cat. I-III
VS VF VS VF VS VF
Cat. I Cat. I&II Cat. I-III
VS VF VS VF VS VF
Cat. I Cat. I&II Cat. I-III
All sequences > 500 bp All sequences > 1,000 bp
Figure S6
log10(contig size in bp)
Frequency
3.0 3.5 4.0 4.5 5.0 5.5
0 20000 40000 60000 80000
Figure S7
0 1 2 4 5
0.00
0.10
0.20
0.30
0 2 4 8
0.00
0.10
0.20
0.30
0 6 12 15
0.00
0.10
0.20
0 5 10 20 25
0.00
0.10
0.20
0.30
0 5 10 15 25
0.00
0.10
0.20
0 10 30 40 50 60
0.00
0.10
0.20
0.30
20 40 80 100 120 140
0.00
0.10
0.20
60 80 100 120 140 160
0.00
0.10
0.20
Frequency
Words with negative coefficients Words with positive coefficients
Virus
Host
Frequency Frequency Frequency
Top 100 most
highly scored
words
All words
Figure S8
x 1e-3 x 1e-3
 Cumulative frequencies
for all the included k-mers
x 1e-3 x 1e-3
x 1e-3 x 1e-3
x 1e-3 x 1e-3
Top 500 most
highly scored
words
Top 1000 most
highly scored
words
n=6082 n=6269
3 6 10 12
20
15 3 9
20
60
 Cumulative frequencies
for all the included k-mers
20 30
Position along contig (kb) Position along contig (kb)
Position along genome (kb)
40 50 60 70
crAssphage
026
035
033
031
029
027
025
024
023
037
057
059
060
062
069
070
71
0 2 4 6 8 12 10 0 2 4 6 8 10
Contig
k99_1820233_flag_0_multi_1_0066_len_10533
Contig
k99_1695388_flag_0_multi_1_0095_len_12742
20
30
40
50
60
70
Amino acid
% identity
Figure S9
Figure S10
1.0
0.9
0.8
0.7
0.6
500 1,000
Contig size (bp)
3,000 5,000 10,000
AUROC
14,772 prokaryotic genomes from
Roux et al. 2015 with proviruses
removed
14,772 prokaryotic genomes
from Roux et al. 2015
31,986 prokaryotic genomes
used in this study
Figure S11
500 1,000 3,000 5,000 10,000
Contig size (bp)
AUROC
1.0
0.9
0.8
0.7
0.6
0.5
5% viral contigs added to host training database
Control (no viral contigs added into host training database) 
Asset Metadata
Core Title FigureS1 
Tag oai:digitallibrary.usc.edu:usctheses,OAI-PMH Harvest 
Permanent Link (DOI) https://doi.org/10.25549/usctheses-oUC11256150 
Unique identifier UC11256150 
Legacy Identifier etd-RenJie-5618-sup_chapter4_Figures 
Tags
alignment-free
comparative genomics
machine learning
Markov chain
metagenomics
next generation sequencing
virus-host interaction
Linked assets
Alignment-free sequence comparison methods and applications to comparative genomics
doctype icon
Alignment-free sequence comparison methods and applications to comparative genomics 
Action button