Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Alignment-free sequence comparison methods and applications to comparative genomics
/
FigureS1
(USC Thesis Other)
FigureS1
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A
B
Figure S1
10,000 bp 5,000 bp 3,000 bp 1,000 bp 500 bp
0.5
0.6
0.7
0.8
0.9
1.0
AUROC
10% virus 50% virus 90% virus
Fraction of viral contigs
0.5
0.6
0.7
0.8
0.9
1.0
AUPRC
10% virus
Fraction of viral contigs
50% virus 90% virus
500 1,000 3,000 5,000 10,000
Contig size (bp)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
True positive rate (recall)
VirSorter (cat. I & II)
VirFinder at VirSorter FPR
VirFinder at 0.001 FPR
VirFinder at 0.005 FPR
VirFinder at 0.01 FPR
500 1,000 3,000 5,000 10,000
Contig size (bp)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
True positive rate (recall)
VirSorter (cat. I & II)
VirFinder at VirSorter FPR
VirFinder at 0.001 FPR
VirFinder at 0.005 FPR
VirFinder at 0.01 FPR
A
B
Figure S2
0.0000
0.0004
0.002
0.008
0.007
0.000
0.000
0.007
0.009
0.011
0.7
0.8
0.9
1.0
500 1000 5000 10000
Contig length (bp)
3000
AUROC
0.01
0.001
0.0001
0 (no mutation)
Figure S3
500−1,000 bp
AUROC include chimeras: 0.91
AUROC exclude chimeras: 0.90
1,000−3,000 bp
AUROC include chimeras: 0.94
AUROC exclude chimeras: 0.94
>3,000 bp
AUROC include chimeras: 0.98
AUROC exclude chimeras: 0.96
Exclude chimeras
Include chimeras
0.0
0.2
0.4
0.6
0.8
1.0
True positive rate (recall)
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.9
0.95
1
10% virus 50% virus 90% virus 10% virus 50% virus 90% virus
10M reads 20M reads
AUPRC
Number of reads and viral fraction
500-1,000 bp 1,000-3,000 bp
> 3,000 bp
all sequences > 500 bp all sequences >1,000 bp
0.85
A
B
Figure S4
I I&II I&II&III I I&II I&II&III I I&II I&II&III
True positive rate (recall)
0.0
0.2
0.4
0.6
0.8
1.0
* *
*
*
*
*
*
*
*
Virus prediction method used
VirSorter
VirFinder
* * *
* * *
* *
*
I I&II I&II&III I I&II I&II&III I I&II I&II&III
0.0
0.2
0.4
0.6
0.8
1.0
VirSorter
VirFinder
True positive rate (recall)
Virus prediction method used
A 10% viral contigs
B 90% viral contigs
Figure S5
0.0
0.2
0.4
0.6
0.8
VS VF VS VF VS VF
0.0
0.2
0.4
0.6
0.8
10% Virus
50% Virus
90% Virus
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
Cat. I Cat. I&II Cat. I-III
VS VF VS VF VS VF
Cat. I Cat. I&II Cat. I-III
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
True positive rate (recall) True positive rate (recall) True positive rate (recall)
True positive rate (recall) True positive rate (recall) True positive rate (recall)
VS VF VS VF VS VF
Cat. I Cat. I&II Cat. I-III
VS VF VS VF VS VF
Cat. I Cat. I&II Cat. I-III
VS VF VS VF VS VF
Cat. I Cat. I&II Cat. I-III
VS VF VS VF VS VF
Cat. I Cat. I&II Cat. I-III
All sequences > 500 bp All sequences > 1,000 bp
Figure S6
log10(contig size in bp)
Frequency
3.0 3.5 4.0 4.5 5.0 5.5
0 20000 40000 60000 80000
Figure S7
0 1 2 4 5
0.00
0.10
0.20
0.30
0 2 4 8
0.00
0.10
0.20
0.30
0 6 12 15
0.00
0.10
0.20
0 5 10 20 25
0.00
0.10
0.20
0.30
0 5 10 15 25
0.00
0.10
0.20
0 10 30 40 50 60
0.00
0.10
0.20
0.30
20 40 80 100 120 140
0.00
0.10
0.20
60 80 100 120 140 160
0.00
0.10
0.20
Frequency
Words with negative coefficients Words with positive coefficients
Virus
Host
Frequency Frequency Frequency
Top 100 most
highly scored
words
All words
Figure S8
x 1e-3 x 1e-3
Cumulative frequencies
for all the included k-mers
x 1e-3 x 1e-3
x 1e-3 x 1e-3
x 1e-3 x 1e-3
Top 500 most
highly scored
words
Top 1000 most
highly scored
words
n=6082 n=6269
3 6 10 12
20
15 3 9
20
60
Cumulative frequencies
for all the included k-mers
20 30
Position along contig (kb) Position along contig (kb)
Position along genome (kb)
40 50 60 70
crAssphage
026
035
033
031
029
027
025
024
023
037
057
059
060
062
069
070
71
0 2 4 6 8 12 10 0 2 4 6 8 10
Contig
k99_1820233_flag_0_multi_1_0066_len_10533
Contig
k99_1695388_flag_0_multi_1_0095_len_12742
20
30
40
50
60
70
Amino acid
% identity
Figure S9
Figure S10
1.0
0.9
0.8
0.7
0.6
500 1,000
Contig size (bp)
3,000 5,000 10,000
AUROC
14,772 prokaryotic genomes from
Roux et al. 2015 with proviruses
removed
14,772 prokaryotic genomes
from Roux et al. 2015
31,986 prokaryotic genomes
used in this study
Figure S11
500 1,000 3,000 5,000 10,000
Contig size (bp)
AUROC
1.0
0.9
0.8
0.7
0.6
0.5
5% viral contigs added to host training database
Control (no viral contigs added into host training database)
Linked assets
Alignment-free sequence comparison methods and applications to comparative genomics
Conceptually similar
PDF
sup_chapter4
PDF
etd-RenJie-5618-3.pdf
PDF
Ahlgren_NAR_virus-host_suppmaterial_D2_JR_NA
PDF
etd-RenJie-5618-2.pdf
PDF
etd-RenJie-5618-4.pdf
PDF
Alignment-free sequence comparison methods and applications to comparative genomics [pdf]
PDF
etd-RenJie-5618-sup_chapter2.pdf
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Applications and improvements of background adjusted alignment-free dissimilarity measures
PDF
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Whole genome bisulfite sequencing: analytical methods and biological insights
PDF
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
PDF
Application of machine learning methods in genomic data analysis
PDF
Ecological patterns of free-living and particle-associated prokaryotes, protists, and viruses at the San Pedro Ocean Time-series between 2005 and 2018
PDF
Improved methods for the quantification of transcription factor binding using SELEX-seq
Asset Metadata
Core Title
FigureS1
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11256150
Unique identifier
UC11256150
Legacy Identifier
etd-RenJie-5618-sup_chapter4_Figures
Tags
alignment-free
comparative genomics
machine learning
Markov chain
metagenomics
next generation sequencing
virus-host interaction