Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Applications and improvements of background adjusted alignment-free dissimilarity measures
(USC Thesis Other)
Applications and improvements of background adjusted alignment-free dissimilarity measures
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Applications and Improvements of Background
Adjusted Alignment-free Dissimilarity Measures
by
Kujin Tang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Quantitative and Computational Biology)
May 2020
Acknowledgements
I would like to express my deepest appreciation to my advisor, Dr. Fengzhu Sun, for
his guidance during my Ph.D. studies. The success of my research would not have been
possible without his valuable advice and unwavering support.
I also had great pleasure of working with my collaborators both within and outside
of USC for jointly publishing ndings. They are Dr. Michael S. Waterman, Dr. Richard
Cronn, Dr. David Erickson, Dr. Brook Milligan, Dr. Meaghan Parker-Forney, Dr. John
L. Spouge, Dr. Jie Ren and Dr. Yang Lu, Xin Bai.
I am also thankful to my oral committee and dissertation committee members, Dr.
Michael S. Waterman, Dr Jed A. Fuhrman, Dr. Liang Chen, Dr. Mark Chaisson and Dr.
Jiang F. Zhong.
I want to thank my teammates at SunLab: Dr. Wangshu Zhang, Dr. Mengge Zhang,
Dr. Han Li, Weili Wang, Zifan Zhu, Siliangyu Cheng, Yilin Gao, Tianqi Tang, Wenxuan
zuo, Yuxuan Du, Jiawei Huang.
I want to thank other friends at USC: Dr. Junsong Zhao, Dr. Long Pei, Dr. Jianghan
Qu, Dr. Meng Zhou, Dr. Chao Deng, Dr. Wenzheng Li, Dr. Beibei Xin, Dr. Nan Hua,
Jinsen Li, Kaida Ning, Saket Choudhary, Rishvanth Prabakar, Katrina Sherbina, Mary
Same, Bo Sun, Jingwen Ren, Haiyang Zhang, Shiyue Huang, Yiwei He, Zhongying Wang,
Xin Li, Fang Zhang, Beibei Wang and many others.
ii
Contents
Acknowledgements ii
List of Figures vii
List of Tables xii
1 Introduction 1
1.1 Aligment-free distance/dissimilarity measures . . . . . . . . . . . . . . . . . 3
1.2 Applications of background adjusted alignment-free dissimilarity measures . 6
1.2.1 Applications on horizontal gene transfer (HGT) detection . . . . . . 6
1.2.2 Applications on cotinental origin prediction of white oak . . . . . . . 7
1.3 Improvements of background adjusted alignment-free dissimilarity measures 8
1.3.1 Bias adjustment for background adjusted alignment-free dissimilarity
measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background Adjusted Alignment-free Dissimilarity Measures Improve
the Detection of Horizontal Gene Transfer 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Articial genome simulation . . . . . . . . . . . . . . . . . . . . . . . 12
iii
2.2.2 Distance calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Predicting HGT regions . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.4 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.5 Investigating the eect of evolution relationship between the host
and donor genomes and window size on the performance of dierent
methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.6 Investigation of HGT within 118 genomes and E.faecalis V583 . . . 17
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Background adjusted dissimilarity measures outperform non-background
adjusted methods for HGT detection based on the E. coli articial
genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 The performance of the alignment-free methods increases with the
genetic distance between the donor genome and the host genome . . 21
2.3.3 The performance of the alignment-free methods increases with the
window size within the range of 3kbp to 8kbp . . . . . . . . . . . . . 24
2.3.4 Robustness of the relative performance of the dierent methods with
respect to dierent host genomes . . . . . . . . . . . . . . . . . . . . 24
2.3.5 Applications to real HGT data support the good performance of
background adjusted dissimilarity measures . . . . . . . . . . . . . . 26
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Alignment-free Genome Comparison Enables Accurate Geographic Sourc-
ing of White Oak DNA 32
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 NGS data from white oak samples . . . . . . . . . . . . . . . . . . . 34
iv
3.2.2 Dissimilarity measures between genomes based on NGS data . . . . 35
3.2.3 Circular plots and principal coordinate analysis . . . . . . . . . . . . 36
3.2.4 Intra- and inter-continental d
2
dissimilarity distributions . . . . . . . 36
3.2.5 Continental origin prediction by KNN and d
2
. . . . . . . . . . . . . 36
3.2.6 Eect of laboratory and sequencer error on accuracy . . . . . . . . . 37
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.1 Genomic dissimilarity analyses based on resolve oak geographic origins 39
3.3.2 Mean d
2
is smaller within continents than among continents . . . . . 42
3.3.3 KNN predictions are robust to multiple sources of errors . . . . . . 42
3.3.4 KNN predictions are robust to sequencing technologies . . . . . . . . 46
3.3.5 Assigning condence to the predicted continental origins . . . . . . . 49
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 Afann: bias adjustment for alignment-free sequence comparison based
on sequencing data using neural network regression 53
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.1 Alignemt-free methods overestimate distance between NGS samples 57
4.2.2 Bias adjustment by a neural network regression model . . . . . . . . 62
4.2.3 The correlation between the adjusted dissimilarity measures based
on NGS samples and genomes of 21 primates is markedly increased . 63
4.2.4 The correlation between the adjusted dissimilarity measures based
on NGS samples and genomes of 28 mammals is markedly increased 66
4.2.5 The accuracy on predicting continental origins of white oak NGS
samples using k-NN is markedly increased . . . . . . . . . . . . . . . 68
v
4.2.6 The prediction accuracy of geographic origin at ner scales for white
oak NGS samples is markedly increased . . . . . . . . . . . . . . . . 69
4.2.7 The correlation between the adjusted dissimilarity measures based
on NGS samples and genomes of 67 vertebrates is markedly increased 72
4.2.8 Running time and memory . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5.1 Developing a bias adjustment model . . . . . . . . . . . . . . . . . . 82
4.5.2 Model training and evaluation . . . . . . . . . . . . . . . . . . . . . 84
4.5.3 White oak continental origin prediction by k-NN and d
2
. . . . . . . 88
5 Conclusion and future work 89
6 Supplementary materials 91
6.1 Chapter 2 Supplementary materials . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Chapter 3 Supplementary materials . . . . . . . . . . . . . . . . . . . . . . . 96
6.3 Chapter 4 Supplementary materials . . . . . . . . . . . . . . . . . . . . . . . 116
Bibliography 155
vi
List of Figures
2.1 Precision-Recall Curves (PRC) for all the methods. (a) shows PRC for
CVTree with dierent word lengths. (b) shows PRC for all thed
2
methods.
(c) shows PRC for alld
s
2
methods. (d) shows PRC for Manhattan, Euclidean
and d
2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 The Precision-Recall Curves (PRC) of dierent HGT detection methods
along articial genomes using E. coli as host genome. (a) PRC when us-
ing S.sonnei as donor genome, no methods performs well. (b) PRC when
using B.abortus as donor genome, CVT (3), CVT (4),d
2
(3; 1) and d
2
(4; 1)
outperform other methods. (c) PRC when using C.coli as donor genome,
all methods perform reasonably well. . . . . . . . . . . . . . . . . . . . . . . 23
2.3 The Precision-Recall Curves (PRC) of the dierent methods based on 118
genomes with known HGT genomic islands. . . . . . . . . . . . . . . . . . . 27
vii
3.1 The circular plots of 92 white oak tree samples based on the six dissimilarity
measures: d
2
, d
s
2
, d
2
, CVTree, Euclidean, and Manhattan, using 100 Mbp
of next generation sequencing data. Dierent sectors correspond to dierent
continents, with NA in red, EU in orange and AS in blue. Within each sector,
samples are sorted by their longitude, so that samples that are geographically
close are also close to each other in the gure. The most similar tree sample
to each sample is linked. The k-mer length is 12 and the Markov order of the
background sequence is 10 ford
2
,d
s
2
andCVTree. The most similar sample
to each sample according to d
2
andd
s
2
are from the same continent-of-origin. 40
3.2 The principal coordinate plots (PCoA) of 92 white oak tree samples based
on the d
2
dissimilarity values using 50,100 and 300 Mbp of next genera-
tion sequencing data. The k-mer length is 12 and the Markov order of the
background sequence is 10. With sequence quantity of 100 and 300 Mbp,
the Europe (EU) and Asia (AS) samples are clearly separated in the PCoA
plots. Four western North America samples separate from the other eastern
North America samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Comparison of intra- and inter-continental d
2
dissimilarities with sequence
quantity of 100 Mbp. The k-mer length is 12 and the Markov order of
the background sequence is 10. The p-values were calculated based on the
Wilkinson-Man-Whitney test statistic and by permuting the continental la-
bels of the white oak tree samples 10
7
times. The inter-continental dis-
similarities are signicantly higher than intra-continental d
2
dissimilarities.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
viii
3.4 The circular plots for independent samples sequenced using a) Illumina NGS
of a California Valley Oak tree, b) a mixture of short- and long- read from
with both Illumina and PacBio sequencing of the Pendunculate Oak tree,
and c) seven diverse tree samples using RAD-seq. Thed
2
dissimilarity mea-
sures of each independent sample with the 92 reference samples were calcu-
lated and the two most similar reference samples are linked. . . . . . . . . . 48
4.1 Bias caused by NGS sampling. (a) shows two genomes that dier by only
one bp, marked in red. Both of their NGS samples (red arrows) perfectly
cover their genomes. (b) shows two genomes that are exactly the same.
Their NGS samples (blue arrows) only partially cover their genomes. . . . . 58
4.2 Relationship between pairwise s
s
2
estimated by primate genomes and NGS
samples usingK = 14 andM = 12 of dierent numbers of reads without bias
adjustment. X-axis is the pairwise s
s
2
estimated by genomes and Y-axis is
the pairwise s
s
2
estimated based on NGS samples. (a)-(h) show relationship
between s
s
2
estimated by primate genomes and s
s
2
estimated based on NGS
samples of only 1 M, 3M, 5 M, 7M, 9 M, 11 M, 13 M or 15 M reads,
respectively. (i) shows pairwise s
s
2
estimated based on mixed NGS samples.
NGS samples of dierent numbers of reads are colored accordingly. `Mix'
means two NGS samples have dierent numbers of reads (e.g between 1 M
and 5 M or between 7 M and 11 M) and is colored in grey. The root mean
squared error (RMSE) and Spearman correlation coecients (SPC) between
pairwises
s
2
estimated based on NGS samples and genomes are shown on each
subplot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
ix
4.3 Relationship between pairwise s
s
2
estimated by primate genomes and NGS
samples using K = 14 and M = 12 of dierent numbers of reads with bias
adjustment. X-axis is the pairwise s
s
2
estimated by genomes and Y-axis
is the pairwise s
s
2
estimated based on NGS samples after bias adjustment.
(a)-(h) show the relationship betweens
s
2
estimated by primate genomes and
adjusted s
s
2
based on NGS samples of only 1 M, 3M, 5 M, 7M, 9 M, 11 M,
13 M or 15 M reads, respectively. (i) shows pairwise adjusted s
s
2
based on
mixed NGS samples. NGS samples of dierent numbers of reads are colored
accordingly. `Mix' means two NGS samples have dierent numbers of reads
(e.g between 1 M and 5 M or between 7 M and 11 M) and is colored in grey.
The root mean squared error (RMSE) and Spearman correlation coecients
(SPC) between pairwise s
s
2
estimated based on NGS samples and genomes
are shown on each subplot. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4 Relationship between pairwises
s
2
estimated usingK = 14 andM = 12 based
on 28 mammalian genomes and NGS samples of dierent numbers of reads.
(a) shows the relationship before bias adjustment. (b) shows the relationship
after bias adjustment for NGSs
s
2
. The root mean squared error (RMSE) was
decreased and the Spearman correlation coeent (SPC) between pairwise
genome s
s
2
and NGS s
s
2
was increased after bias adjustment. . . . . . . . . . 67
x
4.5 The circular plots of 92 white oak tree samples based on d
2
with K = 12
and M = 10 before and after bias adjustment. Dierent sectors correspond
to dierent continents, with North America (NA) in red, Europe (EU) in
orange and Asia (AS) in blue. Within each sector, samples are sorted by
their longitude, so that samples that are geographically close are also close
to each other in the gure. The most similar tree sample to each sample is
linked. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Relationship between pairwise s
2
estimated using K = 14 and M = 12
based on 67 vertebrate genomes and NGS samples of dierent numbers of
reads. (a) shows the relationship before bias adjustment. (b) shows the
relationship after bias adjustment for NGS s
2
. The root mean squared
error was decreased and the Spearman correlation coeent between pairwise
genome s
2
and NGS s
2
was increased after bias adjustment. . . . . . . . . . 73
4.7 Diagram of hyperparameter tuning and evaluation. 1. Trainig set is aug-
mented. 2. Training set after data augmentation is used to t the model.
3. The trained model is used to predict d(A
G
;B
G
) for the validation set.
For each combination of hyperparameters, we repeated step 1-3 10 times to
calculate the averageR
2
, and the combination of hyperparameters with the
highest average R
2
was chosen. 4. After hyperparameter tuning, the nal
model is tested on the test set. . . . . . . . . . . . . . . . . . . . . . . . . . 87
xi
List of Tables
2.1 Complete evaluation results for dierent dissimilarity measures with dif-
ferent word lengths k and Markov orders when needed. Numbers in the
brackets in the rst column indicate the word length k and Markov order
used by methods d
2
and d
s
2
. For example, d
2
(3; 1) means that d
2
was the
dissimilarity measure with word length 3 and Markov order 1. Optimal F
1
is the highest average F
1
-score that can be achieved by this method under
a certain threshold. Optimal r for each method is the value of r, which is
used to set the threshold, to achieve the optimalF
1
. Corresponding average
precision and average recall for the optimal F
1
are recorded in the second
and the third columns. Standard deviations of precision, recall and optimal
F
1
-score over 10 simulations are shown as superscripts. Highlighted are the
top 4 F
1
-scores for the dierent methods. . . . . . . . . . . . . . . . . . . . 19
xii
2.2 Performance of dierent alignment-free HGT detection methods over 20 ar-
ticial genomes with dierent donor genomes. The rst column shows the
donor genome of the articial genome. The top 12 species have the same
order level as E. coli and the bottom 8 species have dierent order level
from E. coli. The second column is the Manhattan distance between donor
genome and E. coli K12 based on tetranucletide frequency. The third to
the ninth columns are the optimal F
1
-score of dierent methods over dif-
ferent articial genomes. The optimal F
1
scores for each donor genome are
highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Performance of dierent methods over articial genomes by using dierent
window sizes. Values in the second column are the window sizes. All the
other columns are the same as in Table 2.2. The optimal F
1
scores for each
donor genome by using dierent window sizes are highlighted. *WS: window
size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Performance of dierent methods over 118 genomes with known HGT ge-
nomic islands in [1] based on a) optimal accuracy and b) optimal F
1
-score.
The second and third columns show the precision and recall to achieve the
optimal accuracy given in the fourth column. The fth and sixth columns
show the precision and recall corresponding to the optimalF
1
-score given in
the seventh column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 The distances between each gene and E. faecalis V583 genome were calcu-
lated and genes were ranked by their distances. The rst to seventh rows
show the ranks of EF2293-EF2299 among all E.faecalis V583 genes calcu-
lated by dierent methods. The eighth and ninth rows show the median and
mean of the ranks of the seven genes. . . . . . . . . . . . . . . . . . . . . . . 29
xiii
3.1 KNN accuracy on test data for dierent sample sizes, test sizes, training
sizes and dierent numbers of neighbors K used. . . . . . . . . . . . . . . . 44
3.2 KNN accuracy on test data with 5% simulated sequencing error for dierent
sample sizes, test sizes, training sizes and dierent numbers of neighbors. . 45
4.1 Prediction accuracy using k-NN on 92 white oak dataset of mixed sequence
quantity based on d
2
before and after bias adjustment for dierent query
sizes, reference sizes and dierent numbers of neighbors k used. For each
query sizes and reference sizes, the dataset was randomly split 100 times
and an average prediction accuracy was calculated over 100 splits. . . . . . 70
4.2 Kmer counting time, dissimilarity calculation time and total time as well
as memory usage used by Cafe and Afann to calculate the pairwise d
s
2
, d
2
and CVTree using K = 12 and M = 10 among a dataset of 92 white
oak NGS samples of 300 Mbp. Afann-d
2
-fast and Afann-CVTree-fast stand
for the fast mode of d
2
and CVTree supported in Afann. Running time
and memory usage of Mash and Skmer were also included. Mash
min
and
Skmer
min
used K = 12 and s = 10
3
which require the minimum computing
power. Mash
opt
and Skmer
opt
used K = 31 and s = 10
7
which have the
optimal performance among Mash and Skmer using dierent combinations
of kmer lengths and sketch sizes as shown in Table S2 and Table S3. . . . . 76
xiv
Chapter 1
Introduction
Molecular sequence comparison is one of the most basic and fundamental problems in com-
putational biology, and it has been widely used from phylogeny reconstruction to genome
assembly. Since 1980s, a large number of alignment-based methods and tools have been
created to solve sequence comparison problems including BLAST [2], FASTA [3], ClustalW
[4], etc. Despite their extensive applications, alignment-based mehtods have several limi-
tations. First, they are memory consuming and time consuming, often requiring quadratic
time in the length of the sequence so that aligning two long DNA sequences is infeasible
in practice. Second, it is dicult to use alignment-based methods to reconstruct the phy-
logeny when there are many genome rearrangments such as duplications, translocations
and horizontal gene transfer [5]. Third, they can not be directly applied to NGS samples
without assembling the samples from short reads to long contigs rst, thus are of limited use
when the sequencing coverage is low and their performance largely depends on the assem-
bly tools. Therefore, alignment-free methods, alternatives over alignment-based methods,
have recently received increasing attention because they are generally more memory and
time ecient [6, 7, 8, 9, 10, 11, 12, 13, 14]. Moreover, alignment-free methods, especially
1
kmer-based approaches that use the frequencies of kmers (k-words or k-grams) for sequence
comparison can be naturally adapted to shotgun NGS sequencing data without assembly
[15, 16, 9, 10, 17, 13, 14].
The history of using kmer-based approaches to compare molecular sequences traces back
to the early work of Carl Woese and colleagues from the early 1970s to the mid-1980s,
when they generated oligonucleotide catalogs of 16S ribosomal RNA (rRNA) sequences
from about 400 organisms [18, 19]. They showed a positive correlation (0.40) between the
dissimilarity of two sequences using k-mers with the distance calculated by alignment [20].
Since then, many word count{based methods for sequence comparison have been developed,
including the uncentered correlation of word count vectors between two sequences (d
2
) [21],
2
statistics [22, 23], composition vectors (CVTree) [6], nucleotide relative abundances [24],
and the recently developed d
2
and d
s
2
statistics [25, 7].
In-depth background of alignment-free methods for sequence comparison has been re-
viewed in several excellent publications [11, 26, 27, 12, 28]. Recently, Zielezinski et al. [14]
published a comprehensive comparison over 74 alignment-free methods for ve research ap-
plications including cis-regulatory module detection, protein sequence classication, gene
tree inference, genome-based phylogeny and reconstruction of species trees under sequence
rearrangements.
Based on the rationale that similar sequences share similar kmer frequency prole,
also known as genomic signature [29], kmer-based alignment-free methods rst count the
number of occurrences of kmers along a sequence or in an NGS sample and characterize
each sequence or an NGS sample as a feature vector of length 4
K
. Second, transformation
can be applied to normalize the kmer count vector or to remove the random background
of kmer counts using a Markov model [6, 7]. Alignment-free methods that remove the
random background are also known as background-adjusted methods such as CVTree[6],
2
d
s
2
[7] andd
2
[7]. In addition, dissimilarity measures such as Manhattan distance, Euclidean
distance, Mash (Jaccard distance) [10] and Cosine distance are used to compare any pair
of sequence-representing feature vectors. The performances of kmer-based methods can
vary largely depending on the choice of the value of k and dissimilarity measures between
kmer vectors.
1.1 Aligment-free distance/dissimilarity measures
Among kmer-based methods, Manhattan, Euclidean andd
2
[30] distances between the kmer
frequency vector are widely used because of their simplicity. Given two genomic sequences
i and j and a given word length k, we rst count the number of occurrences of all kmers
in sequence i and sequence j, respectively. The full set of kmers of length k is dened as
A
k
whereA = (A;T;C;G) for nucleotide sequences. For a given kmer w, its occurrences
in i is dened as N
(i)
w
and the frequency or the relative abundance of this kmer is dened
as f
(i)
w
=
N
(i)
w
P
w
N
(i)
w
.
Manhattan
The Manhattan distance (Ma) is dened as:
Ma =
X
w2A
k
jf
(i)
w
f
(j)
w
j
3
Euclidean
The Euclidean distance (Eu) is dened as:
Eu =
s
X
w2A
k
jf
(i)
w
f
(j)
w
j
2
d
2
[30]
The d
2
distance is dened as:
d
2
=
1
2
0
@
1
P
w2A
kf
(i)
w
f
(j)
w
q
P
w2A
k (f
(i)
w
)
2
q
P
w2A
k (f
(j)
w
)
2
1
A
Recently, several new background-adjusted dissimilarity measures for sequence com-
parison based on kmer frequency vectors have been developed including CVTree [6], d
2
and d
s
2
[7, 31, 32]. They assume the background sequence follows an m-th order Markov
model and require to remove the background from the kmer count. The expected number
of occurrences of word w, EN
(i)
w
, can be calculated from the stationary probability of the
rstm-merw[1 :m] and the transition probabilities from the n-thm-merw[n :n +m 1]
to the (n +m)-th nucleotide w[n +m]:
EN
(i)
w
= (L
(i)
k + 1)(w[1 :m])
km
Y
n=1
(w[n :n +m 1];w[n +m])
4
whereL
(i)
is the length of sequencei, is the stationary probability and is the transition
probability that can be estimated from the sequence data. The dierence between the
occurrences of kmer w and its expected occurrences is dened as
~
N
(i)
w
=N
(i)
w
EN
(i)
w
.
CVTree [6]
The CVTree dissimilarity is dened as:
CVTree =
1
2
0
@
1
P
w2A
k
^
f
(i)
w
^
f
(j)
w
q
P
w2A
k (
^
f
(i)
w
)
2
q
P
w2A
k (
^
f
(j)
w
)
2
1
A
where
^
f
(i)
w
=
~
N
(i)
w
EN
(i)
w
. CVTree calculatesEN
(i)
w
by assuming a (k2)-th order Markov chain
for genomic sequences.
d
2
[7]
The d
2
dissimilarity is dened as:
d
2
=
1
2
0
@
1
P
w2A
k
f
(i)
w
f
(j)
w
q
P
w2A
k (
f
(i)
w
)
2
q
P
w2A
k (
f
(j)
w
)
2
1
A
where
f
(i)
w
=
~
N
(i)
w
p
EN
(i)
w
.
5
d
s
2
[7]
The d
s
2
dissimilarity is dened as:
d
s
2
=
1
2
0
@
1
P
w2A
k
~
f
(i)
w
~
f
(j)
w
q
P
w2A
k (
~
f
(i)
w
)
2
q
P
w2A
k (
~
f
(j)
w
)
2
1
A
where
~
f
(i)
w
=
~
N
(i)
w
((
~
N
(i)
w
)
2
+(
~
N
(j)
w
)
2
)
1
4
and
~
f
(j)
w
=
~
N
(j)
w
((
~
N
(i)
w
)
2
+(
~
N
(j)
w
)
2
)
1
4
.
These background-adjusted dissimilairty measures have been shown to out-perform
commonly used measures such as Manhattan and Euclidean distances for solving dierent
problems including evolutionary distance estimation [33], virus-host interaction prediction
[34], and metagenome and metatranscriptome comparison [35, 15].
1.2 Applications of background adjusted alignment-free dis-
similarity measures
In Chapter2 and 3, we studied applications of background adjusted alignment-free meth-
ods on detecting horizontal gene transfers in bacterial genomes and predicting continental
origins of white oaks.
1.2.1 Applications on horizontal gene transfer (HGT) detection
Horizontal gene transfer (HGT) plays an important role in the evolution of microbial or-
ganisms including bacteria. Alignment-free methods based on single genome compositional
information have been used to detect HGT. Currently, Manhattan and Euclidean distances
based on tetranucleotide frequencies are the most commonly used alignment-free dissimi-
6
larity measures to detect HGT.
In Chapter2, we found that more advanced alignment-free dissimilarity measures such
as CVTree and d
2
that take into account the background Markov sequences can solve
HGT detection problems with signicantly improved performance by testing on simulated
bacterial sequences and real data sets with known horizontal transferred genomic regions.
We also studied the in
uence of dierent factors such as evolutionary distance between
host and donor sequences, size of sliding window, and host genome composition on the
performances of alignment-free methods to detect HGT. Our study showed that alignment-
free methods can predict HGT accurately when host and donor genomes are in dierent
order levels. Among all methods, CVTree with word length of 3, d
2
with word length 3,
Markov order 1 and d
2
with word length 4, Markov order 1 outperform others in terms of
their highest F1-score and their robustness under the in
uence of dierent factors.
1.2.2 Applications on cotinental origin prediction of white oak
The application of genomic data and bioinformatics for the identication of restricted or
illegally-sourced natural products is urgently needed. The taxonomic identity and geo-
graphic provenance of raw and processed materials have implications in sustainable-use
commercial practices, and relevance to the enforcement of laws that regulate or restrict il-
legally harvested materials, such as timber. Improvements in genomics make it possible to
capture and sequence partial-to-complete genomes from challenging tissues, such as wood
and wood products.
In Chapter3, we show that our approch based ond
2
and KNN can identify the contiental
origins of white oaks at close to 100% accuracy and our prediction is robust to sequencing
errors and to dierent sequencing technologies. This method oers an approach based
on genome-scale data, rather than panels of pre-selected markers for specic taxa. The
7
method provides a generalizable platform for the identication and sourcing of materials
using a unied next generation sequencing and analysis framework.
1.3 Improvements of background adjusted alignment-free dis-
similarity measures
1.3.1 Bias adjustment for background adjusted alignment-free dissimi-
larity measures
The development of next-generation sequencing (NGS) technologies increases the capacity
to sequence a large number of genomes quickly and economically but also poses a challenge
to traditional alignment-based methods which depend on assembly and alignment. To
overcome this challenge, alignment-free methods which are usually more time and memory
ecient have been widely used for sequence comparison problems as alternatives based on
raw NGS samples without assembly or alignment.
In Chapter4, we showed that alignment-free dissimilarity calculated based on NGS
samples can be overestimated compared with the dissimilarity calculated based on their
genomes due to the stochastic distribution of short reads. This bias can signicantly de-
crease the performance of alignment-free analysis especially when NGS samples are of
dierent sequencing depths. To adjust this bias, we developed a de-novo method Afann
(Alignment-Free methods Adjusted by Neural Network) based on neural network regression
and we showed that the performance of two background-adjusted dissimilarity measures
d
s
2
and d
2
on estimating the evolutional relationship among primates, mammals, verte-
brates and predicting continental origins of white oaks signicantly increased after bias
adjustment.
8
Chapter 2
Background Adjusted
Alignment-free Dissimilarity
Measures Improve the Detection
of Horizontal Gene Transfer
2.1 Introduction
As opposed to vertical transmission in which DNA is transferred from parent to ospring,
horizontal gene transfer (HGT) or lateral gene transfer (LGT) is dened as the movement
of genetic material between organisms that are not in a parent-ospring relationship. HGT
plays an important role in bacterial evolution as it is the primary reason underlying the
adaptation of bacteria such as metabolic adaptation [36] and antibiotic resistance [37].
Both alignment-based and alignment-free methods have been used to infer horizontal gene
transfer [38, 39, 40, 41, 5, 42, 43, 44, 45, 46]. Alignment-based methods are often considered
9
as the gold standard [47] for HGT detection because of their explicit model. Such methods
detect horizontal gene transfer by integrating information from multiple organisms to nd
genes whose phylogenetic relationships among multiple organisms dier signicantly from
that of other genes [38, 39]. Despite their extensive applications in horizontal gene transfer
detection, nding topological incongruences is time-consuming, uses large memory, and
requires that genomes of interest have to be annotated and their phylogenetic relationships
are known. In addition, alignment-based methods can only be applied to gene or protein
sequences and thus limit their ability to detect horizontal transfer in non-coding regions.
On the other hand, alignment-free methods detect horizontal gene transfer based on
the detection of regions in a genome with atypical word pattern (kmer, ktuple, kgram,
etc) composition. These methods are based on the observation that dierent microbial
species have their own genomic word pattern signatures [40] so that sequences transferred
from donor genome are likely to have dierent composition signatures from that of the host
genome. DNA acquired via horizontal gene transfer will, over time, acquire the composition
signatures of the host genome through a process called amelioration [48]. Widely used
alignment-free methods apply a sliding window to scan a single genome and calculate the
distance between the composition of each window and the whole genome. Consecutive
windows with distance from the whole genome higher than a threshold are inferred as
HGT. The performances of alignment-free methods depend largely on the choice of genomic
signatures. Commonly-used genomic signatures include, but are not limited to GC content
[43] , codon usage [43] and oligonucleotide (kmer) frequencies [44]. Becq et al. [45] reviewed
alignment-free methods on horizontal gene transfer detection and showed that kmer-based
methods with a 5kbp sliding window outperformed other alignment-free methods based on
features such as GC content [43], codon usage [43] and dinucleotides [40]. However, they
only tested Euclidean distance with kmer length 4 as genomic signature [49] for kmer-based
10
methods. In fact, the performances of kmer-based methods can vary largely depending on
the choice of the value of k and dissimilarity measures between kmer vectors.
For kmer-based methods, Manhattan and Euclidean distances between the kmer fre-
quency vector of a genomic region and that of the whole genome are the most frequently
used measures for detecting HGTs because of their simplicity. For example, Dufraigne
analyzed HGT regions of 22 genomes by using Euclidean distance with kmer length 4
[49]. In addition, they compared the genomic signatures of HGT regions with 12, 000
species from GeneBank by Euclidean distance to nd their potential donors. Rajan et
al. used Manhattan distance with k-mer length 5 to detect HGT in 50 diverse bacterial
genomes [50]. Tsirigos and Rigoutsos [44] proposed to use relative kmer frequencies de-
ned by the absolute kmer frequency over the expected frequency under the independent
identically distributed (IID) model for HGT detection. They also investigated a few dis-
similarity measures between the relative frequencies of a genomic region and the whole
genome including correlation, covariance, Manhattan distance, Mahalanobis distance, and
Kullback{Leibler (KL) distance for HTG detection. They showed that kmers of length 6-8
with covariance dissimilarity perform the best under their simulated situations. Several
review papers on the use of kmers for the detection of HGT are available [38, 39, 46]. As
in most studies of HGT, we concentrate on the use of kmers for HGT detection by using
a single genome in this paper.
Recently, several new dissimilarity measures for sequence comparison based on kmer
frequency vectors have been developed including CVTree [6], d
2
and d
s
2
[7, 31, 32]. They
have been shown to out-perform commonly used measures such as Manhattan and Eu-
clidean distances for solving dierent problems including evolutionary distance estimation
[33], virus-host interaction prediction [34], and metagenome and metatranscriptome com-
parison [35, 15].
11
However, these dissimilarity measures have not been used for HGT detection. It is im-
portant to know whether these new dissimilarity measures have better performance than
available methods for detecting horizontal gene transfers. In addition, it is important to
study the in
uence of evolutionary distance between host and donor genomes, sliding win-
dow size on the performance of kmer-based alignment-free methods on HGT detection. In
this study, we have addressed all these issues.
2.2 Materials and Methods
2.2.1 Articial genome simulation
We chose Escherichia coli K12 (E. coli) as the host genome and Bacillus subtilis 168 (B.
subtilis), Haemophilus in
uenzae Rd KW20 (H. in
uenzae), Helicobacter pylori 26695 (H.
pylori), Mycobacterium tuberculosis H37RV (M. tuberculosis), and Streptococcus pneumo-
niae R6 (S. pneumoniae) as donor genomes. Each time, we picked a fragment randomly
from the donor genome with length uniformly chosen from 8kbp to 40kbp and inserted
it into a random position uniformly along the E. coli K12 genome until the simulated
HGT consists of up to 10% of the articial genome, since the HGT proportions in most
bacteria genomes range from 2% to 15% [51]. We named the simulated genome as \E.
coli articial". To make our results more reliable, we did 10 simulations. Table S1 in
the supplementary material shows the detailed composition of one of these 10 simulated
genomes.
One of the challenges for evaluating HGT detection methods is the lack of a benchmark
data. The host genome may contain genes historically transferred from other genomes, but
they are not part of the simulated transferred regions. If a HGT detection method predicts
12
such a gene as a HGT, although the prediction is correct, the prediction will be reported
as a false positive since the gene is not transferred through the simulation. Therefore, the
reported false positive rate maybe higher than the true false positive rate. On the other
hand, such a problem is common to all the HGT detection methods and their relative
performances are still valid. Therefore, we can still use articial genomes to compare the
relative performance of dierent methods.
2.2.2 Distance calculation
As in most studies [44, 49], we used a sliding window approach for the detection of HGT.
Starting from the 5'-end of the E. coli articial genome, we divided the genome into over-
lapped windows of size b with sliding step of 500 bps. As suggested by Dufraigne et al.
[49], we rst used b = 5kbp. We uesd CAFE [32], an accelerated alignment-free sequence
analysis tool, to calculate dierent dissimilarity measures between each window and the
whole genome by using the dierent alignment-free dissimilarity measures with dierent
kmer lengths and Markov orders as needed. For measure d
2
, Euclidean, and Manhattan,
that do not require Markov order information, we usedk = 3, 4, 5. Ford
2
andd
s
2
, we tested
them withk = 3, 4, 5 and Markov order = 0, 1, 2, 3. ForCVTree that assumes a Markov
chain of order (k 2), we tested it with k = 3, 4, 5. For all methods, a double-strand sig-
nature was used to remove strand compositional asymmetry [52], which means we counted
kmer occurrences in both the sequence and its reverse complementary sequence.
2.2.3 Predicting HGT regions
Windows with high dissimilarity with the whole genome are more likely to be transferred
from other genomes. Therefore, a window is predicted to be a HGT region if its dissimilarity
with the whole genome D is above a certain threshold T . We used the same criterion as
13
in [45] to determine the threshold, that is,
T =Q
3
+r(Q
3
Q
1
);
whereQ
1
andQ
3
are the rst and third quartiles of the distribution of dissimilarity values
between all the windows and the whole genome, and r is a parameter used to set the
threshold that ranges from 0.25 to 10.00 with a step of 0.25. Therefore, for each alignment-
free method with certain word lengthk and Markov orderm, we could dene 40 thresholds.
Windows with distance from the whole genome above the threshold were dened as atypical
windows. Overlapped atypical windows were then concatenated to form atypical regions,
which were predicted as HGT regions.
2.2.4 Evaluation Criteria
By comparing the detected HGT and the real transferred fragments in E. coli articial,
we calculated the recall (sensitivity) and precision. Recall is calculated as the length of
the overlapped sequence between detected HGT and simulated HGT divided by the total
length of simulated HGT fragments. Precision is calculated as the length of the overlapped
sequence between detected HGT and simulated HGT divided by the total length of detected
HGT. A commonly used measure that combines precision and recall is the harmonic mean
of precision and recall, the traditional F
1
-measure or balanced F
1
-score, dened as
F
1
= 2
precision recall
precision + recall
:
Given an E. coli articial genome, for each threshold, we calculated the precision, recall
and theF
1
-score for each method. We then calculated the average precision, recall and the
average F
1
-score for each threshold over 10 simulated genomes and plotted the precision-
14
recall curve. We report the optimal F
1
-score for each dissimilarity measure.
Since most parts of the host genome are not transferred from other genomes, the receiver
operating curve (ROC) showing the relationship between the false positive rate (FPR, 1 -
specicity) and true positive rate (TPR, recall or sensitivity) is not optimal for comparing
the dierent dissimilarity measures since the area under the ROC curve (AUC) and the
specicity are generally very high. Therefore, we used the precision recall curve (PRC)
and F
1
-score as our criterion for comparing the dierent dissimilarity measures.
2.2.5 Investigating the eect of evolution relationship between the host
and donor genomes and window size on the performance of dier-
ent methods
In the simulated genome above, we assumed that all the donor genomes can contribute to
the host genome through HGT. Since closely related genomes have similar kmer frequencies,
it will be dicult to detect HGT from closely related genomes. On the other hand, if the
donor genome has high evolutionary distance from the host genome, it will be relatively
easy to identify HGT with any reasonable methods. Therefore, we next investigated how
the evolutionary relationship between the the donor genome and the host genome aects
the relative performance of the dierent HGT detection methods.
In our simulations, we still used E. coli K12 that is of the Proteobacteria phylum,
Gammaproteobacteria class, Enterobacteriales order, Enterobacteriaceae family and Es-
cherichia genus as host genome and chose 20 donor genomes having dierent evolutionary
relationships with E. coli. Four of them are dierent species of the Escherichia genus (Es-
cherichia albertii KF1 (E. albertii), Escherichia fergusonii ATCC 35469 (E. fergusonii), Es-
cherichia hermannii NBRC 105704 (E. hermannii), Escherichia vulneris NBRC 102420 (E.
vulneris)), four of them are in dierent genus of the Enterobacteriaceae family (Enterobac-
15
ter cloacae ATCC 13047 (E. cloacae), Klebsiella pneumoniae HS11286 (K. pneumoniae),
Salmonella typhimurium LT2 (S. typhimurium), Shigella sonnei 53G (S. sonnei)), four of
them are in dierent families of the Enterobacteriales order (Yersinia pestis KIM 10+ (Y.
pestis), Photorhabdus luminescens TT01 (P. luminescens), Pantoea ananatis LMG 20103
(P. ananatis), Brenneria goodwinii OBR1 (B. goodwinii)), four genomes are in dierent
orders of the Gammaproteobacteria class (Legionella pneumophila Philadelphia 1 (L. pneu-
mophila), Pseudomonas aeruginosa PA01 (P. aeruginosa), Vibrio parahaemolyticus RIMD
2210633 (V. parahaemolyticus), Xanthomonas axonopodis Xac29-1 (X. axonopodis)), and
four genomes are in dierent classes of the Proteobacteria phylum (Burkholderia pseu-
domallei K96243 (B. pseudomallei), Brucella abortus 2308 (B. abortus), Campylobacter
coli RM4661 (C. coli), Acidithiobacillus ferrooxidans ATCC 23270 (A. ferrooxidans)). By
transferring fragments between 8kbp and 40kbp uniformly picked from these genomes into
E. coli K12, we constructed 20 articial genomes, each of them consists of 10% HGT from a
certain single donor genome. We then detected the HGT using the dierent alignment-free
methods and compared them using the same criteria as above.
In order to study the eect of window length, we continued to use the 20 articial
genomes generated above. Instead of using 5kbp as the length of sliding window, we
changed the window size to 3kbp and 8kbp, respectively. Finally, we used the F
1
-score to
evaluate the dierent methods.
To see if our results are consistent for dierent host genomes, we changed the host
genome from E. coli to B. abortus and K. pneumoniae, respectively. Then we did the
same analyses as for E. coli.
16
2.2.6 Investigation of HGT within 118 genomes and E.faecalis V583
To evaluate the performances of alignment-free methods on HGT detection over real data,
we used a data set constructed in [1]. In this study, the authors selected 118 genomes
from 117 dierent strains and used a comparative genomics approach to detect genomic
islands resulted from horizontal gene transfer. This benchmark data was constructed using
alignments and did not use nucleotide composition information. Therefore, the data set
can be used to evaluate dierent alignment-free HGT detection methods. For each genome,
the authors provided positive and negative regions of HGT. As in [1], we used precision,
recall and overall accuracy to evaluate performances of alignment-free methods on HGT
prediction over these 118 chromosomes, where the accuracy is calculated by the fraction
of true positives and true negatives over all the predictions. In addition, we also used the
optimal F
1
-score and the precision-recall curve to compare the dierent methods.
We also used the dierent methods to identify HGT regions of Enterococcus faecalis
V583 (E. faecalis) that contains seven known genes transferred from other genomes. Since
we do not know the whole set of HGT genes, we just investigated if these seven genes are
ranked higher than other genes. The higher these genes are ranked by a particular method,
the better performance the method is in predicting HGT.
2.3 Results
2.3.1 Background adjusted dissimilarity measures outperform non-background
adjusted methods for HGT detection based on theE. coli articial
genome
Table 2.1 shows the precision and recall yielding the highest averageF
1
-score of the dierent
alignment-free methods for dierent word size k and Markov order m when needed. The
17
highestF
1
-score of 0.88 is obtained forCVT (4), followed byCVT (3),d
2
(3; 1) andd
2
(4; 1)
(the rst number in the parenthesis is the word length and the second number is the
order of MC) with averageF
1
-score at least 0.87. In comparison with background adusted
dissimilarity measures, the widely-used Manhattan and Euclidean distances both have F
1
-
score at most 0.80.
In addition to comparing the dierent methods at the optimalF
1
level, we also plotted
the precision-recall curves for the dierent methods shown in Figure 2.1. Figure 2.1(d)
shows that non-background adjusted methods Ma,Eu andd
2
showed similar performace.
Figure 2.1(a)(b)(c) show that background adjusted methods had better performance than
non-background adjusted methods when k = 3 or k = 4. Among all methods, CVT (3),
CVT (4), d
2
(3; 1) and d
2
(4; 1) had the best performace in terms of precision-recall curves.
The conclusions about the relative performance of the dierent methods are the same based
on either the F
1
-score or the precision-recall curves.
The better performance of the background adjusted methods (CVTree, d
2
and d
s
2
)
over the non-background adjusted methods (Manhattan, Euclidean, and d
2
) can probably
be explained by the following observations. By removing the background counts of the
word patterns, the signals from the most relevant kmers representative of the host genome
are amplied while the contributions of irrelevant kmers are mitigated. Therefore, the
background adjusted dissimilarity measures perform well in HGT detection.
Based on the performances of the dierent methods shown in Table 2.1 and Figure 2.1,
we only present our results for the top performing methods in the rest of the paper. We
chose CVT (3), CVT (4), d
2
(3; 1), and d
2
(4; 1) to represent background adjusted methods
and Ma(5), Eu(5), andd
2
(5) to represent non-background adjusted methods as candidates
for the following studies.
18
Method Precision Recall Optimal F
1
Optimal r
CVT (3) 0:77
0:01
0:99
0:01
0:87
0:00
4.50
CVT (4) 0:81
0:01
0:95
0:02
0:88
0:01
2.75
CVT (5) 0:70
0:02
0:71
0:05
0:71
0:03
1.25
d
2
(3, 0) 0:70
0:04
0:65
0:11
0:67
0:08
4.75
d
2
(3, 1) 0:77
0:01
0:99
0:00
0:87
0:01
4.25
d
2
(4, 0) 0:56
0:01
0:96
0:02
0:71
0:01
2.00
d
2
(4, 1) 0:77
0:01
0:99
0:01
0:87
0:01
3.75
d
2
(4, 2) 0:77
0:01
0:96
0:02
0:86
0:01
2.25
d
2
(5, 0) 0:58
0:01
0:93
0:03
0:71
0:01
2.00
d
2
(5, 1) 0:76
0:01
0:98
0:01
0:86
0:01
3.00
d
2
(5, 2) 0:82
0:01
0:90
0:03
0:86
0:02
2.25
d
2
(5, 3) 0:54
0:03
0:78
0:05
0:64
0:03
1.00
d
s
2
(3, 0) 0:39
0:12
0:82
0:21
0:49
0:03
0.50
d
s
2
(3, 1) 0:75
0:01
0:99
0:01
0:85
0:01
2.50
d
s
2
(4, 0) 0:54
0:10
0:83
0:19
0:63
0:04
0.75
d
s
2
(4, 1) 0:76
0:06
0:79
0:18
0:76
0:09
1.00
d
s
2
(4, 2) 0:74
0:02
0:79
0:13
0:76
0:06
1.00
d
s
2
(5, 0) 0:58
0:02
0:80
0:09
0:67
0:03
1.00
d
s
2
(5, 1) 0:74
0:03
0:89
0:08
0:80
0:03
1.50
d
s
2
(5, 2) 0:83
0:02
0:87
0:06
0:85
0:03
1.50
d
s
2
(5, 3) 0:63
0:02
0:67
0:08
0:65
0:04
1.00
Ma(3) 0:75
0:04
0:79
0:12
0:76
0:07
2.50
Ma(4) 0:80
0:03
0:80
0:12
0:80
0:07
3.00
Ma(5) 0:79
0:03
0:81
0:12
0:80
0:07
3.25
Eu(3) 0:76
0:03
0:78
0:13
0:76
0:07
2.50
Eu(4) 0:80
0:02
0:77
0:12
0:79
0:07
2.75
Eu(5) 0:79
0:02
0:80
0:12
0:79
0:07
2.75
d
2
(3) 0:80
0:04
0:76
0:12
0:78
0:07
5.00
d
2
(4) 0:77
0:04
0:82
0:12
0:79
0:06
4.50
d
2
(5) 0:81
0:03
0:81
0:12
0:81
0:07
4.50
Table 2.1: Complete evaluation results for dierent dissimilarity measures with dierent
word lengths k and Markov orders when needed. Numbers in the brackets in the rst
column indicate the word length k and Markov order used by methods d
2
and d
s
2
. For
example, d
2
(3; 1) means that d
2
was the dissimilarity measure with word length 3 and
Markov order 1. Optimal F
1
is the highest average F
1
-score that can be achieved by this
method under a certain threshold. Optimal r for each method is the value of r, which
is used to set the threshold, to achieve the optimal F
1
. Corresponding average precision
and average recall for the optimal F
1
are recorded in the second and the third columns.
Standard deviations of precision, recall and optimalF
1
-score over 10 simulations are shown
as superscripts. Highlighted are the top 4 F
1
-scores for the dierent methods.
19
Figure 2.1: Precision-Recall Curves (PRC) for all the methods. (a) shows PRC forCVTree
with dierent word lengths. (b) shows PRC for all the d
2
methods. (c) shows PRC for all
d
s
2
methods. (d) shows PRC for Manhattan, Euclidean and d
2
.
20
2.3.2 The performance of the alignment-free methods increases with the
genetic distance between the donor genome and the host genome
We next investigated the in
uence of evolutionary distance between the donor genome and
the host genome on the performance of the dierent methods CVT (3), CVT (4), d
2
(3; 1),
d
2
(4; 1), Ma(5), Eu(5), and d
2
(5) based on the 20 articial genomes described in the \Ma-
terials and Methods" section and the results are given in Table 2.2. The 20 donor genomes
were sorted by the Manhattan distance of the tetra-mer frequencies between the donor and
E. coli K12.
We divided the donor genomes into three groups separated by horizontal lines in Table
2.2. For the top group of donor genomes with Manhattan distance between the donor and
host genomes less than 0.12, none of the methods haveF
1
value greater than 0.30 indicating
that none of them can successfully detect HGT when the donor genome and host genome are
very close. For the second group of donor genomes with Manhattan distance between 0.12
to 0.31, for eight out of ten donor genomes except for V.parahaemolyticus and B.abortus,
the optimalF
1
scores are moderate between 0.32 to 0.71. Except for E.cloacae, E.vulneris
and K.pneumoniae, the background adjusted dissimilarity measures outperform the non-
background adjusted measures, some times by a signicant margin. For example, when
the donor genome is V.parahaemolyticus, theF
1
-scores for CVT (3), CVT (4), d
2
(3; 1) and
d
2
(4; 1) are all at least 0.85, while theF
1
-scores for Ma(5), Eu(5) andd
2
(5) are at most 0.58.
Within this group of donor genomes,CVT (4) seems to perform better thanCVT (3) when
the Manhattan distance between the donor and host genomes is between 0.12 and 0.22,
while CVT (3) is slightly better than CVT (4) when the Manhattan distance is between
0.22 to 0.31. The results are reasonable since when the donor and host genomes are
relatively close, relative long kmers are needed to separate the transferred fragments from
the background. On the other hand, when the donor and host genomes are relatively far
21
Donor Distance CVT (3) CVT (4) d
2
(3; 1) d
2
(4; 1) Ma(5) Eu(5) d
2
(5)
S. sonnei 0.027 0:18
0:03
0:18
0:03
0:17
0:03
0:17
0:04
0:16
0:02
0:16
0:02
0:17
0:03
E. fergusonii 0.038 0:19
0:02
0:15
0:02
0:19
0:02
0:18
0:02
0:17
0:02
0:16
0:02
0:18
0:02
E. albertii 0.044 0:21
0:02
0:17
0:02
0:21
0:01
0:21
0:02
0:17
0:02
0:17
0:02
0:18
0:02
S. typhimurium 0.090 0:23
0:02
0:19
0:02
0:23
0:02
0:22
0:02
0:25
0:01
0:27
0:01
0:27
0:01
E. hermannii 0.119 0:16
0:01
0:27
0:02
0:14
0:02
0:15
0:02
0:26
0:02
0:26
0:02
0:25
0:02
P. ananatis 0.123 0:23
0:02
0:38
0:02
0:19
0:02
0:21
0:03
0:26
0:01
0:26
0:01
0:25
0:02
B. goodwinii 0.124 0:27
0:02
0:44
0:02
0:27
0:02
0:29
0:02
0:30
0:02
0:32
0:03
0:32
0:02
E. cloacae 0.141 0:23
0:02
0:28
0:02
0:19
0:03
0:21
0:02
0:32
0:02
0:30
0:02
0:30
0:02
Y. pestis 0.160 0:51
0:02
0:61
0:02
0:50
0:02
0:56
0:02
0:33
0:03
0:30
0:03
0:37
0:02
E. vulneris 0.223 0:39
0:02
0:27
0:02
0:29
0:01
0:33
0:01
0:46
0:02
0:44
0:02
0:43
0:02
K. pneumoniae 0.228 0:28
0:03
0:26
0:03
0:21
0:02
0:23
0:01
0:46
0:02
0:46
0:03
0:43
0:01
V. parahaemolyticus 0.261 0:87
0:01
0:85
0:01
0:88
0:01
0:88
0:00
0:55
0:02
0:51
0:04
0:58
0:01
P. luminescens 0.283 0:60
0:02
0:65
0:03
0:59
0:01
0:63
0:02
0:56
0:01
0:55
0:01
0:57
0:01
A. ferrooxidans 0.301 0:71
0:02
0:68
0:02
0:62
0:02
0:63
0:02
0:54
0:01
0:52
0:03
0:52
0:01
B. abortus 0.308 0:86
0:01
0:82
0:01
0:85
0:01
0:85
0:01
0:63
0:01
0:60
0:01
0:55
0:01
L. pneumophila 0.449 0:84
0:01
0:78
0:01
0:84
0:01
0:87
0:00
0:85
0:01
0:82
0:02
0:84
0:02
X. axonopodis 0.487 0:87
0:01
0:86
0:01
0:83
0:01
0:82
0:01
0:85
0:03
0:85
0:03
0:76
0:02
P. aeruginosa 0.550 0:89
0:00
0:79
0:01
0:86
0:01
0:81
0:01
0:90
0:01
0:89
0:01
0:81
0:01
B. pseudomallei 0.682 0:96
0:01
0:87
0:01
0:95
0:01
0:94
0:02
0:90
0:02
0:90
0:03
0:88
0:03
C. coli 0.713 0:97
0:00
0:94
0:01
0:97
0:00
0:98
0:00
0:98
0:00
0:97
0:00
0:97
0:00
Table 2.2: Performance of dierent alignment-free HGT detection methods over 20 articial
genomes with dierent donor genomes. The rst column shows the donor genome of the
articial genome. The top 12 species have the same order level as E. coli and the bottom
8 species have dierent order level from E. coli. The second column is the Manhattan
distance between donor genome and E. coli K12 based on tetranucletide frequency. The
third to the ninth columns are the optimal F
1
-score of dierent methods over dierent
articial genomes. The optimal F
1
scores for each donor genome are highlighted.
apart, relatively short-mers are more discriminative. For the last group of donor genomes
with large distances between the donor and host genomes, all the seven methods perform
decently well with CVT (3), d
2
(4; 1) and Ma(5) generally as the best performers.
In addition to the comparison of the dierent methods based on the optimal F
1
-score,
we also plotted the precision-recall curves for three donor genomes S. sonnei, B. abortus,
and C. coli in Figure 2.2 as examples for each group. Similar results for the relative
performance of the dierent methods as based on F
1
-scores were observed.
22
Figure 2.2: The Precision-Recall Curves (PRC) of dierent HGT detection methods along
articial genomes using E. coli as host genome. (a) PRC when using S.sonnei as donor
genome, no methods performs well. (b) PRC when using B.abortus as donor genome,
CVT (3), CVT (4),d
2
(3; 1) and d
2
(4; 1) outperform other methods. (c) PRC when using
C.coli as donor genome, all methods perform reasonably well.
23
2.3.3 The performance of the alignment-free methods increases with the
window size within the range of 3kbp to 8kbp
We further studied the in
uence of window size on dierent methods as the performances
of alignment-free methods always reply on the sequence length that should be long enough
to represent the genomic signature. Besides 5kbp window size with 500 bp sliding step, we
also checked the performance of dierent methods based on 3kbp window size with 300 bp
sliding step and 8kbp window size with 800 bp sliding step by using the same evaluation
approach. Among the 20 articial genomes that have been generated to study the in
uence
of the genetic distance between the donor genome and host genome, we chose 8 of them
in which donors have dierent order level from that of E. coli K12. Optimal F
1
score of
dierent methods using dierent window sizes over these 8 genomes are shown in Table
2.3. All methods showed similar trend that their mean F
1
score increases as the window
length increases from 3kbp to 8kbp. ButCVT (3) is the most robust with dierent window
sizes and its performance suers less with the decrease of window size compared with other
methods.
2.3.4 Robustness of the relative performance of the dierent methods
with respect to dierent host genomes
To see the robustness of our results on the relative performance of the dierent alignment-
free HGT detection methods with respect to host genomes, we changed the host genome
from E. coli to B. abortus and K. pneumoniae, respectively. The complete results are
given as Table S2 and Table S3 in the supplementary material. From both tables, it can
be seen that the conclusions about the relative performance of the dierent methods hold
regardless of the host genome.
24
Donor WS* (kbp) CVT (3) CVT (4) d
2
(3; 1) d
2
(4; 1) Ma(5) Eu(5) d
2
(5)
V. parahaemolyticus 3 0:84
0:01
0:82
0:01
0:85
0:01
0:87
0:01
0:47
0:02
0:43
0:02
0:53
0:02
V. parahaemolyticus 5 0:87
0:01
0:85
0:01
0:88
0:01
0:88
0:00
0:55
0:02
0:51
0:04
0:58
0:01
V. parahaemolyticus 8 0:95
0:01
0:91
0:01
0:95
0:01
0:94
0:01
0:62
0:02
0:60
0:02
0:64
0:01
A. ferrooxidans 3 0:65
0:02
0:62
0:01
0:58
0:02
0:59
0:02
0:51
0:02
0:47
0:02
0:49
0:02
A. ferrooxidans 5 0:71
0:02
0:68
0:02
0:62
0:02
0:63
0:02
0:54
0:01
0:52
0:03
0:52
0:01
A. ferrooxidans 8 0:82
0:03
0:72
0:02
0:68
0:04
0:66
0:02
0:61
0:03
0:56
0:03
0:56
0:02
B. abortus 3 0:79
0:01
0:77
0:02
0:78
0:01
0:78
0:01
0:55
0:01
0:52
0:01
0:50
0:01
B. abortus 5 0:86
0:01
0:82
0:01
0:85
0:01
0:85
0:01
0:63
0:01
0:60
0:01
0:55
0:01
B. abortus 8 0:94
0:01
0:88
0:02
0:93
0:01
0:91
0:01
0:68
0:01
0:66
0:02
0:61
0:02
L. pneumophila 3 0:79
0:02
0:73
0:01
0:79
0:01
0:84
0:01
0:79
0:01
0:77
0:01
0:79
0:01
L. pneumophila 5 0:84
0:01
0:78
0:01
0:84
0:01
0:87
0:00
0:85
0:01
0:82
0:02
0:84
0:02
L. pneumophila 8 0:91
0:02
0:82
0:01
0:89
0:01
0:93
0:01
0:88
0:02
0:86
0:01
0:87
0:01
X. axonopodis 3 0:81
0:01
0:81
0:02
0:74
0:01
0:74
0:01
0:78
0:02
0:78
0:03
0:69
0:02
X. axonopodis 5 0:87
0:01
0:86
0:01
0:83
0:01
0:82
0:01
0:85
0:03
0:85
0:03
0:76
0:02
X. axonopodis 8 0:96
0:01
0:92
0:01
0:92
0:01
0:91
0:02
0:86
0:03
0:82
0:02
0:81
0:02
P. aeruginosa 3 0:86
0:01
0:73
0:02
0:79
0:01
0:72
0:01
0:84
0:01
0:83
0:02
0:76
0:01
P. aeruginosa 5 0:89
0:00
0:79
0:01
0:86
0:01
0:81
0:01
0:90
0:01
0:89
0:01
0:81
0:01
P. aeruginosa 8 0:96
0:01
0:84
0:01
0:95
0:01
0:90
0:01
0:90
0:01
0:88
0:03
0:86
0:01
B. pseudomallei 3 0:92
0:01
0:83
0:01
0:91
0:01
0:90
0:01
0:90
0:03
0:91
0:02
0:84
0:03
B. pseudomallei 5 0:96
0:01
0:87
0:01
0:95
0:01
0:94
0:02
0:90
0:02
0:90
0:03
0:88
0:03
B. pseudomallei 8 0:97
0:01
0:93
0:01
0:96
0:01
0:96
0:01
0:89
0:02
0:89
0:03
0:86
0:02
C. coli 3 0:96
0:01
0:90
0:00
0:97
0:01
0:97
0:01
0:97
0:00
0:95
0:01
0:96
0:00
C. coli 5 0:97
0:00
0:94
0:01
0:97
0:00
0:98
0:00
0:98
0:00
0:97
0:00
0:97
0:00
C. coli 8 0:93
0:01
0:96
0:01
0:94
0:00
0:96
0:01
0:97
0:00
0:96
0:01
0:96
0:00
Table 2.3: Performance of dierent methods over articial genomes by using dierent
window sizes. Values in the second column are the window sizes. All the other columns are
the same as in Table 2.2. The optimal F
1
scores for each donor genome by using dierent
window sizes are highlighted. *WS: window size.
25
2.3.5 Applications to real HGT data support the good performance of
background adjusted dissimilarity measures
Evaluation of dierent methods based on 118 genomes with known HGT ge-
nomic islands
We next applied the various dissimilarity measures to identify genomic islands generated
from HGT for the 118 genomes described in the \Materials and Methods" section. We
still chose 40 thresholds as in our simulation studies for each method, and calculated
the optimal accuracy that is the highest accuracy one method can achieve under certain
threshold. The results are shown in part a) of Table 2.4. The values of the optimal
accuracy for the dierent methods are not markedly dierent, but we can still see that
the background adjusted dissimilarity measures CVT (3), CVT (4), d
2
(3; 1) and d
2
(4; 1)
have slightly higher accuracy than the non-background adjusted dissimilarity measures
Eu(5);Ma(5) and d
2
(5). Similarly, we also evaluated the dierent methods based on
the optimal F
1
-score as shown in part b) of Table 2.4. The conclusions on the relative
performance of the methods based on F
1
-score are essentially the same as that based on
optimal accuracy. In addition, we also plotted the precision-recall curves of the dierent
methods based on this data set and the resulting gures are shown in Figure 2.3. It is clear
from the gure that CVT (3), d
2
(3; 1) and d
2
(4; 1) perform much better than the other
methods. In [1], SIGI-HMM and IslandPath/DIMOB showed the highest accuracy of 0.86.
We did not include them in our comparison because they incorporate other information
such as codon uasge, dinucleotide bias, gene expression and mobility that can only be used
when the genome is annotated. However, in terms of accuracy, d
2
(4; 1) can achieve the
same performance as SIGI-HMM and IslandPath/DIMOB by detecting HGT purely based
on the genomic composition.
26
Figure 2.3: The Precision-Recall Curves (PRC) of the dierent methods based on 118
genomes with known HGT genomic islands.
27
a) Based on accuracy b) Based on F
1
score
Method Precision Recall Optimal Accuracy Precision Recall Optimal F
1
score
CVT (3) 0.68 0.41 0.84 0.54 0.60 0.57
CVT (4) 0.62 0.31 0.83 0.50 0.56 0.53
d
2
(3, 1) 0.72 0.38 0.85 0.57 0.58 0.58
d
2
(4, 1) 0.72 0.45 0.86 0.58 0.63 0.61
Ma(5) 0.67 0.26 0.83 0.48 0.68 0.56
Eu(5) 0.58 0.46 0.83 0.50 0.63 0.55
d
2
(5) 0.60 0.30 0.82 0.45 0.67 0.53
Table 2.4: Performance of dierent methods over 118 genomes with known HGT genomic
islands in [1] based on a) optimal accuracy and b) optimal F
1
-score. The second and third
columns show the precision and recall to achieve the optimal accuracy given in the fourth
column. The fth and sixth columns show the precision and recall corresponding to the
optimal F
1
-score given in the seventh column.
Evaluation of dierent methods based on E.faecalis V583 with known seven
HGT genes
In E. faecalis V583, a genomic region that contains 7 genes (EF2293-EF2299) conferring
vancomycin resistance to E. faecalis has been known to have been horizontally transferred
[44]. In this case, we calculated the distance between each gene and the E. faecalis V583
genome using dierent methods. We then ranked all 3112 E. faecalis V583 genes by the
distance in descending order where the rst gene has the largest distance to E. faecalis
V583 genome. Better HGT detection methods should give EF2293-EF2299 lower ranks.
Ranks of EF2293-EF2299 and the median and mean rank of these 7 genes for all the
methods are shown in Table 2.5. d
2
(3; 1) gives lower median and mean ranks for EF2293-
EF2299 than other methods. In comparison with d
2
, the median and mean ranks given
by more commonly-used Manhanttan and Euclidean distances are larger than 1000, which
are unreasonably high considering the fact that the HGT proportions in most bacteria
genomes range from only 2% to 15% [51].
28
Gene CVT (3) CVT (4) d
2
(3; 1) d
2
(4; 1) Ma(5) Eu(5) d
2
(5)
EF2293 607 815 688 605 854 1001 511
EF2294 325 1874 222 447 1302 1373 719
EF2295 138 855 109 219 1169 1273 520
EF2296 379 1613 313 385 1392 1491 850
EF2297 618 2638 665 1245 1117 1165 551
EF2298 660 1355 702 772 1978 1924 1025
EF2299 687 1084 477 607 814 820 384
Median 607 1355 477 605 1169 1273 551
Mean 487.7 1462.0 453.7 611.4 1232.3 1292.4 651.4
Table 2.5: The distances between each gene and E. faecalis V583 genome were calculated
and genes were ranked by their distances. The rst to seventh rows show the ranks of
EF2293-EF2299 among all E.faecalis V583 genes calculated by dierent methods. The
eighth and ninth rows show the median and mean of the ranks of the seven genes.
2.4 Discussion
Kmer-based alignment-free methods have been used to detect horizontal gene transfers in
bacterial genomes [44, 49, 50]. There are a number of advantages of kmer-based methods
over other alignment-free methods or alignment-based methods. First of all, kmer-based
methods are time ecient and memory friendly by avoiding alignment and topological
data analysis. Secondly, kmer-based methods do not rely on phylogenetic relationships
among multiple organisms, which enables them to detect HGTs from a single unannotated
genome. In addition, kmer-based methods are able to detect HGTs in both coding and
non-coding regions.
In this study, we investigated the potential of using recently developed alignment-free
sequence comparison statistics, in particular, CVTree, d
2
and d
s
2
, that adjust for the
background word frequencies, for horizontal gene transfer detection. Although many com-
position based methods have been used for HGT detection, to the best of our knowledge,
the background adjusted statistics have not been used for HGT detection.
29
We rst generated simulated articial genomes with HGT by using E. coli K12 as
the host genome and inserted sequences uniformly chosen from other genomes into it.
We then evaluated the performance of kmer-based alignment-free methods of dierent
distance measures, kmer length and Markov order on HGT detection of articial genomes.
Based on the results, we reduced our set toCVTree(k=3),CVTree(k=4),d
2
(k= 3;m= 1),
d
2
(k= 4;m= 1), Ma(k=5), Eu(k=5), andd
2
(k= 5) for more detailed comparisons including
in
uence of dierent factors and their performance on real data sets.
As a conclusion, we evaluated the performance of kmer-based alignment-free methods
with dierent dissimilarity measures, kmer length and Markov order on both articial
genomes and real data sets. Our results suggest the background adjusted dissimilarity
measures,CVTree,d
2
andd
s
2
, generally perform better than the non-background adjusted
measures based on Euclidean and Manhattan distances or d
2
. In terms of word length,
k = 3 or k = 4 seems to perform well in both our simulation and real data analysis.
Although kmer-based alignment-free methods for HGT detection are more time and
memory ecient than alignment-based methods and they do not depend on genome anno-
tation or evolutionary tree, they also have limits. First of all, their performances depend on
the evolutionary distance between host and donor genomes. Our study showed alignment-
free methods are suitable for HGT detection when host and donor genomes are in dierent
order levels. In addition, the size of sliding window is the smallest length of HGT that
can be detected by the kmer-based alignment-free methods, so they are not suitable for
identifying HGT smaller than 5kbp. Furthermore, they are not likely to detect HGT that
occurred in the very distant past, as these sequences transferred from the donor genome
will ameliorate to re
ect the DNA composition of the host genome over time [48]. Finally,
the detected atypical regions could be explained by some other reasons. For example,
rRNA regions can have their own genomic signatures [49, 53], which dier from the host
30
signature, but this does not imply that they are horizontally transferred.
Therefore, alignment-free methods are not aimed to replace alignment-based methods
in all cases. Instead, they are complementary as each has unique advantages in dierent
scenarios and they also tend to nd complementary sets of HGT regions [54]. Alignment-
free methods are preferred when no evolutionary trees are available or genomes are not
annotated, which is common in many studies. The ndings of our study suggest CVTree
with word length of 3, d
2
with word length 3, Markov order 1 and d
2
with word length 4,
Markov order 1 perform well in most situations.
31
Chapter 3
Alignment-free Genome
Comparison Enables Accurate
Geographic Sourcing of White Oak
DNA
3.1 Introduction
The annual trade in natural resources represents $3.35T of global imports and $3.25T of
global exports [55], and it supports the world's supply of food, building materials, and
ber. Disturbingly, Global Financial Integrity estimates the annual, transnational illegal
trade in natural resources to be valued at $90B-$276B [56]. Illegal trade in forest products
is the single largest component of illegal trade, accounting for $50% of estimated annual
losses. Illegal logging contributes signicantly to deforestation and forest degradation, and
these have cascading impacts on natural resource conservation, global biodiversity, climate
32
change mitigation, and the economic health of billions of people [57]. To mitigate tracking
of illegally-sourced wood, the United States (2008), European Union (2010) and Australia
(2012) adopted regulations that prohibit the import, export, transport, purchase or sale
of illegally harvested timber and plant products. These regulations can impose civil and
criminal penalties on buyers and suppliers of wood products who fail to adopt \due care"
controls. A key component of due care is that wood or wood products entering or exiting
the U.S. must declare the scientic name and geographic source of the wood. Despite this
requirement, mislabeling and document falsication are widespread because few methods
are available to validate these declarations [58].
Historically, verication of wood has been accomplished using features such as density,
scent, cellular composition, and vessel distribution [59]. This approach is rapid, but gener-
ally incapable of identifying trees to species or predicting their geographic origin [60, 61].
Chemical [62, 63] and genetic [64, 65] approaches are increasingly used to provide more
accurate species identications[58], but determining geographic origin continues to be a
daunting task [66, 67, 68, 69]. Here, we demonstrate an ecient use of next generation se-
quencing (NGS) data to predict the geographic source of white oak species (Quercus subg.
Quercus). Unlike traditional genetic analysis, our approach uses whole genome DNA se-
quence data without a priori selection of marker loci. This work extends studies showing
that background-adjusted alignment-free sequence comparison measures (CVTree [6]; d
2
and d
s
2
[7, 70, 16, 25]) oer improvements over other comparison measures (Euclidean,
Manhattan,d
2
distances [31, 32, 15]) for the comparison of molecular sequences. We chose
white oaks for this analysis for three reasons: white oaks include hardwood species with
the highest export volume from the U.S. [71]; they cannot be readily discriminated using
wood anatomy, and at least one species (Q. mongolica from Russia) is protected by the
Convention on International Trade in Endangered Species of Wild Fauna and Flora, and
33
was the focus of a U.S. Lacey Act conviction [72].
Using NGS data from 92 white oaks from North America (NA), Europe (EU), and
Asia (AS), we show that for each sample the two most similar white oak trees according
to the dissimilarity measure are from the same geographic provenance based on small
sequencing quantities (e.g., 50 Mbp). Finally, we show that K-nearest neighbors (KNN)
classication yields close to 100% classication accuracy of geographic provenance, even
with data generated from dierent sequencing platforms and genome reduction methods.
Our study demonstrates that continental origin of trees can be accurately predicted using
KNN coupled with dissimilarity, and that the method oers a simple and unied approach
for geographic and taxonomic identication that can be applied to any biological sample.
3.2 Materials and Methods
3.2.1 NGS data from white oak samples
NGS whole genome shotgun (WGS) sequencing data of 99 white oaks from North America,
Europe and Asia were downloaded from NCBI BioProject PRJNA269970. The sequence
data for these samples was derived from leaf tissue sampled from specimens collected in
the eld, which were previously published [69]. Four samples (SRR2053123, SRR2053080,
SRR2053066, SRR2053060) showed less than 8 Mbp sequence data and were discarded
due to insucient data. Two-dimensional PCoA of the 95 tree samples based on all six
dissimilarity measures identied three samples (SRR2053124 [Q. robur], SRR2053125 [Q.
robur], SRR2053082 [Q. dentata]) as extreme outliers (Figure S1). These samples also
had low sequence yields (237 Mbp, 274 Mbp, 473 Mbp), which could be indicative of poor
library quality; for this reason, these samples were also removed from analysis, leaving 92
samples with sequence yields in the range of 360 Mbp to 1765 Mbp.
34
Mean dissimilarity measures used in this study are weakly and inversely correlated with
sequence quantity. To reduce confounding eects caused by dierent sequence quantities,
we down-sampled data for all 92 samples to produce three dierent datasets. Two datasets
consisted of random samples of reads totaling to 50 Mbp and 100 Mbp for each sample,
respectively. The third consisted of reads totaling to 300 Mbp. All samples were divided
into three geographic categories based on their continental origin. Samples from the United
States and Canada were categorized as North America (NA), samples from west of 60
E
longitude were categorized as Europe (EU), and samples from east of 60
E longitude were
categorized as Asia (AS).
3.2.2 Dissimilarity measures between genomes based on NGS data
We used six alignment-free distance/dissimilarity measures based on the relative frequen-
cies of k-mers (k-grams, k-tuples, k-words) to compare any pair of samples. These are
the traditional Manhattan, Euclid, and d
2
[30] distances, and three recently developed
background-adjusted dissimilarity measures: CVTree [6], d
2
and d
s
2
[7, 70, 16]. The
background-adjusted dissimilarity measures are obtained based on a model of the back-
ground DNA sequence using an m-th order Markov chain, with m estimated using the
method developed for NGS short read data [18]. For the available white oak NGS data, m
= 10. Previous studies showed that and performed well when k = m + 2 [70]. Therefore,
we used k = 12 and Markov order m = 10 to calculate the dissimilarity between pairs of
samples. For comparison, we also used k = 12 in the calculation of the traditional Man-
hattan, Euclid and d
2
distances. All calculations of the pairwise dissimilarity values were
carried out using the software package CAFE [32], a user-friendly and ecient package for
calculating 28 alignment-free sequence dissimilarity measures.
35
3.2.3 Circular plots and principal coordinate analysis
For each sample, we found the most similar samples to it and linked them using the circular
visualization tool [73] based upon each of the six pairwise distance/dissimilarity measures.
Of the six dissimilarity measures, the circular plots of the 92 samples show that the most
similar samples are from the same continent-of-origin using the d
2
and d
s
2
dissimilarity,
while others contain some mistakes. Sinced
2
is simpler to calculate thand
s
2
andd
s
2
is more
sensitive to sequencing platforms, we focused on d
2
for the remaining studies. Pairwise
dissimilarities among the 92 white oak samples (16 samples from Asia, 33 samples from
Europe, 43 samples from North America) of 50, 100, and 300 Mbp were used for principal
coordinate analysis using R .
3.2.4 Intra- and inter-continental d
2
dissimilarity distributions
For each quantity of sequence (50, 100 and 300 Mbp), we contrasted pairwise intra-
continentald
2
dissimilarities with pairwise inter-continental dissimilarities. The hypothesis
that intra-continental dissimilarities should be lower than inter-continental dissimilarities
was tested with the Wilcoxon-Mann-Whitney (WMW) test statistic. To obtain a p-value,
we permuted the continental labels of the tree samples 10
7
times and then compared the
intra- with inter-continental d
2
dissimilarities using the WMW statistic for the permuted
samples. We approximated the p-value by the fraction of times that the WMW values for
the permuted data were higher than that for the original labelled data.
3.2.5 Continental origin prediction by KNN and d
2
A K-nearest neighbors (KNN) algorithm was used to predict the continental origins of
white oaks. For each quantity of sequence (50, 100 and 300 Mbp), samples were randomly
divided into training and test data sets, with the training set making up 91, 77, 60, 45,
36
30 or 15 of the total 92 samples. For each sample in the test set, we found its K-nearest
neighbors measured by d
2
in the training set and predicted its continental origin by a
majority vote. One hundred distinct splits of the data into training and test data sets were
constructed, for each of which the origin was predicted for a range of K from 1 to 10.
To investigate the eects of sequencing error on the prediction accuracy of KNN, we
randomly mutated the sequences by altering individual nucleotides at a rate of 5%; erro-
neous bases were selected with equal probability without regard to transition/transversion
bias or regional nucleotide composition. We recalculated d
2
dissimilarities between test
samples and training samples and then calculated the prediction accuracy, and repeated
the process of evaluating the KNN prediction accuracy 100 times. We then compared
the average KNN prediction accuracy with simulated errors to the KNN prediction accu-
racy without simulated errors; both cases included the non-zero background of naturally
occurring errors.
3.2.6 Eect of laboratory and sequencer error on accuracy
To test if our computational method of predicting continental origin is sensitive to variation
in NGS library construction method or sequencing platform, we assembled data from other
genomics studies of white oaks that were unrelated to the training data set, and used the
data to predict continent-of-origin for each sample. For each of these comparisons, we
randomly chose 100 Mbp from each data set, calculated the d
2
dissimilarity between these
datasets and the 92 samples in our reference data set, and used KNN to predict the
continental origins of the test samples. Three data sets were used and the characteristics
of all samples are given in Table S1.
(A) Total genomic data derived from one North American California Valley Oak (Q. lo-
bata Nee), a white oak member from Sect. Quercus (https://www.ncbi.nlm.nih.gov/bioproject/308314;
37
https://valleyoak.ucla.edu) [74]. Samples were sequenced using Illumina HiSeq2500 with
dierent library preparation methods and read lengths than those used to construct our
reference library. These data allowed us to test whether dierent library construction
methods produce accurate geographic predictions.
(B) Total genomic data derived from one European Swiss Pedunculate Oak (Q. robur
L.), another white oak member from Sect. Quercus (https://www.ncbi.nlm.nih.gov/bioproject/327502).
DNA was isolated from leaves of two branches of a 234-year-old oak tree [75]. This project
includes 30 SRA experiments, 8 using the Illumina HiSeq 2000 (similar to our reference
data), and 22 using long single-molecule (2,489 bp to 7,622 bp) real-time sequencing from
the PacBio-SMRT platform. These data allow us to test whether dierent sequencing
platforms with dierent error proles produce accurate geographic predictions.
(C) Targeted genomic data derived from a restriction site-associated DNA Sequenc-
ing (RAD-Seq) [76] study of multiple white oaks [77]. This study used the restriction
enzyme PstI to selectively enrich genomic regions for targeted Illumina sequencing. For
our study, we predicted the continental origins of seven white oak samples based on RAD-
Seq data: North America (Q. bicolor: SRR5632514, Q. stellata: SRR5632513, Q. lobata:
SRR5632586), Europe (Q. robur: SRR5632600, Q. petraea: SRR5284338), and Asia (Q.
dentata: SRR5632587, Q. mongolica: SRR5284345). Importantly, two of these RAD-
Seq samples were derived from the identical DNA preparation from single trees that were
used in our reference database of shotgun DNA sequences (Q. petraea: shotgun library
SRR2053073; Q. mongolica: shotgun library SRR2053072) [69, 78].
38
3.3 Results
3.3.1 Genomic dissimilarity analyses based on resolve oak geographic
origins
We used six alignment-free distance/dissimilarity measures (Manhattan, Euclid, d
2
[30],
CVTree [6], d
2
and d
s
2
[7, 70, 16]) based on the relative frequencies of k-mers to calculate
pairwise distances of white oak tree samples based on DNA samples of 50, 100 and 300
Mbp. Figure 3.1 shows the circular plots [73] of the oak trees at sequencing quantity of
100 Mbp using the six dissimilarity measures (circular plots at sequencing quantities of
50 Mbp and 300 Mbp are shown as Figure S2). In each plot, the most similar sample to
each of the reference specimens is linked. Of the six dissimilarity measures, only d
2
andd
s
2
showed 100% accuracy in linking a sample to its continent-of-origin.
Principal coordinate analysis of the pairwise d
2
dissimilarities among 92 samples at
sequencing quantities of 50, 100 and 300 Mbp showed that the rst three principal coor-
dinates accounted for 25% of variance in dissimilarities. Samples could be separated
into three distinct groups corresponding to their continental origins (Figure 3.2) using the
rst three principal coordinates. The rst principal coordinate separates all samples from
dierent primary continents, i.e., North America and Europe/Asia, and the third principal
coordinate separates samples from Europe (EU) versus Asia (AS). Although the second
principal coordinate of the AS tree samples is generally larger than that of the EU tree
samples, it does not completely separate the EU tree samples from the AS tree samples.
The latitudes of the tree samples are not strongly associated with the rst three principal
coordinates (Figure S3).
39
2053067
2053065
2053064
2053063
2053062
2053061
2053069
2053068
2053126
2053127
2053129
2053098
2053099
2053128
2053092
2053093
2053090
2053091
2053096
2053097
2053094
2053095
2053089
2053088
2053122
2053120 2053121
2053081
2053083
2053085
2053084
2053087
2053086
2053034
2053035
2053036
2053037
2053033
2053038
2053039
2053131
2053130
2053108
2053109
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053058
2053059
2053056
2053057
2053054
2053055
2053052
2053053
2053050
2053051
2053119
2053118
2053117
2053116
2053115
2053114
2053113
2053112
2053111
2053110
2053049
2053048
2053045
2053044
2053047
2053046
2053041
2053040
2053043
2053042
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
(a) d
2
*
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120 2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(b) d 2
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120 2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(c) Euclidean
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120 2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(d) d
2
S
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120 2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(e) CVtree
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120 2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
NA AS EU
(f) Manhattan
Figure 3.1: The circular plots of 92 white oak tree samples based on the six dissimilarity
measures: d
2
,d
s
2
,d
2
,CVTree, Euclidean, and Manhattan, using 100 Mbp of next genera-
tion sequencing data. Dierent sectors correspond to dierent continents, with NA in red,
EU in orange and AS in blue. Within each sector, samples are sorted by their longitude,
so that samples that are geographically close are also close to each other in the gure. The
most similar tree sample to each sample is linked. The k-mer length is 12 and the Markov
order of the background sequence is 10 for d
2
, d
s
2
and CVTree. The most similar sample
to each sample according to d
2
and d
s
2
are from the same continent-of-origin.
40
PCoA coordinate1
PCoA coordinate2
PCoA coordinate3
(a) Samples of 50 MBases
PCoA coordinate1
PCoA coordinate2
PCoA coordinate3
(b) Samples of 100 MBases
PCoA coordinate1
PCoA coordinate2
PCoA coordinate3
(c) Samples of 300MBases
AS
NA
EU
Figure 3.2: The principal coordinate plots (PCoA) of 92 white oak tree samples based
on the d
2
dissimilarity values using 50,100 and 300 Mbp of next generation sequencing
data. The k-mer length is 12 and the Markov order of the background sequence is 10.
With sequence quantity of 100 and 300 Mbp, the Europe (EU) and Asia (AS) samples are
clearly separated in the PCoA plots. Four western North America samples separate from
the other eastern North America samples.
41
3.3.2 Mean d
2
is smaller within continents than among continents
The distributions of d
2
dissimilarity values were compared for white oaks within- and
among-continents across all samples, and at three sequencing quantities (100 Mbp in Figure
3.3, 50 and 300 Mbp in Figure S4). Mean pairwise dissimilarities within a continent
are signicantly smaller than dissimilarities among dierent continents (Wilcoxon-Mann-
Whitney test; p< 0.001). Within continents, white oak samples from EU show the highest
similarity, followed by white oaks from AS; white oaks from NA showed the greatest average
within-continent divergence. Among-continent comparisons showed that EU and AS have
the highest similarity, and that white oaks from NA are almost equally dissimilar to EU
and AS white oaks; these dissimilarities mirror the chloroplast genome-based phylogenic
estimates for these same taxa [78]. Our observations suggest that the continental origin of
white oak samples can be predicted by KNN, in which the continent-of-origin for a sample
is predicted as the continent containing the closest neighbors based on d
2
.
3.3.3 KNN predictions are robust to multiple sources of errors
Based on the dissimilarity measure, we used KNN to build a predictive model for the
continental origin of tree samples. Table 4.1 shows the mean prediction accuracy using
KNN over 100 training and test data sets for dierent sequencing quantities, values of K,
and sizes of training data. In all cases, K = 1 and K = 2 have similarly high prediction
accuracies. The prediction accuracy of KNN increases with sequencing quantity and the
size of training samples. For example, when K = 1 and training size is at least 75 reference
samples, KNN prediction accuracy can reach 100%, even when the quantity of sequence is
as low as 50 Mbp. With only 15 reference samples, KNN prediction accuracy ranges from
89% (50 Mbp) to 96% (300 Mbp).
Table 3.2 shows the average prediction accuracy of KNN with 5% additional simulated
42
p=0.001
p<1e−7
0.239
0.299
0.252
N=528 N=1419 N=528
0.152
0.208
0.161
0.201
0.25
0.21
0.1
0.2
0.3
0.4
AS_AS AS_EU AS_NA
Region
Distance
(a)
p=7.1e−6
p<1e−7
N=1419 N=903 N=688
0.299
0.278
0.292
0.208
0.182
0.207
0.25
0.225
0.24
0.1
0.2
0.3
0.4
NA_NA NA_EU AS_NA
Region
Distance
(c)
p=1e−7
p<1e−7
0.252
0.292
0.215
0.161
0.207
0.15
N=528 N=688 N=120
0.21
0.24
0.183
0.1
0.2
0.3
0.4
EU_EU AS_EU NA_EU
Region
Distance
(b)
Figure 3.3: Comparison of intra- and inter-continental d
2
dissimilarities with sequence
quantity of 100 Mbp. The k-mer length is 12 and the Markov order of the background
sequence is 10. The p-values were calculated based on the Wilkinson-Man-Whitney test
statistic and by permuting the continental labels of the white oak tree samples 10
7
times.
The inter-continental dissimilarities are signicantly higher than intra-continental d
2
dis-
similarities.
43
Test size Training size K=1 K=2 K=3 K=4 K=5 K=6 K=7 K=8 K=9 K=10
Samples of 50 MBases
1 91 1.00 1.00 1.00 1.00 1.00 1.00 0.93 0.97 0.88 0.94
17 75 1.00 1.00 0.99 0.99 0.97 0.97 0.95 0.96 0.94 0.96
32 60 0.99 0.99 0.97 0.97 0.94 0.95 0.93 0.95 0.93 0.95
47 45 0.98 0.98 0.95 0.96 0.93 0.94 0.93 0.95 0.91 0.93
62 30 0.95 0.95 0.92 0.93 0.90 0.93 0.90 0.92 0.89 0.91
77 15 0.89 0.89 0.85 0.87 0.81 0.82 0.79 0.77 0.73 0.67
Samples of 100 MBases
1 91 1.00 1.00 1.00 1.00 1.00 1.00 0.98 1.00 0.96 1.00
17 75 1.00 1.00 0.99 0.99 0.98 0.98 0.97 0.99 0.97 0.99
32 60 1.00 1.00 0.98 0.99 0.97 0.98 0.96 0.97 0.95 0.96
47 45 0.99 0.99 0.97 0.98 0.96 0.97 0.95 0.97 0.95 0.97
62 30 0.98 0.98 0.95 0.96 0.94 0.96 0.93 0.95 0.90 0.91
77 15 0.93 0.93 0.90 0.91 0.86 0.84 0.81 0.78 0.70 0.67
Samples of 300 MBases
1 91 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
17 75 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00
32 60 1.00 1.00 0.99 1.00 0.99 1.00 0.98 0.99 0.98 0.99
47 45 1.00 1.00 0.99 0.99 0.98 0.99 0.98 0.99 0.97 0.98
62 30 0.99 0.99 0.97 0.98 0.96 0.97 0.94 0.95 0.91 0.92
77 15 0.96 0.96 0.92 0.93 0.86 0.86 0.81 0.79 0.74 0.71
Table 3.1: KNN accuracy on test data for dierent sample sizes, test sizes, training sizes
and dierent numbers of neighbors K used.
44
Test size Training size K=1 K=2 K=3 K=4 K=5 K=6 K=7 K=8 K=9 K=10
Samples of 50 MBases
1 91 0.93 0.93 0.90 0.92 0.90 0.91 0.84 0.85 0.71 0.81
17 75 0.88 0.88 0.84 0.87 0.82 0.83 0.76 0.79 0.71 0.78
32 60 0.86 0.86 0.80 0.83 0.78 0.79 0.74 0.79 0.75 0.82
47 45 0.80 0.80 0.73 0.76 0.69 0.74 0.71 0.76 0.73 0.80
62 30 0.77 0.77 0.68 0.75 0.71 0.77 0.74 0.79 0.78 0.81
77 15 0.66 0.66 0.64 0.68 0.70 0.74 0.74 0.73 0.69 0.67
Samples of 100 MBases
1 91 1.00 1.00 0.98 1.00 1.00 1.00 0.99 0.99 0.82 0.92
17 75 0.99 0.99 0.96 0.98 0.93 0.93 0.86 0.90 0.82 0.88
32 60 0.96 0.96 0.92 0.94 0.87 0.89 0.84 0.87 0.83 0.88
47 45 0.93 0.93 0.87 0.90 0.83 0.87 0.83 0.88 0.83 0.88
62 30 0.86 0.86 0.79 0.84 0.78 0.83 0.80 0.84 0.82 0.85
77 15 0.77 0.77 0.72 0.76 0.73 0.75 0.73 0.72 0.68 0.65
Samples of 300 MBases
1 91 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.99 0.95 0.98
17 75 1.00 1.00 1.00 1.00 0.98 0.99 0.95 0.97 0.93 0.95
32 60 0.99 0.99 0.97 0.98 0.94 0.95 0.92 0.95 0.93 0.95
47 45 0.98 0.98 0.94 0.95 0.92 0.93 0.91 0.94 0.92 0.95
62 30 0.95 0.95 0.90 0.93 0.90 0.93 0.89 0.91 0.87 0.89
77 15 0.88 0.88 0.84 0.86 0.81 0.82 0.78 0.76 0.72 0.70
Table 3.2: KNN accuracy on test data with 5% simulated sequencing error for dierent
sample sizes, test sizes, training sizes and dierent numbers of neighbors.
45
sequencing error in test data for dierent sequence quantities, values of K, and sizes of
training data. While prediction accuracy decreases with increasing sequencing error rate,
prediction accuracies can reach 100% at sequencing quantity of 100 Mbp and a training
set of 91 samples. At sequencing quantity of 300 Mbp, the prediction accuracy can reach
100% with a training set of 75. For most modern sequencers, the per-position sequencing
error rate is much lower than 5%; for example, the sequencing error rate for Illumina is
about 0.1% [79]. Our results show that KNN can predict the continental origins of oaks
based on dissimilarity values at a high accuracy, and the prediction accuracy is robust to
sequencing errors if the sequencing quantity in the training data set is at least 100 Mbp.
3.3.4 KNN predictions are robust to sequencing technologies
We next applied the KNN approach to predict the continental origins of white oak sam-
ples from independent laboratories using a) dierent Illumina sequencing platforms, b)
various short- and long- reads sequencing technologies, and c) RAD sequencing. The re-
sults are summarized in Figure 3.4. The corresponding gures using d
s
2
and Manhattan
dissimilarity measures are shown in Figure S5. The rst example is a white oak sample
from NCBI using Illumina NGS data produced by independent laboratories (described in
Methods). For all 11 NGS data sets from the Californian Valley Oak genome project, the
most similar sequence in our training set was a tree from the same species (Quercus lobata;
SRR2053043), also from California. For these libraries, the second most similar sequence
from our training set came from a phylogenetically closely-related species from a proximal
geographic region in western North America (Q. garryana, SRR2053062; Oregon, USA)
[78]. For all the 11 data sets, the top 20 most closely related samples were all from NA.
Therefore, KNN with K = 1 to 20 can accurately predict the continental origins of the
tree, irrespective of the library preparation methods and Illumina sequencing technologies
46
with dierent read lengths (e.g., 100 bp single-end vs. 150 bp paired-end).
The second example is a white oak sample from NCBI that was sequenced using a
mix of short- and long-read sequencing technologies (described in Methods). For the eight
Illumina data sets from the Pendunculate Oak genome project (Q. robur; provenance near
Lausanne, Switzerland), the smallest d
2
dissimilarity between each data set and the 92
training samples ranged from 0.23-0.24. The two most similar sequences in our training
set were from a closely related European species (Q. petraea; SRR2053113, SRR2053111)
of German provenance. For these libraries, the ve most similar samples included Q. robur
and Q. petraea, all from EU. Therefore, we can make accurate predictions for continent-
of-origin based on KNN with K = 1 to 5. This genome project also used long-read PacBio
sequencing [80], a method that shows a higher per-position error rate of 11% - 15%. The
smallest d
2
dissimilarity values between the Pendunculate Oak genome PacBio sequence
and the training data were 0.42, indicating substantial dierences attributable to the dif-
ferent sequencing platforms. Despite these large dissimilarities, the most similar training
samples still included Q. robur and Q. petraea from EU. Therefore, if we use 1NN as
the classier using d
2
dissimilarities, the prediction accuracy is 100% irrespective of the
sequencing platform used. Only in two cases, the second most similar samples are from
AS.
Genome-reduction methods are increasingly used in population genomic analysis, so we
tested KNN classication for continent-of-origin using seven white oak samples sequenced
with the RAD-Seq technique [76], including 2 from AS, 2 from EU, and 3 from NA [77].
These samples were highly divergent from our 92 white oak reference samples, with d
2
dis-
similarities ranging between 0.483 and 0.491. In two cases, identical DNAs were compared
by RAD-Seq and shotgun sequencing: these include Q. mongolica (d
2
= 0.488) and Q. pe-
traea (d
2
= 0.485). Overall, we found that the top two highest similarity comparisons for all
47
3244044 3244044
3244045 3244045
3244046 3244046
3244047 3244047
3244048 3244048
3244049 3244049
3244050 3244050
3244051 3244051
3244052 3244052
3244053 3244053
3244054 3244054
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(a)
3860149 3860149
3860174 3860174
3860182 3860182
3860183 3860183
3860184 3860184
3860185 3860185
3860207 3860207
3860230 3860230
3860242 3860242
3860265 3860265
3860289 3860289
3860310 3860310
3860329 3860329
3860335 3860335
3860358 3860358
3860382 3860382
3860390 3860390
3860406 3860406
3860428 3860428
3860429 3860429
3860430 3860430
3860431 3860431
3860432 3860432
3860433 3860433
3860434 3860434
3860435 3860435
3884561 3884561
3884562 3884562
3884563 3884563
3884564 3884564
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(b)
5284338 5284338
5284345 5284345
5632513 5632513
5632514 5632514
5632586 5632586
5632587 5632587
5632600 5632600
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(c)
NA AS EU
Figure 3.4: The circular plots for independent samples sequenced using a) Illumina NGS of
a California Valley Oak tree, b) a mixture of short- and long- read from with both Illumina
and PacBio sequencing of the Pendunculate Oak tree, and c) seven diverse tree samples
using RAD-seq. The d
2
dissimilarity measures of each independent sample with the 92
reference samples were calculated and the two most similar reference samples are linked.
48
RAD-Seq samples were reference sequences from the correct continental origin, indicating
100% prediction accuracy. For the identical DNAs sampled by two sequencing methods,
RAD-Seq samples did not show minimum d
2
dissimilarity with their corresponding shot-
gun Illumina samples, but were instead ranked 5th (Q. mongolica) and 6th (Q. petraea)
among the 92 pairwise comparisons. These results indicate that KNN classication of d
2
dissimilarities lack the specicity required for individual identication for samples obtained
using dierent genome sampling methodologies, but that geographic prediction accuracy
is suciently high for continent-of-origin prediction in white oaks.
3.3.5 Assigning condence to the predicted continental origins
The predicted continent of origin of a wood sample depends on the reference samples and
the NGS data. To evaluate the in
uence of reference sample composition on prediction
accuracy, we dened reference-condence (RC) by random sampling of the references and
NGS-condence (NC) by random sampling of reads (see Methods section for details). We
calculated the RC and the NC of the predicted continental origins of the California Valley
Oak tree and the Swiss Pendunculate Oak tree. For the 11 NGS data sets derived from the
California Valley Oake, all the RC and NC indicate 100% accuracy for K = 1 to 10. Among
the 30 data sets derived from the Swiss Pendunculate Oak, 14 data sets have RC of 100%
and 13 data sets have RC value between 99 and 100%. Only three data sets, SRR3860432,
SRR3860434, SRR3860435, have RC value of 66, 90, and 74%, respectively. For all these
data sets, they were predicted to come from EU/AS with 100% condence. In terms of
prediction variation due to NGS reads data, among the 207 non-overlapping data sets from
the 30 reads sets, the predicted continental origins for 205 sets were EU and 2 data sets
with predicted origin as AS. Therefore, the predictions were not aected by the dierent
non-overlapping data sets of DNA sequences.
49
3.4 Discussion
We evaluated six alignment-free sequence comparison dissimilarity measures for predicting
continent-of-origin based on NGS short read data from 92 white oak trees sampled in North
America, Europe, and Asia. The recently developed background-adjusted dissimilarity
measures d
s
2
and d
2
correctly predicted the continent-of-origin with highest accuracy, and
we explored prediction accuracies ofd
2
-based KNN classication for the continental origins
of white oak samples. We found that prediction accuracy reaches 100% with as little as 50
Mbp sequence data (< 1/10 the size of the white oak genome), small values of K (1 - 2),
and a modest training database of 75 samples. With a larger training database of 92 trees,
the prediction accuracy is 100% for 100 Mbp of sequence data and larger values of K (<
6). Although the prediction accuracy of KNN decreases with increasing sequencing error,
the prediction accuracy of KNN can be as high as 100% with 5% additional errors over the
observed experimental errors, as long as the sequencing quantity is at least 100 Mbp. This
suggests that d
2
-based classication is suciently accurate for portable nanopore-based
sequencers[81]. This would expand the utility of eld-based DNA sequencing beyond simple
organisms with small genomes to organisms with larger genomes, and open new applications
for remote eld-based studies that put DNA-based identication closer to supply regions
with the greatest risk of illegal harvesting.
To evaluate the applicability of d
2
-based KNN prediction of continent-of-origin for oak
DNA sequence data from dierent library preparation methods and dierent sequencing
platforms, we predicted the continental origins of tree genome sequences obtained from
NCBI that were based on whole genome sequencing (Q. lobata from NA; Q. robur from
EU) and one genome reduction technique (RAD-Seq; seven trees from AS, EU and NA). We
found that dierent library preparation methods and laboratories had the smallest impact
on d
2
dissimilarity, that dierent sequencing platforms (Illumina versus PacBio) had a
50
larger eect, and that dierent genome sampling methods (shotgun sampling versus RAD-
Seq) had the largest eect. Surprisingly, KNN still predicted continental origins of oak
trees perfectly for all of these methodological permutations (laboratory; sequencer; genome
sampling), as long as the query tree sample NGS data was compared with reference tree
data derived from a single, accurate sequencing platform (Illumina, in our case). Although
technical and sampling errors in data acquired using PacBio or RAD-Seq are larger than
those in the reference data sets generated using Illumina sequencing, data from these
alternative methods still show smaller dissimilarities to the correct geographic assignment,
and this allows KNN to accurately predict the continental origins of tree samples.
White oaks are notorious for exhibiting high intraspecic variation, low reproductive
barriers, and genetic variation that transcends species boundaries [82, 83]. This makes
white oaks an exceptionally challenging group to classify based on DNA variation. Evo-
lutionary studies based on chloroplast and nuclear genome partitions have shown that the
combined in
uences of hybridization, geographic isolation, and evolutionary divergence
[78, 84, 85] have created a network of genealogies that cannot be translated into simple
classications [69, 86](e.g., DNA barcodes). The scale of `continent' is where phylogenetic
and geographic signals show the greatest congruence in white oaks [78], and this is the
signal we are able to capture with our alignment-free method. In this particular case, the
taxonomic identity of samples within the group of white oaks cannot be determined, but
the continental geographic origin can be determined with great accuracy. In less complex
biota than the genus Quercus, the d
2
-based KNN prediction approach may be informative
at ner geographic scales (e.g., specic countries, provinces, or conservation reserves), and
it has the potential to be extended to determining taxonomic identity.
The fact that our approach works so well for dierentiating white oaks, one of the most
complicated groups of trees, is strong indication that it has wider utility for determining the
51
geographic origin and taxonomic identity of a broader array of biological samples. Not only
is our approach eective, it can also be implemented uniformly across any taxon that can
be sampled by shotgun sequencing or `genome skimming' [87]. The practical implications
of this are enormous, given the recent rapid growth in DNA sequencing capacity, as well
as the massive scale of commerce involving biological material and the high prevalence
of provenance and taxonomic mislabeling. The improvements in identication described
here can directly aid ongoing domestic and international eorts to improve legality, an
important facet of sustainability.
52
Chapter 4
Afann: bias adjustment for
alignment-free sequence
comparison based on sequencing
data using neural network
regression
4.1 Introduction
With the advent of next-generation sequencing (NGS) technologies, enormous amounts of
sequence data are emerging rapidly. Although alignment-based approaches for sequence
comparison are generally accurate and powerful, their applications are being challenged
by the size of sequence data that increases at an exponential rate. More importantly,
the application of alignment-based methods in NGS analysis could also be limited when
53
the sequencing depth is low so that assembled contigs might not share long homologous
regions that could be aligned. Throughout the paper, the sequencing depth (fold cover-
age) is measured by the total number of sequenced bases divided by the genome length.
Therefore, alignment-free methods, alternatives over alignment-based methods, have re-
cently received increasing attention because they are generally more memory and time
ecient [6, 7, 8, 9, 10, 11, 12, 13, 14]. Moreover, alignment-free methods, especially kmer-
based approaches that use the frequencies of kmers (k-words or k-grams) for sequence
comparison can be naturally adapted to shotgun NGS sequencing data without assembly
[15, 16, 9, 10, 17, 13, 14]. Recently, Zielezinski et al. [14] published a comprehensive compar-
ison over 74 alignment-free methods for ve research applications including cis-regulatory
module detection, protein sequence classication, gene tree inference, genome-based phy-
logeny and reconstruction of species trees under sequence rearrangements.
Based on the rationale that similar sequences share similar kmer frequency prole,
also know as genomic signature [29], kmer-based alignment-free methods rst count the
number of occurrences of kmers along a sequence or in an NGS sample and characterize
each sequence or an NGS sample as a feature vector of length 4
K
. Second, transformation
can be applied to normalize the kmer count vector or to remove the random background
of kmer counts using a Markov model [6, 7]. Alignment-free methods that remove the
random background are also known as background-adjusted methods such as CVTree[6],
d
s
2
[7] andd
2
[7]. In addition, dissimilarity measures such as Manhattan distance, Euclidean
distance, Mash (Jaccard distance) [10] and Cosine distance are used to compare any pair
of sequence-representing feature vectors.
Since kmer frequency can be counted directly from raw NGS samples, kmer-based
alignment-free methods can be easily adapted to compare NGS samples without assembly.
This adaptation relies on a strong assumption that the sequence-representing feature vec-
54
tors of NGS samples can be used as alternatives of sequence-representing feature vectors of
their genomes and thus the alignment-free dissimilarity calculated based on NGS samples
should be close to the dissimilarity calculated based on their genomes. While this assump-
tion is reasonable when sequencing depth is high because of the law of large numbers, it
can nevertheless be compromised by low sequencing depth, sequencing error and sequenc-
ing bias. For example, for any alignment-free method, the dissimilarity between a genome
and itself should be 0 because their feature vectors should be exactly the same whereas
the dissimilarity between two NGS samples sampled from the same genome will be greater
than 0 since their feature vectors will be dierent due to the stochastic distribution of
reads along the genomes. Therefore, it is expected that the dissimilarity calculated based
on NGS samples will most likely be overestimated than the dissimilarity calculated based
on their genomes and the overestimation will increase as the sequencing depth decreases,
which has also been revealed in several studies based on various alignment-free methods
[9, 17, 13]. This bias, which refers to the overestimated dissimilarity based on NGS sam-
ples, is a common problem for all alignment-free methods since it results from the intrinsic
stochastic distribution of short reads regardless of the choice of dissimilarity measures.
The alignment-free dissimilarity between two NGS samples A and B is determined
by three factors which are alignment-free dissimilarity estimated based on their genomes,
the bias caused by random sampling of NGS sample A and the bias caused by random
sampling of NGS sample B. Comparing NGS samples without bias adjustment may thus
be misguided and be prone to drawing conclusions that are inconsistent with analysis based
on their genomes. This can be explained by the fact that the high dissimilarity between two
NGS samples does not necessarily imply the high dissimilarity between their genomes. It
could also result from the large bias caused by low sequencing depth. Therefore, the relative
order of pairwise dissimilarity between NGS samples and dissimilarity between genomes will
55
be dierent if the sequencing depths of NGS samples are dierent. For example, suppose
genome A is closer to genome B than to genome C based on their complete genomes.
All three genomes are sequenced using NGS and the sequencing depth of genome B is
lower than that of genome C. Since the dissimilarity between two genomes using NGS
data increases as sequencing depth decreases, it is possible that the dissimilarity between
A and B is higher than that between A and C based on NGS data, resulting in incorrect
relationships among the genomes A, B and C.
One feasible solution is to downsample all NGS samples to the same number of reads
or the same total number of sequenced bases if the lengths of reads are dierent [17].
While biases are not adjusted, they can nevertheless be controlled at the same level after
downsampling. As a result, the dissimilarity between NGS samples is aected by the same
level of bias and the relative order of pairwise dissimilarity between NGS samples should
be determined only by their genome dissimilarity. However, this method causes a huge
waste of reads since all samples will be downsampled to the same sequencing quantity as
the smallest sample and thus a vast majority of informative reads in other samples will be
discarded, which could have been included to improve the performance.
Another solution is to modify the formula of alignment-free dissimilarity by considering
sequencing depth and sequencing error. To the best of our knowledge, AAF [9] and Skmer
[13] are the only existing methods that account for sequencing depth and sequencing error
and adjust the alignment-free dissimilarity accordingly. AAF rst infers a phylogenetic
tree of a group of genomes and then corrects all branch lengths (tip correction) based on
the average fold coverage of all NGS samples. However, since samples of high sequencing
depth tend to group together as aforementioned, tip correction after phylogeny inference
is not capable of correcting the structure of the misleading phylogeny. In addition, AAF
corrects every branch length by the same amount, which does not solve the problem caused
56
by samples of dierent biases. Moreover, this correction depends on the estimation of
sequencing depth and sequencing error rate, which complicates the problem. On the other
hand, Skmer is able to adjust the bias between any pair of NGS samples dierently but it
also requires to estimate sequencing depth and sequencing error rate rst and then adjust
the formula of Mash (Jaccard distance) [10] accordingly. Although this bias adjustment
method works for simple dissimilarity measures such as Jaccard distance, adjusting the
formula of more complicated background-adjusted methods such as CVTree[6],d
s
2
[7], and
d
2
[7] can be a daunting, if not impossible, task.
Therefore, a method that can adjust the bias for alignment-free dissimilarity based
on NGS samples without downsampling and without introducing new estimations such as
sequencing depth or sequencing error rate is necessary. Since background-adjusted dis-
similarity measures have been shown to outperform other methods for solving dierent
problems ranging from evolutionary distance estimation [32] to virus-host interaction pre-
diction [88], geographic location prediction [17], horizontal gene transfer detection [89] and
metagenome and metatranscriptome comparison [35, 15], we focused on the bias adjust-
ment for two background-adjusted dissimilarity measures d
s
2
and d
2
in this study. Never-
theless, our method can be naturally generalized to adjust the bias for other alignment-free
methods.
4.2 Results
4.2.1 Alignemt-free methods overestimate distance between NGS sam-
ples
The bias caused by NGS samples can be illustrated by a simplied example in Figure 4.1.
Figure 4.1 (a) shows two ctitious 12-bp genomes that dier by one base pair (A-T$ G-C)
57
and Figure 4.1 (b) shows two 12-bp genomes that are exactly the same. The dissimilarity
measured by any reasonable alignment-free method between two genomes in Figure 4.1 (b)
should be 0 and is thus smaller than the dissimilarity between two genomes in Figure 4.1
(a). However, the dissimilarity between their NGS samples can show opposite results. For
example, if the short reads (red arrows) in NGS samples fully cover the two genomes in
Figure 4.1 (a) whereas the short reads (blue arrows) only partially cover the two genomes
in Figure 4.1 (b), it is clear that the dissimilarity based on the two NGS samples in Figure
4.1 (b) is greater than the dissimilarity based on the two NGS samples in Figure 4.1 (a).
This apparent contradiction can be explained by dierent biases of NGS samples caused
by dierent sequencing depths.
(a)
5' AACTGACGTTAT 3'
3' TTGACTGCAATA 5'
5' AACTGACGTTGT 3'
3' TTGACTGCAACA 5'
(b)
5' AACTGACGTTAT 3'
3' TTGACTGCAATA 5'
5' AACTGACGTTAT 3'
3' TTGACTGCAATA 5'
Figure 4.1: Bias caused by NGS sampling. (a) shows two genomes that dier by only one
bp, marked in red. Both of their NGS samples (red arrows) perfectly cover their genomes.
(b) shows two genomes that are exactly the same. Their NGS samples (blue arrows) only
partially cover their genomes.
Although Figure 4.1 illustrates this bias by a simplied and extreme example, we used
a real dataset of 21 primates from [90] and simulated NGS samples to show this bias. In
our previous study [32], we calculated pairwise d
s
2
andd
2
usingK = 5 toK = 14 where K
58
is the length of the kmer with Markovian order M =K 2 for the background sequences
between these 21 primate genomes and compared them with their pairwise evolutionary
distances estimated by alignment-based methods. Our results showed that pairwised
s
2
and
d
2
with K = 14 and M = 12 are highly correlated with their evolutionary distances based
on alignments with Spearman correlation coecients 0.979 ford
s
2
and 0.970 ford
2
. (Figure
S1-S4)
To study the in
uence of sequencing depths ond
s
2
andd
2
, we simulated 8 NGS samples
of dierent numbers of 150-bp Illumina reads (1 M, 3 M, 5 M, 7 M, 9 M, 11 M, 13 M and
15 M) for each primate genome, corresponding to sequencing depths from 0:05 to 0:75
(see Methods). A total 8 21 = 168 NGS samples with dierent sequencing depths were
generated and mixed together. We then calculated their pairwise d
s
2
and d
2
values and
compared them with the pairwise d
s
2
and d
2
calculated based on their complete genomes.
The result of d
s
2
using K = 14 and M = 12 was shown in Figure 4.2. Results of d
s
2
using
other kmer lengths and Markovian orders were shown in Figure S5. Results of d
2
using
dierent kmer lengths were shown in Figure S6-S7. Bothd
s
2
andd
2
have been transformed
to their corresponding similarity measures where s
s
2
= 1 2d
s
2
and s
2
= 1 2d
2
.
As shown in Figure 4.2 (a)-(h), it is clear that s
s
2
estimated from NGS samples is lower
thans
s
2
estimated from genomes as all scatter points are below the dashed blue line across
the diagonal and the bias, visualized as the gap between scatter points and the diagonal,
decreases when the sequencing depth increases. In addition, Figure 4.2 (a)-(h) clearly
illustrate that s
s
2
calculated based on NGS strongly correlates with s
s
2
calculated based on
the whole genomes if all samples have the same sequencing depth even when the sequencing
depth is as low as 0:05 (1 M). The Spearman correlation coecients between s
s
2
based
on NGS samples of 1 M reads and s
s
2
based on genomes is as high as 0.969. However,
if not all samples have the same sequencing depth, the Spearman correlation coecient
59
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.372
SPC: 0.969
(a) 1 M reads
RMSE: 0.298
SPC: 0.982
(b) 3 M reads
RMSE: 0.252
SPC: 0.988
(c) 5 M reads
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.220
SPC: 0.989
(d) 7 M reads
RMSE: 0.195
SPC: 0.994
(e) 9 M reads
RMSE: 0.176
SPC: 0.994
(f) 11 M reads
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.160
SPC: 0.995
(g) 13 M reads
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.148
SPC: 0.996
(h) 15 M reads
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.256
SPC: 0.802
(i) Mix
1 M
3 M
5 M
7 M
9 M
11 M
13 M
15 M
Mix
0.0 0.2 0.4 0.6 0.8 1.0
Genome s
s
2
0.0
0.2
0.4
0.6
0.8
1.0
NGS s
s
2
Figure 4.2: Relationship between pairwise s
s
2
estimated by primate genomes and NGS
samples usingK = 14 andM = 12 of dierent numbers of reads without bias adjustment.
X-axis is the pairwises
s
2
estimated by genomes and Y-axis is the pairwises
s
2
estimated based
on NGS samples. (a)-(h) show relationship between s
s
2
estimated by primate genomes and
s
s
2
estimated based on NGS samples of only 1 M, 3M, 5 M, 7M, 9 M, 11 M, 13 M or
15 M reads, respectively. (i) shows pairwise s
s
2
estimated based on mixed NGS samples.
NGS samples of dierent numbers of reads are colored accordingly. `Mix' means two NGS
samples have dierent numbers of reads (e.g between 1 M and 5 M or between 7 M and 11
M) and is colored in grey. The root mean squared error (RMSE) and Spearman correlation
coecients (SPC) between pairwise s
s
2
estimated based on NGS samples and genomes are
shown on each subplot.
60
dropped signicantly even when we analyzed more number of reads in total, supported by
comparing Figure 4.2 (a) and Figure 4.2 (i). In Figure 4.2 (a), all samples have only 1 M
reads whereas in Figure 4.2 (i) each sample has a dierent number of reads ranging from 1
M to 15M. The most likely reason is that ifs
s
2
calculated between two NGS samples A and
B of 15 M reads (Figure 4.1 (h)) is greater than s
s
2
calculated between two NGS samples
C and D of 1 M reads (Figure 4.1 (a)), it doesn't necessarily mean the genome s
s
2
between
A and B is greater than that between C and D. The reason is that samples of 15 M reads
have smaller bias than samples of 1 M reads and thereby s
s
2
calculated from samples of 15
M reads will be generally greater than samples of 1 M reads regardless of their genome s
s
2
.
This observation supports our argument that bias caused by dierent sequencing depths
markedly decrease the performance of alignment-free analysis based on NGS sequencing
data. The same observation can be made for d
s
2
using dierent kmer lengths (Figure S5)
and for d
2
(Figure S6-S7). A more detailed results of the `Mix' label in Figure 4.2 (i) was
reported in Figure S8 (a) in which `Mix' was divided into more specic labels such as `1 M
& 5 M', `1 M & 15 M' and `5 M & 15 M'.
To show that this bias is a common problem for all alignment-free methods, we did the
same analysis for another state-of-the-art alignment-free method Mash [10] which is based
on Jaccard distance. We rst calculated pairwise Mash distances based on 21 primate
genomes using K = 14 (same kmer length as we used for d
s
2
and d
2
), K = 21 (default
kmer length for Mash), K = 31 (maximum kmer length allowed by Mash) and sketch size
s = 10
3
, s = 10
5
, s = 10
7
and compared them with the pairwise evolutionary distances
estimated by alignment-based methods. Figure S9 shows that the pairwise Mash distances
and the evolutionary distances have the highest Spearman correlation coecient 0.984
when using K = 21 and s = 10
7
.
We then chose kmer length K = 21 and sketch size s = 10
7
and compared Mash
61
distances estimated from primate genomes and Mash distances estimated from primate
NGS samples. Results were shown in Figure S10 and Mash distance has been transformed
to the corresponding Mash similarity that equals to 1 - Mash distance. Similar tos
2
ands
s
2
,
Mash similarity estimated from NGS samples is also lower than Mash similarity estimated
from genomes and this bias increases as the sequencing depth decreases as shown in Figure
S10 (a)-(h). As a consequence, the Spearman correlation coecient (0.860) between Mash
similarity based on genomes and Mash similarity based on NGS samples of 1 M to 15 M
reads (Figure S10 (i)) is even lower than the corresponding Spearman correlation coecient
(0.943) based on NGS samples of only 1 M reads (Figure S10 (a)).
As aforementioned, one solution is to downsample all NGS samples to have the same
number of reads as the smallest sample, which is 1 M reads in this example, as shown in
Figure 4.2 (a). This method does not adjust the bias ofs
s
2
calculated based on NGS samples
but it controls that all samples have similar biases. The performance after downsampling is
acceptable with Spearman correlation coecient 0.969 (Figure 4.2 (a)) and is better than
the performance without bias adjustment or downsampling (Figure 4.2 (i)). However, the
vast majority of reads are discarded by downsampling and thereby much information is
lost. For instance, in order to downsample a sample of 15 M reads to 1 M reads, we need
to discard 93.3% of reads in this sample.
4.2.2 Bias adjustment by a neural network regression model
We characterize the bias adjustment process as a regression problem that predicts the dis-
similarity based on genomes from the dissimilarity based on NGS samples and their biases.
It can be clearly seen from Figure 4.2, Figure S8 (a) and Figure S10 that the alignment-
free dissimilarity between any pair of NGS samples d(A
NGS
;B
NGS
) is determined by the
alignment-free dissimilarity based on their genomesd(A
G
;B
G
) and the bias caused by each
62
NGS sample Bias(A
NGS
) and Bias(B
NGS
):
d(A
NGS
;B
NGS
) =F (d(A
G
;B
G
);Bias(A
NGS
);Bias(B
NGS
))
In other words, if we know the function F , alignment-free dissimilarity between a pair of
NGS samples and their corresponding biases, then the alignment-free dissimilarity based
on their genomes which is not biased by the sequencing depths in NGS samples can be
predicted. Although it is hard to infer a closed-form formula for functionF for background-
adjusted methods such as CVTree[6], d
s
2
[7], and d
2
[7], a neural network regression model
can be trained to approximate it. See Methods section for more details about the denition
of Bias(A
NGS
), Bias(B
NGS
) and model training and evaluation.
4.2.3 The correlation between the adjusted dissimilarity measures based
on NGS samples and genomes of 21 primates is markedly increased
Two neural network regression models were trained using the 21 primate dataset for d
s
2
and d
2
separately and used to adjust the bias of primate NGS samples (see Methods).
Using the resulting neural network model, we adjusted the pairwised
s
2
andd
2
dissimilarity
measures described in the above subsection. We then calculated the Spearman correlation
between the adjusted dissimilarity measures with the corresponding values using the whole
genomes. The correlations between adjustedd
s
2
(adjustedd
2
) and their genomed
s
2
(d
2
) were
calculated, and results of d
s
2
usingK = 14 andM = 12 were transformed to s
s
2
and shown
in Figure 4.3 and Figure S8 (b) in which `Mix' was divided into more specic labels such
as `1 M & 5 M'. Results of bias adjustment ford
s
2
using other kmer lengths and Markovian
orders were shown in Figure S11. Results of bias adjustment for d
2
using dierent kmer
lengths were transformed tos
2
and shown in Figure S12-S13. By comparing Figure 4.2 and
Figure 4.3, we can conclude that our model successfully adjusted the bias between NGS s
s
2
63
and genomes
s
2
, supported by the observation that most scatter points fall on the diagonal
in Figure 4.3. In addition, the root mean squared error was decreased and the Spearman
correlation coecient was increased after bias adjustment. More importantly, Figure 4.3,
Figure S8 and Figure S11-13 revealed that our bias adjustment method works for both d
s
2
and d
2
, regardless of the chosen kmer length, Markovian order or sequencing depth. It
should be noticed that our bias adjustment model is capable of increasing the Spearman
correlation coecients even when all samples have the same number of reads by comparing
Figure 4.3 (a)-(h) to the corresponding Figure 4.2 (a)-(h). A possible explanation could be
that same number of reads cannot guarantee the same sequencing depth if genome lengths
are dierent. Moreover, the bias might also be caused by other factors such as sequencing
errors and sequencing bias that cannot be controlled by downsampling. Therefore, we
suggest always using our model to adjust the bias in the alignment-free analysis based on
NGS sequencing data even when each sample has a similar number of reads to achieve
better performance.
We also evaluated the performance of Skmer [13] on the same primate dataset using
kmer lengthK = 21 and sketch sizes = 10
7
, which is a recent alignment-free method that
corrects the formula of Mash distance based on NGS samples by estimating the sequencing
depth and sequencing error rate. The relationship between the Skemr distances using the
whole genomes and the Skmer distances using the NGS samples were shown in Figure S14
and Skmer distance has been transformed to the corresponding Skmer similarity that equals
to 1 - Skmer distance. By comparing Figure S10 (a)-(h) to the corresponding Figure S14
(a)-(h), we can see that Skmer adjusted Mash similarity by increasing its value estimated
from NGS samples to compensate for the low sequencing depths and sequencing errors
as more points fall on diagonals in Figure S14. However, Skmer decreased the Spearman
correlation coecients, especially when NGS samples have dierent sequencing depths by
64
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.030
SPC: 0.985
(a) 1 M reads
RMSE: 0.034
SPC: 0.996
(b) 3 M reads
RMSE: 0.016
SPC: 0.998
(c) 5 M reads
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.013
SPC: 0.997
(d) 7 M reads
RMSE: 0.011
SPC: 0.999
(e) 9 M reads
RMSE: 0.011
SPC: 0.999
(f) 11 M reads
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.012
SPC: 0.998
(g) 13 M reads
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.013
SPC: 0.998
(h) 15 M reads
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.022
SPC: 0.989
(i) Mix
1 M
3 M
5 M
7 M
9 M
11 M
13 M
15 M
Mix
0.0 0.2 0.4 0.6 0.8 1.0
Genome s
s
2
0.0
0.2
0.4
0.6
0.8
1.0
Adjusted NGS s
s
2
Figure 4.3: Relationship between pairwise s
s
2
estimated by primate genomes and NGS
samples using K = 14 and M = 12 of dierent numbers of reads with bias adjustment.
X-axis is the pairwise s
s
2
estimated by genomes and Y-axis is the pairwise s
s
2
estimated
based on NGS samples after bias adjustment. (a)-(h) show the relationship between s
s
2
estimated by primate genomes and adjusted s
s
2
based on NGS samples of only 1 M, 3M, 5
M, 7M, 9 M, 11 M, 13 M or 15 M reads, respectively. (i) shows pairwise adjusted s
s
2
based
on mixed NGS samples. NGS samples of dierent numbers of reads are colored accordingly.
`Mix' means two NGS samples have dierent numbers of reads (e.g between 1 M and 5 M
or between 7 M and 11 M) and is colored in grey. The root mean squared error (RMSE)
and Spearman correlation coecients (SPC) between pairwise s
s
2
estimated based on NGS
samples and genomes are shown on each subplot.
65
comparing the coecient of Mash (0.860) in Figure S10 (i) and the coecient of Skmer
(0.766) in Figure S14 (i). A possible explanation could be that the formula that Skmer used
in [13] to correct Mash distance by estimating sequencing depth and sequencing error rate
is not accurate when two NGS samples have dierent sequencing depths. As a comparison,
Figure 4.3, Figure S10 and Figure S12 demonstrated that adjusted d
s
2
and d
2
outperform
Mash and Skmer in all circumstances, especially when the sequencing depth is low (< 9 M
reads) or samples have dierent sequencing depths.
4.2.4 The correlation between the adjusted dissimilarity measures based
on NGS samples and genomes of 28 mammals is markedly increased
We tested our model for d
s
2
bias adjustment on an independent dataset of 28 mammals
from [91]. In our previous study [32], we have calculated pairwise d
s
2
using K = 14 and
M = 12 between these 28 mammalian genomes and showed that their pairwise d
s
2
are
highly correlated with their pairwise evolutionary distances estimated by alignment-based
methods with Spearman correlation coecient of 0.927, and the result was shown in Figure
S15. We simulated 3 NGS samples of dierent numbers of 150-bp Illumina reads (1 M, 5 M
and 15 M) for each mammalian genome, corresponding to sequencing depths from 0:05
to 0:75, resulting in a total of 28 3 = 84 samples (see Methods). We then calculated
pairwise d
s
2
between all 84 NGS samples, adjusted them using our neural network model
and then compared them with the pairwised
s
2
calculated from their complete genomes. The
result was transformed to s
s
2
and shown in Figure 4.4. It can be clearly seen from Figure
4.4 (a) that pairwise NGS d
s
2
was overestimated before adjustment since all scatter points
were below the diagonal whereas most scatter points after bias adjustment in Figure 4.4
(b) fall on the diagonal, which proved that our model has successfully adjusted the bias of
d
s
2
. In addition, the root mean squared error was decreased and the Spearman correlation
66
coecient was increased after bias adjustment.
0.0 0.2 0.4 0.6 0.8 1.0
Genome s
s
2
0.0
0.2
0.4
0.6
0.8
1.0
NGS s
s
2
RMSE: 0.049
SPC: 0.889
(a) Before bias adjustment
1 M
5 M
15 M
Mix
0.0 0.2 0.4 0.6 0.8 1.0
Genome s
s
2
0.0
0.2
0.4
0.6
0.8
1.0
Adjusted NGS s
s
2
RMSE: 0.007
SPC: 0.967
(b) After bias adjustment
1 M
5 M
15 M
Mix
Figure 4.4: Relationship between pairwise s
s
2
estimated using K = 14 and M = 12 based
on 28 mammalian genomes and NGS samples of dierent numbers of reads. (a) shows
the relationship before bias adjustment. (b) shows the relationship after bias adjustment
for NGS s
s
2
. The root mean squared error (RMSE) was decreased and the Spearman
correlation coeent (SPC) between pairwise genome s
s
2
and NGS s
s
2
was increased after
bias adjustment.
We next tested the performance of Mash and Skmer on the same mammalian dataset.
We rst calculated pairwise Mash distances based on the 28 mammalian genomes using
K = 14, K = 21, K = 31 and sketch size s = 10
3
, s = 10
5
, s = 10
7
and compared them
with the pairwise evolutionary distances estimated by alignment-based methods. Figure
S16 shows that pairwise Mash distances and the evolutionary distances have the highest
Spearman correlation coecient 0.943 when using K = 31 and s = 10
7
. We then chose
kmer length K = 31 and sketch size s = 10
7
and compared Mash distance and Skmer
distance estimated from mammalian genomes and estimated from NGS samples. Results
67
were shown in Figure S17. The Spearman correlation coecient (0.967) between adjusted
d
s
2
based on NGS samples and genomes is signicantly higher than that for Mash (0.789)
and Skmer (0.688).
4.2.5 The accuracy on predicting continental origins of white oak NGS
samples using k-NN is markedly increased
We tested our model for d
2
bias adjustment on a dataset of 92 white oak NGS samples
collected from 3 continents (North America, Asia and Europe). In our previous study [17]
, we downsampled each sample to 3 dierent sequencing quantities (50 Mbp, 100 Mbp and
300 Mbp), corresponding to sequencing depths from 0:07 to 0:42. At each sequencing
quantity, samples were randomly divided into reference and query set, and for each sample
in the query set, we found its k-nearest neighbors (k-NN) measured by d
2
with K = 12
and M = 10 in the reference set and predicted its continental origin by a majority vote
approach (see Methods). k-NN accuracy at all these 3 sequencing quantities was shown in
Table S1 and it can be clearly seen that the accuracy increases with sequencing quantity.
We randomly selected 30 samples from 50 Mbp dataset, 31 samples from 100 Mbp
dataset and 31 samples from 300 Mbp dataset and mixed them together to build a new
dataset of NGS samples with dierent sequencing quantities. We predicted the continental
origins of samples in the query set using the same method and results were shown at the
top of Table 4.1. Unsurprisingly, the accuracy was lower than even when we downsampled
all samples to 50 Mbp (Table S1) because a sample from Asia might have smaller d
2
to
a sample from Europe of 300 Mbp than another sample from Asia of 50 Mbp and it's
likely to be misclassied. We used our model for d
2
to adjust the bias and predicted their
continental origins again based on the dissimilarity after bias adjustment and the prediction
accuracy was shown at the bottom of Table 4.1. It is clear that our bias adjustment model
68
was capable of increasing the accuracy markedly, especially when the reference size is
small. It should be noticed that the accuracy after bias adjustment is higher than the
accuracy by downsampling all samples to 50 Mbp or 100 Mbp and it's comparable to the
accuracy when all samples are of 300 Mbp, which shows that bias adjustment can achieve
better performance than downsampling since the vast majority of reads are discarded by
downsampling whereas bias adjustment still analyzes all the reads.
The prediction accuracy of Mash and Skmer was tested on the same oak dataset using
K = 12,K = 21,K = 31 and sketch sizes = 10
3
,s = 10
5
,s = 10
7
and results were shown
in Table S2 and Table S3, respectively. It is clear that the adjustedd
2
has higher prediction
accuracy than Mash and Skmer, especially when the reference size is small. For instance,
when there are only 15 samples in the reference set, the adjusted d
2
can still achieve an
average prediction accuracy of 0.96 whereas the highest average prediction accuracies of
Mash and Skmer using K = 31 and s = 10
7
are 0.64 and 0.73, respectively.
4.2.6 The prediction accuracy of geographic origin at ner scales for
white oak NGS samples is markedly increased
Without downsampling, we calculated the pairwise d
2
using K = 12 and Markovian order
M = 10 between all 92 white oak tree NGS samples with sequencing quantity ranging from
379 Mbp to 1, 852 Mbp, corresponding to sequencing depths from 0:53 to 2:59. For
each sample, we rst found its closest sample according to d
2
and linked them together
as shown in Figure 4.5 (a). Although the most similar sample to each sample according
to d
2
are from the same continent-of-origin, it does not perform well at ner geographic
scales. It is clear that there are two sink nodes SRR2053099 (blue arrow) and SRR2053115
(orange arrow) in Figure 4.5 (a). According to d
2
, SRR2053099 was predicted as the most
similar sample to 19 out of 31 Asian samples and SRR2053115 was predicted as the most
69
Query size Reference size k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10
Before bias adjustment
1 91 0.97 0.97 0.97 1.00 1.00 1.00 0.98 0.95 0.91 0.91
17 75 0.98 0.98 0.96 0.99 0.96 0.98 0.96 0.95 0.91 0.91
32 60 0.97 0.97 0.94 0.96 0.94 0.95 0.91 0.92 0.88 0.89
47 45 0.95 0.95 0.93 0.94 0.91 0.91 0.88 0.89 0.87 0.88
62 30 0.93 0.93 0.88 0.89 0.85 0.87 0.83 0.84 0.82 0.81
77 15 0.84 0.84 0.77 0.78 0.75 0.74 0.69 0.70 0.67 0.65
After bias adjustment
1 91 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
17 75 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00
32 60 1.00 1.00 1.00 1.00 0.99 1.00 0.99 1.00 0.99 0.99
47 45 1.00 1.00 0.99 1.00 0.99 0.99 0.98 0.99 0.97 0.98
62 30 0.99 0.99 0.97 0.97 0.94 0.95 0.92 0.93 0.90 0.91
77 15 0.96 0.96 0.92 0.93 0.87 0.87 0.81 0.79 0.74 0.70
Table 4.1: Prediction accuracy using k-NN on 92 white oak dataset of mixed sequence
quantity based on d
2
before and after bias adjustment for dierent query sizes, reference
sizes and dierent numbers of neighbors k used. For each query sizes and reference sizes,
the dataset was randomly split 100 times and an average prediction accuracy was calculated
over 100 splits.
70
similar sample to 8 out of 16 European samples. The reason is that SRR2053099 (1, 414
Mbp) is one of the largest samples among all samples from Asia and SRR2053115 (1, 852
Mbp) is the largest sample among all samples from Europe, so they have the smallest
biases in samples from Asia and Europe, respectively. Therefore, they are more likely to
be predicted as the closest samples to other samples according to d
2
.
Figure 4.5: The circular plots of 92 white oak tree samples based on d
2
with K = 12
and M = 10 before and after bias adjustment. Dierent sectors correspond to dierent
continents, with North America (NA) in red, Europe (EU) in orange and Asia (AS) in
blue. Within each sector, samples are sorted by their longitude, so that samples that are
geographically close are also close to each other in the gure. The most similar tree sample
to each sample is linked.
We adjusted the biases of d
2
using Afann and results were shown in Figure 4.5 (b). It
can be clearly seen that there is no sink node such as SRR2053099 and SRR2053115 in
Figure 4.5 (a), which proves that the adjusted d
2
is not biased by sequencing depth. In
order to show that bias adjustment can improve the prediction accuracy at ner geographic
71
scales, we calculated the average distance between each sample and its closest sample
according to d
2
before and after bias adjustment. In Figure 4.5, all samples are sorted by
their longitude so we dene the distance between each sample and its closest sample based
on their distance in the circular plots. For each sample, the minimum distance should
be 1 if and only if its closest sample according to d
2
is next to it in the circular plots.
The average distance between all 92 samples and their closest samples are 7.42 before
bias adjustment and 5.85 after bias adjustment. A paired sample t-test showed that the
average distance after bias adjustment is signicantly lower than before adjustment with
a p-value of 0.023. Therefore, although d
2
based on NGS samples without downsampling
or bias adjustment can successfully predict their continental origins, we proved that bias
adjustment can further increase the prediction accuracy at ner geographic scales.
4.2.7 The correlation between the adjusted dissimilarity measures based
on NGS samples and genomes of 67 vertebrates is markedly in-
creased
Since our previous datasets all consist of sequencing reads coming from closely-related
species, we constructed a dataset that contains samples from 67 vertebrates to evaluate
the performance of our method on diverse datasets. It contains vertebrate genomes of 67
species from 5 dierent classes including sh, amphibians, reptiles, birds and mammals.
Among these 67 vertebrate genomes, we randomly selected 23, 22 and 22 genomes and
simulated their NGS samples of 1 M, 5 M, 15 M 150-bp Illumina reads respectively and
mixed all 67 NGS samples together. The sequencing depths of all NGS samples range from
0:024 to 3:49.
We then calculated pairwise d
2
and d
s
2
using K = 14 and M = 12 between all 67 NGS
samples with and without bias adjustment, and compared them with the pairwise d
2
and
72
d
s
2
calculated from their complete genomes. The result of d
2
was transformed to s
2
and
shown in Figure 4.6. It can be demonstrated from Figure 4.6 that our method markedly
decreased the root mean squared error and increased the Spearman correlation coecient
from 0.727 to 0.959. The result of d
s
2
was transformed to s
s
2
and shown in Figure S18
and the bias adjustment ofd
s
2
increased the Spearman correlation coecient from 0.701 to
0.935.
0.0 0.2 0.4 0.6 0.8 1.0
Genome s
*
2
0.0
0.2
0.4
0.6
0.8
1.0
NGS s
*
2
RMSE: 0.043
SPC: 0.727
(a) Before bias adjustment
1 M
5 M
15 M
Mix
0.0 0.2 0.4 0.6 0.8 1.0
Genome s
*
2
0.0
0.2
0.4
0.6
0.8
1.0
Adjusted NGS s
*
2
RMSE: 0.006
SPC: 0.959
(b) After bias adjustment
1 M
5 M
15 M
Mix
Figure 4.6: Relationship between pairwise s
2
estimated using K = 14 and M = 12 based
on 67 vertebrate genomes and NGS samples of dierent numbers of reads. (a) shows the
relationship before bias adjustment. (b) shows the relationship after bias adjustment for
NGSs
2
. The root mean squared error was decreased and the Spearman correlation coeent
between pairwise genome s
2
and NGS s
2
was increased after bias adjustment.
The performance of Mash and Skmer using K = 31 and s = 10
7
were tested on the
same vertebrate dataset and shown in Figure S19. The Spearman correlation coecients
between adjustedd
2
(0.959) and adjustedd
s
2
(0.935) based on vertebrate NGS samples and
73
genomes are signicantly higher than that for Mash (0.747) and Skmer (0.735).
4.2.8 Running time and memory
Although background-adjusted alignment-free methods such as CVTree [6], d
s
2
[7] and d
2
[7] have been shown to achieve better performance than simpleManhattan andEuclidean
distances [32, 17, 89, 35, 15], their applications have been limited due to the high time and
memory cost in the random background removing step. To overcome this bottleneck,
we improved the speed and memory usage of background-adjusted methods in Afann by
hashing, multi-threading and vectorization and compared it with our previous program
Cafe [32].
Both tools were used to calculate the pairwise d
s
2
, d
2
and CVTree among the white
oak datasets of 92 NGS samples with 300 Mbp sequencing quantity usingK = 12,M = 10
and among the primate dataset of 21 genomes using K = 14, M = 12. Comparisons of
time and memory based on white oak datasets and primate dataset were shown in Table
4.2 and Table S4, respectively. The total time was divided into kmer counting time and
dissimilarity calculation time. It can be clearly seen from Table 4.2 that the total speedup
ratio of Afann is around 100 for all 3 background-adjusted methods whereas the memory
usage is only
1
5
of the memory of Cafe. Afann also supports fast calculation mode for d
2
and CVTree which further increases the calculation speed by using more memory. The
memory usage is O(4
K
) for normal mode and O(N 4
K
) for the fast mode where K is
the kmer length and N is the number of samples. It should be noticed that the counting
time isO(N4
K
) whereas the calculation time isO(N
2
4
K
), the total speedup ratio will
thereby be close to the speedup ratio of dissimilarity calculation as the number of samples
N increases. We suggest using fast mode when memory allows. For example, it is common
to compare the pairwise dissimilarity among thousands of bacterial genomes using small
74
kmer length 5 or 6 which does not require much memory, then the speedup ratio of fast
mode can be more than 5; 000.
The running time and memory usage of other fast alignment-free tools Mash [10] and
Skmer [13] on the same oak NGS and primate genome datasets were also calculated and
reported in Table 4.2 and Table S4, respectively. The running time and memory usage of
an alignment-free genome comparison tool FFP [8] on the primate genome dataset using
K = 16 as suggested in [8] was reported in Table S4. It should be noticed that the running
time of Mash and Skmer usingK = 12,s = 10
3
for oak NGS dataset andK = 14,s = 10
3
for primate genome dataset were only included to test their speed when using the same
kmer length as d
2
and d
s
2
. In practice, kmer length shorter than 21 is not recommended
for Mash and Skmer [13]. We can see from Table 4.2 and Table S4 that Cafe calculates
d
s
2
, d
2
and CVTree much slower than Mash, Skmer and FFP whereas Afann is capable
of calculating d
s
2
, d
2
and CVTree in a comparable amount of time as Mash, Skmer and
FFP. The pairwise dissimilarity measures among primate genomes calculated by dierent
methods were compared with their evolutionary distances estimated by alignment-based
methods in [90] and shown in Figure S20. All dissimilarity measures except FFP are
highly correlated with the evolutionary distance with Spearman correlation coecients
higher than 0.95 which demonstrated the applicability of alignment-free methods in genome
comparisons. However, our evaluations based on dierent independent datasets in previous
sections showed that Afann signicantly outperforms others in comparing NGS samples.
4.3 Discussion
Alignment-free sequence comparison methods, especially kmer-based methods have been
widely-used in NGS analysis without assembly or alignment. However, several studies have
revealed that the alignment-free dissimilarity calculated based on NGS samples tends to
75
Counting (min) Calculation (min) Total Time (min) Memory (MB)
Cafe-d
s
2
450.2 4260.2 4710.4 1916
Afann-d
s
2
21.9 31.2 53.1 449
Cafe-d
2
450.2 4224.1 4764.3 1928
Afann-d
2
21.9 14.2 36.1 304
Afann-d
2
-fast 21.9 0.3 22.2 11953
Cafe-CVTree 450.2 4295.4 4745.6 1960
Afann-CVTree 21.9 14.1 36.0 304
Afann-CVTree-fast 21.9 0.3 22.2 11953
Mash
min
21.5 0.1 21.5 3
Mash
opt
125.6 25.5 151.1 20830
Skmer
min
NA NA 111.9 565
Skmer
opt
NA NA 656.9 2556
Table 4.2: Kmer counting time, dissimilarity calculation time and total time as well as
memory usage used by Cafe and Afann to calculate the pairwise d
s
2
,d
2
andCVTree using
K = 12 and M = 10 among a dataset of 92 white oak NGS samples of 300 Mbp. Afann-
d
2
-fast and Afann-CVTree-fast stand for the fast mode of d
2
and CVTree supported in
Afann. Running time and memory usage of Mash and Skmer were also included. Mash
min
and Skmer
min
used K = 12 and s = 10
3
which require the minimum computing power.
Mash
opt
and Skmer
opt
used K = 31 and s = 10
7
which have the optimal performance
among Mash and Skmer using dierent combinations of kmer lengths and sketch sizes as
shown in Table S2 and Table S3.
76
be over-estimated compared to the alignment-free dissimilarity calculated based on their
genomes caused by the stochastic distribution of short reads [9, 17, 13]. In this study, we
showed that this bias could signicantly decrease the performance of alignment-free analysis
based on NGS samples of dierent sequencing depths by investigating four independent
datasets. For the primate, mammalian and vertebrate datasets, the correlation between
pairwise NGS dissimilarity and pairwise genome dissimilarity dropped markedly if NGS
samples of dierent numbers of reads were mixed together. For the white oak dataset, the
k-NN prediction accuracy of their continental origins based on a dataset of samples with
50 M, 100 M and 300 M sequencing quantity is even much lower than the accuracy based
on a dataset of all samples with only 50 M sequencing quantity.
This problem was previously solved by downsampling [17] or modifying the specic
dissimilarity formula by estimating sequencing depth and sequencing error rate [9, 13].
However, the rst method discards the vast majority of reads that could have been in-
formative and the second method depends on the estimation of sequencing depth and
sequencing error rate and cannot be generalized to adjust the bias for other alignment-free
methods calculated by a dierent formula. In addition, it can be extremely hard to adjust
the formula of several complicated background-adjusted methods such as CVTree[6], d
s
2
[7] and d
2
[7].
Therefore, we introduced a de-novo method in this study to adjust bias without es-
timating sequencing depth or sequencing error rate explicitly by dening Bias(A
NGS
) =
d(A
NGS
;A
R
NGS
). This bias estimator will increase as the sequencing depth decreases or
sequencing error rate increases and thus implicitly capture information from sequencing
depth and sequencing error rate. Therefore, bias adjustment could be characterized as a
regression problem that uses d(A
NGS
;B
NGS
), Bias(A
NGS
) and Bias(B
NGS
) to predict
d(A
G
;B
G
). Two neural network regression models were trained for d
s
2
and d
2
separately
77
using the primate dataset. Our results showed that bias was successfully adjusted for
NGS samples of dierent sequencing depth and calculated by using dierent k-mer length,
supported by the large improvement of root mean squared error and Spearman correla-
tion coecient between dissimilarity based on NGS samples after bias adjustment and
dissimilarity based on genomes.
Without changing any parameters, the performance of our models was tested on three
independent datasets. A 28 mammalian dataset was used to test our bias adjustment model
ford
s
2
. Each genome was simulated to 3 NGS samples of 1 M, 5 M and 15 M Illumina 150
bp reads and mixed together. Pairwise d
s
2
values using K = 14 and M = 12 between all
84 samples were calculated without and with bias adjustment and compared with pairwise
genome d
s
2
. Results showed that our method successfully adjusted the bias and greatly
improved both root mean squared error and Spearman correlation coecient. A 92 white
oak dataset was used to test our bias adjustment model for d
2
. We randomly selected 30
samples from the 50 Mbp dataset, 31 samples from the 100 Mbp dataset and 31 samples
from the 300 Mbp dataset and mixed them together. Pairwise d
2
values usingK = 12 and
M = 10 between all 92 samples were calculated without and with bias adjustment and
k-NN was used to predict the continental origins of test samples based on d
2
. Our result
showed that the prediction accuracy of the mixed dataset using k-NN is even lower than
that using a dataset of all 50 Mbp samples before bias adjustment. After bias adjustment,
the k-NN accuracy markedly increased and was comparable to the accuracy based on a
dataset of all 300 Mbp samples. In addition, we proved that bias adjustment could increase
the accuracy of prediction not only at continental level but also at ner geographic scales.
At last, a 67 vertebrate dataset consisting of species from 5 dierent classes was used to
demonstrate the reliability of our method on datasets composed of diverse species. In all
datasets, our method outperformed other alignment-free methods including Mash [10] and
78
Skmer [13] in terms of the Spearman correlation coecient and prediction accuracy.
It should be noticed that while our bias adjustment method is capable of successfully
predicting the alignment-free dissimilarity based on genomes regardless of the chosen kmer
length, it is, nevertheless, important to choose a proper kmer length so that the alignment-
free dissimilarity based on genomes is highly correlated with their evolutionary distance.
For instance, while Figure S11 shows that our model successfully adjusted the bias of
d
s
2
using K = 5 to K = 13 based on primate NGS samples, it can be clearly seen from
Figure S4 that the correlation betweend
s
2
based on primate genomes and their evolutionary
distance is lower than 0.90 when K < 10 is used. Therefore, even if our model can adjust
the bias of K < 10, the performance might not increase as we expected. There have been
several investigations into the choice of optimal kmer length [92, 8, 7, 93]. In practice,
shorter kmers are optimal when sequences are short or obviously dierent, whereas longer
k-mers should be used when sequences are from very closely related species in order to
reduce the probability that a kmer commonly appear in a sequence by chance [92, 8, 7, 10].
In conclusion, our study showed that bias adjustment is a necessary step to increase
the performance of alignment-free methods based on NGS samples. Although our models
were trained only on the primate dataset, it was able to adjust the bias for independent
mammalian, vertebrate and white oak datasets, which proved that our model is general-
izable to adjust the bias of NGS samples from dierent species, sequencing depths using
dierent kmer lengths. Future work could train our models using more samples from dif-
ferent species with more variant sequencing depths to further improve the performance. In
addition, since our bias adjustment method only relies on the alignment-free dissimilarity
calculated between A
NGS
and A
R
NGS
without estimating sequencing depths, sequencing
error rates or considering the actual formula of the dissimilarity measures, it can be easily
generalized to adjust the bias of all alignment-free dissimilarity measures by training their
79
own regression models. In this paper, we showed the success of our bias adjustment model
for two background-adjusted methods d
s
2
and d
2
, the framework developed in this paper
can be easily adapted to adjust bias in other alignment-free dissimilarity measures.
4.4 Conclusion
Afann is a fast tool to calculate background-adjusted alignment-free dissimilarity measures
CVTree,d
2
andd
s
2
between genome sequences or NGS samples. In addition, it can adjust
the biases caused by NGS samples of dierent sequencing depths for d
2
and d
s
2
without
downsampling or estimating the sequencing depth. Our results showed that the adjustedd
2
and d
s
2
are not biased by sequencing depth and can signicantly increase the performance
of studies based on NGS samples.
4.5 Methods
See Appendix A of the supplementary material for more details about dierent alignment-
free dissimilarity measures including CVTree[6], d
s
2
[7] and d
2
[7].
Genomic datasets and simulation of NGS samples
The primate dataset consists of 21 complete primate genome sequences downloaded from
NCBI. In [90], the author estimated the evolutionary distances among 186 primates based
on the alignment of 54 nuclear gene regions. In our previous study [32], we found 21
complete representative genomes on NCBI among these 186 primates and demonstrated
that their pairwise d
s
2
and d
2
with K = 14 and M = 12 are highly correlated with their
evolutionary distances estimated in [90]. The species names, assembly accession numbers
and total sequence lengths of these 21 primate genomes were shown in Table S5. For each
80
genome, we used ART [94] to simulate dierent numbers of Illumina HiSeq 2500 reads of
length 150 bp with default sequencing error prole. We produced 8 dierent datasets with
1 M, 3 M, 5 M, 7 M, 9 M, 11 M, 13 M and 15 M reads for each NGS sample. We then
mixed all 21 8 = 168 NGS samples to generate a new dataset of primate NGS samples.
Similarly, the mammalian dataset consists of 28 complete vertebrate genome sequences
downloaded from NCBI, with evolutionary distances calculated by the alignment-based
method in [91]. The species names, assembly accession numbers and total sequence lengths
of these 28 mammalian genomes were shown in Table S6. For each genome, we used ART
[94] to simulate dierent numbers of Illumina HiSeq 2500 reads of length 150 bp with
default sequencing error prole. We produced 3 dierent datasets with 1 M, 5 M and 15
M reads for each NGS sample. We then mixed all 28 3 = 84 NGS samples to generate a
new dataset of mammalian NGS samples.
The white oak tree dataset consists of whole-genome shotgun (WGS) sequencing data
of 92 white oaks from North America, Europe and Asia with sequencing quantity ranging
from 379 Mbp to 1, 852 Mbp from NCBI BioProject PRJNA269970 [95]. The run accession
numbers, number of bases and continental origins for these 92 samples were shown in Table
S7. We downsampled all 92 samples to produce three dierent datasets with 50 Mbp, 100
Mbp and 300 Mbp, for each sample, respectively. Then, we randomly chose 30 samples
from the 50 Mbp dataset, 31 samples from the 100 Mbp dataset and 31 samples from the
300 Mbp and mixed them together to generate a new dataset of 92 NGS samples with
dierent sequencing quantities. All samples were divided into three geographic categories
(North America, Europe and Asia) based on their continental origins.
The vertebrate dataset consists of 67 complete vertebrate genome sequences downloaded
from NCBI. The species are from 5 dierent classes, including 15 sh, 7 amphibians,
15 reptiles, 15 birds and 15 mammals. All 15 species were randomly selected from the
81
corresponding classes except for amphibian where there are only 7 amphibian complete
genome sequences available on NCBI and thereby they were all included in the dataset.
The species names, classes, assembly accession numbers and total sequence lengths of these
67 vertebrate genomes were shown in Table S8. Among these 67 vertebrate genomes, we
randomly selected 23, 22 and 22 genomes and simulated their NGS samples of 1 M, 5 M
and 15 M 150 bp Illumina reads respectively by ART [94] and mixed them together to
generate a dataset of 67 vertebrate NGS samples.
4.5.1 Developing a bias adjustment model
For any pair of NGS samples, their alignment-free dissimilairty d(A
NGS
;B
NGS
) is deter-
mined by 3 variables, which are the alignment-free dissimilairty based on their genomes
d(A
G
;B
G
) and the bias caused by each sample Bias(A
NGS
) and Bias(B
NGS
):
d(A
NGS
;B
NGS
) =F (d(A
G
;B
G
);Bias(A
NGS
);Bias(B
NGS
)) (4.1)
We dene the bias of an NGS sample A by the following equation:
Bias(A
NGS
) =d(A
NGS
;A
R
NGS
) (4.2)
where A
NGS
is the original NGS sample and A
R
NGS
is a mapped NGS sample that each
read in it is a reverse complementary mapping of a read in the original NGS sample. For
example, the NGS sample in the left of Figure 4.1 (a) has readsfAACT, GACG, TTAT,
ATAA, CGTC, AGTTg, and its correspondingA
R
NGS
can be inferrd by mapping each read
in A
NGS
to its reverse complementary read and thus isfAGTT, CGTC, ATAA, TTAT,
GACG, AACTg, which is exactly the same asA
NGS
. The NGS sample in the left of Figure
4.1 (b) has readsfACTG, GTTA, ATAAg and itsA
R
NGS
should befCAGT, TAAC, TTATg
82
accordingly, which is apparently dierent from A
NGS
.
Given an dissimilarity measure, such asd
s
2
ord
2
, theBias(A
NGS
) can then be calculated
between A
NGS
and A
R
NGS
. We expect that Bias(A
NGS
) will increase as the sequencing
depth of A
NGS
decreases or the sequencing error rate increases, as shown in Figure 4.1
(b). The advantage of deningBias(A
NGS
) in this way is that we do not need to estimate
sequencing depth or sequencing error rate explicitly but this information has already been
implicitly considered when we compare A
NGS
with A
R
NGS
.
Given d(A
G
;B
G
), d(A
NGS
;B
NGS
) will increase as Bias(A
NGS
) or Bias(B
NGS
) in-
creases, as shown in Figure 4.2 (i) that samples of high sequencing depth and thus low bias
(red points) have higher NGS s
s
2
(lower NGS d
s
2
) than samples of low sequencing depth
and thus high bias (blue points) even when their genome s of interest are the same. In
addition, if Bias(A
NGS
) and Bias(B
NGS
) do not change, d(A
NGS
;B
NGS
) will increase
as d(A
G
;B
G
) increases, as shown in Figure 4.2 (a)-(h). Since NGS samples in the same
subplot have same number of reads and thus have similar Bias(A
NGS
) and Bias(B
NGS
),
their pairwise d(A
NGS
;B
NGS
) value increases with d(A
G
;B
G
).
Because of this partial monotonic relationship betweend(A
NGS
;B
NGS
) andd(A
G
;B
G
)
given Bias(A
NGS
) and Bias(B
NGS
), equation (1) can be rewritten as:
d(A
G
;B
G
) =G(d(A
NGS
;B
NGS
);Bias(A
NGS
);Bias(B
NGS
)) (4.3)
where G is a general function. Therefore, the bias adjustment process can be character-
ized as a regression problem that is capable of predicting the real genome dissimilarity
d(A
G
;B
G
) between any pair of NGS samples. To solve this supervised learning problem,
we can rst train our regression models on datasets of known d(A
G
;B
G
),d(A
NGS
;B
NGS
),
Bias(A
NGS
) and Bias(B
NGS
). Then, for any new pair of NGS samples, we rst calcu-
late theird(A
NGS
;B
NGS
),Bias(A
NGS
) andBias(B
NGS
) and use our model to predict its
83
d(A
G
;B
G
). After bias adjustment, our sequence comparison can be based on the predicted
unbiased d(A
G
;B
G
) instead of biased d(A
NGS
;B
NGS
).
4.5.2 Model training and evaluation
1. Creating training samples
We trained two neural network regression models that are widely used to solve nonlinear
regression problems for d
s
2
and d
2
separately using the 21 primate dataset. Instead of
training on NGS samples we generated previously to plot Figure 4.2, we generated a new
dataset by simulating 8 NGS samples of dierent number of reads (1 M, 3 M, 5 M, 7
M, 9 M, 11 M, 13 M and 15 M) for each genome again and mixed them together. The
samples are denoted from P
1
NGS
to P
168
NGS
, respectively. We describe how we trained the
bias adjustment model ford
s
2
in the following section. The same training method was used
for d
2
and can be easily generalized for other alignment-free methods.
For each pair of NGS samples P
i
NGS
and P
j
NGS
, we calculated their NGS dissimilarity
(d
s
2
(P
i
NGS
;P
j
NGS
)), their genome dissimilarity (d
s
2
(P
i
G
;P
j
G
)),Bias(P
i
NGS
) andBias(P
j
NGS
)
using kmer length from 5 to 14 and Markovian order = k 2. For each kmer length, there
are 168 167 = 28; 056 pairs, so that X
k
will be a matrix of dimension 28; 056 3 and
y
k
will be a vector of length 28, 056 as shown below. To ensure that our model can train
Bias(P
i
NGS
) and Bias(P
j
NGS
) symmetrically, both d
s
2
(P
i
NGS
;P
j
NGS
) and d
s
2
(P
j
NGS
;P
i
NGS
)
were included in the training samples, which was veried after model training and shown
in Figure S21. In order to build a regression model that is capable of adjusting the bias for
dierent kmer lengths, we concatenated X
k
from X
5
to X
14
vertically and concatenated
y
k
fromy
5
toy
14
. Therefore, our nalX = [X
T
5
;X
T
6
:::X
T
14
]
T
is a 280; 560 3 matrix and
y = [y
T
5
;y
T
6
;:::y
T
14
]
T
is a 280; 560 1 vector.
84
2
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
4
d
s
2
(P
1
NGS
;P
2
NGS
) Bias(P
1
NGS
) Bias(P
2
NGS
)
d
s
2
(P
1
NGS
;P
3
NGS
) Bias(P
1
NGS
) Bias(P
3
NGS
)
d
s
2
(P
1
NGS
;P
4
NGS
) Bias(P
1
NGS
) Bias(P
4
NGS
)
.
.
.
.
.
.
.
.
.
d
s
2
(P
1
NGS
;P
168
NGS
) Bias(P
1
NGS
) Bias(P
168
NGS
)
d
s
2
(P
2
NGS
;P
1
NGS
) Bias(P
2
NGS
) Bias(P
1
NGS
)
d
s
2
(P
2
NGS
;P
3
NGS
) Bias(P
2
NGS
) Bias(P
3
NGS
)
d
s
2
(P
2
NGS
;P
4
NGS
) Bias(P
2
NGS
) Bias(P
4
NGS
)
.
.
.
.
.
.
.
.
.
d
s
2
(P
168
NGS
;P
167
NGS
) Bias(P
168
NGS
) Bias(P
167
NGS
)
3
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
5
| {z }
X
k
2
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
4
d
s
2
(P
1
G
;P
2
G
)
d
s
2
(P
1
G
;P
3
G
)
d
s
2
(P
1
G
;P
4
G
)
.
.
.
d
s
2
(P
1
G
;P
168
G
)
d
s
2
(P
2
G
;P
1
G
)
d
s
2
(P
2
G
;P
3
G
)
d
s
2
(P
2
G
;P
4
G
)
.
.
.
d
s
2
(P
168
G
;P
167
G
)
3
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
5
| {z }
y
k
2. Training samples augmentation
A data augmentation technique based on prior knowedge was used to further increase
training samples. If there is no bias in NGS samples, then the alignment-free dissimilar-
ity based on NGS samples should be equal to the dissimilarity based on their genomes
(d(A
NGS
;B
NGS
) = d(A
G
;B
G
) () Bias(A
NGS
) = Bias(B
NGS
) = 0). Therefore, we
dened a hyperparameter augmentation ratio as r, and randomly simulated d
1
to d
m
(d
i
U(0; 1)) and concatenatedX
A
andy
A
shown as below to our training samplesX and
y, respectively, to t our model. The sizes of X
A
and y
A
were determined by the size of
85
training samples and augmentation ratio r where m =jXjr.
2
6
6
6
6
6
6
6
6
6
6
4
d
1
0 0
d
2
0 0
d
3
0 0
.
.
.
.
.
.
.
.
.
d
m
0 0
3
7
7
7
7
7
7
7
7
7
7
5
| {z }
X
A
2
6
6
6
6
6
6
6
6
6
6
4
d
1
d
2
d
3
.
.
.
d
m
3
7
7
7
7
7
7
7
7
7
7
5
|{z}
y
A
3. Hyperparameter tuning and evaluation
A neural network regression model with ReLU activation (sklearn.neural network.MLPRegressor)
was trained and a grid search algorithm was implemented to nd the optimal combination
of hyperparameters such as hidden layer sizes, regularization term and augmentation ratio.
The work
ow is described below and also shown in Figure 4.7.
1. 28, 056 (10%) samples were randomly selected as a held-out test set. The remaining
252, 504 (90%) samples were used as a training set.
2. A given combination of hyperparameters was chosen. Step 3-4 were repeated 10
times, and the averageR
2
under this combination of hyperparameters were calculated
(10-fold cross-validation).
3. 10% samples from the training set were randomly chosen as a validation set, the
other 90% samples were rst augmented as aforementioned, and then used to t our
model.
4. The trained model was used to predict d(A
G
;B
G
) for the validation set and R
2
was
calculated.
86
Training set
Test set
Training set
after data
augmentation
Model
Validation set
1
2
3
4
Figure 4.7: Diagram of hyperparameter tuning and evaluation. 1. Trainig set is augmented.
2. Training set after data augmentation is used to t the model. 3. The trained model
is used to predict d(A
G
;B
G
) for the validation set. For each combination of hyperparam-
eters, we repeated step 1-3 10 times to calculate the average R
2
, and the combination of
hyperparameters with the highest averageR
2
was chosen. 4. After hyperparameter tuning,
the nal model is tested on the test set.
87
5. Repeat steps 2-4 with dierent hyperparameters, and the optimal combination of
hyperparameters with the highest average R
2
was chosen (Hyperparameter tuning).
6. The optimal combination of hyperparameters was chosen, trained on all training set,
and the nal model was used to predict d(A
G
;B
G
) for the held-out test set and R
2
was calculated (Evaluation).
Finally, the combination of hyperparameters with the highest cross-validation score was
chosen (one hidden layer with 2, 000 neurons, regularization term 0.0001 and augmentation
ratio 2) and tested on the held-out test data with an R
2
value 0.98 for d
s
2
and 0.99 for d
2
.
The nal models ford
s
2
andd
2
were then used to adjust the bias for primate, mammalian,
vertebrate and white oak NGS datasets. It should be mentioned that altough d(A
G
;B
G
)
and d(B
G
;A
G
) were almost identical as shown in Figure S21, we take the the average of
d(A
G
;B
G
) and d(B
G
;A
G
) as the nal predicted dissimilarity between A and B to strictly
satisfy symmetry property.
4.5.3 White oak continental origin prediction by k-NN and d
2
For each sequencing quantity (50, 100 and 300 Mbp). We rst calculated the pairwise d
2
using K = 12 and M = 10 between each pair of samples in the dataset. Then 92 samples
were randomly divided into the reference set and query set. The number of samples in the
reference set ranges from 91 (leave-one-out), 77, 60, 45, 30 to 15. For each sample in the
query set, we found its k-nearest (k=1-10) neighbors measured by d
2
in the reference set
and predicted its continental origin by a majority vote. For each reference size, we split
100 times and the prediction accuracy was averaged over 100 splits and shown in Table S1.
We then randomly selected 30 samples from the 50 Mbp dataset, 31 samples from the 100
Mbp dataset, and the 300 Mbp dataset as a mixed dataset. The same prediction method
was used and accuracies with and without bias adjustment were shown in Table 4.1.
88
Chapter 5
Conclusion and future work
The dissertation presents the wide applications of alignment-free methods for dierent
problems including HGT detection in bacterial genomes and origin prediction for white
oaks. It proves that alignment-free methods can not only be applied for studying complete
genomes, but also NGS samples without assembly regardless of the sequencing technologies.
Out of all alignment-free methods, background adjusted dissimilarity measures such as d
2
and d
s
2
perform better than other methods in our studies.
In addition, we solved two limitations of background adjusted alignment-free dissimi-
larity measures. We rst reimplemented the code of calculating d
2
andd
s
2
so that they can
now run as fast as other state-of-art alignment-free methods such as Mash and Skmer. We
also adjusted the bias of alignment-free methods based on sequencing data using a neural
network regression model.
In the future study, further improvements about background adjusted alignment-free
dissimilarity measures can be done by associating them with the real evolutionary distances.
Althoughd
2
andd
s
2
have been proven to be highly correlated to the evolutionary distances
estimated by alignment-based methods, this correlation is not linear and it is thus necessary
89
to nd a transformation that can transform d
2
and d
s
2
to the corresponding evolutionary
distances.
90
Chapter 6
Supplementary materials
Supplementary materials for Chapters 2, 3, 4 are atteched.
6.1 Chapter 2 Supplementary materials
91
New Background Adjusted Alignment-free Dissimilarity
Measures Improve the Detection of Horizontal Gene Transfer:
Supplementary Material
Kujin Tang
1
, Yang Young Lu
1
, and Fengzhu Sun
∗1,2
1
Molecular and Computational Biology Program, Department of Biological
Sciences, University of Southern California, CA, USA
2
Centre for Computational Systems Biology, School of Mathematical Sciences,
Fudan University, Shanghai, China
∗
Towhomcorrespondenceshouldbeaddressed. Tel: +1(213)-740-2413; Fax: +1(213)-740-8631; Email:
fsun@usc.edu
92
Host Start End Donor Start End Length
E. coli 345278 380961 M. tuberculosis 3905301 3940984 35684
E. coli 388702 405082 S. pneumoniae 1899095 1915475 16381
E. coli 523074 536488 H. influenzae 1570632 1584046 13415
E. coli 609097 644659 M. tuberculosis 56261 91823 35563
E. coli 880226 906410 S. pneumoniae 69014 95198 26185
E. coli 1102764 1114083 M. tuberculosis 1354709 1366028 11320
E. coli 1314037 1352941 H. pylori 125023 163927 38905
E. coli 1363730 1378984 H. pylori 499545 514799 15255
E. coli 1412972 1442804 M. tuberculosis 2659436 2689268 29833
E. coli 1563095 1595311 B. subtilis 2750746 2782962 32217
E. coli 2133106 2153111 B. subtilis 2472809 2492814 20006
E. coli 2418973 2431888 M. tuberculosis 3267094 3280009 12916
E. coli 2661049 2686710 B. subtilis 2471084 2496745 25662
E. coli 2739292 2762176 S. pneumoniae 3745 26629 22885
E. coli 2964060 2975795 M. tuberculosis 2700695 2712430 11736
E. coli 3049010 3058543 B. subtilis 2069861 2079394 9534
E. coli 3196491 3216327 B. subtilis 3287257 3307093 19837
E. coli 3865131 3873654 H. pylori 299584 308107 8524
E. coli 4148122 4188039 H. influenzae 1208124 1248041 39918
E. coli 4260138 4279066 H. influenzae 435547 454475 18929
E. coli 4536358 4549293 S. pneumoniae 1261210 1274145 12936
E. coli 4909481 4923174 H. influenzae 1336445 1350138 13694
Table S1: Detailed composition of one of the 10 E. coli artificial genomes. The second
and third columns are the start and end positions of this insertion in the E. coli artificial
genome. The fifth and sixth columns are the position of this fragment in the donor genoms.
The seventh column is the length of the transferred fragment.
93
Donor Distance CVT (3) CVT (4) d
∗
2
(3, 1) d
∗
2
(4, 1) Ma(5) Eu(5) d
2
(5)
A. ferrooxidans 0.205 0.71
±0.01
0.36
±0.02
0.70
±0.01
0.68
±0.01
0.68
±0.01
0.63
±0.03
0.66
±0.01
K. pneumoniae 0.223 0.74
±0.01
0.46
±0.02
0.72
±0.02
0.69
±0.01
0.72
±0.01
0.68
±0.01
0.69
±0.01
B. goodwinii 0.237 0.76
±0.01
0.57
±0.03
0.74
±0.01
0.70
±0.01
0.71
±0.01
0.67
±0.01
0.68
±0.01
E. hermannii 0.245 0.77
±0.01
0.51
±0.02
0.76
±0.01
0.72
±0.01
0.71
±0.01
0.68
±0.01
0.69
±0.01
E. cloacae 0.258 0.77
±0.01
0.47
±0.02
0.77
±0.01
0.73
±0.01
0.75
±0.01
0.71
±0.01
0.74
±0.01
E. vulneris 0.258 0.81
±0.01
0.47
±0.02
0.77
±0.01
0.73
±0.01
0.75
±0.00
0.72
±0.02
0.73
±0.01
P. ananatis 0.271 0.79
±0.01
0.56
±0.04
0.77
±0.01
0.73
±0.01
0.75
±0.01
0.71
±0.01
0.74
±0.02
S. typhimurium 0.273 0.79
±0.01
0.55
±0.02
0.78
±0.01
0.73
±0.01
0.75
±0.01
0.71
±0.01
0.73
±0.02
E. coli 0.308 0.80
±0.01
0.51
±0.04
0.78
±0.02
0.73
±0.01
0.77
±0.02
0.74
±0.02
0.76
±0.01
S. sonnei 0.315 0.82
±0.02
0.54
±0.02
0.78
±0.01
0.74
±0.01
0.77
±0.01
0.74
±0.01
0.77
±0.01
X. axonopodis 0.324 0.78
±0.02
0.54
±0.03
0.71
±0.02
0.67
±0.01
0.75
±0.02
0.75
±0.02
0.68
±0.02
E. albertii 0.332 0.81
±0.02
0.55
±0.02
0.78
±0.01
0.74
±0.01
0.78
±0.01
0.75
±0.01
0.77
±0.01
E. fergusonii 0.334 0.82
±0.02
0.50
±0.01
0.79
±0.01
0.74
±0.01
0.78
±0.00
0.75
±0.01
0.77
±0.01
P. aeruginosa 0.392 0.78
±0.02
0.47
±0.03
0.68
±0.02
0.63
±0.01
0.83
±0.01
0.83
±0.02
0.75
±0.01
Y. pestis 0.414 0.87
±0.01
0.70
±0.03
0.85
±0.01
0.83
±0.01
0.86
±0.02
0.85
±0.01
0.85
±0.02
V. parahaemolyticus 0.475 0.90
±0.01
0.84
±0.01
0.90
±0.01
0.90
±0.01
0.94
±0.01
0.94
±0.01
0.94
±0.02
B. pseudomallei 0.502 0.84
±0.01
0.52
±0.02
0.82
±0.02
0.77
±0.01
0.88
±0.02
0.90
±0.03
0.79
±0.02
P. luminescens 0.531 0.86
±0.02
0.70
±0.02
0.84
±0.01
0.83
±0.01
0.92
±0.02
0.91
±0.02
0.91
±0.02
L. pneumophila 0.646 0.92
±0.01
0.73
±0.01
0.93
±0.01
0.92
±0.01
0.97
±0.01
0.95
±0.02
0.93
±0.00
C. coli 0.877 0.97
±0.01
0.87
±0.01
0.95
±0.01
0.95
±0.01
0.97
±0.00
0.96
±0.00
0.90
±0.01
Table S2: Performances of different methods over artificial genomes with B. abortus as host
genome and different donor genomes. Values in 2nd column are the Manhattan distance
between donor genome and B. abortus based on tetranucletide frequency. The third to
the ninth columns are the optimal F
1
-score of different methods over different artificial
genomes.
94
Donor Distance CVT (3) CVT (4) d
∗
2
(3, 1) d
∗
2
(4, 1) Ma(5) Eu(5) d
2
(5)
E. vulneris 0.174 0.33
±0.01
0.18
±0.01
0.33
±0.02
0.31
±0.02
0.16
±0.02
0.14
±0.02
0.20
±0.02
E. cloacae 0.223 0.23
±0.01
0.29
±0.01
0.26
±0.01
0.30
±0.01
0.17
±0.01
0.16
±0.01
0.25
±0.01
E. hermannii 0.541 0.19
±0.01
0.36
±0.01
0.20
±0.01
0.27
±0.01
0.16
±0.01
0.13
±0.02
0.23
±0.01
P. ananatis 0.880 0.22
±0.01
0.45
±0.01
0.22
±0.01
0.32
±0.02
0.18
±0.02
0.16
±0.02
0.28
±0.02
B. goodwinii 0.121 0.32
±0.02
0.43
±0.02
0.32
±0.02
0.36
±0.02
0.22
±0.02
0.20
±0.02
0.29
±0.02
A. ferrooxidans 0.228 0.64
±0.01
0.52
±0.01
0.60
±0.01
0.58
±0.01
0.32
±0.01
0.26
±0.02
0.37
±0.01
S. typhimurium 0.256 0.26
±0.02
0.31
±0.01
0.27
±0.02
0.33
±0.02
0.24
±0.02
0.21
±0.02
0.31
±0.02
B. abortus 0.140 0.76
±0.01
0.73
±0.02
0.74
±0.01
0.74
±0.01
0.41
±0.01
0.28
±0.07
0.44
±0.01
E. coli 0.087 0.30
±0.01
0.32
±0.02
0.31
±0.01
0.38
±0.01
0.30
±0.02
0.27
±0.01
0.40
±0.01
S. sonnei 0.624 0.30
±0.02
0.34
±0.01
0.32
±0.02
0.39
±0.02
0.30
±0.02
0.27
±0.02
0.41
±0.02
E. fergusonii 0.348 0.34
±0.02
0.33
±0.01
0.35
±0.01
0.41
±0.01
0.35
±0.02
0.32
±0.03
0.46
±0.02
E. albertii 0.183 0.37
±0.02
0.34
±0.02
0.37
±0.02
0.43
±0.02
0.34
±0.01
0.31
±0.01
0.45
±0.02
X. axonopodis 0.231 0.82
±0.01
0.68
±0.01
0.75
±0.01
0.69
±0.01
0.58
±0.02
0.53
±0.03
0.47
±0.01
Y. pestis 0.447 0.55
±0.01
0.61
±0.01
0.54
±0.01
0.59
±0.01
0.55
±0.02
0.49
±0.02
0.61
±0.02
P. aeruginosa 0.316 0.84
±0.01
0.56
±0.01
0.79
±0.01
0.67
±0.01
0.60
±0.02
0.56
±0.01
0.47
±0.00
V. parahaemolyticus 0.260 0.84
±0.01
0.84
±0.01
0.86
±0.01
0.86
±0.01
0.79
±0.01
0.76
±0.01
0.79
±0.01
P. luminescens 0.340 0.62
±0.01
0.63
±0.01
0.60
±0.01
0.64
±0.01
0.74
±0.03
0.73
±0.03
0.75
±0.03
B. pseudomallei 0.476 0.92
±0.01
0.69
±0.01
0.89
±0.01
0.87
±0.01
0.80
±0.02
0.85
±0.02
0.65
±0.02
L. pneumophila 0.161 0.79
±0.01
0.74
±0.02
0.78
±0.01
0.82
±0.01
0.93
±0.01
0.91
±0.03
0.93
±0.01
C. coli 0.168 0.97
±0.00
0.90
±0.01
0.97
±0.01
0.96
±0.01
0.97
±0.00
0.96
±0.00
0.97
±0.00
Table S3: Performances of different methods over artificial genomes with K. pneumoniae
as host genome and different donor genomes. Values in 2nd column are the Manhattan
distance between donor genome and K. pneumoniae based on tetranucletide frequency.
The third to the ninth columns are the optimalF
1
-score of different methods over different
artificial genomes.
95
6.2 Chapter 3 Supplementary materials
96
Alignment-free Genome Comparison Enables
Accurate Geographic Sourcing of White Oak
DNA:
Supplementary Material
Kujin Tang
1
, Jie Ren
1
, Richard Cronn
2∗
, David L. Erickson
3
, Brook G. Milligan
4
,
Meaghan Parker-Forney
5
, John L. Spouge
6
, Fengzhu Sun
1,7∗
Details on the definitions of six alignment-free
distance/dissimilarity measures between two genomes
based on NGS data
Given two NGS data setsi andj from different samples and a given word length
k, we first count the number of occurrences of all k-mers in all reads of sample
i and sample j, respectively. The full set of k-mers of length k is defined as
A
k
whereA = (A,T,C,G) for nucleotide sequences. For a given k-mer w, its
number of occurrences in data set i is defined as N
(i)
w
and the frequency or the
relative abundance of this k-mer is defined as f
(i)
w
=
N
(i)
w
P
w
N
(i)
w
.
In this study, we consider six distance/dissimilarity measures between two
samples based on NGS data. These include the traditional Manhattan, Eu-
clidean, and d
2
[1] distances between the frequencies of the word patterns.
The Manhattan distance (Ma) is defined as:
Ma =
X
w∈A
k
|f
(i)
w
−f
(j)
w
|.
The Euclidean distance (Eu) is defined as:
Eu =
s
X
w∈A
k
(|f
(i)
w
−f
(j)
w
|)
2
.
The d
2
distance is defined as:
d
2
=
1
2
1−
P
w∈A
kf
(i)
w
f
(j)
w
q
P
w∈A
k (f
(i)
w
)
2
q
P
w∈A
k (f
(j)
w
)
2
.
97
We also investigate three recently developed background adjusted dissimilarity
measures including CVTree [2],d
∗
2
andd
s
2
[3, 4, 5, 6]. We model the background
DNA sequence of a sample using m-th order Markov chain where the order m
is estimated using the method developed for NGS short read data [4]. The
expected number of occurrences of word w, EN
(i)
w
, can be calculated from the
stationary probability of the firstm-merw[1 :m] and the transition probabilities
from the n-th m-mer w[n :n +m− 1] to the (n +m)-th nucleotide w[n +m]:
EN
(i)
w
≈L
(i)
μ(w[1 :m])
k−m
Y
n=1
π(w[n :n +m− 1],w[n +m])
where L
(i)
equals to the sum of the lengths of all reads in the i-th data set
minus (m− 1)R where R is the total number of reads, μ is the stationary
probability distribution, andπ is the transition probability distribution that can
be estimated from the data. The difference between the number of occurrences of
k-merw and its expected number of occurrences is defined as
˜
N
(i)
w
=N
(i)
w
−EN
(i)
w
that we refer to as the background adjustedk-mer counts. The CVTree,d
∗
2
and
d
s
2
dissimilarity measures are defined as follows.
The CVTree dissimilarity is defined as:
CVTree =
1
2
1−
P
w∈A
k
ˆ
f
(i)
w
ˆ
f
(j)
w
q
P
w∈A
k (
ˆ
f
(i)
w
)
2
q
P
w∈A
k (
ˆ
f
(j)
w
)
2
,
where
ˆ
f
(i)
w
=
˜
N
(i)
w
EN
(i)
w
. CVTree calculates EN
(i)
w
by assuming a (k− 2)-th order
Markov chain for genomic sequences. The d
∗
2
dissimilarity is defined as:
d
∗
2
=
1
2
1−
P
w∈A
k
¯
f
(i)
w
¯
f
(j)
w
q
P
w∈A
k (
¯
f
(i)
w
)
2
q
P
w∈A
k (
¯
f
(j)
w
)
2
,
where
¯
f
(i)
w
=
˜
N
(i)
w
√
EN
(i)
w
. The d
s
2
dissimilarity is defined as:
d
s
2
=
1
2
1−
P
w∈A
k
˜
f
(i)
w
˜
f
(j)
w
q
P
w∈A
k (
˜
f
(i)
w
)
2
q
P
w∈A
k (
˜
f
(j)
w
)
2
,
where
˜
f
(i)
w
=
˜
N
(i)
w
((
˜
N
(i)
w
)
2
+(
˜
N
(j)
w
)
2
)
1
4
and
˜
f
(j)
w
=
˜
N
(j)
w
((
˜
N
(i)
w
)
2
+(
˜
N
(j)
w
)
2
)
1
4
.
The estimated order of the Markov chains for the white oak trees based
on the NGS data using the method in [4] is 10. Therefore, we used k = 12
and Markov order 10 to calculate d
∗
2
, d
s
2
, and CVTree between any pair of
the tree samples. For comparison, we also used k = 12 in the calculation of
Manhattan, Euclidean and d
2
. All the calculations of the pairwise dissimilarity
98
values were carried out using the software package CAFE [7], a user-friendly
and efficient package for calculating 28 alignment-free sequence dissimilarity
measures. Other calculations and plots were carried out using R [8], a widely
used statistical package.
Supplementary Tables
99
Study # of samples
Sequence
Platform
Genomic
Library
Preparation
Read length
(bp)
NCBI
Bioproject or
SRA
accessions
White oak
reference
dataset
92
Illumina
HiSeq2000
Tru-Seq, total
genomic DNA
Single-end,
101bp
BioProject
PRJNA269970
California
Valley
white oak
9
Illumina
HiSeq2500
Nextera
long-insert
Mate-pair,
total genomic
DNA
Paired-end,
250 bp
BioProject
PRJNA308314
2
Illumina
HiSeq2500
Nextera
Mate-pair
short-insert
(550 bp), one
PCR-free,
one PCR-
enriched,
total genomic
DNA
Paired-end,
150 bp per
end
BioProject
PRJNA308314
Swiss
Pedunculate
white oak
8
Illumina
HiSeq2000
Nextera
Mate-pair
short-insert
(400 bp),
total genomic
DNA
Paired-end,
100 bp
BioProject
PRJNA327502
22
PacBio-
SMRT
Nextera
Mate-pair
long-insert
(3,000 bp),
total genomic
DNA
Single-end,
2,489—7,622
bp
BioProject
PRJNA327502
RAD-Seq
data from
white oaks
5
Illumina
HiSeq2500
RAD-Seq
with PstI
selection;
samples
independent
from the
reference
trees
Single-end,
91 bp
Q. bicolor,
SRR5632514
Q. stellata,
SRR5632513
Q. lobata,
SRR5632586
Q. robur,
SRR5632600
Q. dentata,
SRR5632587
2
Illumina
HiSeq2500
RAD-Seq
with PstI
selection;
samples
identical to
two reference
trees
Single-end,
91 bp
Q. mongolica,
SRR5284345
Q. petraea,
SRR5284338
Table S1: Library construction, sequencing, and genome enrichment methods
used for all DNA libraries in this study.
100
References
[1] Torney DC, Burks C, Davison D, Sirotkin KM. Computation of d2: a
measure of sequence dissimilarity. In: Computers and DNA: the proceedings
of the Interface between Computation Science and Nucleic Acid Sequencing
Workshop, held December 12 to 16, 1988 in Santa Fe, New Mexico/edited
by George I. Bell, Thomas G. Marr. Redwood City, Calif.: Addison-Wesley
Pub. Co., 1990.; 1990. .
[2] Qi J, Luo H, Hao B. CVTree: a phylogenetic tree reconstruction tool based
on whole genomes. Nucleic Acids Research. 2004;32(suppl 2):W45–W47.
[3] Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F. Alignment-free sequence
comparison based on next-generation sequencing reads. Journal of compu-
tational biology. 2013;20(2):64–79.
[4] Ren J, Song K, Deng M, Reinert G, Cannon CH, Sun F. Inference of Marko-
vian properties of molecular sequences from NGS data and applications to
comparative genomics. Bioinformatics. 2015;32(7):993–1000.
[5] Wan L, Reinert G, Sun F, Waterman MS. Alignment-free sequence compar-
ison (II): theoretical power of comparison statistics. Journal of Computa-
tional Biology. 2010;17(11):1467–1490.
[6] Reinert G, Chew D, Sun F, Waterman MS. Alignment-free sequence
comparison (I): statistics and power. Journal of Computational Biology.
2009;16(12):1615–1634.
[7] Lu YY, Tang K, Ren J, Fuhrman JA, Waterman MS, Sun F. CAFE: aCcel-
erated Alignment-FrEe sequence analysis. Nucleic Acids Research. 2017;p.
gkx351.
[8] Team RC, et al. R: A language and environment for statistical computing.
R Foundation for Statistical Computing, Vienna, Austria. 2013;.
101
−0.01
0.00
0.01
0.00 0.01 0.02 0.03 0.04 0.05
PCoA coordinate1
PCoA coordinate2
(a) PCoA of 50M samples by Eu
−0.01
0.00
0.01
0.00 0.01 0.02 0.03 0.04 0.05
PCoA coordinate1
PCoA coordinate2
(b) PCoA of 100M samples by Eu
−0.01
0.00
0.01
0.00 0.01 0.02 0.03 0.04 0.05
PCoA coordinate1
PCoA coordinate2
(c) PCoA of 300M samples by Eu
North_America
Europe
Asia
Outliers_Russia
Outliers_Belarus
Figure S1(a). The two-dimensional principal coordinate (PCoA) plots of the
95 tree samples based on the Euclidean distance of the samples for differ-
ent sequence quantities of 50, 100 and 300 Mbp, respectively. Three outliers,
SRR2053124 [Q. robur], SRR2053125 [Q. robur], SRR2053082 [Q. dentata], were
identified. However, the other samples cluster together.
102
−0.2
−0.1
0.0
0.1
0.2
0.0 0.5 1.0
PCoA coordinate1
PCoA coordinate2
(a) PCoA of 50M samples by Ma
−0.2
−0.1
0.0
0.1
0.2
0.3
0.0 0.5 1.0
PCoA coordinate1
PCoA coordinate2
(b) PCoA of 100M samples by Ma
−0.2
−0.1
0.0
0.1
0.2
0.3
0.0 0.5 1.0
PCoA coordinate1
PCoA coordinate2
(c) PCoA of 300M samples by Ma
North_America
Europe
Asia
Outliers_Russia
Outliers_Belarus
Figure S1(b). The two-dimensional principal coordinate (PCoA) plots of the
95 tree samples based on the Manhattan distance of the samples for differ-
ent sequence quantities of 50, 100 and 300 Mbp, respectively. Three outliers,
SRR2053124 [Q. robur], SRR2053125 [Q. robur], SRR2053082 [Q. dentata], were
identified. Several other outliers are also observed.
103
−0.2
−0.1
0.0
0.1
0.2
0.00 0.25 0.50
PCoA coordinate1
PCoA coordinate2
(a) PCoA of 50M samples by d
2
−0.1
0.0
0.1
0.2
0.00 0.25 0.50
PCoA coordinate1
PCoA coordinate2
(b) PCoA of 100M samples by d
2
−0.1
0.0
0.1
0.2
0.00 0.25 0.50
PCoA coordinate1
PCoA coordinate2
(c) PCoA of 300M samples by d
2
North_America
Europe
Asia
Outliers_Russia
Outliers_Belarus
Figure S1(c). The two-dimensional principal coordinate (PCoA) plots of the 95
tree samples based on the d
2
dissimilarity of the samples for different sequence
quantities of 50, 100 and 300 Mbp, respectively. Three outliers, SRR2053124
[Q. robur], SRR2053125 [Q. robur], SRR2053082 [Q. dentata], were identified.
Several other outliers are also observed.
104
0.0
0.1
0.2
−0.10 −0.05 0.00 0.05
PCoA coordinate1
PCoA coordinate2
(a) PCoA of 50M samples by CVTree
−0.1
0.0
0.1
0.2
−0.10 −0.05 0.00 0.05
PCoA coordinate1
PCoA coordinate2
(b) PCoA of 100M samples by CVTree
−0.1
0.0
0.1
0.2
−0.10 −0.05 0.00 0.05 0.10
PCoA coordinate1
PCoA coordinate2
(c) PCoA of 300M samples by CVTree
North_America
Europe
Asia
Outliers_Russia
Outliers_Belarus
Figure S1(d). The two-dimensional principal coordinate (PCoA) plots of the
95 tree samples based on the CVTree dissimilarity of the samples for differ-
ent sequence quantities of 50, 100 and 300 Mbp, respectively. Three outliers,
SRR2053124 [Q. robur], SRR2053125 [Q. robur], SRR2053082 [Q. dentata], were
identified. At sequence quantity of 50 Mbp, several other outliers are also ob-
served.
105
0.0
0.1
0.2
−0.05 0.00 0.05
PCoA coordinate1
PCoA coordinate2
(a) PCoA of 50M samples by d
2
S
0.0
0.1
0.2
0.3
−0.10 −0.05 0.00 0.05 0.10
PCoA coordinate1
PCoA coordinate2
(b) PCoA of 100M samples by d
2
S
0.0
0.1
0.2
0.3
−0.10 −0.05 0.00 0.05 0.10 0.15
PCoA coordinate1
PCoA coordinate2
(c) PCoA of 300M samples by d
2
S
North_America
Europe
Asia
Outliers_Russia
Outliers_Belarus
Figure S1(e). The two-dimensional principal coordinate (PCoA) plots of the 95
tree samples based on the d
s
2
dissimilarity of the samples for different sequence
quantities of 50, 100 and 300 Mbp, respectively. Three outliers, SRR2053124
[Q. robur], SRR2053125 [Q. robur], SRR2053082 [Q. dentata], were identified.
106
0.0
0.1
0.2
−0.05 0.00 0.05
PCoA coordinate1
PCoA coordinate2
(a) PCoA of 50M samples by d
2
*
0.0
0.1
0.2
0.3
−0.05 0.00 0.05 0.10
PCoA coordinate1
PCoA coordinate2
(b) PCoA of 100M samples by d
2
*
0.0
0.1
0.2
−0.05 0.00 0.05 0.10 0.15
PCoA coordinate1
PCoA coordinate2
(c) PCoA of 300M samples by d
2
*
North_America
Europe
Asia
Outliers_Russia
Outliers_Belarus
Figure S1(f). The two-dimensional principal coordinate (PCoA) plots of the 95
tree samples based on the d
∗
2
dissimilarity of the samples for different sequence
quantities of 50, 100 and 300 Mbp, respectively. Three outliers, SRR2053124
[Q. robur], SRR2053125 [Q. robur], SRR2053082 [Q. dentata], were identified.
107
2053067
2053065
2053064
2053063
2053062
2053061
2053069
2053068
2053126
2053127
2053129
2053098
2053099
2053128
2053092
2053093
2053090
2053091
2053096
2053097
2053094
2053095
2053089
2053088
2053122
2053120
2053121
2053081
2053083
2053085
2053084
2053087
2053086
2053034
2053035
2053036
2053037
2053033
2053038
2053039
2053131
2053130
2053108
2053109
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053058
2053059
2053056
2053057
2053054
2053055
2053052
2053053
2053050
2053051
2053119
2053118
2053117
2053116
2053115
2053114
2053113
2053112
2053111
2053110
2053049
2053048
2053045
2053044
2053047
2053046
2053041
2053040
2053043
2053042
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
(a) d
2
*
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(b) d
2
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(c) Euclidean
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(d) d
2
S
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(e) CVtree
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
NA AS EU
(f) Manhattan
Figure S2(a). The circular plots of 92 white oak tree samples based on the six
dissimilarity measures: d
∗
2
, d
s
2
, d
2
, CVTree, Euclidean, and Manhattan, using
50 Mbp of next generation sequencing data. Different sectors correspond to
different continents, with NA in red, EU in orange and AS in blue. Within
each sector, samples are sorted by their longitude, so that samples that are
geographically close are also close to each other in the figure. The most similar
tree samples to each sample are linked. The k-mer length is 12 and the Markov
order of the background sequence is 10 ford
∗
2
,d
S
2
, and CVTree. The most similar
samples to each sample according to d
∗
2
andd
S
2
are from the same continent-of-
origin.
108
2053067
2053065
2053064
2053063
2053062
2053061
2053069
2053068
2053126
2053127
2053129
2053098
2053099
2053128
2053092
2053093
2053090
2053091
2053096
2053097
2053094
2053095
2053089
2053088
2053122
2053120
2053121
2053081
2053083
2053085
2053084
2053087
2053086
2053034
2053035
2053036
2053037
2053033
2053038
2053039
2053131
2053130
2053108
2053109
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053058
2053059
2053056
2053057
2053054
2053055
2053052
2053053
2053050
2053051
2053119
2053118
2053117
2053116
2053115
2053114
2053113
2053112
2053111
2053110
2053049
2053048
2053045
2053044
2053047
2053046
2053041
2053040
2053043
2053042
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
(a) d
2
*
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(b) d
2
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(c) Euclidean
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(d) d
2
S
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(e) CVtree
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
NA AS EU
(f) Manhattan
Figure S2(b). The circular plots of 92 white oak tree samples based on the six
dissimilarity measures: d
∗
2
, d
s
2
, d
2
, CVTree, Euclidean, and Manhattan, using
300 Mbp of next generation sequencing data. Different sectors correspond to
different continents, with NA in red, EU in orange and AS in blue. Within
each sector, samples are sorted by their longitude, so that samples that are
geographically close are also close to each other in the figure. The most similar
tree samples to each sample are linked. The k-mer length is 12 and the Markov
order of the background sequence is 10 for d
∗
2
, d
S
2
, and CVTree. The most
similar samples to each sample according to Manhattan,d
∗
2
andd
S
2
are from the
same continent-of-origin.
109
−0.05
0.00
0.05
−100 −50 0 50 100 150
Longitude
PCoA coordinate1
−0.05
0.00
0.05
−100 −50 0 50 100 150
Longitude
PCoA coordinate2
−0.10
−0.05
0.00
0.05
−100 −50 0 50 100 150
Longitude
PCoA coordinate3
North_America
Europe
Asia
Figure S3(a). The relationship between the first three principal coordinates
and longitude of the tree samples based on the d
∗
2
dissimilarity values using
sequencing quantity of 100 Mbp. The k-mer length is 12 and the Markov order
of the background sequence is 10. The first principal coordinate separates the
North America tree samples from the Europe and Asia tree samples, and the
third principal coordinate separates the Europe samples from Asia samples.
The second principal coordinates of the most of Asian samples are larger than
that of the Europe samples. However, the second principal coordinate does not
separate them.
110
−0.05
0.00
0.05
30 40 50 60
Latitude
PCoA coordinate1
−0.05
0.00
0.05
30 40 50 60
Latitude
PCoA coordinate2
−0.10
−0.05
0.00
0.05
30 40 50 60
Latitude
PCoA coordinate3
North_America
Europe
Asia
Figure S3(b). The relationship between the first three principal coordinates and
latitude of the tree samples based on thed
∗
2
dissimilarity values using sequencing
quantity of 100 Mbp. The k-mer length is 12 and the Markov order of the
background sequence is 10. The latitude is not associated with any of the any
of first three principal coordinates.
111
p=0.089
p<1e−7
0.309
0.356
0.314
N=528 N=1419 N=528
0.219
0.266
0.225
0.273
0.312
0.277
0.2
0.3
0.4
AS_AS AS_EU AS_NA
Region
Distance
(a)
p=0.035
p<1e−7
N=1419 N=903 N=688
0.356
0.35 0.351
0.266
0.195
0.264
0.312
0.297
0.303
0.2
0.3
0.4
NA_NA NA_EU AS_NA
Region
Distance
(c)
p=3.1e−6
p<1e−7
0.314
0.351
0.289
0.225
0.264
0.217
N=528 N=688 N=120
0.277
0.303
0.254
0.2
0.3
0.4
EU_EU AS_EU NA_EU
Region
Distance
(b)
Figure S4(a). Comparison of intra- and inter-continental d
∗
2
dissimilarities with
sequence quantity of 50 Mbp. The k-mer length is 12 and the Markov order
of the background sequence is 10. The p-values were calculated based on the
Wilkinson-Man-Whitney test statistic and by permuting the continental labels
of the white oak tree samples 10
7
times. The inter-continentald
∗
2
dissimilarities
are significantly higher than intra-continental d
∗
2
dissimilarities.
112
p<1e−7
p<1e−7
0.16
0.227
0.177
N=528 N=1419 N=528
0.07
0.133
0.085
0.107
0.17
0.126
0.1
0.2
0.3
AS_AS AS_EU AS_NA
Region
Distance
(a)
p<1e−7
p<1e−7
N=1419 N=903 N=688
0.227
0.202
0.221
0.133
0.089
0.134
0.17
0.135
0.164
0.1
0.2
0.3
NA_NA NA_EU AS_NA
Region
Distance
(c)
p<1e−7
p<1e−7
0.177
0.221
0.129
0.085
0.134
0.072
N=528 N=688 N=120
0.126
0.164
0.099 0.1
0.2
0.3
EU_EU AS_EU NA_EU
Region
Distance
(b)
Figure S4(b). Comparison of intra- and inter-continental d
∗
2
dissimilarities with
sequence quantity of 300 Mbp. The k-mer length is 12 and the Markov order
of the background sequence is 10. The p-values were calculated based on the
Wilkinson-Man-Whitney test statistic and by permuting the continental labels
of the white oak tree samples 10
7
times. The inter-continentald
∗
2
dissimilarities
are significantly higher than intra-continental d
∗
2
dissimilarities.
113
3244044 3244044
3244045 3244045
3244046 3244046
3244047 3244047
3244048 3244048
3244049 3244049
3244050 3244050
3244051 3244051
3244052 3244052
3244053 3244053
3244054 3244054
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(a)
3860149 3860149
3860174 3860174
3860182 3860182
3860183 3860183
3860184 3860184
3860185 3860185
3860207 3860207
3860230 3860230
3860242 3860242
3860265 3860265
3860289 3860289
3860310 3860310
3860329 3860329
3860335 3860335
3860358 3860358
3860382 3860382
3860390 3860390
3860406 3860406
3860428 3860428
3860429 3860429
3860430 3860430
3860431 3860431
3860432 3860432
3860433 3860433
3860434 3860434
3860435 3860435
3884561 3884561
3884562 3884562
3884563 3884563
3884564 3884564
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(b)
5284338 5284338
5284345 5284345
5632513 5632513
5632514 5632514
5632586 5632586
5632587 5632587
5632600 5632600
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(c)
NA AS EU
Figure S5(a). The circular plots for independent samples sequenced using a)
Illumina NGS of a California Valley Oak tree, b) a mixture of short- and long-
read from with both Illumina and PacBio sequencing of the Pendunculate Oak
tree, and seven diverse tree samples using RAD-seq. The d
s
2
dissimilarity mea-
sures of each independent sample with the 92 reference samples were calculated
and the two most similar reference samples are linked.
114
3244044 3244044
3244045 3244045
3244046 3244046
3244047 3244047
3244048 3244048
3244049 3244049
3244050 3244050
3244051 3244051
3244052 3244052
3244053 3244053
3244054 3244054
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(a)
3860149 3860149
3860174 3860174
3860182 3860182
3860183 3860183
3860184 3860184
3860185 3860185
3860207 3860207
3860230 3860230
3860242 3860242
3860265 3860265
3860289 3860289
3860310 3860310
3860329 3860329
3860335 3860335
3860358 3860358
3860382 3860382
3860390 3860390
3860406 3860406
3860428 3860428
3860429 3860429
3860430 3860430
3860431 3860431
3860432 3860432
3860433 3860433
3860434 3860434
3860435 3860435
3884561 3884561
3884562 3884562
3884563 3884563
3884564 3884564
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(b)
5284338 5284338
5284345 5284345
5632513 5632513
5632514 5632514
5632586 5632586
5632587 5632587
5632600 5632600
2053033
2053034
2053035
2053036
2053037
2053038
2053039
2053040
2053041
2053042
2053043
2053044
2053045
2053046
2053047
2053048
2053049
2053050
2053051
2053052
2053053
2053054
2053055
2053056
2053057
2053058
2053059
2053061
2053062
2053063
2053064
2053065
2053067
2053068
2053069
2053070
2053071
2053072
2053073
2053074
2053075
2053076
2053077
2053078
2053079
2053081
2053083
2053084
2053085
2053086
2053087
2053088
2053089
2053090
2053091
2053092
2053093
2053094
2053095
2053096
2053097
2053098
2053099
2053100
2053101
2053102
2053103
2053104
2053105
2053106
2053107
2053108
2053109
2053110
2053111
2053112
2053113
2053114
2053115
2053116
2053117
2053118
2053119
2053120
2053121
2053122
2053126
2053127
2053128
2053129
2053130
2053131
(c)
NA AS EU
Figure S5(b). The circular plots for independent samples sequenced using a)
Illumina NGS of a California Valley Oak tree, b) a mixture of short- and long-
read from with both Illumina and PacBio sequencing of the Pendunculate Oak
tree, and seven diverse tree samples using RAD-seq. The Manhattan distance
measures of each independent sample with the 92 reference samples were calcu-
lated and the two most similar reference samples are linked.
115
6.3 Chapter 4 Supplementary materials
116
Afann: bias adjustment for alignment-free sequence
comparison based on sequencing data using neural network
regression:
Supplementary Material
Kujin Tang
1
, Jie Ren
1
, and Fengzhu Sun
∗1
1
Quantitative and Computational Biology, Department of Biological Sciences,
University of Southern California, CA, USA
Appendix A: Aligment-free distance/dissimilarity measures
Given two genomic sequences or NGS samples i and j and a given word length k, we first
count the number of occurrences of all kmers in sequence i and sequence j, respectively.
The full set of kmers of length k is defined asA
k
whereA = (A,T,C,G) for nucleotide
sequences. For a given kmerw, its occurrences ini is defined asN
(i)
w
and the frequency or
the relative abundance of this kmer is defined as f
(i)
w
=
N
(i)
w
P
w
N
(i)
w
.
Some dissimilarity measures such asd
∗
2
andd
s
2
need anm-th order Markov model for
the background sequence. The expected number of occurrences of word w, EN
(i)
w
, can be
calculated from the stationary probability of the first m-mer w[1 : m] and the transition
probabilities from then-thm-merw[n :n +m− 1] to the (n +m)-th nucleotidew[n +m]:
EN
(i)
w
= (L
(i)
−k + 1)μ(w[1 :m])
k−m
Y
n=1
π(w[n :n +m− 1],w[n +m])
whereL
(i)
is the length of sequencei,μ is the stationary probability andπ is the transition
probability that can be estimated from the sequence data. The difference between the
number of occurrences of kmer w and its expected occurrences is defined as
˜
N
(i)
w
=N
(i)
w
−
EN
(i)
w
.
∗
Towhomcorrespondenceshouldbeaddressed. Tel: +1(213)-740-2413; Fax: +1(213)-740-8631; Email:
fsun@usc.edu
117
Manhattan
The Manhattan distance (Ma) is defined as:
Ma =
X
w∈A
k
|f
(i)
w
−f
(j)
w
|
Euclidean
The Euclidean distance (Eu) is defined as:
Eu =
s
X
w∈A
k
|f
(i)
w
−f
(j)
w
|
2
CVTree [1]
The CVTree dissimilarity is defined as:
CVTree =
1
2
1−
P
w∈A
k
ˆ
f
(i)
w
ˆ
f
(j)
w
q
P
w∈A
k (
ˆ
f
(i)
w
)
2
q
P
w∈A
k (
ˆ
f
(j)
w
)
2
where
ˆ
f
(i)
w
=
˜
N
(i)
w
EN
(i)
w
. CVTree calculatesEN
(i)
w
by assuming a (k−2)-th order Markov chain
for genomic sequences.
d
∗
2
[2]
The d
∗
2
dissimilarity is defined as:
d
∗
2
=
1
2
1−
P
w∈A
k
¯
f
(i)
w
¯
f
(j)
w
q
P
w∈A
k (
¯
f
(i)
w
)
2
q
P
w∈A
k (
¯
f
(j)
w
)
2
where
¯
f
(i)
w
=
˜
N
(i)
w
√
EN
(i)
w
.
118
d
s
2
[2]
The d
s
2
dissimilarity is defined as:
d
s
2
=
1
2
1−
P
w∈A
k
˜
f
(i)
w
˜
f
(j)
w
q
P
w∈A
k (
˜
f
(i)
w
)
2
q
P
w∈A
k (
˜
f
(j)
w
)
2
where
˜
f
(i)
w
=
˜
N
(i)
w
((
˜
N
(i)
w
)
2
+(
˜
N
(j)
w
)
2
)
1
4
and
˜
f
(j)
w
=
˜
N
(j)
w
((
˜
N
(i)
w
)
2
+(
˜
N
(j)
w
)
2
)
1
4
.
119
0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175
Pairwise evolutionary distance
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
Pairwise genome d
s
2
SPC: 0.979
K = 14, M = 12
Figure S1: Relationship between pairwise d
s
2
using K = 14 and evolutionary distances
among 21 primates. X-axis is the pairwise primate evolutionary distances estimated by
alignment-based method in [3] and Y-axis is the pairwise d
s
2
caculated based on primate
genomes using K = 14 and M = 12. The Spearman correlation coefficient (SPC) is 0.979.
120
0.0
0.1
0.2
0.3
0.4
SPC: 0.576
(a) K = 5, M = 3
SPC: 0.717
(b) K = 6, M = 4
SPC: 0.760
(c) K = 7, M = 5
0.0
0.1
0.2
0.3
0.4
SPC: 0.805
(d) K = 8, M = 6
SPC: 0.856
(e) K = 9, M = 7
SPC: 0.900
(f) K = 10, M = 8
0.00 0.05 0.10 0.15
0.0
0.1
0.2
0.3
0.4
SPC: 0.935
(g) K = 11, M = 9
0.00 0.05 0.10 0.15
SPC: 0.954
(h) K = 12, M = 10
0.00 0.05 0.10 0.15
SPC: 0.969
(i) K = 13, M = 11
0.0 0.2 0.4 0.6 0.8 1.0
Pairwise evolutionary distance
0.0
0.2
0.4
0.6
0.8
1.0
Pairwise genome d
s
2
Figure S2: Relationship between pairwise d
s
2
using K = 5 to K = 13 and evolutionary
distances among 21 primates. X-axis is the pairwise primate evolutionary distances esti-
mated by alignment-based method in [3] and Y-axis is the pairwise d
s
2
caculated based on
primate genomes using K = 5 to M = 13 and M =K− 2. The corresponding Spearman
correlation coefficients (SPC) for each K are shown on the subplot.
121
0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175
Pairwise evolutionary distance
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
Pairwise genome d
*
2
SPC: 0.970
K = 14, M = 12
Figure S3: Relationship between pairwise d
∗
2
using K = 14 and evolutionary distances
among 21 primates. X-axis is the pairwise primate evolutionary distances estimated by
alignment-based method in [3] and Y-axis is the pairwise d
∗
2
caculated based on primate
genomes using K = 14 and M = 12. The Spearman correlation coefficient (SPC) is 0.970.
122
0.0
0.1
0.2
0.3
0.4
SPC: 0.551
(a) K = 5, M = 3
SPC: 0.728
(b) K = 6, M = 4
SPC: 0.749
(c) K = 7, M = 5
0.0
0.1
0.2
0.3
0.4
SPC: 0.789
(d) K = 8, M = 6
SPC: 0.816
(e) K = 9, M = 7
SPC: 0.856
(f) K = 10, M = 8
0.00 0.05 0.10 0.15
0.0
0.1
0.2
0.3
0.4
SPC: 0.897
(g) K = 11, M = 9
0.00 0.05 0.10 0.15
SPC: 0.934
(h) K = 12, M = 10
0.00 0.05 0.10 0.15
SPC: 0.959
(i) K = 13, M = 11
0.0 0.2 0.4 0.6 0.8 1.0
Pairwise evolutionary distance
0.0
0.2
0.4
0.6
0.8
1.0
Pairwise genome d
*
2
Figure S4: Relationship between pairwise d
∗
2
using K = 5 to K = 13 and evolutionary
distances among 21 primates. X-axis is the pairwise primate evolutionary distances esti-
mated by alignment-based method in [3] and Y-axis is the pairwise d
∗
2
caculated based on
primate genomes using K = 5 to M = 13 and M =K− 2. The corresponding Spearman
correlation coefficients (SPC) for each K are shown on the subplot.
123
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.028
SPC: 0.969
(a) K = 5
RMSE: 0.034
SPC: 0.984
(b) K = 6
RMSE: 0.038
SPC: 0.984
(c) K = 7
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.047
SPC: 0.983
(d) K = 8
RMSE: 0.054
SPC: 0.976
(e) K = 9
RMSE: 0.071
SPC: 0.961
(f) K = 10
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.113
SPC: 0.923
(g) K = 11
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.180
SPC: 0.880
(h) K = 12
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.241
SPC: 0.830
(i) K = 13
1 M
3 M
5 M
7 M
9 M
11 M
13 M
15 M
Mix
0.0 0.2 0.4 0.6 0.8 1.0
Genome s
s
2
0.0
0.2
0.4
0.6
0.8
1.0
NGS s
s
2
Figure S5: Relationship between pairwise s
s
2
estimated by primate genomes using K = 5
to K = 13, M = K− 2 and NGS samples of different numbers of reads without bias
adjustment. X-axis is the pairwise s
s
2
estimated by genomes and Y-axis is the pairwise s
s
2
estimated based on mixed NGS samples. (a)-(h) show relationship between s
s
2
estimated
based on mixed NGS using differentK andM ands
s
2
estimated based on primate genomes.
NGS samples of different numbers of reads are colored accordingly. ‘Mix’ means two NGS
samples have different numbers of reads (e.g between 1 M and 5 M or between 7 M and 11
M) and is colored in grey. The root mean squared error (RMSE) and Spearman correlation
coefficients (SPC) between pairwise s
s
2
estimated based on NGS samples and genomes are
shown on each subplot.
124
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.415
SPC: 0.960
(a) 1 M reads
RMSE: 0.335
SPC: 0.975
(b) 3 M reads
RMSE: 0.284
SPC: 0.986
(c) 5 M reads
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.246
SPC: 0.988
(d) 7 M reads
RMSE: 0.216
SPC: 0.994
(e) 9 M reads
RMSE: 0.193
SPC: 0.994
(f) 11 M reads
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.174
SPC: 0.995
(g) 13 M reads
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.158
SPC: 0.996
(h) 15 M reads
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.270
SPC: 0.838
(i) Mix
1 M
3 M
5 M
7 M
9 M
11 M
13 M
15 M
Mix
0.0 0.2 0.4 0.6 0.8 1.0
Genome s
*
2
0.0
0.2
0.4
0.6
0.8
1.0
NGS s
*
2
Figure S6: Relationship between pairwise s
∗
2
estimated by primate genomes and NGS
samples usingK = 14 andM = 12 of different numbers of reads without bias adjustment.
X-axis is the pairwise s
∗
2
estimated by genomes and Y-axis is the pairwise s
∗
2
estimated
based on NGS samples. (a)-(h) show the relationship between s
s
2
estimated based on
primate genomes ands
∗
2
estimated based on NGS samples of only 1 M, 3M, 5 M, 7M, 9 M,
11 M, 13 M or 15 M reads, respectively. (i) shows pairwise s
∗
2
estimated based on mixed
NGS samples. NGS samples of different numbers of reads are colored accordingly. ‘Mix’
means two NGS samples have different numbers of reads (e.g between 1 M and 5 M or
between 7 M and 11 M) and is colored in grey. The root mean squared error (RMSE)
and Spearman correlation coefficients (SPC) between pairwise s
∗
2
estimated based on NGS
samples and genomes are shown on each subplot.
125
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.011
SPC: 0.990
(a) K = 5
RMSE: 0.015
SPC: 0.996
(b) K = 6
RMSE: 0.015
SPC: 0.996
(c) K = 7
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.022
SPC: 0.993
(d) K = 8
RMSE: 0.026
SPC: 0.993
(e) K = 9
RMSE: 0.032
SPC: 0.992
(f) K = 10
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.054
SPC: 0.976
(g) K = 11
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.126
SPC: 0.917
(h) K = 12
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.229
SPC: 0.856
(i) K = 13
1 M
3 M
5 M
7 M
9 M
11 M
13 M
15 M
Mix
0.0 0.2 0.4 0.6 0.8 1.0
Genome s
*
2
0.0
0.2
0.4
0.6
0.8
1.0
NGS s
*
2
Figure S7: Relationship between pairwise s
∗
2
estimated by primate genomes using K = 5
to K = 13, M = K− 2 and NGS samples of different numbers of reads without bias
adjustment. X-axis is the pairwise s
∗
2
estimated by genomes and Y-axis is the pairwise s
∗
2
estimated based on mixed NGS samples. (a)-(h) show relationship between s
∗
2
estimated
based on mixed NGS using differentK andM ands
∗
2
estimated based on primate genomes.
NGS samples of different numbers of reads are colored accordingly. ‘Mix’ means two NGS
samples have different numbers of reads (e.g between 1 M and 5 M or between 7 M and 11
M) and is colored in grey. The root mean squared error (RMSE) and Spearman correlation
coefficients (SPC) between pairwise s
∗
2
estimated based on NGS samples and genomes are
shown on each subplot.
126
0.0 0.2 0.4 0.6 0.8 1.0
Genome s
s
2
0.0
0.2
0.4
0.6
0.8
1.0
NGS s
s
2
RMSE: 0.298
SPC: 0.753
(a) Before bias adjustment
1 M & 1 M
1 M & 5 M
1 M & 15 M
5 M & 5 M
5 M & 15 M
15 M & 15 M
0.0 0.2 0.4 0.6 0.8 1.0
Genome s
s
2
0.0
0.2
0.4
0.6
0.8
1.0
Adjusted NGS s
s
2
RMSE: 0.023
SPC: 0.986
(b) After bias adjustment
1 M & 1 M
1 M & 5 M
1 M & 15 M
5 M & 5 M
5 M & 15 M
15 M & 15 M
Figure S8: Relationship between pairwise s
s
2
estimated using K = 14 and M = 12 based
on 21 primate genomes and NGS samples of 1 M, 5 M and 15 M reads (samples of other
sequencing depths were not shown in this figure for less crowded visualization). (a) re-
lationship before bias adjustment. (b) relationship after bias adjustment for NGS s
s
2
. ‘1
M & 1 M’ represents the s
s
2
between two NGS samples of 1 M reads and ‘1 M & 15 M’
represents the s
s
2
bewteen one NGS sample of 1 M reads and the other NGS sample of 15
M reads. The root mean squared error (RMSE) was decreased and the Spearman corre-
lation coeffient (SPC) between pairwise genome s
s
2
and NGS s
s
2
was increased after bias
adjustment.
127
0.000
0.025
0.050
0.075
0.100
0.125
0.150
0.175
SPC: 0.669
(a) K = 14, s = 10
3
SPC: 0.847
(b) K = 14, s = 10
5
SPC: 0.851
(c) K = 14, s = 10
7
0.000
0.025
0.050
0.075
0.100
0.125
0.150
0.175
SPC: 0.969
(d) K = 21, s = 10
3
SPC: 0.983
(e) K = 21, s = 10
5
SPC: 0.984
(f) K = 21, s = 10
7
0.00 0.05 0.10 0.15
0.000
0.025
0.050
0.075
0.100
0.125
0.150
0.175
SPC: 0.963
(g) K = 31, s = 10
3
0.00 0.05 0.10 0.15
SPC: 0.976
(h) K = 31, s = 10
5
0.00 0.05 0.10 0.15
SPC: 0.978
(i) K = 31, s = 10
7
0.0 0.2 0.4 0.6 0.8 1.0
Pairwise evolutionary distance
0.0
0.2
0.4
0.6
0.8
1.0
Pairwise genome MASH distance
Figure S9: Relationship between pairwise Mash distances using K = 14, K = 21, K = 31
and sketch size s = 10
3
, s = 10
5
, s = 10
7
and evolutionary distances among 21 primates.
X-axis is the pairwise primate evolutionary distances estimated by alignment-based method
in [3] and Y-axis is the pairwise Mash distances calculated based on primate genomes. The
corresponding Spearman correlation coefficients (SPC) for each combination of K and s
are shown on the subplot. Mash distances with K = 21 and s = 10
7
and the evolutionary
distances have the highest Spearman correlation coefficient 0.984.
128
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.088
SPC: 0.943
(a) 1 M reads
RMSE: 0.066
SPC: 0.971
(b) 3 M reads
RMSE: 0.053
SPC: 0.989
(c) 5 M reads
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.045
SPC: 0.992
(d) 7 M reads
RMSE: 0.038
SPC: 0.994
(e) 9 M reads
RMSE: 0.033
SPC: 0.994
(f) 11 M reads
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.029
SPC: 0.995
(g) 13 M reads
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.026
SPC: 0.995
(h) 15 M reads
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.059
SPC: 0.860
(i) Mix
1 M
3 M
5 M
7 M
9 M
11 M
13 M
15 M
Mix
0.0 0.2 0.4 0.6 0.8 1.0
Genome Mash similarity
0.0
0.2
0.4
0.6
0.8
1.0
NGS Mash similarity
Figure S10: Relationship between pairwise Mash similarity estimated by primate genomes
using K = 21 and sketch size s = 10
7
and NGS samples of different numbers of reads.
X-axis is the pairwise Mash similarity estimated by genomes and Y-axis is the pairwise
Mash similarity estimated based on NGS samples. (a)-(h) show the relationship between
Mash similarity estimated based on primate genomes and Mash similarity estimated based
on NGS samples of only 1 M, 3M, 5 M, 7M, 9 M, 11 M, 13 M or 15 M reads, respectively.
(i) shows pairwise Mash similarity estimated based on mixed NGS samples. NGS samples
of different numbers of reads are colored accordingly. ‘Mix’ means two NGS samples
have different numbers of reads (e.g between 1 M and 5 M or between 7 M and 11 M)
and is colored in grey. The root mean squared error (RMSE) and Spearman correlation
coefficients (SPC) between pairwise Mash similarity estimated based on NGS samples and
genomes are shown on each subplot.
129
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.027
SPC: 0.969
(a) K = 5
RMSE: 0.031
SPC: 0.984
(b) K = 6
RMSE: 0.033
SPC: 0.985
(c) K = 7
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.037
SPC: 0.987
(d) K = 8
RMSE: 0.034
SPC: 0.986
(e) K = 9
RMSE: 0.030
SPC: 0.987
(f) K = 10
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.028
SPC: 0.986
(g) K = 11
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.028
SPC: 0.986
(h) K = 12
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.023
SPC: 0.988
(i) K = 13
1 M
3 M
5 M
7 M
9 M
11 M
13 M
15 M
Mix
0.0 0.2 0.4 0.6 0.8 1.0
Genome s
s
2
0.0
0.2
0.4
0.6
0.8
1.0
Adjusted NGS s
s
2
Figure S11: Relationship between pairwises
s
2
estimated by primate genomes usingK = 5 to
K = 13,M =K− 2 and NGS samples of different numbers of reads with bias adjustment.
X-axis is the pairwise s
s
2
estimated by genomes and Y-axis is the pairwise s
s
2
estimated
based on mixed NGS samples after bias adjustment. (a)-(h) show relationship between s
s
2
estimated by primate genomes and adjusteds
s
2
based on mixed NGS using differentK and
M. NGS samples of different numbers of reads are colored accordingly. ‘Mix’ means two
NGS samples have different numbers of reads (e.g between 1 M and 5 M or between 7 M
and 11 M) and is colored in grey. The root mean squared error (RMSE) and Spearman
correlation coefficients (SPC) between pairwise adjusted s
s
2
based on NGS samples and
genomes are shown on each subplot.
130
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.039
SPC: 0.987
(a) 1 M reads
RMSE: 0.023
SPC: 0.995
(b) 3 M reads
RMSE: 0.014
SPC: 0.997
(c) 5 M reads
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.013
SPC: 0.997
(d) 7 M reads
RMSE: 0.006
SPC: 0.999
(e) 9 M reads
RMSE: 0.007
SPC: 0.999
(f) 11 M reads
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.010
SPC: 0.999
(g) 13 M reads
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.010
SPC: 0.999
(h) 15 M reads
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.016
SPC: 0.994
(i) Mix
1 M
3 M
5 M
7 M
9 M
11 M
13 M
15 M
Mix
0.0 0.2 0.4 0.6 0.8 1.0
Genome s
*
2
0.0
0.2
0.4
0.6
0.8
1.0
Adjusted NGS s
*
2
Figure S12: Relationship between pairwise s
∗
2
estimated by primate genomes and NGS
samples using K = 14 and M = 12 of different numbers of reads with bias adjustment.
X-axis is the pairwise s
∗
2
estimated by genomes and Y-axis is the pairwise s
∗
2
estimated
based on NGS samples after bias adjustment. (a)-(h) show the relationship between s
∗
2
estimated by primate genomes and adjusted s
∗
2
based on NGS samples of only 1 M, 3M, 5
M, 7M, 9 M, 11 M, 13 M or 15 M reads, respectively. (i) shows pairwise adjusted s
∗
2
based
on mixed NGS samples. NGS samples of different numbers of reads are colored accordingly.
‘Mix’ means two NGS samples have different numbers of reads (e.g between 1 M and 5 M
or between 7 M and 11 M) and is colored in grey. The root mean squared error (RMSE)
and Spearman correlation coefficients (SPC) between pairwise s
∗
2
estimated based on NGS
samples and genomes are shown on each subplot.
131
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.010
SPC: 0.990
(a) K = 5
RMSE: 0.015
SPC: 0.996
(b) K = 6
RMSE: 0.015
SPC: 0.996
(c) K = 7
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.020
SPC: 0.993
(d) K = 8
RMSE: 0.022
SPC: 0.995
(e) K = 9
RMSE: 0.021
SPC: 0.996
(f) K = 10
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.017
SPC: 0.998
(g) K = 11
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.016
SPC: 0.997
(h) K = 12
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.017
SPC: 0.996
(i) K = 13
1 M
3 M
5 M
7 M
9 M
11 M
13 M
15 M
Mix
0.0 0.2 0.4 0.6 0.8 1.0
Genome s
*
2
0.0
0.2
0.4
0.6
0.8
1.0
Adjusted NGS s
*
2
Figure S13: Relationship between pairwises
∗
2
estimated by primate genomes usingK = 5 to
K = 13,M =K− 2 and NGS samples of different numbers of reads with bias adjustment.
X-axis is the pairwise s
∗
2
estimated by genomes and Y-axis is the pairwise s
∗
2
estimated
based on mixed NGS samples after bias adjustment. (a)-(h) show the relationship between
adjusted s
∗
2
based on mixed NGS using different K and M and s
∗
2
estimated by primate
genomes. NGS samples of different numbers of reads are colored accordingly. ‘Mix’ means
two NGS samples have different numbers of reads (e.g between 1 M and 5 M or between 7
M and 11 M) and is colored in grey. The root mean squared error (RMSE) and Spearman
correlation coefficients (SPC) between pairwise adjusted s
∗
2
based on NGS samples and
genomes are shown on each subplot.
132
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.078
SPC: 0.943
(a) 1 M reads
RMSE: 0.024
SPC: 0.874
(b) 3 M reads
RMSE: 0.019
SPC: 0.988
(c) 5 M reads
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.015
SPC: 0.989
(d) 7 M reads
RMSE: 0.013
SPC: 0.993
(e) 9 M reads
RMSE: 0.011
SPC: 0.994
(f) 11 M reads
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
RMSE: 0.009
SPC: 0.995
(g) 13 M reads
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.008
SPC: 0.995
(h) 15 M reads
0.0 0.2 0.4 0.6 0.8 1.0
RMSE: 0.031
SPC: 0.766
(i) Mix
1 M
3 M
5 M
7 M
9 M
11 M
13 M
15 M
Mix
0.0 0.2 0.4 0.6 0.8 1.0
Genome Skmer similarity
0.0
0.2
0.4
0.6
0.8
1.0
NGS Skmer similarity
Figure S14: Relationship between pairwise Skmer similarity estimated by primate genomes
using K = 21 and sketch size s = 10
7
and NGS samples of different numbers of reads. X-
axis is the pairwise Skmer similarity estimated by genomes and Y-axis is the pairwise
Skmer similarity estimated based on NGS samples. (a)-(h) show the relationship between
Skmer similarity estimated based on primate genomes and Skmer similarity estimated
based on NGS samples of only 1 M, 3M, 5 M, 7M, 9 M, 11 M, 13 M or 15 M reads,
respectively. (i) shows pairwise Skmer similarity estimated based on mixed NGS samples.
NGS samples of different numbers of reads are colored accordingly. ‘Mix’ means two NGS
samples have different numbers of reads (e.g between 1 M and 5 M or between 7 M and 11
M) and is colored in grey. The root mean squared error (RMSE) and Spearman correlation
coefficients (SPC) between pairwise Skmer similarity estimated based on NGS samples and
genomes are shown on each subplot.
133
0.0 0.5 1.0 1.5 2.0
Pairwise evolutionary distance
0
5
10
15
20
25
Pairwise genome d
s
2
SPC: 0.927
K = 14, M = 12
Figure S15: Relationship between pairwise d
s
2
using K = 14 and evolutionary distances
among 28 mammals. X-axis is the pairwise mammalian evolutionary distances estimated by
alignment-based method in [4] and Y-axis is the pairwised
s
2
caculated based on mammalian
genomes using K = 14 and M = 12 and has been transformed by (log(1− 2×d
s
2
))
2
for
better visualization. The Spearman correlation coefficient (SPC) is 0.927.
134
0.0
0.2
0.4
0.6
0.8
1.0
SPC: 0.849
(a) K = 14, s = 10
3
SPC: 0.856
(b) K = 14, s = 10
5
SPC: 0.853
(c) K = 14, s = 10
7
0.0
0.2
0.4
0.6
0.8
1.0
SPC: 0.768
(d) K = 21, s = 10
3
SPC: 0.917
(e) K = 21, s = 10
5
SPC: 0.917
(f) K = 21, s = 10
7
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
SPC: 0.875
(g) K = 31, s = 10
3
0.0 0.5 1.0 1.5 2.0
SPC: 0.935
(h) K = 31, s = 10
5
0.0 0.5 1.0 1.5 2.0
SPC: 0.943
(i) K = 31, s = 10
7
0.0 0.2 0.4 0.6 0.8 1.0
Pairwise evolutionary distance
0.0
0.2
0.4
0.6
0.8
1.0
Pairwise genome MASH distance
Figure S16: Relationship between pairwise Mash distance using K = 14, K = 21, K = 31
and sketch size s = 10
3
, s = 10
5
, s = 10
7
and evolutionary distances among 28 mammals.
X-axis is the pairwise mammalian evolutionary distances estimated by alignment-based
method in [4] and Y-axis is the pairwise Mash distances calculated based on mammalian
genomes. The corresponding Spearman correlation coefficients (SPC) for each combination
of K and s are shown on the subplot. Mash distances with K = 31 and s = 10
7
and the
evolutionary distances have the highest Spearman correlation coefficient 0.943.
135
0.0 0.2 0.4 0.6 0.8 1.0
Genome Mash similarity
0.0
0.2
0.4
0.6
0.8
1.0
NGS Mash similarity
RMSE: 0.040
SPC: 0.789
(a) Mash
1 M
5 M
15 M
Mix
0.0 0.2 0.4 0.6 0.8 1.0
Genome Skmer similarity
0.0
0.2
0.4
0.6
0.8
1.0
NGS Skmer similarity
RMSE: 0.035
SPC: 0.688
(b) Skmer
1 M
5 M
15 M
Mix
Figure S17: Relationship between pairwise Mash and Skmer similarity estimated using
K = 31 and sketch size s = 10
7
based on 28 mammalian genomes and NGS samples of
different numbers of reads. (a) relationship of Mash similarity. (b) relationship of Skmer
similarity.
136
0.0 0.2 0.4 0.6 0.8 1.0
Genome s
s
2
0.0
0.2
0.4
0.6
0.8
1.0
NGS s
s
2
RMSE: 0.043
SPC: 0.701
(a) Before bias adjustment
1 M
5 M
15 M
Mix
0.0 0.2 0.4 0.6 0.8 1.0
Genome s
s
2
0.0
0.2
0.4
0.6
0.8
1.0
Adjusted NGS s
s
2
RMSE: 0.009
SPC: 0.935
(b) After bias adjustment
1 M
5 M
15 M
Mix
Figure S18: Relationship between pairwise s
s
2
estimated using K = 14 and M = 12 based
on 67 vertebrate genomes and NGS samples of different numbers of reads. (a) relationship
before bias adjustment. (b) relationship after bias adjustment for NGS s
s
2
. The root
mean squared error was decreased and the Spearman correlation coeffient between pairwise
genome s
s
2
and NGS s
s
2
was increased after bias adjustment.
137
0.0 0.2 0.4 0.6 0.8 1.0
Genome Mash similarity
0.0
0.2
0.4
0.6
0.8
1.0
NGS Mash similarity
RMSE: 0.036
SPC: 0.747
(a) Mash
1 M
5 M
15 M
Mix
0.0 0.2 0.4 0.6 0.8 1.0
Genome Skmer similarity
0.0
0.2
0.4
0.6
0.8
1.0
NGS Skmer similarity
RMSE: 0.037
SPC: 0.735
(b) Skmer
1 M
5 M
15 M
Mix
Figure S19: Relationship between pairwise Mash and Skmer similarity estimated using
K = 31 and sketch size s = 10
7
based on 67 vertebrate genomes and NGS samples of
different numbers of reads. (a) relationship of Mash similarity. (b) relationship of Skmer
similarity.
138
0.0
0.1
0.2
0.3
0.4
0.5
SPC: 0.979
(a) d
s
2
: K = 14, M = 12
SPC: 0.970
(b) d
*
2
: K = 14, M = 12
SPC: 0.955
(c) CVTree: K = 14, M = 12
0.00 0.05 0.10 0.15
0.0
0.1
0.2
0.3
0.4
0.5
SPC: 0.984
(d) Mash: K = 21, s = 10
7
0.00 0.05 0.10 0.15
SPC: 0.984
(e) Skmer: K = 21, s = 10
7
0.00 0.05 0.10 0.15
SPC: 0.708
(f) FFP: K = 16
0.0 0.2 0.4 0.6 0.8 1.0
Pairwise evolutionary distance
0.0
0.2
0.4
0.6
0.8
1.0
Pairwise alignment-free dissimilarity
Figure S20: Relationship between different pairwise alignment-free dissimilarity and evo-
lutionary distances among 21 primates: (a) d
s
2
with K = 14 and M = 12. (b) d
∗
2
with
K = 14 and M = 12. (c) CVTree with K = 14 and M = 12. (d) Mash with K = 21 and
s = 10
7
. (e) Skmer withK = 21 ands = 10
7
. (f) FFP withK = 16. X-axis is the pairwise
primate evolutionary distances estimated by alignment-based method in [3] and Y-axis is
the pairwise alignment-free dissimilarity. The Spearman correlation coefficients (SPC) are
shown on each subplot.
139
0.0 0.2 0.4 0.6 0.8 1.0
Adjusted NGS d
s
2
(A, B)
0.0
0.2
0.4
0.6
0.8
1.0
Adjusted NGS d
s
2
(B, A)
RMSE: 0.005
SPC: 0.999
Figure S21: Relationship between pairwise adjustedd
s
2
(A,B) and adjustedd
s
2
(B,A) based
on 21 primates dataset. For any pair of NGS samples A and B, X-axis is the adjusted
d
s
2
(A,B) and Y-axis is the adjustedd
s
2
(B,A). The Spearman correlation coefficient is 0.999
which proves that our model successfully trained d
s
2
(A,B) and d
s
2
(B,A) symmetrically.
140
Query size Reference size k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10
Samples of 50 Mbp
1 91 1.00 1.00 1.00 1.00 1.00 1.00 0.93 0.97 0.88 0.94
17 75 1.00 1.00 0.99 0.99 0.97 0.97 0.95 0.96 0.94 0.96
32 60 0.99 0.99 0.97 0.97 0.94 0.95 0.93 0.95 0.93 0.95
47 45 0.98 0.98 0.95 0.96 0.93 0.94 0.93 0.95 0.91 0.93
62 30 0.95 0.95 0.92 0.93 0.90 0.93 0.90 0.92 0.89 0.91
77 15 0.89 0.89 0.85 0.87 0.81 0.82 0.79 0.77 0.73 0.67
Samples of 100 Mbp
1 91 1.00 1.00 1.00 1.00 1.00 1.00 0.98 1.00 0.96 1.00
17 75 1.00 1.00 0.99 0.99 0.98 0.98 0.97 0.99 0.97 0.99
32 60 1.00 1.00 0.98 0.99 0.97 0.98 0.96 0.97 0.95 0.96
47 45 0.99 0.99 0.97 0.98 0.96 0.97 0.95 0.97 0.95 0.97
62 30 0.98 0.98 0.95 0.96 0.94 0.96 0.93 0.95 0.90 0.91
77 15 0.93 0.93 0.90 0.91 0.86 0.84 0.81 0.78 0.70 0.67
Samples of 300 Mbp
1 91 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
17 75 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00
32 60 1.00 1.00 0.99 1.00 0.99 1.00 0.98 0.99 0.98 0.99
47 45 1.00 1.00 0.99 0.99 0.98 0.99 0.98 0.99 0.97 0.98
62 30 0.99 0.99 0.97 0.98 0.96 0.97 0.94 0.95 0.91 0.92
77 15 0.96 0.96 0.92 0.93 0.86 0.86 0.81 0.79 0.74 0.71
Table S1: Prediction accuracy usingk-NN on 92 white oak dataset based ond
∗
2
for differnt
sequence quantity, query sizes, reference sizes and different numbers of neighbors k used.
For each query size and reference size, the dataset was randomly split 100 times and an
average prediction accuracy was calculated over 100 splits.
141
Query size Reference size k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10
K = 12, s = 10
3
1 91 0.44 0.44 0.40 0.45 0.49 0.47 0.53 0.55 0.53 0.58
17 75 0.47 0.47 0.41 0.48 0.47 0.50 0.52 0.52 0.52 0.53
32 60 0.45 0.45 0.44 0.48 0.47 0.49 0.50 0.51 0.52 0.53
47 45 0.43 0.43 0.44 0.46 0.46 0.48 0.49 0.50 0.49 0.50
62 30 0.42 0.42 0.45 0.47 0.46 0.47 0.47 0.47 0.46 0.46
77 15 0.41 0.41 0.42 0.43 0.43 0.42 0.42 0.43 0.43 0.43
K = 12, s = 10
5
1 91 0.60 0.60 0.47 0.45 0.51 0.47 0.48 0.56 0.66 0.67
17 75 0.62 0.62 0.52 0.51 0.55 0.56 0.58 0.63 0.64 0.61
32 60 0.59 0.59 0.56 0.56 0.58 0.60 0.60 0.60 0.60 0.59
47 45 0.59 0.59 0.57 0.58 0.57 0.59 0.58 0.58 0.58 0.57
62 30 0.55 0.55 0.55 0.56 0.54 0.55 0.54 0.53 0.53 0.52
77 15 0.52 0.52 0.49 0.50 0.48 0.48 0.47 0.47 0.46 0.45
K = 12, s = 10
7
1 91 0.54 0.54 0.58 0.55 0.61 0.60 0.58 0.75 0.74 0.61
17 75 0.62 0.62 0.57 0.55 0.58 0.62 0.57 0.63 0.62 0.58
32 60 0.59 0.59 0.57 0.57 0.59 0.60 0.58 0.59 0.58 0.56
47 45 0.59 0.59 0.59 0.58 0.58 0.58 0.56 0.58 0.57 0.56
62 30 0.56 0.56 0.56 0.55 0.55 0.54 0.54 0.54 0.52 0.51
77 15 0.54 0.54 0.52 0.51 0.48 0.48 0.47 0.47 0.46 0.46
K = 21, s = 10
3
1 91 0.63 0.63 0.57 0.63 0.57 0.58 0.54 0.60 0.56 0.59
17 75 0.63 0.63 0.61 0.63 0.61 0.62 0.58 0.61 0.58 0.62
32 60 0.63 0.63 0.62 0.63 0.62 0.61 0.60 0.61 0.60 0.62
47 45 0.60 0.60 0.59 0.60 0.58 0.59 0.59 0.61 0.60 0.62
62 30 0.59 0.59 0.58 0.58 0.57 0.59 0.58 0.58 0.59 0.60
77 15 0.53 0.53 0.53 0.55 0.54 0.54 0.54 0.53 0.53 0.53
K = 21, s = 10
5
1 91 0.82 0.82 0.72 0.72 0.53 0.55 0.46 0.48 0.48 0.48
17 75 0.83 0.83 0.68 0.68 0.57 0.58 0.51 0.53 0.52 0.54
32 60 0.79 0.79 0.64 0.65 0.55 0.57 0.53 0.56 0.55 0.58
47 45 0.75 0.75 0.59 0.61 0.54 0.57 0.54 0.59 0.58 0.61
62 30 0.65 0.65 0.54 0.58 0.54 0.59 0.58 0.61 0.62 0.64
77 15 0.59 0.59 0.55 0.57 0.56 0.58 0.57 0.56 0.54 0.52
K = 21, s = 10
7
1 91 0.92 0.92 0.73 0.75 0.53 0.54 0.48 0.48 0.43 0.43
17 75 0.85 0.85 0.61 0.63 0.49 0.52 0.47 0.48 0.47 0.50
32 60 0.80 0.80 0.58 0.61 0.51 0.54 0.50 0.54 0.53 0.57
142
47 45 0.74 0.74 0.55 0.58 0.51 0.55 0.55 0.59 0.59 0.62
62 30 0.68 0.68 0.52 0.57 0.52 0.58 0.57 0.62 0.61 0.64
77 15 0.58 0.58 0.55 0.59 0.57 0.58 0.58 0.57 0.55 0.55
K = 31, s = 10
3
1 91 0.66 0.66 0.64 0.67 0.65 0.62 0.62 0.62 0.58 0.58
17 75 0.66 0.66 0.68 0.67 0.66 0.65 0.62 0.62 0.59 0.59
32 60 0.66 0.66 0.67 0.65 0.63 0.63 0.60 0.61 0.59 0.60
47 45 0.63 0.63 0.62 0.62 0.59 0.59 0.57 0.57 0.56 0.56
62 30 0.61 0.61 0.58 0.59 0.57 0.57 0.56 0.57 0.55 0.56
77 15 0.57 0.57 0.55 0.56 0.53 0.54 0.52 0.52 0.52 0.49
K = 31, s = 10
5
1 91 0.98 0.98 0.82 0.85 0.77 0.77 0.62 0.60 0.59 0.59
17 75 0.92 0.92 0.76 0.78 0.69 0.70 0.62 0.63 0.61 0.61
32 60 0.86 0.86 0.72 0.73 0.64 0.64 0.60 0.61 0.60 0.61
47 45 0.82 0.82 0.68 0.68 0.61 0.63 0.60 0.62 0.61 0.64
62 30 0.75 0.75 0.65 0.67 0.62 0.65 0.63 0.67 0.65 0.67
77 15 0.63 0.63 0.57 0.60 0.58 0.60 0.57 0.56 0.55 0.54
K = 31, s = 10
7
1 91 0.96 0.96 0.78 0.84 0.67 0.72 0.53 0.52 0.53 0.52
17 75 0.93 0.93 0.77 0.80 0.66 0.69 0.60 0.61 0.60 0.61
32 60 0.88 0.88 0.68 0.72 0.61 0.63 0.57 0.59 0.57 0.59
47 45 0.83 0.83 0.67 0.69 0.61 0.63 0.60 0.63 0.62 0.65
62 30 0.75 0.75 0.63 0.64 0.60 0.62 0.60 0.63 0.63 0.65
77 15 0.64 0.64 0.58 0.60 0.59 0.61 0.59 0.60 0.59 0.57
Table S2: Prediction accuracy using k-NN on 92 white oak dataset of mixed sequence
quantity based on Mash using different kmer lengths K and sketch sizes s for different
query sizes, reference sizes and different numbers of neighbors k used. For each query size
and reference size, the dataset was randomly split 100 times and an average prediction
accuracy was calculated over 100 splits.
143
Query size Reference size k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10
K = 12, s = 10
3
1 91 0.44 0.44 0.44 0.44 0.44 0.44 0.44 0.44 0.44 0.44
17 75 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.35
32 60 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36
47 45 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36
62 30 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.35 0.35
77 15 0.36 0.36 0.35 0.36 0.34 0.35 0.34 0.35 0.35 0.36
K = 12, s = 10
5
1 91 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39
17 75 0.37 0.37 0.37 0.37 0.37 0.37 0.37 0.37 0.37 0.37
32 60 0.37 0.37 0.37 0.37 0.37 0.37 0.37 0.37 0.37 0.37
47 45 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.35
62 30 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36
77 15 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.37
K = 12, s = 10
7
1 91 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38
17 75 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.35
32 60 0.37 0.37 0.37 0.37 0.37 0.37 0.37 0.37 0.37 0.37
47 45 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.35 0.35
62 30 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36
77 15 0.36 0.36 0.36 0.36 0.36 0.36 0.35 0.36 0.36 0.36
K = 21, s = 10
3
1 91 0.77 0.77 0.66 0.72 0.70 0.74 0.69 0.67 0.57 0.58
17 75 0.67 0.67 0.61 0.64 0.61 0.60 0.57 0.57 0.55 0.55
32 60 0.65 0.65 0.61 0.61 0.58 0.58 0.57 0.57 0.57 0.56
47 45 0.62 0.62 0.59 0.61 0.57 0.57 0.56 0.55 0.54 0.53
62 30 0.58 0.58 0.55 0.54 0.53 0.52 0.51 0.50 0.51 0.51
77 15 0.51 0.51 0.48 0.47 0.48 0.51 0.51 0.52 0.54 0.53
K = 21, s = 10
5
1 91 0.87 0.87 0.88 0.83 0.67 0.66 0.56 0.61 0.59 0.48
17 75 0.86 0.86 0.84 0.81 0.72 0.70 0.60 0.60 0.55 0.51
32 60 0.83 0.83 0.78 0.76 0.68 0.64 0.57 0.55 0.49 0.46
47 45 0.79 0.79 0.68 0.63 0.57 0.53 0.47 0.45 0.44 0.43
62 30 0.69 0.69 0.58 0.53 0.48 0.45 0.42 0.45 0.48 0.49
77 15 0.55 0.55 0.45 0.45 0.46 0.51 0.51 0.53 0.52 0.54
K = 21, s = 10
7
1 91 0.94 0.94 0.87 0.84 0.74 0.74 0.72 0.75 0.65 0.51
17 75 0.88 0.88 0.83 0.81 0.73 0.71 0.64 0.61 0.52 0.47
32 60 0.85 0.85 0.79 0.75 0.68 0.64 0.54 0.51 0.46 0.43
144
47 45 0.80 0.80 0.70 0.66 0.59 0.53 0.45 0.43 0.41 0.43
62 30 0.74 0.74 0.60 0.52 0.49 0.46 0.45 0.46 0.50 0.53
77 15 0.58 0.58 0.46 0.45 0.48 0.52 0.52 0.53 0.54 0.56
K = 31, s = 10
3
1 91 0.85 0.85 0.90 0.91 0.88 0.92 0.77 0.79 0.79 0.77
17 75 0.78 0.78 0.83 0.84 0.78 0.78 0.74 0.75 0.74 0.74
32 60 0.77 0.77 0.79 0.80 0.73 0.74 0.71 0.71 0.70 0.69
47 45 0.74 0.74 0.72 0.73 0.69 0.70 0.68 0.68 0.66 0.65
62 30 0.69 0.69 0.66 0.66 0.62 0.61 0.57 0.55 0.53 0.53
77 15 0.59 0.59 0.54 0.51 0.51 0.49 0.49 0.51 0.52 0.52
K = 31, s = 10
5
1 91 1.00 1.00 1.00 1.00 0.97 0.97 0.95 0.95 0.93 0.86
17 75 0.99 0.99 0.96 0.95 0.92 0.91 0.87 0.84 0.80 0.78
32 60 0.98 0.98 0.94 0.93 0.88 0.86 0.81 0.78 0.75 0.72
47 45 0.96 0.96 0.91 0.89 0.82 0.78 0.72 0.70 0.67 0.65
62 30 0.88 0.88 0.78 0.73 0.67 0.62 0.58 0.58 0.55 0.53
77 15 0.71 0.71 0.57 0.53 0.53 0.51 0.50 0.52 0.53 0.55
K = 31, s = 10
7
1 91 1.00 1.00 1.00 0.98 0.95 0.95 0.93 0.93 0.91 0.91
17 75 1.00 1.00 0.98 0.97 0.94 0.94 0.90 0.89 0.85 0.82
32 60 0.98 0.98 0.94 0.93 0.88 0.86 0.81 0.80 0.75 0.72
47 45 0.97 0.97 0.90 0.88 0.82 0.79 0.74 0.71 0.68 0.64
62 30 0.90 0.90 0.79 0.75 0.69 0.66 0.58 0.58 0.57 0.55
77 15 0.73 0.73 0.60 0.55 0.56 0.54 0.53 0.56 0.57 0.56
Table S3: Prediction accuracy using k-NN on 92 white oak dataset of mixed sequence
quantity based on Skmer using different kmer lengths K and sketch sizes s for different
query sizes, reference sizes and different numbers of neighbors k used. For each query size
and reference size, the dataset was randomly split 100 times and an average prediction
accuracy was calculated over 100 splits.
145
Counting (min) Calculation (min) Total Time (min) Memory (GB)
Cafe-d
s
2
2397.3 4023.8 6421.1 28.9
Afann-d
s
2
78.4 27.3 105.7 6.4
Cafe-d
∗
2
2397.3 4039.2 6436.5 28.9
Afann-d
∗
2
78.4 13.3 91.7 4.1
Afann-d
∗
2
-fast 78.4 0.9 79.3 45.1
Cafe-CVTree 2397.3 3921.8 6319.1 28.9
Afann-CVTree 78.4 14.5 92.9 4.1
Afann-CVTree-fast 78.4 1.1 79.5 45.1
Mash
min
63.5 0.1 63.6 0.6
Mash
opt
100.3 1.5 101.8 4.4
Skmer
min
NA NA 69.4 0.6
Skmer
opt
NA NA 121.6 1.1
FFP 94.9 0.1 95.0 0.08
Table S4: Kmer counting time, dissimilarity calculation time and total time as well as
memory usage used by Cafe and Afann to calculate the pairwise d
s
2
,d
∗
2
andCVTree using
K = 14 and M = 12 among a dataset of 21 primate genomes. Afann-d
∗
2
-fast and Afann-
CVTree-fast stand for the fast mode ofd
∗
2
andCVTree supported in Afann. Running time
and memory usage of Mash, Skmer and FFP were also included. Mash
min
and Skmer
min
used K = 14 and s = 10
3
which require the minimum computing power. Mash
opt
and
Skmer
opt
used K = 21 and s = 10
7
which have the optimal performance among Mash and
Skmer using different combinations of kmer lengths and sketch sizes as shown in Figure
S9. FFP used K = 16.
146
Species Assembly accession number Total sequence length (Mb)
Daubentonia madagascariensis GCA 000241425.1 2855.37
Nasalis larvatus GCA 000772465.1 3011.97
Eulemur macaco GCA 001262655.1 2119.88
Homo sapiens GCF 000001405.39 3099.73
Pongo abelii GCF 002880775.1 3441.24
Callithrix jacchus GCF 000004665.1 2914.96
Nomascus leucogenys GCF 000146795.2 2962.06
Gorilla gorilla gorilla GCF 000151905.2 3029.54
Carlito syrichta GCF 000164805.1 3453.86
Otolemur garnettii GCF 000181295.1 2519.72
Saimiri boliviensis boliviensis GCF 000235385.1 2608.59
Pan paniscus GCF 000258655.2 3286.64
Papio anubis GCF 000264685.3 2948.40
Macaca fascicularis GCF 000364345.1 2946.84
Galeopterus variegatus GCF 000696425.1 3187.66
Macaca mulatta GCF 003339765.1 3236.21
Colobus angolensis palliatus GCF 000951035.1 2970.12
Mandrillus leucophaeus GCF 000951045.1 3061.99
Aotus nancymaae GCF 000952055.2 2861.68
Macaca nemestrina GCF 000956065.1 2948.70
Chlorocebus sabaeus GCF 000409795.2 2789.66
Table S5: Species names, assembly accession numbers and total sequence lengths of 21
primate genomes with known pairwise evolutionary distances estimated by alignment-based
method in [3].
147
Species Assembly accession number Total sequence length (Mb)
Homo sapiens GCF 000001405.39 3099.73
Pan troglodytes GCF 002880755.1 3309.56
Macaca mulatta GCF 003339765.1 2969.97
Otolemur garnettii GCF 000181295.1 2519.72
Tupaia belangeri GCA 000181375.1 2137.23
Rattus norvegicus GCF 000001895.5 2909.70
Mus musculus GCF 000001635.26 2730.86
Cavia porcellus GCF 000151735.1 2723.22
Oryctolagus cuniculus GCF 000003625.3 2737.46
Sorex araneus GCF 000181275.1 2423.16
Erinaceus europaeus GCF 000296755.1 2715.72
Canis lupus familiaris GCF 000002285.3 2410.98
Felis catus GCF 000181335.3 2455.54
Equus caballus GCF 002863925.1 2474.91
Bos taurus GCF 002263795.1 2983.31
Dasypus novemcinctus GCF 000208655.1 3631.52
Loxodonta africana GCF 000001905.1 3196.74
Echinops telfairi GCF 000313985.1 2947.02
Monodelphis domestica GCF 000002295.2 3598.44
Ornithorhynchus anatinus GCF 004115215.1 1995.61
Gallus gallus GCF 000002315.6 1046.93
Anolis carolinensis GCF 000090745.1 1799.14
Xenopus tropicalis GCF 000004195.3 1511.72
Tetraodon nigroviridis GCA 000180735.1 342.40
Takifugu rubripes GCF 901000725.2 391.47
Gasterosteus aculeatus GCA 006229165.1 467.45
Oryzias latipes GCF 002234675.1 734.06
Danio rerio GCF 000002035.6 1412.46
Table S6: Species names, assembly accession numbers and total sequence lengths of 28
mammlian genomes with known pairwise evolutionary distances estimated by alignment-
based method in [4].
148
Run accession number Number of bases (Mbp) Continental origin
SRR2053033 685.23 North America
SRR2053034 917.57 North America
SRR2053035 1268.47 North America
SRR2053036 694.58 North America
SRR2053037 1024.22 North America
SRR2053038 869.04 North America
SRR2053039 626.5 North America
SRR2053040 803.69 North America
SRR2053041 616.33 North America
SRR2053042 1079 North America
SRR2053043 906.57 North America
SRR2053044 742.13 North America
SRR2053045 790.02 North America
SRR2053046 781.9 North America
SRR2053047 885.39 North America
SRR2053048 491.75 North America
SRR2053049 883.87 North America
SRR2053050 915.02 North America
SRR2053051 944.57 North America
SRR2053052 642.98 North America
SRR2053053 894.36 North America
SRR2053054 643.77 North America
SRR2053055 1493.05 North America
SRR2053056 638.61 North America
SRR2053057 554.35 North America
SRR2053058 717.96 North America
SRR2053059 688.58 North America
SRR2053061 857.74 North America
SRR2053062 767 North America
SRR2053063 845.86 North America
SRR2053064 927.41 North America
SRR2053065 390.22 North America
SRR2053067 1170.13 North America
SRR2053068 1209.42 North America
SRR2053069 955.16 North America
SRR2053070 845.2 North America
SRR2053071 524.75 North America
SRR2053072 1245.94 Asia
SRR2053073 1688.66 Europe
149
SRR2053074 959.47 North America
SRR2053075 1260.95 North America
SRR2053076 934.44 North America
SRR2053077 1764.53 Europe
SRR2053078 818.77 North America
SRR2053079 586.82 North America
SRR2053081 651.53 Asia
SRR2053083 732.85 Asia
SRR2053084 715.3 Asia
SRR2053085 867.69 Asia
SRR2053086 908.06 Asia
SRR2053087 892.59 Asia
SRR2053088 859.19 Asia
SRR2053089 463.45 Asia
SRR2053090 549.38 Asia
SRR2053091 1170.22 Asia
SRR2053092 612.1 Asia
SRR2053093 1134.81 Asia
SRR2053094 410.8 Asia
SRR2053095 593.49 Asia
SRR2053096 542.02 Asia
SRR2053097 482.64 Asia
SRR2053098 671.66 Asia
SRR2053099 1414.04 Asia
SRR2053100 378.52 Asia
SRR2053101 638.39 Asia
SRR2053102 637.86 Asia
SRR2053103 1539.57 Asia
SRR2053104 1048.44 Asia
SRR2053105 840.4 Asia
SRR2053106 519.54 Asia
SRR2053107 475.99 Asia
SRR2053108 434.93 Asia
SRR2053109 533.56 Europe
SRR2053110 1275.52 Europe
SRR2053111 858.59 Europe
SRR2053112 828.15 Europe
SRR2053113 388.95 Europe
SRR2053114 1048.06 Europe
SRR2053115 1851.59 Europe
SRR2053116 643.04 Europe
150
SRR2053117 676.9 Europe
SRR2053118 783.79 Europe
SRR2053119 699.55 Asia
SRR2053120 768.01 Asia
SRR2053121 823.06 Asia
SRR2053122 664.5 Asia
SRR2053126 1032.69 Europe
SRR2053127 643.54 Europe
SRR2053128 1203.51 Europe
SRR2053129 1278.05 North America
SRR2053130 1849.31 Asia
SRR2053131 1581 Europe
Table S7: Run accession numbers, number of bases and continental origins of 92 white
oak NGS samples from NCBI BioProject PRJNA269970. Continental origins are defined
based on samples’ coordinates according to [5].
151
Species Class Assembly accession Total sequence length (Mb)
Cyprinus carpio Fish GCF 000951615.1 1713.66
Carassius auratus Fish GCF 003368295.1 1820.64
Cyprinodon nevadensis Fish GCA 000776015.1 1011.85
Seriola lalandi Fish GCF 002814215.1 732.51
Channa argus Fish GCA 004786185.1 644.13
Astyanax mexicanus Fish GCF 000372685.2 1335.24
Neolamprologus brichardi Fish GCF 000239395.1 847.91
Latimeria chalumnae Fish GCF 000225785.1 2860.59
Oryzias latipes Fish GCF 002234675.1 734.06
Oncorhynchus mykiss Fish GCF 002163495.1 2179.00
Poecilia reticulata Fish GCF 000633615.1 731.62
Xiphophorus maculatus Fish GCF 002775205.1 704.32
Nothobranchius furzeri Fish GCF 001465895.1 1242.52
Maylandia zebra Fish GCF 000238955.4 957.49
Salmo salar Fish GCF 000233375.1 2966.89
Pyxicephalus adspersus Amphibian GCA 004786255.1 1563.37
Nanorana parkeri Amphibian GCF 000935625.1 2053.87
Rana catesbeiana Amphibian GCA 002284835.2 6250.35
Rhinella marina Amphibian GCA 900303285.1 2551.76
Rhinatrema bivittatum Amphibian GCF 901001135.1 5319.24
Xenopus laevis Amphibian GCF 001663975.1 2718.43
Xenopus tropicalis Amphibian GCF 000004195.3 1440.40
Terrapene carolina Reptile GCF 002925995.2 2571.27
Crotalus viridis Reptile GCA 003400415.2 1340.20
Malaclemys terrapin Reptile GCA 001728815.2 2439.75
Vipera berus Reptile GCA 000800605.1 1532.39
Chrysemys picta Reptile GCF 000241765.3 2365.77
Dermochelys coriacea Reptile GCA 006547105.1 2154.17
Varanus komodoensis Reptile GCA 004798865.1 1507.95
Hydrophis melanocephalus Reptile GCA 004320005.1 1402.64
Emydocephalus ijimae Reptile GCA 004319985.1 1625.20
Hydrophis hardwickii Reptile GCA 004023765.1 1296.39
Salvator merianae Reptile GCA 003586115.2 2068.17
Lacerta viridis Reptile GCA 900245905.1 1439.84
Gopherus agassizii Reptile GCA 002896415.1 2184.97
Laticauda colubrina Reptile GCA 004320045.1 2024.69
Crocodylus porosus Reptile GCF 001723895.1 2049.54
Phylloscopus trochiloides Bird GCA 001655095.1 1003.33
Aquila chrysaetos Bird GCF 000766835.1 1192.74
152
Hirundo rustica Bird GCA 003692655.1 1213.74
Limosa lapponica Bird GCA 002844005.1 1034.77
Saxicola maurus Bird GCA 900205225.1 1020.37
Strix occidentalis Bird GCA 002372975.1 1255.54
Himantopus himantopus Bird GCA 003993805.1 1116.81
Patagioenas fasciata Bird GCA 002029285.1 1089.15
Lagopus muta Bird GCA 004320205.1 1002.56
Tympanuchus cupido Bird GCA 001870855.1 983.78
Lonchura striata Bird GCF 002197715.1 1060.17
Zosterops lateralis Bird GCA 001281735.1 1036.00
Buceros rhinoceros Bird GCF 000710305.1 1065.78
Chlamydotis undulata Bird GCA 003400225.1 1307.64
Phoenicopterus ruber Bird GCA 000687265.1 1132.18
Canis lupus Mammal GCF 003254725.1 2439.83
Sus scrofa Mammal GCF 000003025.6 2501.91
Mus musculus Mammal GCF 000001635.26 2730.86
Odocoileus hemionus Mammal GCA 004115125.1 2343.70
Rhinolophus ferrumequinum Mammal GCA 004115265.2 2075.79
Eulemur fulvus Mammal GCA 004027275.1 2748.89
Panthera tigris Mammal GCF 000464555.1 2391.08
Enhydra lutris Mammal GCF 002288905.1 2455.28
Homo sapiens Mammal GCF 000001405.39 3099.71
Solenodon paradoxus Mammal GCA 004363575.1 2109.88
Cebus capucinus Mammal GCF 001604975.1 2717.70
Marmota marmota Mammal GCF 001458135.1 2510.59
Neophocaena asiaeorientalis Mammal GCF 003031525.1 2284.63
Macaca fuscata Mammal GCA 003118495.1 2930.71
Ursus arctos Mammal GCF 003584765.1 2328.66
Table S8: Species names, classes, assembly accession numbers and total sequence lengths
of 67 vertebrate genomes downloaded from NCBI.
153
References
[1] Ji Qi, Hong Luo, and Bailin Hao. CVTree: a phylogenetic tree reconstruction tool
based on whole genomes. Nucleic Acids Research, 32(suppl 2):W45–W47, 2004.
[2] Gesine Reinert, David Chew, Fengzhu Sun, and Michael S Waterman. Alignment-
free sequence comparison (i): statistics and power. Journal of Computational Biology,
16(12):1615–1634, 2009.
[3] Polina Perelman, Warren E Johnson, Christian Roos, Hector N Seu´ anez, Julie E Hor-
vath, Miguel AM Moreira, Bailey Kessing, Joan Pontius, Melody Roelke, Yves Rumpler,
et al. A molecular phylogeny of living primates. PLOS Genetics, 7(3):e1001342, 2011.
[4] Webb Miller, Kate Rosenbloom, Ross C Hardison, Minmei Hou, James Taylor, Brian
Raney, Richard Burhans, David C King, Robert Baertsch, Daniel Blankenberg, et al.
28-way vertebrate alignment and conservation track in the UCSC genome browser.
Genome Research, 17(12):1797–1808, 2007.
[5] Kujin Tang, Jie Ren, Richard Cronn, David L Erickson, Brook G Milligan, Meaghan
Parker-Forney, John L Spouge, and Fengzhu Sun. Alignment-free genome comparison
enables accurate geographic sourcing of white oak DNA. BMC Genomics, 19(1):896,
2018.
154
Bibliography
[1] Morgan GI Langille, William WL Hsiao, and Fiona SL Brinkman. Evaluation of ge-
nomic island predictors using a comparative genomics approach. BMC Bioinformatics,
9(1):1, 2008.
[2] Stephen F Altschul, Thomas L Madden, Alejandro A Sch aer, Jinghui Zhang, Zheng
Zhang, Webb Miller, and David J Lipman. Gapped blast and psi-blast: a new gener-
ation of protein database search programs. Nucleic acids research, 25(17):3389{3402,
1997.
[3] William R Pearson and David J Lipman. Improved tools for biological sequence
comparison. Proceedings of the National Academy of Sciences, 85(8):2444{2448, 1988.
[4] Julie D Thompson, Desmond G Higgins, and Toby J Gibson. Clustal w: improving
the sensitivity of progressive multiple sequence alignment through sequence weight-
ing, position-specic gap penalties and weight matrix choice. Nucleic acids research,
22(22):4673{4680, 1994.
[5] Yingnan Cong, Yao-ban Chan, and Mark A Ragan. A novel alignment-free method
for detection of lateral genetic transfer based on TF-IDF. Scientic Reports, 6(30308),
2016.
155
[6] Ji Qi, Hong Luo, and Bailin Hao. CVTree: a phylogenetic tree reconstruction tool
based on whole genomes. Nucleic Acids Research, 32(suppl 2):W45{W47, 2004.
[7] Gesine Reinert, David Chew, Fengzhu Sun, and Michael S Waterman. Alignment-free
sequence comparison (i): statistics and power. Journal of Computational Biology,
16(12):1615{1634, 2009.
[8] Gregory E Sims, Se-Ran Jun, Guohong A Wu, and Sung-Hou Kim. Alignment-free
genome comparison with feature frequency proles (FFP) and optimal resolutions.
Proceedings of the National Academy of Sciences, 106(8):2677{2682, 2009.
[9] Huan Fan, Anthony R Ives, Yann Surget-Groba, and Charles H Cannon. An as-
sembly and alignment-free method of phylogeny reconstruction from next-generation
sequencing data. BMC Genomics, 16(1):522, 2015.
[10] Brian D Ondov, Todd J Treangen, P all Melsted, Adam B Mallonee, Nicholas H
Bergman, Sergey Koren, and Adam M Phillippy. Mash: fast genome and metagenome
distance estimation using Minhash. Genome Biology, 17(1):132, 2016.
[11] Andrzej Zielezinski, Susana Vinga, Jonas Almeida, and Wojciech M Karlowski.
Alignment-free sequence comparison: benets, applications, and tools. Genome Biol-
ogy, 18(1):186, 2017.
[12] Jie Ren, Xin Bai, Yang Young Lu, Kujin Tang, Ying Wang, Gesine Reinert, and
Fengzhu Sun. Alignment-free sequence analysis and applications. arXiv preprint
arXiv:1803.09727, 2018.
[13] Shahab Sarmashghi, Kristine Bohmann, M Thomas P Gilbert, Vineet Bafna, and
Siavash Mirarab. Skmer: assembly-free and alignment-free sample identication using
genome skims. Genome Biology, 20(1):34, 2019.
156
[14] Andrzej Zielezinski, Hani Z Girgis, Guillaume Bernard, Chris-Andre Leimeister, Kujin
Tang, Thomas Dencker, Anna K Lau, Sophie R ohling, JaeJin Choi, Michael S Water-
man, et al. Benchmarking of alignment-free sequence comparison methods. Genome
Biology, 20(1):144, 2019.
[15] Bai Jiang, Kai Song, Jie Ren, Minghua Deng, Fengzhu Sun, and Xuegong Zhang.
Comparison of metagenomic samples using sequence signatures. BMC Genomics,
13(1):730, 2012.
[16] Kai Song, Jie Ren, Zhiyuan Zhai, Xuemei Liu, Minghua Deng, and Fengzhu Sun.
Alignment-free sequence comparison based on next-generation sequencing reads. Jour-
nal of computational biology, 20(2):64{79, 2013.
[17] Kujin Tang, Jie Ren, Richard Cronn, David L Erickson, Brook G Milligan, Meaghan
Parker-Forney, John L Spouge, and Fengzhu Sun. Alignment-free genome comparison
enables accurate geographic sourcing of white oak DNA. BMC Genomics, 19(1):896,
2018.
[18] George E Fox, Linda J Magrum, William E Balch, Ralph S Wolfe, and Carl R Woese.
Classication of methanogenic bacteria by 16s ribosomal rna characterization. Proceed-
ings of the National Academy of Sciences of the United States of America, 74(10):4537,
1977.
[19] GE c-authors Fox, E Stackebrandt, RB Hespell, J Gibson, J Manilo, TA Dyer,
RS Wolfe, WE Balch, RS Tanner, LJ Magrum, et al. The phylogeny of prokaryotes.
Science, 209(4455):457{463, 1980.
[20] Carl R Woese. Bacterial evolution. Microbiological reviews, 51(2):221, 1987.
157
[21] David C Torney, Christian Burks, Daniel Davison, and Karl M Sirotkin. Computation
of d 2: A measure of sequence dissimilarity. In Computers and DNA, pages 109{125.
Routledge, 2018.
[22] B Edwin Blaisdell. Markov chain analysis nds a signicant in
uence of neighboring
bases on the occurrence of a base in eucaryotic nuclear dna sequences both protein-
coding and noncoding. Journal of molecular evolution, 21(3):278{288, 1985.
[23] B Edwin Blaisdell. A measure of the similarity of sets of sequences not requiring
sequence alignment. Proceedings of the National Academy of Sciences, 83(14):5155{
5159, 1986.
[24] Samuel Karlin and Jan Mr azek. Compositional dierences within and between eukary-
otic genomes. Proceedings of the National Academy of Sciences, 94(19):10227{10232,
1997.
[25] Lin Wan, Gesine Reinert, Fengzhu Sun, and Michael S Waterman. Alignment-free
sequence comparison (ii): theoretical power of comparison statistics. Journal of Com-
putational Biology, 17(11):1467{1490, 2010.
[26] Susana Vinga and Jonas Almeida. Alignment-free sequence comparison|a review.
Bioinformatics, 19(4):513{523, 2003.
[27] Susana Vinga. Information theory applications for biological sequence analysis. Brief-
ings in bioinformatics, 15(3):376{389, 2013.
[28] Oliver Bonham-Carter, Joe Steele, and Dhundy Bastola. Alignment-free genetic se-
quence comparisons: a review of recent approaches by word analysis. Briengs in
bioinformatics, 15(6):890{905, 2013.
158
[29] Samuel Karlin and Chris Burge. Dinucleotide relative abundance extremes: a genomic
signature. Trends in Genetics, 11(7):283{290, 1995.
[30] David C Torney, Christina Burks, Daniel Davison, and Kart M Sirotkin. Computation
of d2: a measure of sequence dissimilarity. In Computers and DNA: the proceedings of
the Interface between Computation Science and Nucleic Acid Sequencing Workshop,
held December 12 to 16, 1988 in Santa Fe, New Mexico/edited by George I. Bell,
Thomas G. Marr. Redwood City, Calif.: Addison-Wesley Pub. Co., 1990., 1990.
[31] Kai Song, Jie Ren, Gesine Reinert, Minghua Deng, Michael S Waterman, and Fengzhu
Sun. New developments of alignment-free sequence comparison: measures, statistics
and next-generation sequencing. Briengs in Bioinformatics, 15(3):343{353, 2013.
[32] Yang Young Lu, Kujin Tang, Jie Ren, Jed A Fuhrman, Michael S Waterman, and
Fengzhu Sun. CAFE: aCcelerated Alignment-FrEe sequence analysis. Nucleic Acids
Research, 45(W1):W554{W559, 2017.
[33] Jie Ren, Kai Song, Minghua Deng, Gesine Reinert, Charles H Cannon, and Fengzhu
Sun. Inference of Markovian properties of molecular sequences from NGS data and
applications to comparative genomics. Bioinformatics, 32(7):993{1000, 2016.
[34] Nathan A Ahlgren, Jie Ren, Yang Young Lu, Jed A Fuhrman, and Fengzhu Sun.
Alignment-free oligonucleotide frequency dissimilarity measure improves prediction of
hosts from metagenomically-derived viral sequences. Nucleic Acids Research, 45(1):39{
53, 2017.
[35] Weinan Liao, Jie Ren, Kun Wang, Shun Wang, Feng Zeng, Ying Wang, and Fengzhu
Sun. Alignment-free transcriptomic and metatranscriptomic comparison using se-
quencing signatures with variable length markov chains. Scientic Reports, 6, 2016.
159
[36] Csaba P al, Bal azs Papp, and Martin J Lercher. Adaptive evolution of bacterial
metabolic networks by horizontal gene transfer. Nature Genetics, 37(12):1372, 2005.
[37] Carlton Gyles and Patrick Boerlin. Horizontally transferred genetic elements and their
role in pathogenesis of bacterial disease. Veterinary Pathology, 51(2):328{340, 2014.
[38] Matt Ravenhall, Nives
Skunca, Florent Lassalle, and Christophe Dessimoz. Inferring
horizontal gene transfer. PLoS Computational Biology, 11(5):e1004095, 2015.
[39] Bingxin Lu and Hon Wai Leong. Computational methods for predicting genomic
islands in microbial genomes. Computational and Structural Biotechnology Journal,
14:200{206, 2016.
[40] Samuel Karlin and Chris Burge. Dinucleotide relative abundance extremes: a genomic
signature. Trends in Genetics, 11(7):283{290, 1995.
[41] Yingnan Cong, Yao-ban Chan, Charles A Phillips, Michael A Langston, and Mark A
Ragan. Robust inference of genetic exchange communities from microbial genomes
using TF-IDF. Frontiers in Microbiology, 8(21), 2017.
[42] Yingnan Cong, Yao-ban Chan, and Mark A Ragan. Exploring lateral genetic transfer
among microbial genomes using TF-IDF. Scientic Reports, 6(29319), 2016.
[43] Samuel Karlin. Detecting anomalous gene clusters and pathogenicity islands in diverse
bacterial genomes. Trends in Microbiology, 9(7):335{343, 2001.
[44] Aristotelis Tsirigos and Isidore Rigoutsos. A new computational method for the de-
tection of horizontal gene transfer events. Nucleic Acids Research, 33(3):922{933,
2005.
160
[45] Jennifer Becq, C ecile Churlaud, and Patrick Deschavanne. A benchmark of parametric
methods for horizontal transfers detection. PLoS ONE, 5(4):e9989, 2010.
[46] Morgan GI Langille, William WL Hsiao, and Fiona SL Brinkman. Detecting genomic
islands using bioinformatics approaches. Nature Reviews. Microbiology, 8(5):373, 2010.
[47] Patrick J Keeling and Jerey D Palmer. Horizontal gene transfer in eukaryotic evolu-
tion. Nature reviews. Genetics, 9(8):605, 2008.
[48] Jerey G Lawrence and Howard Ochman. Amelioration of bacterial genomes: rates
of change and exchange. Journal of molecular evolution, 44(4):383{397, 1997.
[49] Christine Dufraigne, Bernard Fertil, Sylvain Lespinats, Alain Giron, and Patrick De-
schavanne. Detection and characterization of horizontal transfers in prokaryotes using
genomic signature. Nucleic Acids Research, 33(1):e6{e6, 2005.
[50] Issaac Rajan, Sarang Aravamuthan, and Sharmila S Mande. Identication of com-
positionally distinct regions in genomes using the centroid method. Bioinformatics,
23(20):2672{2677, 2007.
[51] Santiago Garcia-Vallv e, Anton Romeu, and Jaume Palau. Horizontal gene transfer in
bacterial and archaeal complete genomes. Genome Research, 10(11):1719{1725, 2000.
[52] Samuel Karlin. Bacterial DNA strand compositional asymmetry. Trends in Microbi-
ology, 7(8):305{308, 1999.
[53] Pierre Nicolas, Laurent Bize, Florence Muri, Mark Hoebeke, Fran cois Rodolphe,
S Dusko Ehrlich, Bernard Prum, and Philippe Bessi eres. Mining bacillus subtilis
chromosome heterogeneities using hidden markov models. Nucleic Acids Research,
30(6):1418{1426, 2002.
161
[54] Javier Tamames and Andr es Moya. Estimating the extent of horizontal gene transfer
in metagenomic sequences. BMC Genomics, 9(1):136, 2008.
[55] Michele Ruta and Anthony J Venables. International trade in natural resources: prac-
tice and policy. World Trade Organization, Economic Research and Statistics Division,
pages 1{8, 2012.
[56] C May. Transnational crime in the developing world. Global Financial Integrity, pages
59{92, 2017.
[57] Christian Nellemann, Rune Henriksen, Patricia Raxter, Neville Ash, Elizabeth Mrema,
et al. The environmental crime crisis: threats to sustainable development from illegal
exploitation and trade in wildlife and forest resources. United Nations Environment
Programme (UNEP), 2014.
[58] Eleanor E Dormontt, Markus Boner, Birgit Braun, Gerhard Breulmann, Bernd Degen,
Edgard Espinoza, Shelley Gardner, Phil Guillery, John C Hermanson, Gerald Koch,
et al. Forensic timber identication: It's time to integrate disciplines to combat illegal
logging. Biological Conservation, 191:790{798, 2015.
[59] R Bruce Hoadley. Identifying wood: accurate results with simple tools. Taunton Press,
1990.
[60] Peter Gasson. How precise can wood identication be? wood anatomy's role in support
of the legal timber trade, especially cites. IAWA Journal, 32(2):137{154, 2011.
[61] Elisabeth A Wheeler and Pieter Baas. Wood identication-a review. IAWA JOUR-
NAL, 19:241{264, 1998.
[62] Victor Deklerck, Kristen Finch, Peter Gasson, Jan Van den Bulcke, Joris Van Acker,
Hans Beeckman, and Edgard Espinoza. Comparison of species classication models of
162
mass spectrometry data: Kernel discriminant analysis vs. random forest; a case study
of afrormosia (pericopsis elata (harms) meeuwen). Rapid Communications in Mass
Spectrometry, 2017.
[63] Pamela J McClure, Gabriela D Chavarria, and Edgard Espinoza. Metabolic chemo-
types of cites protected dalbergia timbers from africa, madagascar, and asia. Rapid
Communications in Mass Spectrometry, 29(9):783{788, 2015.
[64] Lichao Jiao, Min Yu, Alex C Wiedenhoeft, Tuo He, Jianing Li, Bo Liu, Xiaomei Jiang,
and Yafang Yin. Dna barcode authentication and library development for the wood of
six commercial pterocarpus species: the critical role of xylarium specimens. Scientic
reports, 8(1):1945, 2018.
[65] Isabel Mafra, Isabel MPLVO Ferreira, and M Beatriz PP Oliveira. Food authentication
by pcr-based methods. European Food Research and Technology, 227(3):649{665, 2008.
[66] Edgard O Espinoza, Cady A Lancaster, Natasha M Kreitals, Masataka Hata, Robert B
Cody, and Robert A Blanchette. Distinguishing wild from cultivated agarwood
(aquilaria spp.) using direct analysis in real time and time of-
ight mass spectrometry.
Rapid Communications in Mass Spectrometry, 28(3):281{289, 2014.
[67] Kristen Finch, Edgard Espinoza, F Andrew Jones, and Richard Cronn. Source identi-
cation of western oregon douglas-r wood cores using mass spectrometry and random
forest classication. Applications in plant sciences, 5(5), 2017.
[68] Hye Jin Kim, Yong Taek Seo, Sang-il Park, Se Hee Jeong, Min Kyoung Kim, and
Young Pyo Jang. Dart{tof{ms based metabolomics study for the discrimination anal-
ysis of geographical origin of angelica gigas roots collected from korea and china.
Metabolomics, 11(1):64{70, 2015.
163
[69] Hilke Schroeder, Richard Cronn, Yulai Yanbaev, Tara Jennings, Malte Mader, Bernd
Degen, and Birgit Kersten. Development of molecular markers for determining
continental origin of wood from white oaks (quercus l. sect. quercus). PloS one,
11(6):e0158221, 2016.
[70] Jie Ren, Kai Song, Minghua Deng, Gesine Reinert, Charles H Cannon, and Fengzhu
Sun. Inference of markovian properties of molecular sequences from ngs data and
applications to comparative genomics. Bioinformatics, 32(7):993{1000, 2015.
[71] William Luppold and Matthew Bumgardner. Changes in us hardwood lumber exports,
1990 to 2008. In Proceedings, 17th central hardwood forest conference, pages 570{578.
U.S. Department of Agriculture, Forest Service, Northern Research Station, April 5-7
2011.
[72] Lumber liquidators inc. pleads guilty to environmental crimes and agrees to pay more
than $13 million in nes, forfeiture and community service payments. 2015. [Online;
accessed 22-October-2015].
[73] Zuguang Gu, Lei Gu, Roland Eils, Matthias Schlesner, and Benedikt Brors. circlize
implements and enhances circular visualization in r. Bioinformatics, 30(19):2811{2812,
2014.
[74] Victoria L Sork, Sorel T Fitz-Gibbon, Daniela Puiu, Marc Crepeau, Paul F Gugger,
Rachel Sherman, Kristian Stevens, Charles H Langley, Matteo Pellegrini, and Steven L
Salzberg. First draft assembly and annotation of the genome of a california endemic
oak quercus lobata nee (fagaceae). G3: Genes, Genomes, Genetics, 6(11):3485{3495,
2016.
164
[75] Namrata Sarkar, Emanuel Schmid-Siegert, Christian Iseli, Sandra Calderon, Caro-
line Gouhier-Darimont, Jacqueline Chrast, Pietro Cattaneo, Frederic Schutz, Laurent
Farinelli, Marco Pagni, et al. Low rate of somatic mutations in a long-lived oak tree.
bioRxiv, page 149203, 2017.
[76] Michael R Miller, Joseph P Dunham, Angel Amores, William A Cresko, and Eric A
Johnson. Rapid and cost-eective polymorphism identication and genotyping using
restriction site associated dna (rad) markers. Genome research, 17(2):240{248, 2007.
[77] John D McVay, Duncan Hauser, Andrew L Hipp, and Paul S Manos. Phylogenomics
reveals a complex evolutionary history of lobed-leaf white oaks in western north amer-
ica. Genome, 60(9):733{742, 2017.
[78] Kasey K Pham, Andrew L Hipp, Paul S Manos, and Richard C Cronn. A time and
a place for everything: phylogenetic history and geography as joint predictors of oak
plastome phylogeny. Genome, 60(9):720{732, 2017.
[79] Melanie Schirmer, Umer Z Ijaz, Rosalinda D'Amore, Neil Hall, William T Sloan, and
Christopher Quince. Insight into biases and sequencing errors for amplicon sequencing
with the illumina miseq platform. Nucleic acids research, 43(6):e37{e37, 2015.
[80] Jonas Korlach and Pacic Biosciences. Understanding accuracy in smrt R
sequencing,
2013.
[81] Joe Parker, Andrew J Helmstetter, Dion Devey, Tim Wilkinson, and Alexander ST
Papadopulos. Field-based species identication of closely-related plants using real-time
nanopore sequencing. Scientic reports, 7(1):8345, 2017.
[82] Andrew L Hipp. Should hybridization make us skeptical of the oak phylogeny. Inter-
national Oaks, 26:9{18, 2015.
165
[83] R emy J Petit, Catherine Bod en es, Alexis Ducousso, Guy Roussel, and Antoine Kre-
mer. Hybridization as a mechanism of invasion in oaks. New Phytologist, 161(1):151{
164, 2004.
[84] Andrew L Hipp, Paul S Manos, Antonio Gonz alez-Rodr guez, Marlene Hahn, Matthew
Kaproth, John D McVay, Susana Valencia Avalos, and Jeannine Cavender-Bares. Sym-
patric parallel diversication of major oak clades in the americas and the origins of
mexican species diversity. New Phytologist, 217(1):439{452, 2018.
[85] John D McVay, Andrew L Hipp, and Paul S Manos. A genetic legacy of introgression
confounds phylogeny and biogeography in oaks. Proc. R. Soc. B, 284(1854):20170300,
2017.
[86] Marco C Simeone, Roberta Piredda, Alessio Papini, Federico Vessella, and Bar-
tolomeo Schirone. Application of plastid and nuclear markers to dna barcoding of
euro-mediterranean oaks (quercus, fagaceae): problems, prospects and phylogenetic
implications. Botanical Journal of the Linnean Society, 172(4):478{499, 2013.
[87] Shannon CK Straub, Matthew Parks, Kevin Weitemier, Mark Fishbein, Richard C
Cronn, and Aaron Liston. Navigating the tip of the genomic iceberg: Next-generation
sequencing for plant systematics. American Journal of Botany, 99(2):349{364, 2012.
[88] Nathan A Ahlgren, Jie Ren, Yang Young Lu, Jed A Fuhrman, and Fengzhu Sun.
Alignment-free oligonucleotide frequency dissimilarity measure improves prediction
of hosts from metagenomically-derived viral sequences. Nucleic Acids Research,
45(1):39{53, 2016.
166
[89] Kujin Tang, Yang Young Lu, and Fengzhu Sun. Background adjusted alignment-free
dissimilarity measures improve the detection of horizontal gene transfer. Frontiers in
Microbiology, 9:711, 2018.
[90] Polina Perelman, Warren E Johnson, Christian Roos, Hector N Seu anez, Julie E Hor-
vath, Miguel AM Moreira, Bailey Kessing, Joan Pontius, Melody Roelke, Yves Rum-
pler, et al. A molecular phylogeny of living primates. PLoS genetics, 7(3):e1001342,
2011.
[91] Webb Miller, Kate Rosenbloom, Ross C Hardison, Minmei Hou, James Taylor, Brian
Raney, Richard Burhans, David C King, Robert Baertsch, Daniel Blankenberg, et al.
28-way vertebrate alignment and conservation track in the UCSC genome browser.
Genome Research, 17(12):1797{1808, 2007.
[92] Tiee-Jian Wu, Ying-Hsueh Huang, and Lung-An Li. Optimal word sizes for dissimi-
larity measures and estimation of the degree of dissimilarity between DNA sequences.
Bioinformatics, 21(22):4125{4132, 2005.
[93] Xin Bai, Kujin Tang, Jie Ren, Michael Waterman, and Fengzhu Sun. Optimal choice
of word length when comparing two Markov sequences using a
2
-statistic. BMC
Genomics, 18(6):732, 2017.
[94] Weichun Huang, Leping Li, Jason R Myers, and Gabor T Marth. Art: a next-
generation sequencing read simulator. Bioinformatics, 28(4):593{594, 2011.
[95] ID 269970 - BioProject - NCBI. USDA Forest Service. https://www.ncbi.nlm.nih.
gov/bioproject/PRJNA269970, Accessed 13 Oct 2019.
167
Abstract (if available)
Abstract
The dissertation presents the wide applications of alignment-free methods for different problems including HGT detection in bacterial genomes and origin prediction for white oaks. It proves that alignment-free methods can not only be applied for studying complete genomes, but also NGS samples without assembly regardless of the sequencing technologies. Out of all alignment-free methods, background adjusted dissimilarity measures such as d ⃰₂ and d2s perform better than other methods in our studies. In addition, we solved two limitations of background adjusted alignment-free dissimilarity measures. We first reimplemented the code of calculating d ⃰₂ and d2s so that they can now run as fast as other state-of-art alignment-free methods such as Mash and Skmer. We also adjusted the bias of alignment-free methods based on sequencing data using a neural network regression model.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Application of machine learning methods in genomic data analysis
PDF
The use of alignment-free statistics for the evolutionary study of study of 5' cis-regulatory sequences
PDF
Predicting virus-host interactions using genomic data and applications in metagenomics
PDF
Alignment-free sequence comparison methods and applications to comparative genomics [pdf]
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Genomic mapping: a statistical and algorithmic analysis of the optical mapping system
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Alignment of phylogenetically unambiguous indels for genome-wide phylogenetic analysis and detection of lateral gene transfer
PDF
Exploring the genetic basis of complex traits
PDF
Investigating the evolution of gene networks through simulated populations
PDF
Clustering 16S rRNA sequences: an accurate and efficient approach
PDF
A generalization of the accumulation curve in species sampling and its applications to high‐throughput sequencing
PDF
Enhancing phenotype prediction through integrative analysis of heterogeneous microbiome studies
PDF
Integrating high-throughput sequencing data to study gene regulation
PDF
Genome-wide studies reveal the function and evolution of DNA shape
Asset Metadata
Creator
Tang, Kujin
(author)
Core Title
Applications and improvements of background adjusted alignment-free dissimilarity measures
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
02/14/2020
Defense Date
01/17/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
alignment-free methods,d ⃰₂,d2*,d2s,horizontal gene transfer,machine learning,next-generation sequencing,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sun, Fengzhu (
committee chair
), Waterman, Michael S. (
committee chair
), Zhong, Jiang F. (
committee chair
)
Creator Email
geniustang624@gmail.com,kujin@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-268872
Unique identifier
UC11673993
Identifier
etd-TangKujin-8167.pdf (filename),usctheses-c89-268872 (legacy record id)
Legacy Identifier
etd-TangKujin-8167.pdf
Dmrecord
268872
Document Type
Dissertation
Rights
Tang, Kujin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
alignment-free methods
d ⃰₂
d2*
d2s
horizontal gene transfer
machine learning
next-generation sequencing