Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Alignment-free sequence comparison methods and applications to comparative genomics
/
Alignment-free sequence comparison methods and applications to comparative genomics [pdf]
(USC Thesis Other)
Alignment-free sequence comparison methods and applications to comparative genomics [pdf]
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Alignment-free sequence comparison methods and
applications to comparative genomics
by
Jie Ren
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
Thesis Committee:
Fengzhu Sun
Michael Waterman
Jed Fuhrman
August, 2017
The important thing is not to stop questioning. Curiosity has its own reason for existing.
- Albert Einstein
Acknowledgement
First of all, I want to thank my advisor Dr. Fengzhu Sun for all the help during this journey of my
Ph.D. study. Without his guidance and support, I would not have discovered my research interests,
overcome the obstacles and finally get some nice research results that I am proud of. I am very
grateful for being one of his students.
I want to thank Dr. Michael Waterman for his contributions to developing theories of the
distribution of word count, which provide the foundation for my research. I appreciate the time
and efforts he spent on commenting on my work, helping with my writings, recognition of my
ideas and more importantly sharing his visions in research and life with me.
I want to thank Dr. Jed Fuhrman for introducing me to the wonderful world of the microbiome.
Applying computational methods to making real discoveries brought me a lot of excitement. I also
want to thank him for proposing significant scientific problems and giving helpful suggestions and
comments on the work.
I want to thank my collaborator Dr. Nathan Ahlgren for the great efforts he made to accomplish
the work, and for his professional guidance of the projects, and for his trust, patience, and the
friendship. I am grateful for having the fruitful collaboration with him.
I want to thank Drs. Minghua Deng and Mingping Qian for leading me to the field of compu-
tational biology. The three years’ training at Peking University provides a sound basis for my later
research.
I want to thank Dr. Gesine Reinert for the guidance and efforts she made on supervising me on
my first first-author paper, and for her support and understanding during the past years.
I want to thank other collaborators both within and outside of USC for jointly publishing find-
ings. They are Dr. Xuegong Zhang, Dr. Ying Wang, Weinan Liao, Dr. Kai Song, Dr. Lin Wan,
Dr. Bai Jiang, Mengge Zhang, Yang Lu, Kujin Tang, Xin Kang, Xin Bai, Dr. Zhiyuan Zhai, Dr.
Xuemei Liu and Dr. Charles Cannon.
I want to thank professors in CBB, Drs. Andrew Smith, Ting Chen, Remo Rohs, Jasmine
Zhou, Liang Chen, Frank Alber, and Peter Ralph, for imparting knowledge in different fields of
bioinformatics to me.
I want to thank other friends at USC for the fun we have: Xiaofei Wang, Han Li, Weili Wang,
Wangshu Zhang, Beibei Xin, Jianghan Qu, Wenzheng Li, Qingjiao Li, Long Pei, Ben Decato,
Rishvanth Prabakar, Lin Yang, Zifan Zhu, Fang Zhang, Meng Zhou, Maoqi Xu, Xiaojin Ji, Jinsen
Li, Tsu-Pei Chiu, Yuan Zhang, Longlong Chang, Hailong Cui, Xize Wang, Quan Yuan, Timothy
Daley, Egor Dolzhenko, Philip J Uren and other friends.
Last but most importantly, I want to thank my husband Chao Deng and my parents for their
continuous support on me, belief in me, and understanding of me regardless of ups and downs.
This research is funded in part by USC Provost Fellowship, USC WiSE program, NSF, and
NIH.
Other authors and contributors to the thesis
Drs. Fengzhu Sun, Nathan Ahlgren, Jed Fuhrman, Michael Waterman, Gesine Reinert, Minghua
Deng, Ying Wang, Kai Song, Weinan and Yang Lu.
3
Contents
1 Introduction 1
1.1 Alignment-based methods for genomic sequence comparison . . . . . . . . . . . . 3
1.2 Alignment-free sequence comparison methods . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Research problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Applications of alignment-free methods: studying viruses in metagenomic data . . 10
1.3.1 Research problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Predicting hosts for metagenomically derived viral sequences using alignment-
free methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Identifying viral sequences from metagenomic data using sequence signa-
tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Inference of Markovian Properties of Molecular Sequences from NGS Data 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 Probabilistic modeling of a MC sequence and random sampling of the
reads using NGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Estimating the order of the MC based on NGS reads . . . . . . . . . . . . 23
2.2.3 Estimating the effective coveraged . . . . . . . . . . . . . . . . . . . . . 24
4
2.2.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.5 Applications to the study of relationships among organisms . . . . . . . . 25
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Summary of simulation results . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 The relationship among 28 vertebrate species . . . . . . . . . . . . . . . . 27
2.3.3 The relationship among 13 tropical tree species with unknown reference
genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Studying virus-host interactions using alignment-free sequence comparison methods 34
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Methods and Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Virus and prokaryotic host databases . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 Oligonucleotide frequency measures and availability of VirHostMatcher
software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.3 Determination of predicted host taxonomy . . . . . . . . . . . . . . . . . . 45
3.2.4 Availability of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.1 Initial testing of different ONF based measures . . . . . . . . . . . . . . . 46
3.3.2 Accuracy of ONF measures to predict host taxonomy . . . . . . . . . . . . 47
3.3.3 Approaches for further increasing accuracy . . . . . . . . . . . . . . . . . 52
3.3.4 Comparison of our improved ONF method to previous virus prediction
studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.5 Differences in ONF virus-host similarity and host prediction among viral
groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.6 Testing our improved ONF methods on previous virus-host interaction studies 56
5
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4 Identifying viral sequences from assembled metagenomic data using using sequence
signatures (k-mers) 72
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2.1 The effects ofk-mer length and contig length . . . . . . . . . . . . . . . . 78
4.2.2 Evaluation of VirFinder with contigs subsampled from known virus and
host genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2.3 Comparison of VirFinder and VirSorter performance . . . . . . . . . . . . 81
4.2.4 Sensitivity of VirFinder to mutations . . . . . . . . . . . . . . . . . . . . . 83
4.2.5 Virus prediction for sequences from different host domains and phyla . . . 83
4.2.6 Assessment of the identification of novel viruses . . . . . . . . . . . . . . 84
4.2.7 VirFinder’s performance on assembled contigs from simulated metage-
nomic samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2.8 Example application: Identification and analysis of viral communities in
human gut metagenomes from a liver cirrhosis study . . . . . . . . . . . . 89
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5 Methods and Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5.1 Viruses and prokaryotic host genomes used for training and validation . . . 102
4.5.2 Thek-mer based machine learning prediction model . . . . . . . . . . . . 103
4.5.3 Simulation studies on metagenomes . . . . . . . . . . . . . . . . . . . . . 105
4.5.4 VirSorter settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.5.5 Assembly and analysis of human gut metagenomic samples from liver cir-
rhosis study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6
4.5.6 Availability of data and materials . . . . . . . . . . . . . . . . . . . . . . 108
5 General discussion and future work 111
5.1 Developing a full suite of theories for understanding Markov chain models from
NGS short reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2 Modeling genomic sequences with a more flexible model than Markov chains . . . 112
5.3 Integrating all types of information to infer virus-host interactions . . . . . . . . . 113
5.4 Improving virus identification using deep learning and integration of multiple types
of information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Supplementary data 116
References 117
List of Figures
1.1 Alignment-based and alignment-free methods for comparison of two genomes us-
ing NGS short reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Scientific questions in the study of viruses using metagenomic data . . . . . . . . . 16
2.1 Clustering of 13 tropical tree species usingd
2
under MC with various orders . . . . 33
3.1 Distributions of virus-host distances/dissimilarities and ROC curves for theEu and
d
2
measures fork-mer length 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Prediction accuracy using ONF with various distance/dissimilarity measures atk-
mer length 6 on our benchmark dataset . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 The dependence of host prediction accuracy on the length of the query viral se-
quence and with simulated error . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 Approaches for increasing host prediction accuracy in application of thed
2
mea-
sure (k-mer length 6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5 Comparison of host taxonomy prediction for 285 marine viruses when using all
host genomes or only marine host genomes using the measured
2
(k-mer length 6) . 69
3.6 Comparison of genus level host prediction on 820 complete RefSeq viruses and
2,699 complete bacterial host genomes using different types of methods . . . . . . 69
8
3.7 Comparison of host prediction accuracy using the similarity measuresd
2
andMa
(k-mer length 6) on our benchmark dataset or the Roux et al. 2015 dataset . . . . . 70
3.8 Differences in virus-host dissimilarities and prediction accuracy between the three
major groups within Caudoviruses . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1 The impact ofk-mer size and contig length on the performance of VirFinder . . . . 80
4.2 Performance of VirSorter and VirFinder virus prediction for contigs subsampled
from virus and prokaryotic genomes . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3 Differences in VirFinder’s performance for different groups of viruses and when
excluding particular viruses from the training of VirFinder . . . . . . . . . . . . . 85
4.4 Results for VirFinder predictions made on contigs assembled from simulated hu-
man gut metagenomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5 Evaluation of VirFinder (VF) and VirSorter (VS) predictions on contigs assembled
from simulated human gut metagenomes at 50% viral level . . . . . . . . . . . . . 88
4.6 ROC curves and AUROC scores for classifying healthy and liver cirrhosis based
on viral abundance profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.7 Two way hierarchical clustering of viral contig bins and human gut metagenomic
samples from a liver cirrhosis study . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.8 Box plots summarizing the percentage of healthy or cirrhosis patient samples hav-
ing each viral bin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9
List of Tables
2.1 The SPCC between the true distance matrix and the dissimilarity matrix byd
2
-type
dissimilarity measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 The t-test p-values for comparisons of distance/dissimilarities of 352 virus-host
pairs vs. 352 random virus-host pairs. . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Area under the curve (AUC) for receiver operating characteristic (ROC) curves for
352 virus-host pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3 Host prediction accuracies for 1427 viruses from among 32 000 prokaryotic
hosts for various ONF measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Host prediction accuracies for 820 viruses from 2,699 host genomes from reference
Edwards et al. 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Host predictions for 18 SUP05 metagenomically-assembled viruses . . . . . . . . 59
4.1 Summary information for 15 viral contig bins associated with liver cirrhosis . . . . 93
4.2 The number of fragments generated from the virus genomes before and after 1/1/2014103
10
Chapter 1
Introduction
Genomic sequence comparison is an important problem in genomics. In the past few decades,
the number of sequences has increased exponentially from 606 sequences stored in GeneBank by
the end of the year 1982 to 2 10
8
at the beginning of 2017 [1]. With the rapid growth of
sequencing data for various genetic molecules (DNA, RNA, proteins, etc.) from different species,
there is a huge demand of computational methods and tools for annotating the data and making
useful inferences. Comparison of genomic sequences is such a fundamental component that it
is involved in many aspects of computational biology and bioinformatics. For example, sequence
comparison has been widely used to identify homologous genes, study the evolutionary relationship
among species, understand the evolutionary dynamics of regulatory regions, and predict functions
of protein sequences based on sequence similarities.
Alignment-based methods were first developed to identify regions of similarity between se-
quences, by aligning two sequences and finding matches and mismatches at each base pair (bp).
Though alignment-based methods are intensively used in all kinds of sequence comparisons, it is
realized that alignment-based methods are limited for several scenarios such as comparison of dis-
tantly related sequences and comparison of sets of short fragmented sequences. Alignment-free
1
methods, as an alternative, have been proposed and received attention in recent years. Alignment-
free methods characterize sequences using short nucleotide words and assess sequence similarity
based on the similarity of word usage.
The efficiency of alignment-free methods depends on the choice of specified model for com-
puting word probabilities. Markov chains (MC) are a commonly used model for characterizing the
dependency between the bases and modeling the genomic sequence. The order of a MC specifies
the number of previous positions that the distribution of the current position depends on. Choos-
ing a correct model is critical for alignment-free methods. For single genomic sequences, efficient
statistics are available to estimate the order of MCs. But for samples from next generation sequenc-
ing (NGS) which contains a set of short fragmented sequences randomly sampled from genomic
sequences, the existing methods cannot be directly applied and new methods are needed. Thus, in
Chapter 2, we study the word count distribution and develop methods to estimate the order of MCs
based on short fragmented sequences.
Alignment-free sequence comparison methods have been successfully applied to inferring evo-
lutionary relationship among species, identifying gene sequences, classification of regulatory re-
gions, and clustering microbial samples from different environments. Several similarity/dissimilarity
measures based on word frequencies have been proposed for solving different problems. The
choice of word length and the use of proper measure affect the performance of the alignment-free
methods. Since one size does not fit all, different problems need to be studied individually to find
the optimal choice of word length and similarity/dissimilarity measure.
One new application of alignment-free methods is to predict virus-host interactions based on
sequence similarity. Viruses have rarely share genes with hosts, thus sequence alignment cannot be
used in many cases. Euclidian distance between word counts of virus and host genomic sequences
is frequently used to compute the sequence dissimilarity. Many other measures have been proposed
but no study has comprehensively compared different measures for the virus-host interaction pre-
2
diction problem. In Chapter 3, we study various dissimilarity measures and various word lengths
for the problem of predicting virus-host interactions. We aim to find the best choice of the word
length and the dissimilarity measure that gives the highest prediction accuracy.
Besides the success of alignment-free methods for the unsupervised learning such as cluster-
ing, the approach of characterizing sequences using occurrences of short words can also be used
for supervised learning. Several methods have been developed for the classification problems such
as assignment of sequences from different bacteria and classification of viruses infecting different
bacteria hosts. Sequence classification using the word counts avoids gene-based similarity searches
and thus works well for short sequences with no or incomplete genes. In Chapter 4, we develop
a method for identifying virus sequences from metagenomic sequencing data where genomic se-
quences from different bacteria and viruses in the same environment are sequenced. The results
show that the method correctly recalls more viral sequences than the state-of-the-art gene-based
virus classification tool, especially for short (1000 bp) sequences.
1.1 Alignment-based methods for genomic sequence comparison
Alignment-based methods are the most classical and widely used methods for genomic sequence
comparison, generally including both global sequence alignment using the Needleman-Wunsch
algorithm [2] and local sequence alignment using the Smith-Waterman algorithm [3]. The BLAST
algorithm [4], a sophisticated heuristic for Smith-Waterman algorithm, performs homology search
tens of thousands of times daily in National Center for Biotechnology Information (NCBI) [5]
Sequence alignment is a colinear arrangement of sequences indicating the identities as well as
differences in the sequences. Dynamic programing is used to find the optimal alignment given a
particular scoring function, defining the award for matches and the penalties for mismatches and
indels.
Sequence alignment gives details that how sequences mutate and evolve from one to another.
3
The evolutionary relationship between sequences (or between species that the sequences come
from) can be inferred from the sequence alignment. However, it has been realized that alignment-
based methods are not optimal under several scenarios. First, alignment-based methods suffer
when comparing whole genomic sequences and sequences> 10 kb [6, 7]. The availability of whole
genome sequences demands methods for aligning two long sequences of millions of base pairs. But
the classical alignment methods cannot be straightly applied for whole genome alignment because
it is prohibitively time-consuming [8]. More importantly, the classical alignment methods cannot
handle large sized genome variation events like translocations, inversions and duplication, and
horizontal gene transfer. Therefore, several methods have been developed for rapidly comparing
genomic sequences using different techniques such as seed-and-extend approach [9, 10], anchor-
based approach [11, 12, 13, 14], and hybrid of local alignment and global alignment [7, 15, 16].
Methods for identifying genome rearrangement have also been proposed [17, 18]. GenomeMatcher
and VISTA [19, 20] are visualization tools that integrate comparison results from multiple tools.
Several papers [21, 22, 23] review and assess the different whole genome alignment methods.
However, regardless of many different methods, alignments of long sequences remains a significant
challenge, particularly for noncoding regions and distantly related species [23].
Second, the similarity between regulatory regions of homologous genes is not accurately as-
sessed by either global or local alignment. A regulatory region, such as cis-regulatory module
(CRM), is a stretch of DNA sequence regulating expression of some nearby genes [24]. CRM is
typically few hundred base pairs long, and it contains short multiple transcription factor binding
sites (TFBS) which are 4 to 30 base pairs long. CRMs of homologous genes usually share common
TFBS. The main issue for comparing CRMs or identifying new CRMs is that CRMs are usually
not conserved.
Third, the comparison between two samples of short reads is challenging. Next-generation
sequencing technologies (NGS), one of the most powerful sequencing technologies with high ef-
4
ficiency and low cost, generate millions of short sequence fragments also known as reads that are
only hundreds of bps long from the genomes in an sequencing experiment (Figure 1.1 A). These
short reads are often assembled together by overlaps to generate longer sequences called contigs.
Several de-novo assemblers have been developed [25, 26, 27, 28, 29], but it is still very challenging
to reconstruct the original complete genomic sequence by assembly, especially for repeat regions
longer than the read length. The reconstructed genomic sequences can then be compared using se-
quence alignment (Figure 1.1 B). However, the fact that the assembled contigs come from different
parts of genomes and are often disconnected makes a direct alignment between two samples not
possible.
1.2 Alignment-free sequence comparison methods
Alignment-free sequence comparison methods, an alternative to the alignment-based methods,
were first proposed by Blaisdell [30] in 1986. The methods have been developed at a fast pace dur-
ing the past two decades. Instead of comparing sequences at the base pair level, those alignment-
free methods characterize sequences using occurrences of different nucleotide words (short se-
quences of letters, also well known ask-mers, k-tuples, sequence signatures) (Figure 1.1 C). As
similar dialects share similar usage of words, alignment-free methods assess sequence similarity
based on the similarity between the occurrences of nucleotide words in sequences.
Unlike alignment-based methods, alignment-free methods do not require the assumption of
a homologous relationship (common ancestry) among similar sequences. The methods can be
used to infer the distantly related sequences without homologous genes [31]. Also, alignment-free
methods count word frequencies in the sequence regardless of coding regions, non-coding regions
and local arrangements. Therefore they are more robust than alignment-based methods especially
against genetic recombinations and horizontal gene transfer [32].
Furthermore, since word count patterns are generally stable across different genomic regions
5
or different chromosomes, alignment-free methods work well even with sequences coming from
different regions of the genomes. One successful application is comparing NGS sample that each
sample contains millions of short reads randomly sampled from different parts of genomes [33,
34, 35]. Moreover, alignment-free methods are relatively faster and use much less storage than
alignment-based methods, because the methods avoid the steps of sequence alignment and short
reads assembly that are time and memory consuming.
Other than the methods based on word frequencies, a few studies propose alignment-free meth-
ods based on different philosophies. Several studies compare sequences based on the average
longest common substring [36], longest common substrings with k mismatches [37], and shortest
unique substring [38]. In addition, information theory is also applied for comparing sequences.
Otu and Sayood [39] adopt the idea of data compression and define sequence distances as the the
number of steps required to generate the second sequence given the prior knowledge of the first
string. See [40, 41] for in-depth review. In this dissertation, we only emphasize on the studies of
alignment-free methods using word frequencies.
The alignment-free methods have been applied with great success for many sequence cluster-
ing and classification applications. For example, the phylogeny of hundreds of prokaryotic species
is studied using different alignment-free methods [35, 42, 32]. Jun et al. [42] compare the phy-
logenic trees constructed using alignment-free methods with that using sequence alignment. In
particular, the phylogeny of 38 species from the Escherichia coli/Shigella group [43, 44, 32] is
frequently used for comparing and assessing different methods. Since viruses evolve at high a
mutation rate, viruses do not have universally conserved genes that can be aligned for compari-
son. Therefore, another application of alignment-free methods is to build the phylogeny of 142
large dsDNA virus families based on whole proteome sequences [45]. Other than prokaryotes and
viruses, alignment-free methods are also used for inferring the evolutionary relationship of large
sized genomes. The evolutionary relationship of 21 primates and 28 mammalian species is studied
6
using alignment-free methods [46]. The pairwise dissimilarity between species computed using
alignment-free methods is consistent with the evolutionary time inferred based on sequence align-
ment of homologous genes. In addition, alignment-free methods can also be used to classify CRMs
from different mouse tissues [47].
Three steps in general are involved in the alignment-free sequence comparison methods: (a)
decomposing genomic sequences into nucleotide words using a fix-sized sliding window scanning
along the genomic sequence, (b) counting the occurrences of words in each genomic sequence or
short reads sample, and (c) comparing sequences using similarity/dissimilarity measures defined
based on word frequencies in sequences. The word frequencies can be represented by a frequency
vector. For a given word lengthk, thek-mer word frequency vector is a vector of length 4
k
and
each element is the number of occurrences of the correspondingk-mer word. Counting the word
frequency is generally fast. The most crucial component in alignment-free methods is to define
a proper similarity/dissimilarity measure based on the word frequencies that can truly reflect the
relationship between genomes. Since higher similarity corresponds lower dissimilarity, a similarity
measure can be easily converted into a dissimilarity. Thus in the subsequent paragraphs we use
dissimilarity measure to refer the different measures.
Different dissimilarity measures are developed using various principles. Vinga and Almeida
[48] and [49] provide comprehensive reviews on the methods. Briefly speaking, the measures can
be classified into two groups: measures that consider background word frequency and those that do
not. For measures that do not consider background word frequencies, the observed word frequency
vectors counted from sequences are directly used to compute the dissimilarity measure. The mea-
sures include but not limited to, Euclidian distance (Eu), Manhattan distance (Ma), Chebyshev
distance (Ch),d
2
[30], feature frequency profiles (FFP) [31], and Jensen-Shannon divergence (JS)
[50].
For measures that take background word frequency into account, dissimilarity between se-
7
quences is computed using the normalized word frequencies, where the expected word frequencies
estimated using a background model are removed from the observed word frequencies, in order to
eliminate the effect of the background averaged word counts and to enhance the signal of enriched
and depleted words. This group of measures includesd
2
,d
S
2
[51, 52],Hao [53, 35],Teeling [54],
EuF [55] and Willner [56], where different forms of sequence background models are incor-
porated. See 3.2.2 for details including the underlying mathematical equations of the measures.
Other than the two groups of measures, there is a different set of simple measures which are based
on presence/absence of different words, such as Jaccard and Hamming distances, regardless of the
number of occurrences of words.
Among the various alignment-free sequence comparison methods, the dissimilarity measures,
d
2
andd
S
2
[33, 49], and their variants [57, 58, 59] have shown promise in many applications, such as
recovering the phylogenetic relationship among the genomes of different species [33, 34, 60], and
clustering sequencing samples from different microbial communities with respect to the change of
environmental gradients [61]. These two measures are defined as follows. ConsiderS
(1)
andS
(2)
are two samples containing either genomic sequences or NGS short reads. For a fixed length of
k, a k-mer word w = w
1
w
2
:::w
k
is a substring of length k, where w
i
2A =fA;C;G;Tg.
Because a DNA sequence is double stranded, to remove the effect of the strand direction, a word w
and its reverse complimentary word w are treated together. LetN
(i)
w
be the number of occurrences
of the word w and its reverse complimentary word w in thei-th sample,i = 1; 2. We defineE
(i)
w
to be the expected number of occurrences of word based on a background model of either thei:i:d
model where the positions on a sequence are Independent and Identical Distributed, or a Markov
chain model where the distribution of a nucleotide position depends on the content of its neighbors.
E
(i)
w
=M
(i)
(p
(i)
w
+p
(i)
w
); whereM
(i)
is the total number ofk-mers in thei-th sample, andp
(i)
w
is
the probability of word w in thei-th samples under a specific model. ThenD
2
andD
S
2
are defined
as follows,
8
D
2
=
X
w2A
k
~
N
(1)
w
~
N
(2)
w
q
E
(1)
w
E
(2)
w
; andD
S
2
=
X
w2A
k
~
N
(1)
w
~
N
(2)
w
q
~
N
(1)
w
2
+
~
N
(2)
w
2
;
where
~
N
(i)
w
=N
(i)
w
E
(i)
w
,i = 1; 2. Further, the dissimilarity measuresd
2
andd
S
2
, ranging from 0
to 1, are defined as,
d
2
=
1
2
0
B
B
B
B
@
1
D
2
s
P
w2A
k
~
N
(1)
w
2
E
(1)
w
s
P
w2A
k
~
N
(2)
w
2
E
(2)
w
1
C
C
C
C
A
; and
d
S
2
=
1
2
0
B
B
B
B
B
B
@
1
D
S
2
v
u
u
t
P
w2A
k
~
N
(1)
w
2
r
~
N
(1)
w
2
+
~
N
(2)
w
2
v
u
u
t
P
w2A
k
~
N
(2)
w
2
r
~
N
(1)
w
2
+
~
N
(2)
w
2
1
C
C
C
C
C
C
A
:
It can be seen thatd
2
andd
S
2
require the knowledge about the approximate distribution of word
counts in the underlying sequences. While a model that assumes that all letters in the sequence are
equally likely is relatively straightforward to analyse, see [51], a Markov model for the underlying
sequences is more flexible. Markov chains have been widely used to model molecular sequences
[62] with many applications including the study of dependencies between bases [63], the enrich-
ment and depletion of certain word patterns [64], prediction of occurrences of long word patterns
from short patterns [65, 66], and the detection of signals in introns [67]. Narlikar et al. [50] studied
the effect of the order of the MC on several biological problems including phylogenetic analysis,
assignment of sequence fragments to different genomes in metagenomic studies, motif discovery,
and functional classification of promoters. These applications showed the importance of accurate
specification of the order of MCs.
9
1.2.1 Research problems
Reliable estimators for the order of the MC and the transition probability matrix based on the se-
quence data are crucial. For single complete sequences, efficient statistics are available to estimate
the order of MCs and the transition probability matrix for the sequences [68, 69, 70, 71]. As NGS
data do not provide a single complete sequence, inference methods on Markovian properties of
sequences based on single complete sequences cannot be directly used for NGS short read data.
Due to the additional randomness in the process of sampling short reads from genomic sequences,
the traditional statistics for estimating the order of MC and transition probability no longer follow
the same distributions.
In Chapter 2, we show that the traditional Chi-square statistic for estimating the order of MC
has an approximate gamma distribution, using the Lander-Waterman model for physical mapping.
We also derive a normal approximation for word counts. We propose several methods to estimate
the order of the MC based on NGS reads and evaluate those using simulations. We illustrate the
applications of our results by clustering genomic sequences of several vertebrate and tree species
based on NGS reads using alignment-free sequence dissimilarity measures. We find that the esti-
mated order of the MC has a considerable effect on the clustering results, and that the clustering
results that use a MC of the estimated order give a plausible clustering of the species.
1.3 Applications of alignment-free methods: studying viruses in metage-
nomic data
It is widely recognized that bacteria and archaea (prokaryotes) dominate biomass in many ecosys-
tems, control important global biogeochemical cycles, and significantly impact the health of hu-
mans, animals, and crops [72]. However, much less is known about the viruses that infect prokary-
otes. Viruses generally outnumber prokaryotes by 10 times, and it is believed that viruses are the
10
most abundant biological entities on the earth [73, 74]. Viruses can only replicate themselves by in-
fecting host cells, and viral infections can lead to lysis of host cells. Thus, viruses consequently can
indirectly impact the ecological processes by regulating and controlling the amount of prokaryotes.
The conventional approach to study viruses is to isolate viruses using cultured host strains in
the lab. Though the host of the virus can be readily informed, the approach is low-throughput
and only works for viruses invading culturable host strains. Metagenomic sequencing, which uses
next generation sequencing technology to recover genetic material of microbial organisms from
environment samples, is high-throughput for identifying bacteria, archaea, and viruses regardless
of culturability (Figure 1.2A). Increasing numbers of new viruses have been discovered using this
technology in various environments including human gut [75, 76, 77, 78, 79], ocean [80, 81, 82],
and soil [83, 84, 85]. For example, crAssphage, an highly abundant ubiquitous viruses in human
gut, is discovered computationally using cross-assembly of reads from human gut metaviromes
[75]. The biological functions and bacteria hosts are unclear due to the difficulty of culturing it in
the lab.
Given a metagenomic sample, short reads (100 bp) are first assembled to obtain longer contigs
(1000 bp). Since viral contigs are mixed with bacterial and archaeal host contigs, one problem
is to identify viruses in the metagenomically assembled contigs (Figure 1.2B). Next, two basic
but crucial components of understanding the biology and impact of viruses on their hosts are,
characterizing the diversity of the viral community — what species there are and how abundant
they are, and understanding virus-host interactions— what hosts different viruses infect (Figure
1.2C). Therefore, in this dissertation, we address the two important problems involved in studying
viruses using metagenomic data: (1) identification of viral sequences from metagenomic data, and
(2) prediction of host species of viruses.
11
1.3.1 Research problems
Predicting hosts for metagenomically derived viral sequences using alignment-free methods
For the problem of predicting virus-host interactions, a few computational approaches based on
gene homology searches have been developed in recent years. Gene homology searches between
virus and host genomes are useful when viruses carry host genes, virus genomes are integrated
into host genomes, or host genomes contain CRISPR spacers from viruses. CRISPR is short for
Clustered Regularly Interspaced Short Palindromic Repeats, a prokaryotic immune system as a
molecular memory of prior viral infections [86]. Several database have been developed for CRISPR
detection [87, 88, 89, 90, 91]. However, several facts limit the extensive use of homology searches:
not many viruses share genes or regions with hosts; only lysogenic viruses integrate viral genome
into host genomes; the database for CRISPR is incomplete.
It is observed that viruses and their hosts have similar word count patterns, and thus the host
of a virus can be predicted as the one with smallest sequence dissimilarity to the virus [55]. The
reason behind the observation is thought to be that viruses adopt the codons used by their hosts,
due to the dependency of viruses on the translational machinery of their hosts [92, 93, 94]. There-
fore, alignment-free based methods are good alternatives when the virus and host have no homol-
ogous genes. Euclidian or Manhattan distance between the observed word frequency vectors are
commonly used in assessing the dissimilarity between virus and host pairs in the existing studies
[95, 96]. Several more sophisticated measures using the normalized word frequencies, such asd
2
andd
S
2
, could potentially improve host prediction, but no studies have evaluated them on predict-
ing hosts of viruses. In addition, the 4-mer word, also known as the tetramer, is frequently used in
comparing genomes. There is no study assessing different measures and different word lengths for
the problem of predicting virus-host interaction.
In Chapter 3, we conduct a comprehensive evaluation of 11 alignment-free distance/dissimilarity
measures over various word lengths for predicting virus-host interactions. We assess the methods
12
using the largest benchmark dataset containing 1427 virus isolate genomes whose true hosts are
known and32 000 prokaryotic genomes as host candidates. The background-subtracting mea-
sure d
2
at k = 6 gives the highest host prediction accuracy (33%, genus level) with reasonable
computational times. Requiring a minimum dissimilarity score for making predictions (threshold-
ing) and taking the consensus of the 30 most similar hosts further improves accuracy. Using a pre-
vious dataset of 820 bacteriophage and 2699 bacterial genomes,d
2
host prediction accuracies with
thresholding and consensus methods (genus-level: 64%) exceeds previous Euclidian distance ONF
(32%) or homology-based (22-62%) methods. We applyd
2
to metagenomically assembled marine
SUP05 viruses and the human gut virus crAssphage.d
2
-based predictions overlap (i.e. some same,
some different) with the previously inferred hosts of these viruses. The extent of overlap improves
when only using host genomes or metagenomic contigs from the same habitat or samples as the
query viruses. We conclude that the d
2
method will greatly improve the characterization of the
novel, metagenomic viruses.
Identifying viral sequences from metagenomic data using sequence signatures
In Chapter 4, we address the problem of identifying viral sequences from metagenomic data. Exist-
ing tools for distinguishing prokaryotic virus and host contigs primarily use gene-based similarity
approaches. The state-of-the-art method, VirSorter [97], predicts viral sequences by integrating
multiple types of evidence including presence/absence of viral hallmark genes, enrichment of viral-
like genes, enrichment of uncharacterized genes, and depletion of Pfam affiliated genes. VirSorter
relies heavily on similarity searches of available viral databases, so it may miss novel viruses for
which their hallmark genes have not been characterized or are poorly represented in reference
databases. Moreover, it requires at least three predicted genes within a contig to make a prediction,
thereby excluding many shorter contigs and contigs from non-coding regions.
From the empirical data we observe that viruses and hosts have discernibly different word
13
usage. Following the same approach of characterizing sequences into word frequencies used in
the alignment-free methods, we develop VirFinder, the first word frequency based, machine learn-
ing method for virus contig identification that entirely avoids gene-based similarity searches. We
train the machine learning model using sequences from host and viral genomes sequenced before
1/1/2014 and evaluate the performance on sequences obtained after 1/1/2014. The results show
that VirFinder has significantly better rates of identifying true viral contigs (true positive rates,
TPRs) than VirSorter, when evaluated with either contigs subsampled from complete genomes or
assembled from a simulated human gut metagenome. For example, for contigs subsampled from
complete genomes, VirFinder has 78-, 2.4-, and 1.8-fold higher TPRs than VirSorter for 1000 bp,
3000 bp, and 5000 bp contigs, respectively, at the same false positive rates as VirSorter (0, 0.003,
and 0.006, respectively), thus VirFinder works considerably better for small contigs than VirSorter.
VirFinder furthermore identifies several recently sequenced virus genomes (after 1/1/2014) that
VirSorter does not and that have no nucleotide similarity to previously sequenced viruses, demon-
strating VirFinder’s potential advantage in identifying novel viral sequences. Finally we apply
VirFinder to a set of human gut metagenomes from healthy and liver cirrhosis patients and we
observes a higher viral diversity in healthy individuals than cirrhosis patients. We also identify
contig bins containing crAssphage-like contigs with higher abundance in healthy patients and a
putative Veillonella genus prophage associated with cirrhosis patients. This innovative word based
tool complements gene-based approaches and will significantly improve prokaryotic viral sequence
identification, especially for metagenomic-based studies of viral ecology.
14
Solution 1: Assembly + Alignment dynamic programming alignment assembly Solution 2: Alignment-free without assembly dissimilarity
measure extract signature Genome A Genome B Compare B A C Figure 1.1: Alignment-based and alignment-free methods for comparison of two genomes using next gener-
ation sequencing (NGS) short reads. (A) Two samples of NGS short reads are generated randomly from the
original genomes respectively. Since the original genomes are unknown, the comparison of two genomes is
based on the comparison of the two NGS short reads samples. (B) The conventional method for comparing
two genomic sequences based on short reads assembly and sequence alignment. The short reads are assem-
bled by overlaps to generate long, more useful sequence contigs. The assembled contigs are then compared
using sequence alignment. (C) The alignment-free sequence comparison methods based on characterizing
sequences using k-mer word frequencies. Alignment-free methods first count the occurrences of k-mer
words in each NGS samples individually, and compare two genomes using dissimilarity measures based on
the word frequencies.
15
environmental sample NGS short reads assembled contigs viruses bacteria/archaea (A) Metagenomic sequencing assembly analysis separation (B) Separating viral and host contigs (C) Predicting virus-host pairs diversity bacteria/archaea viruses viruses bacteria/archaea bacteria/archaea viruses Figure 1.2: Scientific questions in the study of viruses using metagenomic data. A sample of genetic
materials of microbial organisms including bacteria (blue), archaea (blue) and viruses (red) is collected from
a specific environment such as gut, ocean, and soil (A). Next generation sequencing technology is used to
sequence genomes from the sample and many short reads about hundreds of bps long are generated. Short
reads are assembled into longer contigs. Contigs can come from bacteria (blue), archaea (blue) and viruses
(red). The first scientific question of interest is to identify viruses from the set of assembled contigs (B).
To study the diversity of the community, the species compositions of viral community and bacteria/archaea
community are characterized respectively. To understand the impact of viruses on their hosts, the second
scientific question of interest is to predict what hosts different viruses infect (C).
16
Chapter 2
Inference of Markovian Properties of
Molecular Sequences from NGS Data
2.1 Introduction
The alignment-free sequence dissimilarity measures,d
2
andd
S
2
have shown promise in recovering
the relationship between genomes or metagenomes [61, 33, 49, 98]. The definitions of the two
measures can be found in the introduction, which require the knowledge about the approximate
distribution of word counts in the underlying sequences. Markov chains (MC) have been widely
used to model molecular sequences with many applications [62, 63, 64, 65, 66, 67], and [50] found
the order of MCs has significant impact on several biological problems including phylogenetic
analysis, assignment of sequence fragments to different genomes in metagnomic studies, motif
discovery, and functional classification of promoters.
Estimating the order of MC model given a long MC sequence [68, 69, 70, 71] and fitting a
MC model to molecular sequences [99, 100, 101, 102] have been well studied. Based on relatively
long molecular sequences, for a general finite state MC sequence of letters from a finite alphabet
17
A = f1; 2; ;Lg of size L, Hoel [68] showed, under the hypothesis that the long sequence
follows a (k2)-th order MC, that twice the log-likelihood ratio of the probability of the sequence
under a (k1)-th order MC versus that under the (k2)-th order MC model follows approximately
a
2
-distribution withdf
k
= (L 1)
2
L
k2
degrees of freedom under general conditions. He also
approximated the log-likelihood ratio by the Pearson-type statistic
S
k
=
X
w2A
k
(N
w
E
w
)
2
E
w
; (2.1)
that is also approximately
2
-distributed with the same degrees of freedom. Here, w =w
1
w
2
w
k
denotes a k-word formed of letters w
i
2 A,
w = w
2
w
k
, w
= w
1
w
2
w
k1
, and
w
= w
2
w
k1
;N
w
denotes the count of the word w in the sequence, andE
w
=
N
w
N
w
N
w
is the estimated expected count of w if the sequence is generated by a MC of orderk 2. Here
k 3; see also [69] for a detailed study, [70, 71] for an an excellent exposition of statistical issues
related to MCs, as well as [99, 100, 101, 102] for applications to sequence analysis.
The Chi-square statistic (2.1) and the log-likelihood ratio statistics can be used to test the order
of a MC, using allk-words w2A
k
. When a particular order of MC is rejected, we can identify
particular word patterns that are exceptional, through the approximate distribution of N
w
. The
approximate distributions ofN
w
in long sequences is well understood, see for example [99, 103,
101, 100]. In particular, suppose that the sequence follows a stationary (k 2)-th order MC and let
^
2
w
=E
w
1
N
w
N
w
1
N
w
N
w
:
For
Z
w
=
N
w
E
w
^
w
; (2.2)
Theorem 6.4.2 in [101] states, as sequence length goes to infinity, for all real valuesx, P(Z
w
18
x)! (x); where denotes the cumulative distribution function of a standard normal variable.
We also say thatZ
w
converges to the standard normal distributionN(0; 1) in distribution. This
asymptotic result can then be used to find exceptional words in long sequences.
Given an NGS short read sample, it is tempting to use the test statisticS
k
defined in (2.1) to
test the order of a MC by simply counting the number of the occurrences of words in short read
data. However, as the short reads from NGS data are sampled randomly from the genome, some
parts of the genome are possibly not sampled and some parts are possibly sampled extensively.
The sampling process introduces additional randomness to the statistic, and makesS
k
deviate from
its traditional
2
-distribution. Similarly, the approximate distribution ofZ
w
given in (2.2) will be
different from the standard normal distribution.
In this chapter, we study these approximate distributions, both theoretically and by simula-
tions. First we extend the statistics S
k
and Z
w
for a MC sequence to S
R
k
and Z
R
w
for the NGS
read data. Our underlying model for the distribution of reads along the genome is the potentially
inhomogeneous Lander-Waterman model for physical mapping [104]. We discover that for a set of
short reads sampled from a (k2)-th order MC sequence, the statisticS
R
k
follows approximately a
gamma distribution with shape parameterdf
k
=2 and scale parameter 2d, whered is a factor related
to the distribution of the reads along the genome. We also show that, with the same factord, the
distribution of the single word statisticZ
R
w
=
p
d tends to the standard normal distribution. Based
on the theoretical results, we introduce an estimator for the order of the MC using NGS data. For
practical purposes, we also give an estimator for the factord when the underlying reads sampling
distribution is unknown. To the best of our knowledge, this is the first study of the Markovian
properties of molecular sequences based on NGS read data.
To illustrate our theoretical results and our estimators, we first carry out a simulation study
based on transition probability matrices that are estimated from cis-regulatory module (CRM)
DNA sequences, and insert repeats. We simulate different read lengths, numbers of reads, in-
19
homogeneous sampling, as well as sequencing errors, and we include a regime where the sampling
rate depends on the GC content. If the GC bias is not very strong or the sequencing depth is not
very low, then the simulation results agree with our theoretical predictions despite the theoretical
assumptions being slightly violated.
Next we apply our methods to cluster 28 vertebrate species using our alignment-free dissimi-
larity measuresd
2
andd
S
2
under different MC models that are estimated from NGS read samples.
The estimated orders based on NGS data without assembly are found to be consistent with those
inferred directly from the long genome sequences. The clustering performs best when using MCs
around the estimated order. Applying the same analysis to 13 tropical tree species whose genomes
are unknown, based on their NGS read samples, the most plausible clustering is achieved when
using a MC model of order close to the one estimated from the NGS reads.
The chapter is organized as follows. The “Methods” section contains probabilistic models for
generating the MC sequence and sampling the short reads, as well as the theorem for the approx-
imate distributions of S
R
k
and Z
R
w
for NGS data. This theorem is used to derive our estimators
for the order of the MC and for the factord. In the “Results” section, we first provide extensive
simulation studies including a comparison of the theoretical approximate distributions and the sim-
ulated results for S
R
k
and Z
R
w
, the effect of inhomogeneous sampling and sequencing errors, the
efficiency of the estimator of the factord, and evaluations of the methods for estimating the MC
order. Second, we estimate the orders of the MCs for 28 vertebrate species based on the simulated
whole genome NGS samples. We then use our dissimilarity measures d
2
and d
S
2
to cluster the
NGS samples of the 28 species under different MC orders to see the effect on the performance
of the clustering. The applications show that our new methods are effective for the inference of
relationships among sequences based on NGS reads. Finally, we use our methods to study the
relationships among 13 tree species whose complete genomic sequences as well as phylogenetic
relationships are unknown. Our clustering results are consistent with the physical characteristics of
20
the tree species. The chapter concludes with some discussion of the study.
2.2 Methods
2.2.1 Probabilistic modeling of a MC sequence and random sampling of the reads
using NGS
In NGS, a large number of reads are randomly sampled from the genome. Hence two random
processes are involved in the generation of the short read data: the generation of the underlying
genome sequence and the random sampling of the reads.
We use anr-th order homogeneous ergodic MC to model the underlying genome sequence with
each letter taking values in a finite alphabet setA of sizeL. Since our study is based on genomic
sequences, L = 4. As in [104, 105, 106, 107, 108], we assume that the genome is continuous
and that the distribution of reads along the genome follows a potentially inhomogeneous Poisson
process with ratec(x) at positionx. Ifc(x) = c for allx, we refer to the sampling of the reads
as homogeneous. We assume that all sampled reads have the same length of bps. A total ofM
reads are independently sampled from the genome of lengthG bps.
We extend the statisticsS
k
andZ
w
in (2.1) and (2.2) to NGS short read data accordingly. Let
N
R
w
be the number of occurrences of thek-word w in the short read data, where the superscriptR
refers to the “read” data, and define
S
R
k
=
X
w2A
k
N
R
w
E
R
w
2
E
R
w
; (2.3)
Z
R
w
=
N
R
w
E
R
w
^
R
w
; (2.4)
21
where
E
R
w
=
N
R
w
N
R
w
N
R
w
and (^
R
w
)
2
=E
R
w
1
N
R
w
N
R
w
!
1
N
R
w
N
R
w
!
:
We have the following theorem on the approximate distributions ofS
R
k
andZ
R
w
; the proof is
given in the Supplementary Materials. Note that for each read we discard the last k 1 posi-
tions as they would lead to words of length less thank; the error made with this approximation is
asymptotically negligible whenk is small relative to.
Theorem 1. Assume that the genome follows a (k 2)-th order MC that assigns non-zero prob-
ability to everyk-word w. LetS
R
k
andZ
R
w
be defined as in (2.3) and (2.4), respectively. Suppose
that the genome of lengthG can be divided into (not necessarily contiguous) regions with constant
coverage r
i
for the i-th region, so that every base is covered exactly r
i
times, based on the first
k + 1 positions of the reads. LetG
i
be the length of thei-th region that changes withG in a
way such that lim
G!1
G
i
=G =f
i
> 0 for thei-th region,i = 1; 2; . Let
d =
P
i
r
2
i
f
i
P
i
r
i
f
i
: (2.5)
Then, asG!1,
a) For eachk-word w, in distribution,Z
R
w
=
p
d!N(0; 1).
b) The statisticS
R
k
=d has an approximate
2
-distribution withdf
k
= (L 1)
2
L
k2
degrees of
freedom; equivalently, the statisticS
R
k
has an approximate gamma distribution with shape
parameterdf
k
=2 and scale parameter 2d.
If theM reads are sampled homogeneously along the genome with coveragec based on the
firstk + 1 positions of the reads along the genome, i.e.c =
M(k+1)
Gk+1
, the Lander-Waterman
formula [104] shows that the fraction of genome coveredr
i
=i times isf
i
= exp(c)c
i
=i!. Under
22
this assumption, we obtain
d =
P
i
i
2
f
i
P
i
if
i
=
c
2
+c
c
=c + 1:
The results in Theorem 1 continue to hold when takingd =c + 1.
In the Lander-Waterman model for physical mapping [104], the factorc =
M
G
is the coverage
of the genome. Hence we refer to d from (2.5) as the effective coverage of the reads along the
genome based on the firstk + 1 positions of each read.
2.2.2 Estimating the order of the MC based on NGS reads
Based on Theorem 1, we can estimate the orderr of a MC sequence using NGS reads. First, we
test the null hypothesis that the sequence follows an independent identically distributed (i.i.d; MC
order = 0) model. For a test at significance level , if S
R
2
=d is higher than the 1 quantile
of the
2
-distribution withdf = (L 1)
2
degrees of freedom, the i.i.d hypothesis is rejected. If
this null hypothesis is rejected, then here we propose an estimator for the order of a MC; it is an
analog of a corresponding established estimator of MC orders based on long sequences that has
been shown to be effective. In the Supplementary Materials we present four related estimators as
well as estimators based on the AIC and BIC information criteria; the one presented here has the
best performance in simulation studies.
We assume that the word lengthk 2 and that the assumptions of Theorem 1 are satisfied.
Then, for k r + 2, S
R
k
=d has approximately a
2
-distribution with (L 1)
2
L
k2
degrees of
freedom. Ifk¡r + 2, thenS
R
k
=d will typically be larger than expected from this
2
-distribution.
Forkr + 2, the law of large numbers gives that
S
R
k+1
LS
R
k
! 1 forG!1; ifk¡r + 2 then the ratio
will be much larger than 1 in the limit. Therefore we can estimater as follows:
^ r
S
k
= argmin
k
(
S
R
k+1
S
R
k
)
1: (2.6)
23
In general, we want the value of min
k
S
R
k+1
S
R
k
to be very small, e.g, less than 0.01.
Using the law of large numbers it can be shown that under our assumptions this estimator is
consistent, in the sense that ^ r
S
k
tends tor in probability asG tends to infinity.
2.2.3 Estimating the effective coverage d
Often the effective coveraged is not known and we would like to estimate the effective coveraged
using NGS short read data. From Theorem 1, we can see that, under the general conditions stated
in the theorem, (Z
R
w
)
2
=d follows a
2
-distribution with one degree of freedom. Since the median
of the
2
-distribution with one degree of freedom is about 0.456, we can use the scaled median as
a robust estimator ford;
^
d = medianf(Z
R
w
)
2
; w2A
k
g=0:456: (2.7)
When we assume that the underlying long sequence follows a MC of order at mostm, we use
(m + 2)-words to estimated using (2.7).
Note that for an i.i.d. model sequence, the set of 2-words would not yield meaningful results as
there are only 16 different 2-words and the median based on 16 numbers is generally not reliable.
As an underlying genome sequence following anr-th order MC can also be seen as an (r + 1)-th,
(r+2)-th, . . . , and higher order MC sequence, we can usek-words with relatively largek (r+2)
to estimate the factord, if the maximum order of a MC is unknown beforehand.
2.2.4 Simulation study
For the simulation study, we first generate MCs of different orders. For realistic parameter values,
the transition probability matrices of the MCs are based on real cis-regulatory module (CRM) DNA
sequences in mouse forebrain from [109]. We use CRM sequences here because CRM sequences
are often used to study the effectiveness of alignment-free sequence dissimilarity measures [47, 49,
24
59]. To take into consideration that in real genomic sequences many repeat regions are present, we
insert repeats into the generated MCs. We simulate NGS data by sampling a varying number of
reads of different lengths from the MC, varying genome length as well as coverage.
We include homogeneous and inhomogeneous sampling of the reads as well as sequencing
errors. We also let the sampling rate of the reads depend on the GC content of the fragments based
on data from the current sequencing technologies [110]. We set the sequencing error rate at 10%,
which is relatively high compared to the true sequencing error rates in order to clearly distinguish
among the estimators with regards to their robustness to sequencing errors. When a sequencing
error occurs at a position, the nucleotide base is changed to one of the other three nucleotides with
equal probability.
Once the NGS reads are generated, we calculate the statistics S
R
k
and Z
R
w
for each word w,
the order estimator ^ r
S
k
and the estimator for effective coveraged based on (2.7); each procedure
is repeated 1000 times. In each repeated experiment, we let the order estimator choose the model
from 1st, 2nd, , 5th order MCs; we estimate the effective coveraged by (2.7), using 3-tuples for
a first order MC, and 4-tuples for a second order MC. The details are given in the Supplementary
Materials.
2.2.5 Applications to the study of relationships among organisms
We test our methods on real and simulated NGS data from 28 vertebrate species whose complete
genomic sequences are available and that are comprehensively studied in [111, 112]. We download
the genomes of the 28 vertebrate species from UCSC Genome Browser, and then use MetaSim
[113] to simulate reads from each of the 28 vertebrate species. In simulations the accuracy of the
order estimation increases with read coverage. To reflect a worst-case scenario, we set the read
coverage to be 1 as a lower bound for the performance although the current sequencing technology
can generate data with very high read coverage. We set MetaSim to generate reads of length 62bp
25
under the error rate that is estimated by Illumina in our simulations.
To estimate the order of MC based on the NGS sample for each of the 28 species, we apply
the order estimator ^ r
S
k
in (2.6); there is no sharp ratio transition found overk = 2; ; 14. Given
that real genomes consist of multiple types of regions (coding, non-coding and regulatory regions)
and each type may fit to different MC models, the result indicates that no suitable MC model
can adequately fit all the patterns in the genome. Instead, we fit the data with a MC model that
can explain the majority (say 80%) of the word patterns in the genome. Motivated by the normal
approximation of a particular word statistic in Theorem 1, we study the fraction ofk-words whose
occurrences can be explained using the statistic (Z
R
w
)
2
=d by comparison to a
2
-distribution with
one degree of freedom with type I error 0.01. We estimate the order of MC to be the smallestk 2
under which more than 80% ofk-words can be explained by the (k 2)-th order MC.
To cluster the organisms, we use the inferred MC models to estimate the expected number
of occurrences of word patterns and then study the relationships among the organisms using our
dissimilarity measuresd
2
andd
S
2
. We briefly present their definitions below, please see [33, 49] for
details. Then we apply a similar approach to study the relationships among 13 tree species with
NGS reads, for which neither the complete genome sequences nor their relationships are known.
To estimate the unknown effective coveraged usingk-words by (2.7), we letk to be relatively large
and use
^
d as the value at which the estimatedd stabilizes ask increases.
2.3 Results
2.3.1 Summary of simulation results
Due to page limitations, we summarize the simulation results here; details are given in the Sup-
plementary Materials. Our extensive simulations show that the simulated mean, standard deviation
and distributions ofS
R
k
andZ
R
w
are very close to their corresponding theoretical approximations
26
given by Theorem 1. Both the effective coverage and the MC order can be estimated accurately
under the parameter settings of the current sequencing technologies.
2.3.2 The relationship among 28 vertebrate species
Table S4 shows the estimated orders of MCs for a group of 28 vertebrate species that are studied
in [111, 112] based on simulated NGS short reads. For each of the 28 species, we compute the
fraction of thek-words that have (Z
R
w
)
2
=
^
d within the 99% of a
2
-distribution with one degree of
freedom, fork = 8; 9;:::; 14. Using 80% as a threshold, we estimate the order of MC for each
species to be the smallest k 2 under which the fraction of words that can be explained by the
(k 2)-th order MC is greater than the threshold.
Comparing our results with the results in [50], where the order of MCs for a selection of
vertebrate genomes was estimated by AIC and BIC criteria using whole genome sequences, we
find that the estimated order based on NGS read data are almost the same as that estimated based
on the whole genome sequences in [50]. Our proposed methods of estimating the order of MC
based on short reads of NGS data achieve the same accuracy as that based on whole genome
sequences.
For a given value ofk, we computed
2
andd
S
2
using anr-th order MC,r = 0 (i.i.d model);:::; (k
2) for each pair of species, yielding a 28 28 pairwise dissimilarity matrix under each MC model.
To evaluate the dissimilarity measures, we use the pairwise distance matrix obtained from Figure
S1 in [111] as the gold standard for the dissimilarity between each pair of the 28 species; the ma-
trix is given as Table S5 in the Supplementary Materials. Note that the estimated orders of the 28
species range from 7 to 11, and the average order is 10. To study the performance of the dissimi-
larity measures under different orders of MC, we choosek = 14 such that we can study the results
under the MC model with orders up to 12.
Table 2.1 shows Spearman’s rank correlation coefficient (SPCC) between the standard dis-
27
tance and the dissimilarity estimated by thed
2
-type measures under MC models of various orders;
higher SPCC indicates better performance. Both measures,d
2
andd
S
2
, achieve their best results of
SPCC=0.92 when using a MC of order 12. Note that using the naive dissimilarity measured
2
only
gives SPCC=0.08.
In general bothd
2
andd
S
2
obtain higher SPCC with the standard distance matrix as the order
of MC increases, except ford
S
2
at order 9. In particular, the measured
2
has negative correlation
coefficient with the standard distance under the i.i.d model. The SPCC becomes stable when the
order of the MC used for the analysis is close to 11, the maximum estimated MC orders over the
28 species. Hered
S
2
is less affected by the order of the MC thand
2
. When the appropriate order of
MC is used,d
2
andd
S
2
perform similarly and much better thand
2
.
d
2
-type order=0 order=5 order=9 order=10 order=11 order=12
d
2
-0.21 -0.16 0.85 0.89 0.90 0.92
d
S
2
0.86 0.87 0.85 0.88 0.90 0.92
Table 2.1: The Spearman’s rank correlation coefficient (SPCC) between the true distance matrix and the
dissimilarity matrix byd
2
-type dissimilarity measures under MC models with order 0 (i.i.d), 5, 9, 10, 11 and
12. The length of thek-tuple word is 14.
2.3.3 The relationship among 13 tropical tree species with unknown reference genomes
We also apply our method to the 13 tree species based on the NGS shotgun read data sets in [114].
The reference genome sequences for the 13 tree species are unknown. Our objective is to cluster
these tree species usingd
2
andd
S
2
with MCs for the sequences.
The estimated order of the MC for all the 13 tree species is 8. We use the dissimilarity measures
d
2
andd
S
2
under various orders of MC as the background model to cluster the 13 tree species from
their NGS reads. We choosek = 11 so that we explore the MC with order up to 9. We use the
Unweighted Pair Group Method with Arithmetic Mean (UPGMA) to cluster the tree species.
The 13 trees species can be generally classified into two groups: 5 tree species from Moraceae
28
and 8 tree species from Fagaceae. The two Moraceaes, Ficus altissima and Ficus microcarpa,
should cluster together because they are known to be closely related and are both large hemiepi-
phytic trees while the other three Moraceae species are small dioecious shrubs. Within the Fa-
gaceae group, the two Castanopsis species should cluster together, and the five Lithocarpus species
should also form a subgroup. Trigonobalanus doichangensis (Fagaceae) is an ancestral genus that
is very divergent from the rest of the family and has undergone considerable sequence evolution. It
should not group within the class of Castanopsis and Lithocarpus in Fagaceae.
Figure 2.1 shows the clustering results of the 13 tree species usingd
2
under MCs of order 0
(i.i.d), 4, 8 and 9. The trees are built based on all the reads. From the results we can see, under
the i.i.d model, Lithocarpus mixes with Castanopsis; T. doichangensis can not be separated from
the rest of Fagaceae, while under the MC of order greater than 4, T. doichangensis is successfully
separated from the rest of the Fagaceae. Moreover, within the Moraceae group, Ficus fistulas and
Ficus langkokensis form a subgroup under the i.i.d model, and they are separated under the MC
with order greater than 4. While F . langkokensis is the closest Maraceae to the Fagaceae under 4th
order MC, F . fistulosa becomes the closest species to the Fagaceaes under 8th and 9th order MCs.
In order to see whether the clustering of the tree species can be correctly inferred using only
a portion of the shotgun read data, we randomly sample 10% of the total read data for each tree
species to cluster them. To study the variation of the clusters due to random sampling of the reads,
we repeat the sampling process 30 times and calculate the frequencies of each internal branch of
the clustering using all the reads occurring among the 30 clusterings. The number on the branch
refers to the frequency of the branch occurring among the 30 clusterings based on random sampled
10% reads. Three branches of the tree under MC of order 9 have frequencies of occurrence less
than 30. When using the MC of a very high order, the clustering becomes unstable.
For the clustering results using d
S
2
, see Figure S7. Under MC with all four orders, the two
Castanopsis and the five Lithocarpus species are grouped separately, and F . altissima (Moraceae)
29
and F .microcarpa (Moraceae) are clustered together. Under the i.i.d model, T.doichangenesis (Fa-
gaceae) is successfully separated from Lithocarpus, but it is not the most remote species in the
Fagaceae group. When the MC order is greater than 4, T.doichangenesis (Fagaceae) gets sepa-
rated from the rest of the Fagaceaes. It can also be seen that when using the i.i.d model, or a MC
with order 8 or greater, some of the branches becomes unstable.
In general, the results show that the clustering becomes more accurate as the order of MC in-
creases using bothd
2
andd
S
2
. Under the i.i.d model, the clustering based ond
2
does not correctly
separate Castanopsis from Lithocarpus, while the clustering based on d
S
2
groups the two types
separately. With higher order MCs,d
2
successfully separates Castanopsis from Lithocarpus. The
general clustering structure among Lithocarpus, Castanopsis, Trigonobalanus and Ficus stays cor-
rect when order is greater than 4 for both measures. When using the MC with order higher than
the estimated order, the clustering is unstable and indeed the branch for L.Hancei (Fagaceae) is not
supported on the last tree when using only 10% of the data. With a large number of parameters to
estimate, 10% of the data does not suffice to capture the information in the data. The best clustering
is achieved under a MC of order 8 and 9.
2.4 Discussion
Next generation sequencing technologies provide large amount of data in the form of short reads.
Assembly of the millions of short reads to recover the long sequences is challenging, because the
relative short length of the reads makes it difficult to resolve the repeat regions, not all regions
may be covered, and assembly is time consuming. While multiple sequence alignment may be
prohibitive, we can use word-count based dissimilarity measures to cluster the underlying species.
These measures require an underlying probability model for the sequences; Markov chains are a
reasonable model for such sequences. While transition probabilities can be estimated directly from
count data, estimating the order of a MC with these data is not straightforward.
30
Methods for estimating the order of a MC of a long sequence have been developed since the
1950s, but estimating the order of a MC directly from a set of short reads without assembly has
not been studied yet. In this chapter, we develop two statistics S
R
k
and Z
R
w
and show that both
S
R
k
andZ
R
w
have surprisingly simple approximate distributions with only two parameters, one of
them depending on the order of the original long MC sequence, and the other one depending on
the distribution of the reads along the sequence. Intriguingly, one of these parameters isd =c + 1
under homogeneous sampling, wherec is the coverage of the reads along the genome based on the
firstk + 1 positions of each read.
Based on the properties ofS
R
k
andZ
R
w
, we develop an estimator for the order of a MC as well
as an estimator for the parameterd based on NGS data. Extensive simulation studies are carried
out to verify the theorem and evaluate the estimator.
Finally, we apply the estimation methods to two NGS data sets. Since the real genome se-
quences consist of coding, non-coding and various regulatory regions, single standard MC mod-
els do not fit the data well. Moreover, some enriched patterns, such as the motif sequences, are
widespread throughout the genomes and violate the simple MC model for the whole genome se-
quence. Hence studying the fraction of k-words whose occurrences can be explained using the
statistic (Z
R
w
)
2
=d by comparison to a
2
1
distribution is a more realistic way to determine the or-
der of the MC for a real genome sequence. The estimated orders are consistent with the orders
estimated directly from the full genome sequences using BIC methods.
Our primary motivation for this study is alignment-free genome comparison using NGS data.
Further, we cluster the 28 species based on the NGS data using MC models with various orders.
The results show that the clustering performs best and gives stable results when using a MC model
with order on and above the estimated order. In addition, we apply the same analysis to 13 tropical
tree species whose reference genomes are unknown; again the best clustering is achieved under a
MC with the order within the estimated range.
31
When the read length is short or the sequencing depth is low, the numbers of occurrences
of some k-words become small or even zero. Then the assumption of non-zero variance for all
word counts which underlies the gamma approximation for S
R
k
no longer holds and the gamma
approximation may not work well. In such a situation an exact test for the order of MCs in the
spirit of [115] could be very helpful. In this chapter we have only made a start on the Markov chain
modelling of NGS data. An exhaustive study of errors in the data, in the form of power studies,
could help to further understand the application range of our results. Finally, in this work we take
the estimation of the transition probabilities for granted, once the order of the MC is determined.
While the estimation of the transition probabilities of the MC model of a long sequence has been
studied by [116] and [117], it would be interesting to extend these methods to NGS data.
32
(a) k11, order0,d
2
(b) k11, order4,d
2
(c) k11, order8,d
2
(d) k11, order9,d
2
Figure 2.1: Clustering of the 13 tropical tree species usingd
2
under MC with order 0 (i.i.d), 4, 8 and 9.
The number on the branch refers to the frequency of the branch occurring among the 30 clusterings based
on random sampled 10% reads. The letter ‘F’ at the beginning of the names represents Fagaceae; similarly
the letter ‘M’ represents Maraceae. 33
Linked assets
Alignment-free sequence comparison methods and applications to comparative genomics
Conceptually similar
PDF
sup_chapter4
PDF
Ahlgren_NAR_virus-host_suppmaterial_D2_JR_NA
PDF
etd-RenJie-5618-2.pdf
PDF
etd-RenJie-5618-4.pdf
PDF
etd-RenJie-5618-3.pdf
PDF
etd-RenJie-5618-sup_chapter2.pdf
PDF
FigureS1
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
PDF
Applications and improvements of background adjusted alignment-free dissimilarity measures
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Whole genome bisulfite sequencing: analytical methods and biological insights
PDF
Predicting virus-host interactions using genomic data and applications in metagenomics
PDF
A generalization of the accumulation curve in species sampling and its applications to high‐throughput sequencing
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Application of machine learning methods in genomic data analysis
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
Asset Metadata
Creator
Ren, Jie
(author)
Core Title
Alignment-free sequence comparison methods and applications to comparative genomics [pdf]
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11256133
Unique identifier
UC11256133
Identifier
etd-RenJie-5618-1.pdf (filename)
Legacy Identifier
etd-RenJie-5618-1
Dmrecord
417432
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
alignment-free
comparative genomics
machine learning
Markov chain
metagenomics
next generation sequencing
virus-host interaction