Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Alignment-free sequence comparison methods and applications to comparative genomics
/
etd-RenJie-5618-sup_chapter2.pdf
(USC Thesis Other)
etd-RenJie-5618-sup_chapter2.pdf
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
BIOINFORMATICS
Vol. 00 no. 00 2015
Pages 1–12
InferenceofMarkovianPropertiesofMolecular
SequencesfromNGSDataandApplicationsto
ComparativeGenomics:
SupplementaryMaterials
Jie Ren
1
, Kai Song
2
, Minghua Deng
2
, Gesine Reinert
3
, Charles H. Cannon
4;5
and Fengzhu Sun
1;6
1
Molecular and Computational Biology Program, University of Southern California, Los Angeles,
California, USA;
2
School of Mathematical Sciences, Peking University, Beijing, China;
3
Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, UK;
4
Department of Biological Sciences, Texas Tech University, TX 79409-3131, USA;
5
Xishuangbanna Tropical Botanic Garden, Chinese Academy of Sciences, Yunnan, China;
6
Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University,
Shanghai, China.
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Associate Editor: XXXXXXX
1 PROOFOFTHEOREM1
Here we assume that the conditions of Theorem 1 prevail. In order to
prove Theorem 1, we introduce some notations. LetNw(i) denote
the number of occurrences of w in the i-th region, i = 1; 2; ;
Ew(i) =
N
w
(i)N
w
(i)
N
w
(i)
be the estimated expected number of
occurrences ofw in thei-th region under the (k 2)-th order MC
model; andPw be the probability ofw assuming that the MC starts
from the stationary distribution. Similarly, letNw ,Ew , and ^
2
w
be
the observed, expected, and variance of the number of occurrences
of w along the long genome sequence. The same notations with
superscript “R” indicate the corresponding quantities based on the
short read data. We assume thatk is small compared to, and hence
the edge effects are small. Then we have the following lemma.
LEMMA 1. a) Foranyk-wordw,foranyregioni = 1;:::;B,
indistribution,
lim
G!1
N
R
w
(i)E
R
w
(i)
^
R
w
=N
0;
r
2
i
fi
P
j
rjfj
!
: (1)
b) Foranyk-wordw,inprobability,
lim
G!1
E
R
w
P
i
E
R
w
(i)
p
E
R
w
= 0: (2)
Proof of Lemma 1. To prove part a), note that under the null
model that the MC is (k2)-th order, Theorem 6.4.2 in Reinertetal.
to whom correspondence should be addressed
(2005) gives that, as sequence length goes to infinity, in distribution,
Nw(i)Ew(i)
^ w(i)
!N(0; 1):
Here we use that the Markov chain assigns non-zero probability to
everyk-wordw and hence the variancew would be non-zero. As
N
R
w
(j) =rjNw(j) the corresponding result forN
R
w
(j) follows; it
remains to identify the asymptotic variance. We have for any region
i
lim
G!1
Nw(i)
Gi
=Pw; (3)
which does not depend oni as the MC is ergodic, and then we have
approximately,
Ew(i) =Gi
P
w
P
w
P
w
and
E
R
w
=
X
i
riEw(i) =
X
i
riGi
P
w
P
w
P
w
; i = 1; 2;:::;:
Thus
lim
G!1
E
R
w
(i)
E
R
w
=
rifi
P
j
rjfj
;
lim
G!1
1N
R
w
(i)=N
R
w
(i)
1N
R
w
=N
R
w
= 1;
lim
G!1
1N
R
w
(i)=N
R
w
(i)
1N
R
w
=N
R
w
= 1:
From the above three equations, we have
lim
G!1
^
R
w
(i)
^
R
w
=
s
rifi
P
j
rjfj
:
c
Oxford University Press 2015. 1
Renetal.
Therefore
N
R
w
(i)E
R
w
(i)
^
R
w
=
riNw(i)riEw(i)
^
R
w
(i)
^
R
w
(i)
^
R
w
!N
0;
r
2
i
fi
P
j
rjfj
!
:
For Part b) of this lemma, note that
0N
R
w
B
X
j=1
N
R
w
(j) 2Mk
as the only differences in the counts occur due to not counting
occurrences at the boundaries of the regions and there areM reads.
As Nw PwG and as limG!1
Mk
p
G
! 0, the second assertion
follows.
From Lemma 1, we can easily show the first assertion in Theorem
1 by noting that
Z
R
w
=
N
R
w
E
R
w
^
R
w
=
X
i
N
R
w
(i)E
R
w
(i)
^
R
w
+
P
i
E
R
w
(i)E
R
w
^
R
w
:
The last summand tends to 0 by (2). WhenGi is large then thei-
th term is close to a normal distribution with mean 0 and variance
r
2
j
f
j
P
i
r
i
f
i
. Since the dependence between the segments is weak, we
can treat the terms in the first summand as independent. Part a) of
Theorem 1 is proved.
Now we prove the part b) of Theorem 1. Suppose there are only
two regions with coverageri and region lengthGi,i = 1; 2. With
(2),S
R
k
has approximately the same distribution as
S
R;
k
=
X
w
N
R
w
E
R
w
(1)E
R
w
(2)
2
E
R
w
=
X
w
X
i
N
R
w
(i)E
R
w
(i)
p
E
R
w
!
2
=
X
w
0
@
X
i
Nw(i)Ew(i)
p
Ew(i)
s
riE
R
w
(i)
E
R
w
1
A
2
=
X
w
X
i
Wi
Nw(i)Ew(i)
p
Ew(i)
!
2
;
where
Wi =
s
riE
R
w
(i)
E
R
w
s
r
2
i
fi
P
j
rjfj
;i = 1; 2:
Generally suppose that the reads come fromB regions with each
region having the same coverage. Letri be the coverage andGi be
the genomic length of thei-th region. Using the same idea as above,
we can approximateS
R
k
by
S
R;
k
=
X
w
B
X
i=1
Wi
Nw(i)Ew(i)
p
Ew(i)
!
2
: (4)
Note that from Section 6.6.1 in Reinertetal. (2005), the vector
~
N(i) :=
Nw(i)Ew(i)
p
Ew(i)
!
w2A
k
!N
0;
2
L
k
L
k
;
in distribution, where
2
L
k
L
k
is the covariance matrix with rank
df
k
. Hence, in distribution,
X
w2A
k
Nw(i)Ew(i)
p
Ew(i)
!
2
!
2
(df
k
)
;
where
2
(df
k
)
is a
2
random variable withdf
k
= (L 1)
2
L
k2
degrees of freedom. On the other hand, we can find a L
k
df
k
matrixM such that
2
L
k
L
k
= MM
T
: LetM
1
be the pseudo-
inverse ofM, then we have
M
1
~
N(i) =
0
@
Z1(i)
:::
Z
df
k
(i)
1
A
=Z(i)
such that approximately Z(i) N (0;I
df
k
): As
~
N(i) = MZ(i)
we obtain that
~
N(i)
T
~
N(i) =Z(i)
T
M
T
MZ(i):
Since the left hand side has approximately a
2
distribution with
df
k
degrees of freedom, the right hand side should be in distribution
close toZ(i)
t
Z(i). Thus approximately,M
T
M = I
df
k
df
k
: Also
note thatM does not depends oni because the correlation structure
of
~
N(i) is the same across the regions under the null model. Then
(4) forS
R;
k
can be written as
S
R;
k
=
B
X
i=1
Wi
~
N(i)
!
T
B
X
i=1
Wi
~
N(i)
!
=
B
X
i=1
WiZ(i)
!
T
M
T
M
B
X
i=1
WiZ(i)
!
=
B
X
i=1
WiZ(i)
!
T
B
X
i=1
WiZ(i)
!
=
df
k
X
k=1
B
X
i=1
WiZ
k
(i)
!
2
:
Since Z
k
(i) are all approximately i.i.dN (0; 1) random variables
and
P
B
i=1
W
2
i
P
B
i=1
r
2
i
f
i
P
j
r
j
f
j
, we obtain that, in distribution,
Y
k
=
B
X
i=1
WiZ
k
(i)!N
0;
B
X
i=1
W
2
i
!
and
Y
k
q
P
B
i=1
W
2
i
!N (0; 1):
Hence
S
R
k
=
B
X
i=1
W
2
i
!
df
k
X
k=1
0
@
Y
k
q
P
B
i=1
W
2
i
1
A
2
!
P
B
i=1
r
2
i
fi
P
B
j=1
rjfj
!
2
(df
k
):
2
InferenceofMarkovianPropertiesofSequencesandApplications
2 METHODS
2.1 Estimating the order of a Markov chain based on
Theorem 1
We assume thatk 2 and that the assumptions of Theorem 1 are
satisfied. Moreover we assume for now thatd is known; in practice,
the effective coverage d is replaced by the estimated value
^
d. In
addition to our estimator
^ rS
k
= argmin
k
S
R
k+1
S
R
k
1 (5)
we define four related estimators based on Theorem 1; they are all
analogs of corresponding established estimators for the order of a
Markov chain.
1. Instead of usingS
R
k
directly, we can calculate the p-value
p
k
=P
S
R
k
s
R
k
=P
S
R
k
=ds
R
k
=d
=P
2
df
k
s
R
k
=d
;
wheres
R
k
is the observed value ofS
R
k
based on the short read
data. We expect p
k
to be small for k < r + 2 while p
k
will not be that small for k r + 2. Therefore, we expect
log(p
k+1
)= log(p
k
) to be the smallest whenk = r + 1. Thus
we can also estimate the order of a MC by
^ rp
k
= argmin
k
log(p
k+1
)
log(p
k
)
1: (6)
2. For a given significance level , we check consecutively if
p
k+1
< and stop whenp
k+1
. We estimater by
^ r
h
= argmin
k
fp
k+1
g1; for a given significant level.
(7)
To avoid early stopping, we can also require that bothp
k
and
p
k+1
are larger than.
3. Ifk r + 2, thenZ
R
w
=
p
d is approximately standard normal
N(0; 1) and thus (Z
R
w
)
2
=d has approximately a
2
-distribution
with one degree of freedom. Ifk < r + 2, for somek-word
w, (Z
R
w
)
2
=d is generally larger than a
2
-distributed random
variable with one degree of freedom. Therefore we would
expect Z
R
max
(k) to be large when k < r + 2 and Z
R
max
(k)
to be relatively small forkr + 2. AsZ
R
max
is the maximum
value over L
k
variables, we should divide Z
R
max
(k) by L
k
.
Therefore, we estimate the order of the MCr by
^ rZ
k
= argmin
k
Z
R
max
(k + 1)
Z
R
max
(k)
1; (8)
whereZ
R
max
(k) = max
w;jwj=k
jZ
R
w
j:
4. Extending the method by Morvai and Weiss (2005) and Peres
and Shields (2005), for a set of short reads and a (k 1)-word
v =v1v
k1
, define
4
k1
(v) = max
a2A
N
R
va
N
R
va
N
R
v
N
R
v
;
then
4
k
= max
v2A
k1
f4
k1
(v)g = max
w2A
k
N
R
w
N
R
w
N
R
w
N
R
w
is the maximum difference between the number of occurrences
of ak-word and its estimated expectation under the (k 2)-th
order MC. Our Peres-Shields-type estimator is
^ rPS(x) = argmax
k
4
k
4
k+1
1: (9)
2.2 Estimating the order of the MC based on modified
AIC and BIC
Several methods based on the Akaike information criterion (AIC)
and the Bayesian information criterion (BIC) have been proposed to
estimate the order of MC based on long sequences (Narlikar et al.,
2013; Tong, 1975; Katz, 1981; Hurvich and Tsai, 1989; Zhaoetal.,
2001). The AIC and BIC for long sequences are defined by
AIC(k) =2 log(Sequence-Likelihood;M
k
) + 2jM
k
j; (10)
AICc(k) =AIC(k) +
2jM
k
j (jM
k
j + 1)
(jSj
k
jM
k
1)
; and (11)
BIC(k) =2 log(Sequence-Likelihood;M
k
) +jM
k
j logjSj
k
;
(12)
whereM
k
indicates the k-th order Markov chain, jM
k
j is the
number of parameters for thek-th order Markov model, i.e.jM
k
j =
(L 1)L
k
, andjSj
k
is the size of the data. For a long sequence of
lengthG, we have thatjSj
k
=Gk is the number of (k+1) tuples.
However, no formulas have been defined for AIC and BIC of
Markov models based on NGS data. In the following we modify the
definitions of AIC and BIC given in (10), (11) and (12) respectively
so that they are applicable to NGS data. For a NGS short read
sample, we define the log-pseudo-likelihood under the k-th order
MC model as,
log(pLH
R
k
) =
X
w2L
k+1
N
R
w
log
N
R
w
N
R
w
;
by replacingNw withN
R
w
in the log-likelihood of a long sequence.
We think of the factord as the effective coverage of the reads along
the genome. So the pseudo-likelihood of the NGS data under the
k-th order of MC model is approximately the likelihood of the
effectivele covered region along the genome to the power of d,
i.e. pLH
R
k
(likelihood of the covered genomic region)
d
. Thus,
the log-likelihood of the covered genomic region is approximately
log(pLH
R
k
)=d. Based on the idea of AIC for long sequences,
we minimize2 log(pLH
R
k
)=d + 2(L 1)L
k
with respect tok.
Therefore, we propose the AIC for k=1, 2, . . . , based on NGS data
as
AIC
R
(k) =2 log(pLH
R
k
) + 2d(L 1)L
k
:
To define BIC and AICc for NGS data, we also need to find
a suitable analog of the data sizejSj
k
in (11) and (12). A naive
substitute is the total effective read length M(k), or in other
words, the number of (k + 1)-tuples used to estimate the likelihood
under thek-th order MC model. However, the reads can overlap and
adjustments are needed. Using the effective coverage factord as a
normalization factor from the NGS to long sequence, the length of
the effective covered genomic region is
M(k)
d
. Then we define
3
Renetal.
AICc and BIC for the NGS read data as
AICc
R
(k) =AIC
R
(k) +
2(L 1)L
k
(L 1)L
k
+ 1
M(k)
d
(L 1)L
k
1
; and
BIC
R
(k) =2 log(pLH
R
k
) +d(L 1)L
k
log
M (k)
d
:
We define the estimators of the order of a Markov model based on
AIC, AICc and BIC for the NGS read data by
^ r
AIC
R = argmin
k
AIC
R
(k) (13)
^ r
AICc
R = argmin
k
AICc
R
(k) (14)
and ^ r
BIC
R = argmin
k
BIC
R
(k) (15)
In order to see the effect of the normalization by the effective
coverage factor d, we denote by ^ rAIC , ^ rAICc and ^ rBIC the
naive estimators, which are obtained by simply substituting the
log-sequence-likelihood and the size of the data by the log-pseudo-
likelihood and the number of (k+1)-tuples in the NGS read data
without normalization byd.
3 RESULTS
3.1 Simulation studies for the validation of the
theoretical results
For the simulation studies we first generate MCs of different orders.
For realistic parameter values, the transition probability matrices
of the MCs are based on real cis-regulatory module (CRM) DNA
sequences in mouse forebrain from Blow et al. (2010). We start
the sequence from the stationary distribution. In addition to the first
order MC model described above, we also simulate a second order
MC. Table S1 shows the transition probability of a) the first and b)
the second order MC matrices we used in the simulations.
We generate MCs with length of G = 10
5
and 2 10
5
. We
simulate NGS data by sampling a varying number of reads of
different lengths from the MC as follows. We use the Lander-
Waterman model for physical mapping (Lander and Waterman,
1988) to sample NGS reads homogeneously with read length =
100; 200; 300; 400; 500 and the number of readsM = 500; 1000.
The coverage c is calculated as c = M=G and the effective
coverage based on the first k + 1 positions is d = c
eff
+
1, where c
eff
= M( k + 1)=(G k + 1). Under each
combination of (G;M;), we calculate the values ofZ
R
w
for each
word w and S
R
k
based on the NGS read data. Then we repeat
the processes 2000 times to obtain the empirical distribution of
Z
R
w
and S
R
k
. Finally, we compare the mean and variance of S
R
k
with their corresponding theoretical approximations and test the
fit of the data to the theoretical approximate distribution using the
Kolmogorov-Smirnov (KS) test.
With the current NGS technologies, the reads are generally
not homogeneously sampled from the genome. In order to
see the effects of inhomogeneous sampling on the approximate
distributions of S
R
k
and Z
R
w
, similarly as in Song et al. (2013)
we implement the following simulation. We divide the long
sequences into 100 blocks, b1;:::;bt;:::;b100. For each block
bt, the sampling probability i for each position in this block is
(a) The transition probability matrix of the first order MC
A C G T
A 0.39 0.15 0.21 0.25
C 0.31 0.19 0.12 0.38
G 0.31 0.23 0.20 0.26
T 0.26 0.18 0.21 0.35
(b) The transition probability matrix of the second order MC
A C G T
AA 0.39 0.19 0.19 0.24
AC 0.25 0.26 0.21 0.27
AG 0.38 0.17 0.29 0.15
AT 0.30 0.24 0.22 0.24
CA 0.46 0.20 0.15 0.18
CC 0.26 0.40 0.11 0.22
CG 0.23 0.16 0.14 0.48
CT 0.29 0.20 0.15 0.37
GA 0.26 0.25 0.23 0.26
GC 0.25 0.17 0.33 0.25
GG 0.26 0.30 0.28 0.17
GT 0.29 0.12 0.22 0.37
TA 0.19 0.19 0.21 0.41
TC 0.12 0.29 0.24 0.35
TG 0.27 0.17 0.25 0.32
TT 0.11 0.18 0.24 0.47
Table S1. The transition probability matrices of a) the first and b) the second
order Markov chain in our simulation studies.
proportional to a random number which is drawn independently
from the gamma distribution (1; 20) (one number per block).
It has been observed that the probability that a fragment is
sequenced in NGS depends on its nucleotide content. Empirical
studies showed that the dependency on GC content is unimodal
and we use the empirical unimodal distribution on GC curve shown
in the software developed by Benjamini and Speed (2012) as an
example to generate the reads. The other parameters are the same as
before.
Sequencing error is another factor that reduces the data quality.
Currently, Illumina sequencing has an error rate about 0.1% and
454 sequencing has an error rate about 1% (Glenn, 2011). In order
to see the effect of sequencing errors on the distribution ofS
R
k
and
Z
R
w
, in each position of a read, we randomly replace the letter in that
position with one of the other three letters with equal probability
0.005; the different letter is drawn with equal probability. The
remaining simulation steps stay the same as before.
Next we evaluate the proposed estimators of the order of MCs,
^ rS
k
, ^ rp
k
, ^ r
h
, ^ rZ
k
and ^ rPS and the estimator for the effective
coverage d based on equation (7) in the main text, respectively.
For estimating the effective coveraged by equation (7) in the main
text, we using 3-tuples for a first order MC, and 4-tuples for a
second order MC. For evaluating the proposed estimators, we let
k range from 2 to 6, i.e. we choose the model from 1st, 2nd, ,
5th order MC. The performance of the estimator is measure by
the precision rate, i.e. the percentage of times in 1000 repeats the
estimator gives the true order. In order to see the effect of genome
length and sequencing depth on the estimators, we take the genome
4
InferenceofMarkovianPropertiesofSequencesandApplications
length 5000, 7500, 10000, 20000, 50000, while fixing the read
coverage to be 1; we take the read coverage 0.05, 0.1, 0.25, 0.5,
0.75, 1, while fixing the genome length to be 10
5
. We also set the
sequencing error rate at 10%, which is relatively high compared to
the true sequencing error rate in real sequencing, as to distinguish
clearly between the estimators with respect to their robustness to
sequencing errors. For the estimator ^ r
h
, we set = 0:05.
When the sequence length is short or the coverage is low, it is
possible that the numbers of occurrences of somek-words are close
to zero and the expected numbers of occurrences are very low; the
estimated expected number of occurrences may turn out to be 0 for
some k-words. In order to overcome the issue, we add a pseudo
count of 1 to the number of occurrence of all words, which is a
common procedure to avoid division by zero and is equivalent to
incorporating a flat prior during the parameter estimation in terms of
Bayesian statistics, see Narlikaretal. (2013); Strelioffetal. (2007).
3.1.1 SimulationresultsforTheorem1:Homogeneoussamplingof
the reads We use simulations to validate the theoretical results in
Theorem 1. In our simulation study, here we first generate sequences
with genome length of G = 1 10
5
and 2 10
5
bps under
the first order MC model as shown in Table S1(a). S
R
3
with their
corresponding theoretical approximations in Theorem 1 are given
in Table S2. It can be seen from the table that the approximate
mean and variance of S
R
3
are very close to their simulated values
except in the case of a large number (M = 1000) of very short
reads ( = 100) in a small genome (G = 10
5
bps).
(a) Genome lengthG = 1 10
5
M = 500 M = 1000
^ mean mean ^ var var p-value ^ mean mean ^ var var p-value
100 53.4 54.0 160.0 162.0 0.06 70.8 72.0 280.0 288.0 0.002
200 71.6 72.0 277.4 288.0 0.99 106.9 108.0 660.5 648.0 0.15
300 89.8 90.0 482.9 450.0 0.39 143.9 144.0 1139.5 1152.0 0.15
400 107.6 108.0 655.0 648.0 0.84 179.8 180.0 1901.3 1800.0 0.20
500 126.0 126.0 905.2 882.0 0.86 216.6 216.0 2675.3 2592.0 0.26
(b) Genome lengthG = 2 10
5
M = 500 M = 1000
^ mean mean ^ var var p-value ^ mean mean ^ var var p-value
100 44.6 45.0 110.9 112.5 0.12 53.7 54.0 142.7 162.0 0.61
200 53.5 54.0 159.6 162.0 0.13 71.8 72.0 278.4 288.0 0.64
300 62.7 63.0 222.5 220.5 0.95 89.0 90.0 440.1 450.0 0.41
400 71.7 72.0 286.0 288.0 0.51 107.4 108.0 645.7 648.0 0.26
500 80.9 81.0 357.5 364.5 0.69 124.7 126.0 821.3 882.0 0.99
Table S2. Comparison of mean and variance of S
R
3
with their
corresponding theoretical approximations under a first order MC model and
the fit of the data to the theoretical approximate distribution using the KS test.
The simulation process was repeated 2000 times for each combination of
(G;M;). The columns ^ mean and ^ var are the simulated mean and variance;
the columns mean and var are the theoretical mean and variance.
Moreover, Figure S1(a) shows typical Q-Q (Quantile-Quantile)
plots of S
R
3
=d versus the distribution of
2
36
, where the subscript
36 indicates the degree of freedom of the
2
distribution, under
different models of sampling reads with/without sequencing errors.
Similarly, Figure S2(a) show the Q-Q plots of Z
R
ACT
=
p
d versus
the standard normal distribution under different scenarios; here,
20 30 40 50 60 70
20 30 40 50 60 70
Homogeneous Sampling, without Error
Sample distribution
Theoretical distribution
(a) homogeneous, without error
20 30 40 50 60 70
20 30 40 50 60
(b):Inhomogeneous Sampling, without Error
Sample distribution
Theoretical distribution (b) inhomogeneous, without error
20 30 40 50 60 70
20 30 40 50 60 70
inhomogeneous sampling based on GC content in fragments, without error
Sample distribution
Theoretical distribution (c) inhomogeneous on GC content
20 30 40 50 60 70
10 20 30 40 50 60 70
(c):Homogeneous Sampling, with Error
Sample distribution
Theoretical distribution
(d) homogeneous, with error
Fig. S1. Q-Q plots of the 2000S
R
3
=d scores v.s. 2000 scores sampled from
a
2
36
distribution; (G;;M) = (10
5
; 200; 1000). a): homogeneous
sampling without error, b): inhomogeneous sampling without error, c):
inhomogeneous sampling with sampling rate depending on GC content, d):
homogeneous sampling with error.
the word under consideration is w = ACT . The theoretical
approximations work well in this situation.
Simulation studies under the higher order MC model and on
genomes inserted with repeated regions are carried out in a similar
fashion (data not shown). The same conclusion that the theoretical
approximate distributions of S
R
k
and Z
R
w
fit their simulated
distributions well holds.
3.1.2 SimulationresultsforTheorem1:Theeffectofinhomogeneous
sampling and of sequencing error We use Q-Q plots to show the
effects of inhomogeneous sampling and sequencing errors on the
distribution of S
R
k
and Z
R
w
for (G;;M) = (10
5
; 200; 1000).
Figure S1 (a, b, c, d) shows the Q-Q plots of the 2000S
R
3
=d scores
from (a) homogeneous sampling, (b) inhomogeneous sampling
with rate not depending on GC content, (c) inhomogeneous
sampling with rate depending on GC content, and (d) homogeneous
sampling with sequencing errors v.s. 2000 scores sampled from
2
36
distribution. The factor d is c
eff
+ 1 in homogeneous sampling;
and d is calculated from the exact distribution of the sampled
reads along the sequence in the inhomogeneous sampling situation.
Figure S2 gives the Q-Q plots forZ
R
ACT
=
p
d, showing the effect of
inhomogeneous sampling and sequencing error on the distribution
of Z
R
ACT
=
p
d. All Q-Q plots show a satisfactory fit, confirming
that the theoretical results from Theorem 1 hold even when the
assumptions are not necessarily satisfied.
3.1.3 Simulation results on estimating the order of MCs based
on simulated NGS reads Figure S3 shows the effects of sequence
5
Renetal.
−3 −2 −10 1 2 3
−3 −2 −10 1 2 3
(a):Homogeneous Sampling, without Error
Theoretical Quantiles
Sample Quantiles
(a) homogeneous, without error
−3 −2 −10 1 2 3
−3 −2 −10 1 2 3
(b):Inhomogeneous Sampling, without Error
Theoretical Quantiles
Sample Quantiles (b) inhomogeneous, without error
-3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3 inhomogeneous sampling based on GC content in fragments, without error
Theoretical Quantiles
Sample Quantiles (c) inhomogeneous on GC content
−3 −2 −10 1 2 3
−3 −2 −10 1 2 3 4
(c):Homogeneous Sampling, with Error
Theoretical Quantiles
Sample Quantiles
(d) homogeneous, with error
Fig. S2. Q-Q plots of the 2000 Z
R
ACT
=
p
d scores of the word ACT
v.s. the 2000 scores sampled from the standard normal distribution;
(G;;M) = (10
5
; 200; 1000). a): homogeneous sampling without error,
b): inhomogeneous sampling without error, c): inhomogeneous sampling
with sampling rate depending on GC content, d): homogeneous sampling
with error.
length and read coverage on the precision of the estimators under
a first and second order MCs, with homogeneous sampling and
sequencing error rate 10%. It can be seen that all the five estimators
perform reasonably well. In particular, the precision rate of
estimators ^ rS
k
, ^ rp
k
, ^ rZ
k
and ^ rPS reach 100% when the genome
length is larger than 20000 bps and the read coverage is greater than
0.2. The estimator ^ r
h
performs slightly worse than the other four
estimators. Since the estimator ^ r
h
is based on hypothesis testing
with a given significant level, the precision rate is not able to reach
100%. It is also possible that nok from 2 to 6 satisfiesp
k1
<
and bothp
k
andp
k+1
are larger than such that the estimator ^ r
h
fails to give an estimation. We observe that the precision of ^ r
h
is
sensitive to the accuracy of the estimation of the effective coverage
d. In the simulation, if we take the underlying true value of d in
place of the estimated value of
^
d in the computation, the precision
rate of ^ r
h
goes up to above 90%.
For inhomogeneous sampling, Figure S4 shows the effects
of sequence length and read coverage on the precision of
the estimators. We can see that the precision rates under the
inhomogeneous sampling start at slightly lower values than those
under the homogeneous case. As the genome length and the read
coverage increase, the estimators perform as well as they perform
under the homogeneous sampling.
Figure S5 and S6 show the disappointing precision rates of the
AIC and BIC based estimators of the order of a MC under a
first and a second order MC, with homogeneous (Figure S5) and
inhomogeneous sampling (Figure S6). The sequencing error rate
● ● ● ● ● ●
10000 20000 30000 40000 50000
0 20 40 60 80 100
Genome Length
Precision(%)
● r s k
r p k
r h
r Z k
r PS
(c) MC1, Homogeneous Sampling, Coverage=1, Read Length=100
(a) a first order MC,c = 1, homogeneous
● ● ● ● ● ●
0.2 0.4 0.6 0.8 1.0
0 20 40 60 80 100
Coverage
Precision(%)
● rsk
rpk
rh
rZk
rPS
(d) MC1, Homogeneous Sampling, Genome Length=1e+05, Read Length=100
(b) a first order MC,G = 10
5
, homogeneous
●
●
● ● ● ●
10000 20000 30000 40000 50000
0 20 40 60 80 100
Genome Length
Precision(%)
● r s k
r p k
r h
r Z k
r PS
(c) MC2, Homogeneous Sampling, Coverage=1, Read Length=100
(c) a second order MC,c = 1, homogeneous
●
● ● ● ● ●
0.2 0.4 0.6 0.8 1.0
0 20 40 60 80 100
Coverage
Precision(%)
● rsk
rpk
rh
rZk
rPS
(d) MC2, Homogeneous Sampling, Genome Length=1e+05, Read Length=100
(d) a second order MC,G = 10
5
, homogeneous
Fig. S3. The precision rates of the estimators for the order of a MC under
a first and a second order MC, with homogeneous sampling and sequencing
error rate 10%. (a,b): The effects of genome length and read coverage to the
precision rates under a first order MC. (c,d): The effects of genome length
and read coverage on the precision rates under a second order MC.
6
InferenceofMarkovianPropertiesofSequencesandApplications
● ● ● ● ● ●
10000 20000 30000 40000 50000
0 20 40 60 80 100
Genome Length
Precision(%)
● r s k
r p k
r h
r Z k
r PS
(c) MC1, Inhomogeneous Sampling, Coverage=1, Read Length=100
(a) a first order MC,c = 1, inhomogeneous
● ● ● ● ● ●
0.2 0.4 0.6 0.8 1.0
0 20 40 60 80 100
Coverage
Precision(%)
● rsk
rpk
rh
rZk
rPS
(d) MC1, Inhomogeneous Sampling, Genome Length=1e+05, Read Length=100
(b) a first order MC,G = 10
5
, inhomogeneous
●
●
●
● ● ●
10000 20000 30000 40000 50000
0 20 40 60 80 100
Genome Length
Precision(%)
● r s k
r p k
r h
r Z k
r PS
(c) MC2, Inhomogeneous Sampling, Coverage=1, Read Length=100
(c) a secomd order MC,c = 1, inhomogeneous
●
● ● ● ● ●
0.2 0.4 0.6 0.8 1.0
0 20 40 60 80 100
Coverage
Precision(%)
● rsk
rpk
rh
rZk
rPS
(d) MC2, Inhomogeneous Sampling, Genome Length=1e+05, Read Length=100
(d) a second order MC,G = 10
5
, inhomogeneous
Fig. S4. The precision rates of the estimators for the order of a MC under a
first and a second order MC, with inhomogeneous sampling and sequencing
error rate 10%. (a,b): The effects of genome length and read coverage to the
precision rates under a first order MC. (c,d): The effects of genome length
and read coverage on the precision rates under a second order MC.
is 10% and read length is 100. It can be seen that the normalized
estimators ^ r
AIC
R, ^ r
AICc
R and ^ r
BIC
R (in solid lines) have better
performance than the their corresponding naive estimators ^ rAIC ,
^ rAICc and ^ rBIC (in dotted lines). The BIC based estimator, ^ r
BIC
R,
has generally the best performance among all the AIC and BIC
based statistics. However, the precision of ^ r
BIC
R is low compared
to the five estimators ^ rS
k
, ^ rp
k
, ^ r
h
, ^ rZ
k
, and ^ rPS . With the increase
of the genome length and read coverage, although ^ r
BIC
R is able
to reach a very high precision rate at some point, the precision
finally drops with further increase of the genome length and the read
coverage, and it fails to give a consistent estimation. The AIC and
AICc based estimators, ^ rAIC and ^ rAICc, do not differ much in the
precision rate.
3.1.4 Simulation results on estimating the effective coverage d
For inhomogeneous sampling of the reads, it is not clear how we
estimate the parameter d in Theorem 1 based purely on the naive
read coverage c. In equation (7) in the main text, we propose a
method to estimate d based on the statistics (Z
R
w
)
2
;w 2 A
k
.
We assess its accuracy using the average relative error, defined
by MRE =
1
T
P
T
i
j
^
d
i
dj
d
, where
^
di is the estimated value of
d in the i-th simulation out of the total T repeats, and by the
root-mean-relative-squared-error (RMRSR) defined as RMRSR =
1
d
q
P
T
i
(
^
did)
2
=T:
The values of d for the four combinations of (;M) and the
average of their estimations are given in the first and second rows
of Table S3. The MRE and RMRSR for the different combinations
of (;M) are given in the third and fourth rows of Table S3.
Table S3 shows the results for (a) the homogeneous and (b)
the inhomogeneous sampling of the reads using k = 3. In
general, although the estimation of the effective coveraged becomes
slightly inaccurate as the read coverage increases, the estimation is
reasonably good. Under the inhomogeneous sampling, the relative
errors are slightly higher than those under the homogeneous case,
which reflects the greater randomness the inhomogeneous sampling
brings into the model; while the MRE error seems unaffected by
the choice of d, the RMRSR shows a tendency to increase with
increasingd.
We only consider (;M) = (100; 2000) for inhomogeneous
sampling with sampling rate depending on GC content as before.
In this case, the value ofd is 2.08. The mean value of estimatedd
is 2.16, the MRE is 0.273, and the RMRSR is 0.358 indicating that
the value ofd can be accurately estimated.
(;M) = (100; 1000) (100; 2000) (500; 1000) (500; 2000)
(a) homogeneous sampling of reads
d 2 3 6 11
d 2.05 3.08 6.16 11.35
MRE 0.278 0.283 0.281 0.279
RMRSR 0.328 0.328 0.339 0.338
(b) inhomogeneous sampling of reads
d 2.80 4.60 9.56 18.15
d 2.85 4.92 10.03 18.86
MRE 0.283 0.281 0.281 0.292
RMRSR 0.321 0.348 0.353 0.359
Table S3. The estimation of the effective coveraged for a) the homogeneous
and b) inhomogeneous sampling of reads withG = 10
5
bps, = 100; 500
bps, andM = 1000; 2000;
d =
P
T
i=1
^
d
i
=T ;T =2000.
7
Renetal.
● ● ● ● ● ●
10000 20000 30000 40000 50000
0 20 40 60 80 100
Genome Length
Precision(%)
●
●
● ● ● ●
●
●
r AIC
r AIC R
r AICc
r AICc R
r BIC
r BIC R
(a) MC1, Homogeneous Sampling, Coverage=1, Read Length=100
(a) a first order MC,c = 1, homogeneous
● ● ● ● ● ●
0.2 0.4 0.6 0.8 1.0
0 20 40 60 80 100
Coverage
Precision(%)
●
● ● ● ● ●
●
●
rAIC
rAIC R
rAICc
rAICc R
rBIC
rBIC R
(b) MC1, Homogeneous Sampling, Genome Length=1e+05, Read Length=100
(b) a first order MC,G = 10
5
, homogeneous
● ● ●
● ● ●
10000 20000 30000 40000 50000
0 20 40 60 80 100
Genome Length
Precision(%)
● ● ●
●
●
●
●
●
r AIC
r AIC R
r AICc
r AICc R
r BIC
r BIC R
(c) MC2, Homogeneous Sampling, Coverage=1, Read Length=100
(c) a second order MC,c = 1, homogeneous
● ●
● ● ● ●
0.2 0.4 0.6 0.8 1.0
0 20 40 60 80 100
Coverage
Precision(%)
● ●
● ● ● ●
●
●
rAIC
rAIC R
rAICc
rAICc R
rBIC
rBIC R
(d) MC2, Homogeneous Sampling, Genome Length=1e+05, Read Length=100
(d) a second order MC,G = 10
5
, homogeneous
Fig. S5. The precision rates of the AIC and BIC based estimators for the
order of MC under a first and a second order MC, with homogeneous
sampling and sequencing error rate 10%. (a,b): The effects of genome length
and read coverage to the precision rates under a first order MC. (c,d): The
effects of genome length and read coverage on the precision rates under a
second order MC.
● ● ● ● ● ●
10000 20000 30000 40000 50000
0 20 40 60 80 100
Genome Length
Precision(%)
●
●
● ● ● ●
●
●
r AIC
r AIC R
r AICc
r AICc R
r BIC
r BIC R
(a) MC1, Inhomogeneous Sampling, Coverage=1, Read Length=100
(a) a first order MC,c = 1, inhomogeneous
● ● ● ● ● ●
0.2 0.4 0.6 0.8 1.0
0 20 40 60 80 100
Coverage
Precision(%)
●
● ● ● ● ●
●
●
rAIC
rAIC R
rAICc
rAICc R
rBIC
rBIC R
(b) MC1, Inhomogeneous Sampling, Genome Length=1e+05, Read Length=100
(b) a first order MC,G = 10
5
, inhomogeneous
● ●
●
● ● ●
10000 20000 30000 40000 50000
0 20 40 60 80 100
Genome Length
Precision(%)
● ● ●
●
●
●
●
●
r AIC
r AIC R
r AICc
r AICc R
r BIC
r BIC R
(c) MC2, Inhomogeneous Sampling, Coverage=1, Read Length=100
(c) a second order MC,c = 1, inhomogeneous
● ●
● ● ● ●
0.2 0.4 0.6 0.8 1.0
0 20 40 60 80 100
Coverage
Precision(%)
● ●
● ● ● ●
●
●
rAIC
rAIC R
rAICc
rAICc R
rBIC
rBIC R
(d) MC2, Inhomogeneous Sampling, Genome Length=1e+05, Read Length=100
(d) a second order MC,G = 10
5
, inhomogeneous
Fig. S6. The precision rates of the AIC and BIC based estimators for the
order of MC under a first and a second order MC, with inhomogeneous
sampling and sequencing error rate 10%. (a,b): The effects of genome length
and read coverage to the precision rates under a first order MC. (c,d): The
effects of genome length and read coverage on the precision rates under a
second order MC.
8
InferenceofMarkovianPropertiesofSequencesandApplications
3.2 The relationship among 28 vertebrate species
Table S4 shows the estimated orders of MC for the 28 genomes
of vertebrate species based on their NGS samples. For each of
the 28 species, we compute the fraction of the k-tuple words that
have (Z
R
w
)
2
=
^
d within the 99% of a
2
distribution with one degree
of freedom, for k = 8; 9;:::; 14. Using 80% as a threshold,
we estimate the orders of MC for each species to be the smallest
k 2 under which the fraction of words that can be explained
by the (k 2)-th order MC is greater than the threshold. The
last two columns of Table S4 are the AIC-predicted optimal and
BIC-predicted optimal orders obtained in Narlikaretal. (2013).
We cluster the NGS datasets of the 28 species listed in the Table
S4, usingd
2
andd
S
2
under MC models of varying order. Figure S1
in Miller et al. (2007) gives the phylogenetic tree of the 28 species
based on alignment methods, with branch lengths noted on it. We
obtain the distance matrix of the 28 species by computing pairwise
disstance between each pair of the 28 species from the phylogenetic
tree; the matrix is given as Table S5. We compare the dissimilarity
matrices usingd
2
andd
S
2
under MC models with different order to
the matrix Table S5 which we use as underlying truth.
3.3 The relationship among 13 tropical tree species
with unknown reference genomes
We also apply our method to the 13 tree species (8 Fagaceae and 5
Moraceae) based on the NGS shotgun read data sets in Cannonetal.
(2010). The reference genome sequences for the 13 tree species are
unknown. Figure 1 in the main text shows the clustering results
usingd
2
under MCs of order 0, 4, 8 and 9. For the clustering results
usingd
S
2
, see Figure S7.
REFERENCES
Benjamini, Y . and Speed, T. (2012). Summarizing and correcting the gc content bias in
high-throughput sequencing. NucleicAcidsResearch, 40(10), e72.
Blow, M. J., McCulley, D. J., Li, Z., Zhang, T., Akiyama, J. A., Holt, A., Plajzer-
Frick, I., Shoukry, M., Wright, C., Chen, F.,etal. (2010). Chip-seq identification of
weakly conserved heart enhancers. NatureGenetics, 42(9), 806–810.
Cannon, C. H., Kua, C. S., Zhang, D., and Harting, J. (2010). Assembly free
comparative genomics of short-read sequence data discovers the needles in the
haystack. MolecularEcology, 19(Suppl. 1), 146–160.
Glenn, T. C. (2011). Field guide to next-generation DNA sequencers. Molecular
EcologyResources, 11(5), 759–769.
Hurvich, C. M. and Tsai, C. L. (1989). Regression and time series model selection in
small samples. Biometrika, 76(2), 297–307.
Katz, R. W. (1981). On some criteria for estimating the order of a Markov chain.
Technometrics, 23(3), 243–249.
Lander, E. S. and Waterman, M. S. (1988). Genomic mapping by fingerprinting random
clones: a mathematical analysis. Genomics, 2(3), 231–239.
Miller, W., Rosenbloom, K., Hardison, R., Hou, M., Taylor, J., Raney, B., Burhans, R.,
King, D., Baertsch, R., Blankenberg, D.,etal. (2007). 28-way vertebrate alignment
and conservation track in the UCSC genome browser. Genome Research, 17(12),
1797–1808.
Morvai, G. and Weiss, B. (2005). Order estimation of Markov chains. Information
Theory,IEEETransactionson, 51(4), 1496–1497.
Narlikar, L., Mehta, N., Galande, S., and Arjunwadkar, M. (2013). One size does
not fit all: On how markov model order dictates performance of genomic sequence
analyses. NucleicAcidsResearch, 41(3), 1416–1424.
Peres, Y . and Shields, P. (2005). Two new Markov order estimators. arXiv preprint
math/0506080.
Reinert, G., Schbath, S., and Waterman, M. S. (2005). Statistics on words with
applications to biological sequences. Lothaire: Applied Combinatorics on Words,
J.BerstelandD.Perrin,eds., 105, 251–328.
Song, K., Ren, J., Zhai, Z., Liu, X., Deng, M., and Sun, F. (2013). Alignment-
free sequence comparison based on next-generation sequencing reads. Journal of
ComputationalBiology, 20(2), 64–79.
Strelioff, C. C., Crutchfield, J. P., and H¨ ubler, A. W. (2007). Inferring markov chains:
Bayesian estimation, model comparison, entropy rate, and out-of-class modeling.
PhysicalReviewE, 76(1), 011106.
Tong, H. (1975). Determination of the order of a Markov chain by Akaike’s information
criterion. JournalofAppliedProbability, 12, 488–497.
Zhao, L. C., Dorea, C. C. Y ., and Gonc ¸alves, C. R. (2001). On determination of the
order of a Markov chain. Statistical Inference for Stochastic Processes, 4(3), 273–
282.
9
Renetal.
word lengthk
genome common name 8 9 10 11 12 13 14 estimated order APO BPO
hg38 human 0.12 0.25 0.45 0.68 0.85 0.9 0.78 10 12 10
panTro4 chimp 0.14 0.28 0.49 0.71 0.87 0.91 0.78 10 12 10
rheMac3 macaque 0.16 0.32 0.55 0.76 0.9 0.92 0.78 10 12 10
otoGar3 bushbaby 0.18 0.37 0.58 0.78 0.9 0.92 0.77 10 na na
tupBel1 tree shrew 0.17 0.34 0.57 0.78 0.9 0.9 0.69 10 na na
rn5 rat 0.17 0.33 0.55 0.75 0.88 0.92 0.79 10 12 10
mm10 mouse 0.16 0.29 0.49 0.69 0.84 0.89 0.76 10 12 10
cavPor3 guinea pig 0.17 0.32 0.52 0.72 0.86 0.9 0.77 10 13 10
oryCun2 rabbit 0.12 0.25 0.45 0.7 0.87 0.91 0.78 10 na na
sorAra2 shrew 0.13 0.27 0.47 0.71 0.87 0.91 0.76 10 na na
eriEur2 hedgehog 0.12 0.24 0.44 0.66 0.83 0.86 0.70 10 na na
canFam3 dog 0.15 0.32 0.56 0.79 0.91 0.93 0.77 10 12 10
felCat5 cat 0.16 0.33 0.57 0.79 0.92 0.94 0.81 10 12 10
equCab2 horse 0.17 0.34 0.56 0.78 0.9 0.93 0.81 10 12 10
bosTau7 cow 0.12 0.24 0.43 0.67 0.84 0.89 0.76 10 12 10
dasNov3 armadillo 0.12 0.24 0.41 0.63 0.81 0.88 0.78 10 na na
loxAfr3 elephant 0.12 0.21 0.37 0.59 0.78 0.88 0.79 11 na na
echTel2 tenrec 0.16 0.3 0.49 0.7 0.86 0.92 0.83 10 na na
monDom5 opossum 0.14 0.26 0.44 0.64 0.81 0.86 0.73 10 13 11
ornAna1 platypus 0.13 0.29 0.51 0.75 0.89 0.91 0.69 10 12 10
galGal4 chicken 0.4 0.67 0.83 0.94 0.97 0.89 0.64 8 11 8
anoCar2 lizard 0.15 0.25 0.42 0.63 0.81 0.86 0.69 10 12 10
xenTro3 frog 0.17 0.3 0.49 0.71 0.86 0.88 0.67 10 12 10
tetNig2 tetraodon 0.54 0.78 0.91 0.97 0.98 0.82 0.40 8 na na
fr3 fugu 0.57 0.8 0.92 0.97 0.98 0.83 0.43 7 11 8
gasAcu1 stickleback 0.48 0.7 0.85 0.94 0.97 0.9 0.54 8 11 8
oryLat2 medaka 0.34 0.5 0.69 0.85 0.94 0.88 0.58 9 na na
danRer7 zebrafish 0.17 0.27 0.42 0.62 0.81 0.88 0.68 10 13 10
Table S4. Estimating the order of the MC for 28 species based on NGS read samples. The ’genome’ column is the scientific names shown in UCSC; ’estimated order’ is the minimum order of MCr
that can explain greater than 80% of the (r + 2)-tuple word at type I error 0.01; ’APO’ and ’BPO’ columns show the AIC-predicted optimal and BIC-predicted optimal orders obtained in Narlikar et al.
(2013); the middle columns with orderk = 8; 9;:::; 14 are the fraction of wordsw with lengthk that have (Z
R
w
)
2
=
^
d within the 99% of a
2
distribution with one degree of freedom. ’na’ means the
numbers are not provided in Narlikaretal. (2013).
10
InferenceofMarkovianPropertiesofSequencesandApplications
human chimp macaque bushbaby tree shrew rat mouse guinea pig rabbit shrew hedgehog dog cat horse cow armadillo elephant tenrec opossum platypus chicken lizard frog tetraodon fugu stickleback medaka zebrafish
human 0.00 0.02 0.07 0.26 0.29 0.46 0.46 0.38 0.36 0.48 0.45 0.35 0.35 0.30 0.35 0.33 0.35 0.48 0.72 0.98 1.10 1.21 1.54 1.85 1.89 1.82 2.01 1.83
chimp 0.00 0.07 0.26 0.29 0.46 0.46 0.38 0.36 0.48 0.45 0.35 0.35 0.30 0.35 0.33 0.35 0.48 0.72 0.98 1.10 1.21 1.54 1.85 1.89 1.82 2.01 1.83
macaque 0.00 0.25 0.28 0.45 0.45 0.37 0.35 0.47 0.44 0.34 0.34 0.29 0.34 0.32 0.34 0.47 0.71 0.97 1.09 1.20 1.53 1.84 1.88 1.81 2.00 1.82
bushbaby 0.00 0.33 0.50 0.50 0.42 0.40 0.52 0.49 0.39 0.39 0.34 0.39 0.37 0.39 0.52 0.76 1.02 1.14 1.25 1.58 1.89 1.93 1.86 2.05 1.87
tree shrew 0.00 0.49 0.49 0.41 0.39 0.51 0.48 0.38 0.38 0.33 0.38 0.36 0.38 0.51 0.75 1.01 1.13 1.24 1.57 1.88 1.92 1.85 2.04 1.86
rat 0.00 0.16 0.48 0.52 0.66 0.63 0.53 0.53 0.48 0.53 0.51 0.53 0.66 0.90 1.16 1.28 1.39 1.72 2.03 2.07 2.00 2.19 2.01
mouse 0.00 0.48 0.52 0.66 0.63 0.53 0.53 0.48 0.53 0.51 0.53 0.66 0.90 1.16 1.28 1.39 1.72 2.03 2.07 2.00 2.19 2.01
guinea pig 0.00 0.44 0.58 0.55 0.45 0.45 0.40 0.45 0.43 0.45 0.58 0.82 1.08 1.20 1.31 1.64 1.95 1.99 1.92 2.11 1.93
rabbit 0.00 0.56 0.53 0.43 0.43 0.38 0.43 0.41 0.43 0.56 0.80 1.06 1.18 1.29 1.62 1.93 1.97 1.90 2.09 1.91
shrew 0.00 0.47 0.47 0.47 0.42 0.47 0.49 0.51 0.64 0.88 1.14 1.26 1.37 1.70 2.01 2.05 1.98 2.17 1.99
hedgehog 0.00 0.44 0.44 0.39 0.44 0.46 0.48 0.61 0.85 1.11 1.23 1.34 1.67 1.98 2.02 1.95 2.14 1.96
dog 0.00 0.20 0.25 0.32 0.36 0.38 0.51 0.75 1.01 1.13 1.24 1.57 1.88 1.92 1.85 2.04 1.86
cat 0.00 0.25 0.32 0.36 0.38 0.51 0.75 1.01 1.13 1.24 1.57 1.88 1.92 1.85 2.04 1.86
horse 0.00 0.27 0.31 0.33 0.46 0.70 0.96 1.08 1.19 1.52 1.83 1.87 1.80 1.99 1.81
cow 0.00 0.36 0.38 0.51 0.75 1.01 1.13 1.24 1.57 1.88 1.92 1.85 2.04 1.86
armadillo 0.00 0.28 0.41 0.67 0.93 1.05 1.16 1.49 1.80 1.84 1.77 1.96 1.78
elephant 0.00 0.33 0.69 0.95 1.07 1.18 1.51 1.82 1.86 1.79 1.98 1.80
tenrec 0.00 0.82 1.08 1.20 1.31 1.64 1.95 1.99 1.92 2.11 1.93
opossum 0.00 0.90 1.02 1.13 1.46 1.77 1.81 1.74 1.93 1.75
platypus 0.00 1.10 1.21 1.54 1.85 1.89 1.82 2.01 1.83
chicken 0.00 0.91 1.42 1.73 1.77 1.70 1.89 1.71
lizard 0.00 1.53 1.84 1.88 1.81 2.00 1.82
frog 0.00 1.87 1.91 1.84 2.03 1.85
tetraodon 0.00 0.44 0.77 0.96 1.48
fugu 0.00 0.81 1.00 1.52
stickleback 0.00 0.81 1.45
medaka 0.00 1.64
zebrafish 0.00
Table S5. The pairwise distance matrix obtain from the Figure S1 in Milleretal. (2007). Figure S1 in Milleretal. (2007) shows the phylogenetic tree of the 28 species based on alignment methods with
branch lengths noted on it.
11
Renetal.
(a) k11, order0,d
S
2
(b) k11, order4,d
S
2
(c) k11, order8,d
S
2
(d) k11, order9,d
S
2
Fig. S7. The clustering of the 13 tropical tree species usingd
S
2
under MC
with order of 0(i.i.d), 4, 8 and 9. The number on the branch refers to the
frequency of the branch occurring among the 30 clusterings based on random
sampled 10% reads. The letter ‘F’ at the beginning of the names represents
Fagaceae; similarly the letter ‘M’ representsMaraceae.
12
Linked assets
Alignment-free sequence comparison methods and applications to comparative genomics
Conceptually similar
PDF
sup_chapter4
PDF
etd-RenJie-5618-3.pdf
PDF
Alignment-free sequence comparison methods and applications to comparative genomics [pdf]
PDF
etd-RenJie-5618-2.pdf
PDF
etd-RenJie-5618-4.pdf
PDF
Ahlgren_NAR_virus-host_suppmaterial_D2_JR_NA
PDF
FigureS1
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Applications and improvements of background adjusted alignment-free dissimilarity measures
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Whole genome bisulfite sequencing: analytical methods and biological insights
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
PDF
Application of machine learning methods in genomic data analysis
Asset Metadata
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11256099
Unique identifier
UC11256099
Legacy Identifier
etd-RenJie-5618-sup_chapter2
Tags
alignment-free
comparative genomics
machine learning
Markov chain
metagenomics
next generation sequencing
virus-host interaction