Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
The effects of dependence among sites in phylogeny reconstruction
(USC Thesis Other)
The effects of dependence among sites in phylogeny reconstruction
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
TH E EFFEC TS OF D EPEN D EN CE AMONG SITES IN PHYLOGENY
RECONSTRUCTION
by
Yinsuo Feng
A Thesis Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirem ents for the Degree
M ASTER OF SCIENCE
(Statistics)
May 1995
UNIVERSITY OF SOUTHERN CALIFORNIA
THE GRADUATE SCHOOL
UNIVERSITY PARK
LOS ANGELES. CALIFORNIA 9 0 0 0 7
This thesis, 'written by
Y1VSULP FE K /fr__
under the direction of hesis Committee,
and approved by all its members, has been pre
sented to and accepted by the Dean ~ of The
Graduate School, in partial fulfillment of the
requirements for the degree of
MASTER, o f s c ie n c e.............................................
Dean
D ate Agri;L14A 3 9 9 5 .......
THESIS COMMITTEE
...............
Chairman
D edication
To my parents and my wife.
A cknow ledgem ents
I would like to thank my academic advisor Dr. Simon Tavare for his guidance on
this project. His patience, support, and effort have been invaluable.
I would also like to thank Dr. Michael W aterm an and Dr. Ken Alexander for
lending their support on my thesis com m ittee.
C ontents
D ed ica tio n ii
A ck n ow led gem en ts iii
1 In tro d u ctio n 1
1.1 In tro d u c tio n ............... 1
2 T h e S to ch a stic M o d el for Seq u en ce E volu tion 6
2.1 The independent sites m o d e l.......................................................................... 6
2.2 Markov chain Monte Carlo (M C M C )......................................................... 8
2.3 A dependent sites m odel.................................................................................... 10
3 T h e S im u la tio n M eth o d 13
3.1 The Sim ulation M o d e l .................................................................................... 13
3.2 The Sim ulation M ethod ................................................................................. 14
4 P h y lo g e n e tic Inferen ces under th e N ew M od el 19
4.1 The S im u la tio n .................................................................................................. 19
4.2 Sim ulation 1 ......................................................................................................... 19
4.3 Sim ulation 2 ......................................................................................................... 24
4.4 Sim ulation 3 ......................................................................................................... 28
4.5 A Discussion of T i e s ........................................................................................ 32
4.6 C onclusion.............................................................................................................. 37
R eferen ce L ist 40
iv
Chapter 1
Introduction
1.1 Introduction
A m ajor objective of biological system atics is infering phylogenies, or evolutionary
trees, from molecular data. A phylogeny is a graph showing the course of evolution
among a group of species. It is composed of nodes and branches, in which only
one branch connects any two adjacent nodes. The branches define the relationships
among the units in term of descent and ancestry. Typically, the starting point for
infering a phylogeny is a set of species, each of which has had representative members
measured for some observable characters, i.e. amino acid sequences of proteins and
nucleic acid sequences. Phylogenetic trees can be either rooted or unrooted. In a
rooted tree there exists a particular node, called the root, from which a unique path
leads to any other node. The length of each path corresponds to evolutionary tim e,
and the root, is the common ancestor of all external nodes. We think of evolution
changing the characters down the rooted tree. An unrooted tree is a tree th at only
specifies the relationship among the external nodes.
DNA sequences are made up of strings of letters from the 4-letter alphabet A =
{A, G, C, T}. We assume th at our sequences are aligned, and th at each is s bases
(or sites) long. This thesis studies a class of stochastic processes th at describes the
evolution of the present-day sequences from their common ancestor.
For a single site in the aligned sequences, the usual model of evolution along a
single branch in the tree is a Markov chain. The state space of the Markov chain is
1
A, and the state a tim e t down the branch is denoted by X(t). X (0) is the type at
the root of th at branch. The homogeneous Markov assum ption asserts th at
Pii{t) = P{X{t + s )= j\X (8 ) = i)
is independent of s > 0 for t > 0. P(t) = (Pij(t)) satisfies
0) Pij(t) > 0, i,j € A]
(ii) HjeA Pij(t) = 1) * € A, t > 0;
(iii) Pik(t + s) = Ejsa Pij(t)Pjk(s), t, s > 0, i, k € A.
(iv) lim< _> o+ Pij(t) — where 8{j = 1 if i = j, and = 0 if i ^ j.
Let
It follows th at
Solving (1.1) gives
Q = iim
h-+ 0+ h
P\t) = P(t)Q = QP(t). (1.1)
P(t) = eQt, t > 0.
Let 7T = (7Ti,... , 7t4) be the stationary distribution of Q, satisfying
7r P ( t ) = 7T, 7zQ = 0 .
It is convenient to think of the process evolving as follows: changes occur at the
points of a Poisson process of rate A ; if a m utation occurs, a site now labelled
i changes to j with probability p,j. In this formulation, Q = A(P — I), where
P = and ^ and P are not unique.
In this thesis we are particularly interested in 4-species trees. A rooted 4-species
binary tree has the following structure:
2
F igure 1 A rooted 4-species tree
The branch lengths are denoted by Note th at the tree is completely
specified by giving for each node i:
(i) The labels of the two offspring of node i (defined as 0 if i is an external node).
(ii) The distance from i to his parent (defined as 0 for the ancestral node).
W hat is of interest is the probability of the observed set of letters at the nodes
1, 2, 3 and 4 for the unrooted tree corresponding to Figure 1. This can be com
puted as follows. Assume a root at 7, as indicated in the figure. We assume th at
evolution occurs independently along the branches separated by each node. Hence
the probability P(i\, * 2, * 3 , * 4) of seeing letters i 1 in species 1 , ..., i4 in species 4 is
Y l 7ra J 2 P*b(t 5)Pac{t6)Pbii (tl)Pbi2(t2)Pci3{t3)PcU{t4) (1.2)
a b,c
where {7T°} is the distribution of the root type at node 7. We want the probabiltiy
in (1.2) to be independent of the location of the root node 7.
3
If we assume th at 7 T ° = 7ra for all a, and that AT(.) is reversible, i.e.
ftiQij — KjQji
then
1TiPij(t) = 7 TjPjiit)
for all i, jf; t > 0, and equation (1.2) can be w ritten
n< > J K H Pb* (*s) (*e)} Ai j ( /, ( 2 . 1)
where ® denotes Kronecker product of matrices: A ® B = (aijB).
Sometimes the stochastic processes {X;(-)} are specified in a different way. We
may be given a rate A ; at which events occur in the Ith process, and stochastic
m atrices PW th at determ ine the probabilities of moves when the events occur. This
corresponds to making ‘changes’ occur at the points of a Poisson process of rate A;.
The m atrices P ^ need not have 0 elements down their diagonal. These models are
continuous-tim e Markov chains, with
= x,(pm _ /)_ / = 1 a.
(2.2)
Conversely, for a given rate m atrix Q(l\ we can construct A / and by uni-
form ization (cf. Keilson, 1979). Let A = m a x { q f\i € -4} , and choose any A ; > A .
Then we can set
p « = i + x- ' q w .
Notice th a t this prescription for P ^ and A / is not unique. For models specified by
(2.2), we see th a t (2.1) becomes
where
Q = £ / & • • • ®A,(P(,)- / ) ® -..®7
i=i
= XT A,/o • • • < g ) (P<')-/)0
i=i
s
= ]TA , [i ® ■ ■ ■ ® P {1 ) ® I ® ■ ■ ■ ® I - B
/= i
= A
= A (P ~ -7 ),
. p (0
I - 1
A — ^ ( A,', /’/ — A// A, I — 1..... s ,
(2.3)
7
and
S
P = ^ r j 0 • • • < g > P (0 < g > / < g > • • • ® /.
/=i
We can think of the X process evolving as follows: Events occur at the points of a
Poisson process of rate A . W hen an event occurs, it happens at site j with probability
rj, the change at site j occurring according to transition m atrix We rem ark
th at if is the stationary distribution of P ^ \ then the stationary distribution of
X is
77 = 7 7 ^ ® • • • (g) 7 7 ^ ,
and th a t if pW (or Q is reversible with respect to 77^, / = 1,...,$, then Q is
reversible with respect to 77. This model arose originally in the population genetics
literature; see Griffiths and Tavare (1994) for further details.
We now think of X as a model for the evolution of a sequence, and we assume it
is reversible. In the special case th at each is reversible, we have seen th at X is
reversible as well.
2.2 M arkov chain M onte Carlo (M CM C)
In this section, we review some results from the Markov chain Monte Carlo literature
th at will be used in the sequel.
Let T = (t{j) be the generator of a continuous-time Markov chain on a discrete
state space X . Let {h{j : i,j € X, i ^ j} be any set of numbers satisfying 0 < h{j < 1.
We construct a Markov chain as follows:
A lg o rith m 1. Set A (0) equal to a prescribed value. If we are currently at tim e
t, then
(i) Suppose th at X(t) = i. Sim ulate an observation W from an exponential
distribution with param eter —ta, and set t — > t + W .
(ii) Choose j according to the distribution {pij = ^ i}
(iii) Set X(t) = j with probability hij, else set X(t) = i.
(iv) Go to step (i).
8
It can be checked th a t {X(<),£ > 0} is a Markov chain on X with generator Q
determ ined by
j Ujhij j ± i
Q i j = 1 v- *.u. (2,4)
For this to be useful, we need a way to construct the {hu} to produce a specified
stationary distribution ir. Hastings (1970) suggests using
0 if tjj = 0
^ min ^ Uj > 0
It is easy to check th at if
ti;j = Q& tji = 0 (2.5)
then Q is the generator of a reversible Markov chain X(-) on X, with stationary
distribution n. Consider first the case hjj = 0. This can occur only if Uj = tji = 0,
in which case 7r,-g,j = ftjqji = 0. If 0 < < 1, then 7 Xjtjj < 7 r T h e r e f o r e
7 T j t j i
^ i Q i j — 7 T j t (j u {j — 7T{t ij - — 7Tj t j j — 7 T j (jj (.
7 T j t ij
Similarly, if /i,j = 1 we see th at 77,^ = TTjC jji, so that
T T iC jij = njCjji for all i ^ j,
and Q is indeed reversible with respect to 7r.
For Markov models specified in term s of A and P (as in (2.3)), the natural
m ethod is to apply the acceptance-rejection step to the jum p chain P itself. To
this end, suppose th at M = (m ,j) is a transition m atrix on X and define, for
0 < hij < 1 V i,j G X and ha = 1 V i 6 X
Pb i ^ , v 2-6)
[ 1 “ E j r r ii j h i j j = i
If M satisfies
rriij = 0 & rriji = 0,
9
and
hij — •
0 if m,j = 0
. (, irjmjA (2.7)
m m 1 , --------- if rriij > 0
V 7 r . - r a « j J
then P is reversible, with stationary measure n. Hence so to is the X process with
generator Q given by Q = A{P — I). The simulation algorithm is given by
A lg o rith m 2 To sim ulate an observation with the distribution of X(t), starting
from A (0) = i,
(i) G enerate a Poisson random variable K with mean At.
(ii) Given K = k, generate k steps of the discrete process with transition m atrix
P. This may be done by generating j according to j 6 A }, calculating
h{j, and accepting j as the next step with probability hij, or i as the next step
w ith probability 1 — hij.
2.3 A dependent sites m odel.
It is well known th at m any regions of DNA do not appear to behave as a sequence of
independent sites, but rather as low-order non-homogeneous M arkovian sequences
(Borodovsky et ah, 1986; Tavare and Song, 1989; W atterson, 1992). We want to
construct a reversible Markov model for X which reflects this aspect of the structure
along each sequence. We will apply the MCMC methods of the last section to
produce such a model.
Suppose th a t we want the process X(-) on X — A s to be reversible, with sta
tionary distribution it. The general model is defined by starting with a candidate
T-m atrix in the form (2.1), and applying Algorithm 1. The biological m otivation
for this model is based on a putative m utation mechanism th at produces a ‘candi
date m utation’ (according to T ), and then allows the m utation to be ‘repaired’ with
probability determ ined by 1 — hij, or accepted with probability hij. We can also use
the analogous model defined in term s of the jum p chain in Algorithm 2. Either of
these processes is too general to be useful for our purposes, and we specialize to the
following case.
10
We assume th at M is determ ined by the model in (2.3): a candidate sequence
j — (ji,--.,js) is produced by choosing a site / according to probability distribution
{r,}, and m utating according to P ^ . We assume pO) = P = (p(i,j )) for each I.
This produces a new sequence j that differs from the original sequence i = ( i i , ia)
in ju st a single coordinate. The decision about whether the new m utation is accepted
is determ ined by h ( i,j ) defined in (2.7). This, in turn, requires some input about
the form of 7r(i). In this thesis, we assume th at n is a Markov measure; th at is,
7r(i) = 7r0(*i)r(ii, i2) ■ ■ • r(i,_ i, ia), (2 .8 )
where {7To(f),z € *4} is a probability distribution, and R = {r(i,j), i,j € A} is a
strictly positive transition m atrix. It then follows th at if i and j differ in exactly
one place, then
_ 7TU)riP(jh s'/)
7 r (i)m(i,j) rr(i)t'ip(ii,j) )
if i and j differ in the Ith coordinate. Substituting from (2.8), we see th at
nU)™(3A)
Tr{i)m{i,j)
K ii,? i)7 ro (jiM ji,i2 )
p(*i,ii)7To(fi)r(ii,f2)
P(js,is)r(h-i,js)
' pi^st js)l is)
I = 1
2 < / < S — 1
/ = s
(2.9)
Furtherm ore, if we assume th a t 7 T o is the stationary distribution of R then the
distribution at each site marginally is also 7 r0. It is also convenient to assume 7 T o is
the stationary distribution of P. In this case, if in addition
r{i,j) = T T 0(j) for alii, (2 .10)
so th a t 7T corresponds to independent sites each distributed like 7 r0 , then
n(j)m U A ) (jl)p{ji,ii)
7r0 (ii)p(ii,ji)
1 < I < s (2 .11)
11
if i and j differ in the Ith coordinate. Hence if P is reversible (as we have assumed
for our models) then h ( i,j ) = 1 for such i and j (and = 0 for other i and j, i ^ j).
This means th at the original model based 011 (2.3) with 7T^ = 7 T 0, = P is a
special case of the new m utation model. This observation will be used to calibrate
the results in the next sections.
12
Chapter 3
The Sim ulation M ethod
3.1 The Sim ulation M odel
As indicated in Section 2.3, let 7 T o be the stationary distribution of R and also the
stationary distribution of P. The simplest model is as the following:
R = P =
( 7 T 0 ^
T T O
7 T 0
\ 7 T 0 /
(3.1)
and it is easy to check that
TT qP — P 7 T 0
i.e. P is reversible, and hence h(i,j), defined in (2.7), satisfies h (i,j)
j differ in a single coordinate (and = 0 for other i and j, i j- j).
We generalize this model, at the same tim e keeping 7r0 = 7z(R)
setting
R — a l + (1 — a)P.
It then follows that
1 if i and
7 r(P ), by
(3.2)
7V0R = 7 T 0( a / + (1 — a )P)
= otTTo + (1 - a)n0P
= a7r0 + (1 - a)n0
13
= T T o
Notice here a is an indicator of the level of independence. Since when a = 0, R is
an independent trials process, and when a — > 1, the sequences start from a random
letter a in A, and are identical to a for the rest of the sequence. Furtherm ore, the
distance between P and R in m atrix norm is:
| | f l - P | | = \\a l + (1 — cx)P — Plj
= W/--P)ll
= « E E Is , j - p a |
« i
= “ D 1 “ *0(0) + Q L 7 ro(i)
i i ^ j
= ~ 7To(0) + " I K 1 “ M O )
i i
= 2a£(l-7ro(i))
i
= 2a(4 — 1)
= 6a.
3.2 The Sim ulation M ethod
For 4 species, there are 3 unrooted trees to choose from:
14
F ig u re 3.1.1 Three kinds of 4-species trees
We sim ulated observations from the model with dependent sites determ ined by P
defined in (3.1), and R as in (3.2) for both a variety of branch lengths in the phy
logeny in Tree 1 (the ’tru e’ tree), and a variety of values of a and 7T0 . For each
sim ulated data set, we used the independent-sites likelihood m ethod described in
Section 2.2 to choose the best tree (that is, the tree th at maximizes the likelihood),
and recorded the num ber of tim es the correct tree was chosen. The key param eters
involved in the simulations are:
15
(i) a, indicator of level of dependence;
(ii) P, the stochastic m atrix;
(iii) The relative branch lengths {/,}. This is one of the most im portant determ i
nants of the perform ance of tree m aking m ethods. (Huelsenbeck and Hillis,
1993).
The percentage of times the right tree is chosen (Per) is a function of a , /?, and
{/,}, i.e. Per = / ( a , R, {/,}).
Following Huelsenbeck and Hillis, we chose the following tree types, called the
two-branch-length trees. They have the property th at two branch lengths labeled l2
are equal, and the three lengths labeled /3 are equal.
To choose the branch lengths /2 and /3, we proceeded as follows. Imagine the
simplest model of sequence evolution in which changes occur at rate A at each site,
and at each change, the base i currently at a site changes to j with probability ttj.
It follows th at the chance th at two sites separated by tim e t are different is
pt = (1 - exp(— At)) 5 ^ ( 1 - ^i)
j
The num ber of sizes th at differ is therefore (under the independent site model) a
binomial random variable with param eters s and pt. The expected num ber of sites
th a t differ is spt, and the expected fraction th at differ is of course pt. To observe a
fraction of different sites equals to p, we solve pt — p for t, i.e.
p = (1 — exp(— A t))F
where F = 7Ty(l — 7r,). This gives
V — log (1 ~ ) (3.3)
For the case of uniform base composition, we get F — f, and so
\ t = - log (1 - (3.4)
16
If p — 0.70, T = 2.71, and if p = 0.10, T — 0.14. Hence to get an overall fraction
of 70% difference per unit tim e, we need a rate of A = 2.71. Hence if our sequence
is of length s = 250, then the overall rate is 250 * 2.71 « 678 for a fraction p = 0.70,
and is 250*0.14 « 35 for a fraction p = 0.10. For the sim ulations, we took the short
branch to be 20, the long branch 700, corresponding to p = 0.06 and p = 0.70. The
trees are shown in F ig u re 3.1.3.
For the second simulation F = 0.74, and F = 0.70 for the third simulation. We
used the same branch lengths as the first one. They correspond to p = .69, p = .057
and p = .66, p = .054 respectively.
F igure 3 .1 .2 Two-branch-length tree
In particular, the following four types of trees are explored in our simulation.
17
I2=7()(), I3=20
* 2
^ 3
1 2=20,13=700
1 2=!3=20
F igure 3.1.3 Four types of trees used in the simulations
18
C hapter 4
P hylogenetic Inferences under the N ew M odel
4.1 The Sim ulation
We performed a simulation experim ent as follows. We choose branch lengths /2, I3,
7 T o (and so P ), and the dependence level a. We sim ulated 500 trees using Tree 1 as
the true phylogeny. Each tree produced a set of sequences at the tips. For each set
of sequences, we used Felsenstein’s (1981) maxim um likelihood m ethod to estim ate
the branch lengths of each of the 3 tree topologies, and the corresponding likelihood
of each tree. For each simulation, we recorded which tree was chosen as the “best”
one.
The particular version of the Felsenstein m ethod we used is as follows. The base
frequencies ttq are estim ated from the frequencies among the sim ulated sequences
at the tips of the tree. When this is done, we estim ate the 5 branch lengths in
the unrooted tree by maximising the likelihood, under the assum ption th at the true
model has P given by (1.3), and independent identically distributed sites. This
allow us to assess the effects of dependence among sites when using standard tree
reconstruction methods.
4.2 Sim ulation 1
The first simulation is completed under the following condition:
7 T 0 = (0.25,0.25,0.25,0.25)
19
Per.
P = (7To, 7T0, ir0,7To)' and R - a l + (1 - a)P.
The result is the following:
Table 4.1.1 : Percentage vs a
a 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Per 0.3 0.383 0.63 0.802 0.942 0.976 0.986 0.95 0.95 0.868
o
o o
o
C D
o
o
C M
o
o
o
Figure 4.1.1 Percentage of tim e the right tree is picked
The figure above is for the first type of tree, /2 = 700, I3 = 700
0.4 0.6
alpha
20
Per.
Table 4.1.2 : Percentage vs a, 7r0 = (0.25,0.25,0.25,0.25)
a 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Per 0.3 0.36 0.45 0.425 0.285 0.365 0.3 0.145 0.155 0.166
0.4 0.6
alpha
Figure 4.1.2 Percentage of tim e the right tree is picked
The figure above is for the second type of tree, I2 = 700, I3 = 20
21
Per.
T ab le 4 .1 .3 : Percentage vs a , tt 0 = (0.25,0.25,0.25,0.25)
a 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Per 0.715 0.93 0.99 1.0 1.0 1.0 1.0 1.0 1.0 0.964
F ig u re 4 .1 .3 Percentage of tim e the right tree is picked
The figure above is for the third type of tree, I2 = 20, I3 = 700
22
Per.
Table 4.1.4 : Percentage vs a, n 0 = (0.25,0.25,0.25,0.25)
a 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Per 1.0 1.0 1.0 1.0 0.994 0.988 0.953 0.93 0.81 0.524
0.4 0.6
alpha
Figure 4.1.4 Percentage of tim e the right tree is picked
The figure above is for the fourth type of tree, I2 = 20, I3 = 20
23
Per.
4.3 Sim ulation 2
The second sim ulation is completed under the following condition: 7 T 0 = (0.28,0.31,0.26,0.15),
P — {'Koj'noi'Koi'Ko) 1 and R = a l + (1 — a)P. The result is as the following:
T able 4.2.1 : Percentage vs a, irQ = (0.28,0.31,0.26,0.15)
a 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Per 0.372 0.42 0.582 0.825 0.925 0.978 0.99 0.968 0.944 0.834
o
co
o'
co
o
o
C M
o
o
o'
Figure 4.2.1 Percentage of tim e the right tree is picked
The figure above is for the first type of tree, /2 = 700, I3 = 700
0.4 0.6
alpha
24
Per.
Table 4.2.2 : Percentage vs a, 7 T 0 = (0.28,0.31,0.26,0.15)
a 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Per 0.36 0.453 0.404 0.361 0.337 0.288 0.129 0.179 0.156 0.185
o
C O
o
(0
o
o
c v j
o
o
o
Figure 4.2.2 Percentage of tim e the right tree is picked
The figure above is for the second type of tree, l2 = 700, I3 = 20
0.4 0.6
alpha
25
Per.
Table 4.2.3 : Percentage vs a , n 0 = (0.28,0.31,0.26,0.15)
a 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Per 0.751 0.901 0.963 0..996 1.0 1.0 1.0 1.0 0.994 0.964
Figure 4.2.3 Percentage of tim e the right tree is picked
The figure above is for the third type of tree, /2 = 20, /3 = 700
26
Per.
Table 4.2.4 : Percentage vs a, Tr0 = (0.28,0.31,0.26,0.15)
a 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Per 1.0 1.0 1.0 1.0 0.996 0.985 0.958 0.938 0.866 0.653
Figure 4.2.4 Percentage of tim e the right tree is picked
The figure above is for the fourth type of tree, /2 = 20, l3 — 20
27
Per.
4.4 Sim ulation 3
The third sim ulation is com pleted under the following condition: 7r0 = (0.4, 0.3,0.2,0.1),
P = (7To, 7T o, T T o , 7To)' and R = a l + (1 — a)P. The result is as the following:
Table 4.3.1 : Percentage vs a, 7r0 = (0.4,0.3,0.2,0.1)
a 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Per 0.354 0.384 0.553 0.828 0.966 0.982 0.99 0.988 0.954 0.834
o
o
c o
o
o ' •
C M
d
o
o '
i I I I I
0.0 0.2 0.4 0.6 0.8
alpha
F igure 4.3.1 Percentage of tim e the right tree is picked
The figure above is for the first type of tree, /2 = 700, I3 = 700
28
Per.
Table 4.3.2 : Percentage vs a , 7r0 = (0.4,0.3,0.2, 0.1)
a 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Per 0.308 0.448 0.432 0.348 0.418 0.272 0.268 0.212 0.23 0.176
o
c o
o
C O
o
o
C M
o
o
o
Figure 4.3.2 Percentage of tim e the right tree is picked
The figure above is for the second type of tree, /2 = 700, Is — 20
0.4 0.6
alpha
29
Per.
T ab le 4 .3 .3 : Percentage vs a , 7 T 0 = (0.4,0.3,0.2,0.1)
a 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Per 0.715 0.906 1.0 1.0 1.0 1.0 1.0 1.0 0.996 0.94
F ig u re 4 .3 .3 Percentage of tim e the right tree is picked
The figure above is for the third type of tree, /2 = 20, /3 = 700
30
Per.
Table 4.3.4 : Percentage vs a , 7 T 0 = (0.4,0.3,0.2,0.1)
a 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Per 1.0 1.0 1.0 0.996 0.994 0.994 0.954 0.894 0.756 0.444
0.0 0.2 0.4 0.6
alpha
Figure 4.3.4 Percentage of tim e the right tree is picked
The figure above is for the fourth type of tree, l2 — 20, l3 = 20
31
4.5 A D iscussion of Ties
As stated in Section 3.1, for phylogenetic inference, there are 3 trees to choose from.
The tree which maximizees the likelihood is chosen. If the “right” tree ( tree 1 )
is chosen, the result is a “win”; if tree 2 or tree 3 are chosen, then the result is a
“loss” . Therefore how to deal with ties is an issue. Our way to deal with this is as
the following:
(i) First we look at our result w ithout considering ties.
(ii) Second we use Per equals to the sum of Percentage of w in,P(win), and half
of Percentage of tie, P{tie). i.e. Per = P(win) + | P(tie )
(iii) Then we compare the results from (i) and (ii) to see if they tell different stories.
We find they tell a same story. Therefore tie is of second order. It doesn’t
affect our conclusion very much. We use m ethod (i) through out this thesis.
The detailed comparison of with and without ties is on the following pages. We only
present the comparison of Simulation 1 ( Section 4.2 ).
32
P e r .(ii) P e r .(i)
CO
O
O
CM
O *
O
o'
0.0 0.2 0.4 0.6 0.8
CO
o
o
CM
O
o
o
0.0 0.2 0.4 0.6 0.8
a lp h a
F igure 4.4.1 Comparison of Per(i) and Per(ii)
The figure above is for the first type of tree, I2
Per(ii) = P(win ) + ^P{tie).
700, I3 = 700. Per[i) = P{win),
33
P e r .(ii) P e r . ( i)
CO
O
c o
o
Tf
o
Cvj
o
o
o
0.0 0.2 0.4 0.6 0.8
0 0
o
co
o
o
C\J
o
o
o
0.0 0.2 0.4 0.6 0.8
a lp h a
F igure 4.4.2 Comparison of Per(i) and Per(ii)
The figure above is for the second type of tree, l2 = 700, /3 = 20. Per(
P(win), Per(ii) = P(win) + ^P(tie ).
P e r .(ii) P e r .(i)
CO
o
o'
C\J
o
o
o
0.0 0.2 0.4 0.6 0.8
CO
o
T t *
o
CVl
o
o
o
0.0 0.2 0.4 0.6 0.8
a lp h a
F igure 4 .4 .3 Comparison of Per(i) and Per(ii)
The figure above is for the third type of tree, /2
Per(ii) = P(win) + ^P(tie).
20, I3 = 700. Per(i) = P(win),
35
P e r .(ii) P e r .(i)
CO
o
rj-
o
C\J
o
o
o
0.0 0.2 0 .4 0.6 0.8
CO
o
T j*
o'
CM
O
O
o
0.0 0.2 0 .4 0.6 0.8
a lp h a
F ig u re 4 .4 .4 Comparison of Per(i) and Per(ii)
The figure above is for the fourth type of tree, / 2 = 20, /3 = 20. Per(i)
Per(ii) = P(win ) + \P(tie).
P(win ),
36
From those comparisons of Per(i) and Per(ii), we can draw a conclusion which
is there is no big difference between the two ways we used to deal with ties. We
chose Per(i) as our measure of percentage of the right tree chosen. There are other
possible ways to deal with ties th at we leave for future research.
4.6 Conclusion
From the sim ulation results in the previous sections, we observe the following:
(i) The m ethods th at assume independence perform even better for small values
of the dependence param eter a.
(ii) As the dependence param eter increases towards 1, the m ethods begin to do
less well.
(iii) For the second type of tree, which corresponds to l2 = 700, /3 = 20, the
estim ation m ethods do even worse than random ly picking a tree. This is an
example of Felsenstein’s observation th at “long branches a ttra c t”.
The percentage of tim e the right tree is picked, P e r, seems to be a continuous func
tion of a. This is supported by the smooth curves we obtained from the simulations.
One im portant feature of the m ethods involves the real m utation rate, which is
defined as
Number o f real mutations
total length o f the tree branches
The following picture shows the relationship between m utation rate and a.
37
o
1 5
k _
c
o
1 5
3
E
(0
o'
2=20,13=20
2=20J3=700
2=70b, 13=20
2=700 I3=70C
o
C \ J
o'
0.0 0.2 0.4 0.8 0.6
alpha
F ig u re 4 .6 .1 M utation rate v.s. a.
We see th at m utation rate is essentially a linear function of a. W hen a is small,
m utation rate is big; when a approaches 1, m utation rate approaches 0. This can
explain generally the three observations.
(i) The estim ation m ethods do better at the beginning, because when a is small,
the fact th a t the num ber of real m utations is smaller than expected means
th at the tips at the end of long branches are less saturated (i.e. less close to
independent) than they appear. Hence there is more inform ation to recover
the correct tree.
38
(ii) The estim ation m ethods do worse as a — > 1, because when the m utation rate
is too small, it is like the branch lengths of the tree are shrunk, and when
branches are too short, there is not enough information for the m ethod to pick
up the correct tree. This is equivalent to the effective length of the sequences
being much shorter than they appear.
(iii) For the second type of tree, there are two long branches and three short
branches. Because of the effect th at “long branches a ttra c t”, the m ethods
do even worse than choosing at random.
39
R eference List
[1] Borodovsky, M.Y., Sprizhitsky, Y., Golovanov, E., and Alexandrov, A. (1986).
Statistical P atterns in the Prim ary Structures of Functional Regions in the
Genome of E. coli: II Nonuniform Markov Models. Mol. Biol., 20:1024-1033.
[2] Felsenstein, J. (1981). Evolutionary Trees from DNA Sequences: A M aximum
Likelihood Approach. J. Mol. Evol. 17:368-376.
[3] Griffiths, R.C. and Tavare, S. (1994). Com putational M ethods for the Coales-
cent. IMA conference proceedings, in press.
[4] Han G.Y. (1993). M axim um Likelihood Methods for Estim ating Evolutionary
Param eters from DNA Sequence Data. USC thesis, M athem atics D epartm ent.
[5] Hastings W .K. (1970). M onte Carlo Sampling M ethods using Markov chains
and their applications. Biometrica, 57:97-109.
[6] Huelsenbeck, J.P. and Hillis, D.M. (1993). Success of Phylogenetic M ethosds in
the Four-Taxon Case. Syst. Biol. 42:247-264.
[7] Keilson, J. (1979) Markov Chain Models - Rarity and Exponentiality. Applied
M athem atical Sciences Series, Volume 28. Springer Verlag, New York.
[8] Kuhner M.K. and Felsenstein J. (1994). A Simulation Comparison of Phy-
logeny Algorithms under Equal and Unequal Evolutionary Rates. Mol. Biol.
Evol. 11:459-468.
[9] Tavare, S. (1986). Some probabilistic and statistical problems in the analysis
of DNA sequences. In “Lectures on M athem atics in the Life Sciences” . Amer.
M ath. Soc. 17:57-86.
[10] Tavare, S. and Song, B. (1989). Codon Preference and Prim ary Sequence Struc
ture in Protein Coding Regions. Bull. M ath. Biol., 51:95-115.
[11] W atterson G.A. (1992). A Stochastic Analysis of Three Viral Sequences. Mol.
Biol. 9:666-677.
40
INFORMATION TO USERS
This manuscript has been reproduced from the microfilm master. UMI
films the text directly from the original or copy submitted. Thus, some
thesis and dissertation copies are in typewriter face, while others may be
from any type o f computer printer.
The quality of this reproduction is dependent upon the quality o f the
copy submitted. Broken or indistinct print, colored or poor quality
illustrations and photographs, print bleedthrough, substandard margins,
and improper alignment can adversely affect reproduction.
In the unlikely event that the author did not send UMI a complete
manuscript and there are missing pages, these will be noted. Also, if
unauthorized copyright material had to be removed, a note will indicate
the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by
sectioning the original, beginning at the upper left-hand comer and
continuing from left to right in equal sections with small overlaps. Each
original is also photographed in one exposure and is included in reduced
form at the back o f the book.
Photographs included in the original manuscript have been reproduced
xerographically in this copy. Higher quality 6” x 9” black and white
photographic prints are available for any photographs or illustrations
appearing in this copy for an additional charge. Contact UMI directly to
order.
UMI
A Bell & Howell Information Company
300 North Zeeb Road, Ann Arbor MI 48106-1346 USA
313/761-4700 800/521-0600
UMI Number: 1378413
UMI Microform 1378413
Copyright 1996, by UMI Company. AH rights reserved.
This microform edition is protected against unauthorized
copying under Title 17, United States Code.
UMI
300 North Zeeb Road
Ann Arbor, MI 48103
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
The analysis of circular data
PDF
Structural equation modeling in educational psychology
PDF
Independent process approximation for the coupon collector's problem
PDF
Ordered Probit Models For Transaction Stock Prices
PDF
Polymorphism of CYP2E1 gene and the risk of lung cancer among African-Americans and Caucasians in Los Angeles County
PDF
Markovian Models For Discrete Data With Repeated Patterns
PDF
A physiologic model of granulopoiesis
PDF
Comparison of gene expression of SCG10 and Stathmin/p19 in aging rat brain: an in situ hybridization study
PDF
The study of temporal variation of coda Q⁻¹ and scaling law of seismic spectrum associated with the 1992 Landers Earthquake sequence
PDF
A computational model of NMDA receptor dependent and independent long-term potentiation in hippocampal pyramidal neurons
PDF
The Hall Canyon pluton: implications for pluton emplacement and for the Mesozoic history of the west-central Panamint Mountains
PDF
The relationship between fatty acid composition of subcutaneous adipose tissue and the risk of proliferateive benign breast disease and breast cancer
PDF
Implementation aspects of Bézier surfaces to model complex geometric forms
PDF
The relationship of stress to strain in the damage regime for a brittle solid under compression
PDF
Altered interaction of human endothelial cells to the glycosylated laminin
PDF
A study of the solution crystallization of poly(ether ether ketone) using dynamic light scattering
PDF
Occupational exposure to extremely low frequency electromagnetic fields as a potential risk factor for Alzheimer's disease
PDF
An analysis of nonresponse in a sample of Americans 70 years of age and older in the longitudinal study on aging 1984-1990
PDF
A kinetic model of AMPA and NMDA receptors
PDF
Repeated Measures In Psychology: Bias In Collinearity Judgment
Asset Metadata
Creator
Feng, Yinsuo
(author)
Core Title
The effects of dependence among sites in phylogeny reconstruction
School
Graduate School
Degree
Master of Science
Degree Program
Statistics
Degree Conferral Date
1995-05
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
biology, botany,biology, general,biology, genetics,biology, molecular,biology, zoology,OAI-PMH Harvest,statistics
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Tavare, Simon (
committee chair
), Alexander, Kenneth A. (
committee member
), Waterman, Michael (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c18-8211
Unique identifier
UC11356813
Identifier
1378413.pdf (filename),usctheses-c18-8211 (legacy record id)
Legacy Identifier
1378413-0.pdf
Dmrecord
8211
Document Type
Thesis
Rights
Feng, Yinsuo
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
biology, botany
biology, general
biology, genetics
biology, molecular
biology, zoology
statistics