Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Inferring gene flow across spatial landscapes
(USC Thesis Other)
Inferring gene flow across spatial landscapes
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Inferring Gene Flow across Spatial Landscapes
Erik Lundgren
A Dissertation Presented to the
Faculty of the USC Graduate School
Computational Biology and Bioinformatics
University of Southern California
Doctor of Philosophy (Ph.D.)
Degree Conferral: December 2018
1
Acknowledgements
I would like to thank my advisor, Peter Ralph, without whom this project could not have been done. He is
always able to provide mathematical insight and has been helpful and motivating throughout my time as a
Ph.D. student. Additionally, I would like to thank the rest of my dissertation committee, Remo Rohs, Paul
Marjoram, and Matt Dean, for comments and suggestions for the project. I would like to thank my parents
and sisters for their constant love and suppport throughout my life. I would also like to thank my girlfriend
for always encouraging shared adventures. I would like to thank my lab mate, Josh Schiman, for the
frequent discussions of science and philosophy. Finally, I would like to thank the rest of my Computational
Biology and Bioinformatics cohort for the hangouts and study sessions.
2
Contents
1 Introduction 4
2 Statement of Problems 6
3 Methods 8
3.1 The Relationship of Coalescence Times and Movement Rates . . . . . . . . . . . . . . . . . . 8
3.1.1 Existence and Uniqueness of Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 The Simplest Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Noisy Coalescence Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Quadratic Approximation to the Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Incomplete Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Comparing Coalescence Time and Commute Time Methods . . . . . . . . . . . . . . . . . . . 18
3.5 MCMC Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Results 21
4.1 Inference with Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Eect of Coalescence Rate on Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Inference of a Barrier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Comparison of Commute and Coalescence Time Inferences . . . . . . . . . . . . . . . . . . . . 30
5 Application of Inference to Simulated Populations in SLiM 33
5.1 Continuous Landscape with Uniform Migration . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Continuous Landscape with a Barrier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Continuous Landscape with Biased Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6 Application of Inference to Mojave Desert Tortoises 43
7 Discussion 48
8 References 54
3
1 Introduction
We are often interested in knowing the history of how populations evolve over time and space; however,
observation of population sizes and movements is dicult in modern populations and impossible for those
long in the past. While it is sometimes possible to directly measure migration through tracking, observing
long distance trends over many generations is much less feasible. Therefore, we would like a way to infer
this information based on the current state of the population. In addition to providing information about
how new mutations spread through the population, this would also be applicable to modeling the spread
of disease and useful for learning strategies to retain genetic diversity in threatened species. One relevant
type of data that we could obtain is the current locations and genotypes of a subset of the individuals in
the population. From this, we can calculate the genetic distance between individuals, which provides a
good estimate of the average time to a common ancestor along the genome, or the average coalescence time.
Coalescence is the process in which genes or regions of the genome from two or more individuals, if traced
back in time, eventually share a common origin [Wakeley, 2005]. While we cannot know exactly when this
happens, if we assume the rate of mutation per nucleotide per generation is relatively constant, we can use
the number of dierences between two copies of a gene to estimate how much time has passed since their
common ancestor (Figure 1). With enough loci, we can calculate an estimate for the average coalescence
time across the genome [Hudson, 2007].
Before we get too far into specics, let us take a step back and talk about populations spread out over
space. Typically, we think of populations eectively living in continuous two dimensional space. If we look at
the model from a forwards time perspective, then at some dened starting point, the population is distributed
over the habitat. Depending on the specics of the model, the number of individuals in each location may
be xed or allowed to vary over time. Individuals can die, migrate to other locations, or reproduce with
other nearby individuals. As they do so, family trees will branch and/or die out, with mutations occurring
along the way. From the perspective of coalescence, we instead follow the lineages back in time; that is, for a
given gene from an individual sampled in the present, where on the landscape were the ancestors from which
that gene was inherited. Instead of looking at the forwards time migration rates, we would be looking at
the reverse time gene
ow rates, i.e., how frequently genes present in individuals from one area are inherited
from individuals previously in another area. However, using the reverse time movement of lineages presents
a mathematical problem: thanks to the Central Limit Theorem, the natural model of the movement of these
lineages is Brownian motion, but two Brownian motions will never meet in two or higher dimensions [Shiga,
1980] [Shiga, 1988] [Barton et al., 2002]. One strategy to get around this issue (which we will be using in this
document) is to use a model of space that is nite and discrete. In this case, lineages will always eventually
meet as long as all states communicate with (that is, are reachable from) each other [Wakeley, 2005].
How to best model the evolution of populations on spatial landscapes is an old but unsolved question
4
Figure 1: An illustration of the process of coalescence. If we sample a group of individuals in the present
and trace the lineages for a region of the genome back in time, they will eventually share a common origin.
Mutations are the shown as the stars on the phylogenetic tree. The longer the coalescence time, the more
mutations there are likely to be. In this illustration, the genetic distance between B and D is 2 while the
genetic distance between B and E is 4.
in population genetics, dating back to work by Wright [1943], which rst coined the term \Isolation by
Distance". In this paper, Wright modeled the population as being broken into discrete demes. There were
also early models in continuous space, such as Mal ecot's 1948 book, The Mathematics of Heredity [Mal ecot,
G., 1948]. However, it is not trivial to formulate a model that is useful and consistent. Felsenstein [1975]
demonstrated that some of Mal ecot's assumptions are incompatible with one another. There have been a
number of attempts to resolve this and other issues, including the method of \Isolation by Resistance" by
McRae [2006], and the associated program Circuitscape, which compared geographic distance to electrical
resistance, the EEMS method (Estimating Eective Migration Surfaces) by Petkova et al. [2016], another
resistance based method in which the habitat is broken up into regions of faster or slower migration rates,
and the spatial -Fleming-Viot process by Barton et al. [2013], where coalescence happens during \events"
of localized extinction and recolonization.
Even though the population dynamics in forwards time are Markov, the dynamics of lineages traced back
in time (the coalescent process) may not be. For example, if fecundity is variable and population density
is low, two lineages near each other are more likely to share a recent common origin, so knowing how one
lineage moves gives you information about how the other lineage is likely to move. This eect becomes
negligible as population density increases [Barton et al., 2002]), so it should not be an issue if the ospring
of any one parent are typically interspersed with the ospring of many others. The methods I develop here
model the coalescent process as Markov, which should be a good approximation for many systems.
Since we can model the movement of two lineages back in time as independent non-uniform random
walks, we can use the estimate of the average amount of time to coalescence to estimate the parameters of
5
the random walk. This is an example of what is known as an inverse problem; a problem where instead of
starting with a model and calculating the result, we are given the result and seek to nd the parameters of
the model. This type of problem can be very dicult to solve in practice since they are often ill-conditioned;
that is, small changes in the input can result in large changes in the inferred parameters [Petrov and Sizikov,
2005].
2 Statement of Problems
We would like to use genetic data to infer populatin parameters by modeling genetic distance as coalescence
times of a random walk, in which movement of the walk tells us about population migration (or rather,
the reverse time gene
ow rates) and coalescence rates tell us about population sizes. To do this, we break
the problem into several smaller, better-dened mathematical questions. First, if we have a Markov chain
with unknown movement rates for which we are given all pairwise average times of coalescence for each
pair of starting locations, can we infer the migration parameters of the Markov chain? With real data, we
will never have exact values for the average coalescence times, so we must be prepared to deal with noise.
Therefore, the next question we wish to solve is the same as above except that we are not given the exact
mean coalescence times, but rather noisy estimates of these. The next modication we will make is that,
since in practice we may not have observations from every location, some of the pairwise coalescence times
are unobserved. Fortunately, allowing for missing values requires only a small modication to the method;
however, it unsurprisingly decreases the accuracy of the estimates, as is discussed later. Finally, a related
question is that of the relationship between commute time (the time to go from i to j, then back to i)
and coalescence time. We are interested in this for two reasons: commute time is often used in practice
(as \resistance distance"), and coalescence time is much more computationally expensive to compute, so if
inference based on commute time gives good results, it would be applicable to much larger models.
Now we can formally dene these questions. First, we will dene H, the matrix of mean coalescence
times.
Denition 1. Let X and Y be independent copies of a continuous-time, nite-state Markov chain on state
space S with generator G, and let
be a non-negative function on S. Suppose that coalescence occurrs at
rate
(x) when X
s
=Y
s
=x, and let be their coalescence time, so that
Pf >tg =E[exp (
Z
t
0
(X
s
)1
Xs=Ys
ds)] where 1 is the indicator function.
Then the matrix of pairwise mean coalescence times, H, is dened as
H
xy
=E[jX
0
=x;Y
0
=y] for each x,y.
6
Now we are ready to more formally write each problem.
Problem 1. Suppose we are given, without error, a mean coalescence time matrix H from an unknown
continuous-time Markov chain on a known state space S with n states. Then, nd all sets of parameters
(G;
) whereG is the generator matrix of a continuous time Markov chain and
is a coalescence rate function
such that the coalescence time matrix is H.
Problem 2. Suppose we are given a mean coalescence time matrix
^
H where
^
H
ij
= H
ij
+
ij
where is a
multivariate normal distribution with mean 0 and covariance matrix , where
ij
are independent for ij
and
ij
=
ji
. Then, nd an estimate of the parameters (G;
) along with a measure of uncertainty of the
estimate.
Problem 3. Suppose we are given an incomplete mean coalescence time matrix
^
H dened the same as above
except that the only given values are
^
H
ab
where a and b are from some subset of S. Then, nd an estimate
of the parameters (G;
) along with a measure of uncertainty of the estimate.
We are also interested in the commute time based approximation of coalescence time (
e
H), which is dened
later in equation 7 in section 3.4:
Problem 4. Suppose we are given a mean coalescence time matrix as in one of the above problems. Then,
nd an estimate of the parameters (G;q) along with a measure of uncertainty of the estimate such that the
resulting commute time approximation
e
H is as close to the given coalescence time matrix as possible. Then,
compare these results to those obtained by solving one of the above problems.
In practice, allowing all o-diagonal elements of the generator matrix to be model parameters often
results in far too many free parameters, as discussed later in section 3.1.1. Additionally, it is mechanistically
reasonable to allow only nearest neighbor migration, with the rest of the transition rates assumed to be zero.
This leads us to the next modication, which we will be using throughout the document.
Problem Modication 1. Suppose that a given subset of the values of G are known to be 0. Then, solve
one of the above problems.
Another modication that reduces the number of parameters that need to found is to assume that the
coalescence rate is the same in all locations. That is, there is only one parameter to nd for coalescence rate
instead of one for each location.
Problem Modication 2. Suppose that the function
is constant over S. Then, solve one of the above
problems.
7
3 Methods
3.1 The Relationship of Coalescence Times and Movement Rates
In order to approach the problem of solving for the Markov chain parameters given the coalescent times, it is
helpful to see how to solve the problem the other way around (that is, to solve for the coalescent times given
the Markov chain). The following techniques are standard, but are included for completeness [Kemeny and
Snell, 1976]. First, we will talk about hitting times more generally. The hitting time from statei to statej is
the amount of time it takes for a particle moving randomly as a Markov chain to rst arrive in state j given
that it was in state i at time t = 0. Consider a continuous-time, nite-state Markov chain on state space
S =f1; 2;:::;ng with rates of movement G
ij
from state i to state j for i6=j. Now let G
ii
=
P
j;i6=j
G
ij
such that each row sums to 0. Then the matrix G is the generator matrix characterizing the movement
of one particle. That is, if a particle is in state i, the amount of time it remains in state i is equal to an
exponential random variable with rate parameterG
ii
(the largerG
ii
is, the shorter it is likely to stay
in i) and the probability that it moves to state j is equal to G
ij
=(G
ii
). Let h
j
be the vector of expected
hitting times to state j from all of the other states, with h
j
(i) being the expected hitting time from i to j.
Assume without loss of generality that j is staten (the nal state). Since we are only interested in the time
before reaching j, we can consider j to be an absorbing state. If we assume that j is the only absorbing
state and all states can reach j, then
h
j
(i) =E[T
j
jX
0
=i] =
Z
1
0
PfX
t
6=j;tT
j
jX
0
=igdt where T
j
is the hitting time to j.
In order to nd the probabilities that the chain is in each state (that is, the transition probability matrix)
as a function of time, we can exponentiate the generating matrix: P (t) = e
tG
, then P
k
fX
t
= lg = P (t)
kl
.
Since we are assumingj is an absorbing state, we can consider the sub-Markov chain with generator matrix
G
j
, which is formed by deleting the j
th
row and column of G. Some rows of G
j
sum to a number less
than zero, meaning that the particle can transition out of a state without entering another state (since state
j is not a state in the sub-Markov chain). Since
e
tG
kl
=
e
tGj
kl
for j = 2fk;lg, we can consider G
j
instead ofG. This is advantageous because whileG is never invertible,G
j
is invertible as long as all states
are reachable from each other [Kemeny and Snell, 1976]. If we dene
i
to be the vector that is all zeros
except 1 at i, and 1 to be the vector of all ones, we can write the above equation as
h
j
(i) =
T
i
Z
1
0
e
tGj
1dt
=
T
i
(G
j
)
1
1:
Rearranging and writing the full vector gives us
8
G
j
h
j
=1: (1)
This equation can be solved for h
j
using any of a variety of linear algebra techniques without explicitly
computing the above matrix exponentiation.
If we want to describe the movement of two independent particles (or two lineages on a landscape), we
have to use a Markov chain that describes the movement on the product space, SS. This results in the
product chain, where each particle moves independently accoring toG. Since this is in continuous time, only
one particle may move at a time, so the generator matrix for the product chain, which we will call G
, is
given by
G
=G
I +I
G
=
2
6
6
6
6
6
6
4
G
11
I G
12
I ::: G
1n
I
G
21
I G
22
I ::: G
2n
I
.
.
.
.
.
.
.
.
.
.
.
.
G
n1
I G
n2
I ::: G
nn
I
3
7
7
7
7
7
7
5
+
2
6
6
6
6
6
6
4
G 0 ::: 0
0 G ::: 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 ::: G
3
7
7
7
7
7
7
5
=
2
6
6
6
6
6
6
4
G
11
I +G G
12
I ::: G
1n
I
G
21
I G
22
I +G ::: G
2n
I
.
.
.
.
.
.
.
.
.
.
.
.
G
n1
I G
n2
I ::: G
nn
I +G
3
7
7
7
7
7
7
5
;
(2)
where I is the nn identity matrix and
denotes the Kronecker product. The state of the product chain
where the rst particle is at a and the second particle is at b is encoded in this matrix as state (a 1)n +b,
where n is the number of states in the one particle system. We will write this as ab, meaning \particle 1
in state a and particle 2 in state b" rather than the number a multiplied by b. When the two partcles are
in the same state, say, state a, coalescence occurs at rate
(a) 0, otherwise they cannot coalesce. We can
record these coalescence rates in the diagonal matrix , with the values on the diagonal being
, which will
be denoted diag(
). More specically, the elements are dened by
diag(
)
ab
=
ab
=
8
>
<
>
:
(a) if a =b
0 otherwise.
It is useful to unstack this matrix into a vector by placing each column underneath the one before it,
dened by the operation vec(). Now we can modify the generator matrix of the product chain to account
for coalescence by adding a state indicating that the lineages have coalesced. We will call this the dagger (y)
9
state. Let
~
G
=
2
4
G
diag(vec()) vec()
0 0
3
5
:
Now, using H the same as in Denition 1, we can write the coalescence time in the same way we did for
general hitting times. For a two particle Markov chain X
starting in state x =ab, we have
H(x) =E
x
[] =
Z
1
0
P
x
fX
t
6=ygdt
=
x
Z
1
0
e
t
~
G
y
1dt
=
x
Z
1
0
e
t(G
diag(vec(()))
dt1
=
x
(G
diag(vec()))
1
1:
Using Equation 1, we can write H as a vector and rearrange to write
(G
diag(vec()))vec(H) =1: (3)
In this document, the solution for H given G and
will be calledH(G;
). This can be rewritten using the
single-chain generator matrix as follows. First, note that equation 3 is
X
j
G
kj
H
j
k
H
k
=1 for all k
Letting k denote the state where the two chains are in states a and b and j denote the state where the two
chains are in states c and d, this is
X
c;d
G
(ab);(cd)
H
cd
ab
H
ab
=1 for all a;b:
By equation (2),
G
(ab);(cd)
=G
ac
bd
+
ac
G
bd
where
ab
is 1 if a =b and 0 otherwise.
Substituting this into the previous equation gives us
X
c;d
(G
ac
bd
+
ac
G
bd
)H
cd
ab
H
ab
=1
X
c
G
ac
H
cb
+
X
d
G
bd
H
ad
ab
H
ab
=1:
So, back in matrix form, we have
GH +HG
T
H =11
T
where is the element-wise product. (4)
10
Note thatH and are now also written as matrices, with being diagonal. If we want to solve this equation
for G, we also must satisfy the condition that G is a generator matrix (i.e., the o-diagonal elements are
greater than or equal to zero, and the rows sum to zero).
To solve Problem 1, we need only solve equation (4). While this form is good for looking at the overall
structure of the problem, in order to solve it using linear algebra software, we would like it to be of the form
Ag =1, where g is a vector containing the unknown values of G and
. Since equation 4 is linear in the
entries ofG and , it is possible to rearrange it to this form. Note that since the left side of (4) is symmetric,
we will only have n(n + 1)=2 independent equations (that is, one for every entry on or above the diagonal).
3.1.1 Existence and Uniqueness of Solutions
Now that we have the basic equation we wish to solve, (equation 3 or 4 for values of G and ) we can talk
about the existence and uniqueness of solutions to this equation. As mentioned before, an H that is not
symmetric cannot have a meaningful solution since that would require the expected coalescence time of two
particles to be dierent depending on how the particles were labeled. A solution whereH has negative values
may be possible; however, it is not of interest since it is not meaningful in terms of coalescence times. There
are likely many other constraints on the possible values of H that are much harder to identify.
If the matrix H does come from a valid Markov chain, we would also like to know whether there is a
unique solution. Suppose we have a matrixZ such thatV =ZH is skew-symmetric; that is,ZH =(ZH)
T
.
Then, if we already have one solution to equation (4), we can add ZH + (ZH)
T
to one side of the equation
to obtain another solution to (4), as follows:
11
T
=GH +HG
T
+ZH + (ZH)
T
H
11
T
=GH +ZH +HG
T
+HZ
T
H
11
T
= (G +Z)H +H(G +Z)
T
H:
Therefore, ifG is a solution, thenG+Z is also a solution. For the non-uniqueness to be of much interest,
we would also need G +Z to be a generator matrix.
This implies that in many situations, the solution to (4) may not be unique. In practice, however, we
would often assume that movement directly between distant locations would not be possible (i.e., many G
ij
s
would be zero). Adding such constraints can signicantly reduce the number of parameters we have to nd,
reducing the problem of nonidentiability. For example, if the state space is a two dimensional grid with n
spaces and we know that the only transitions that occur are nearest neighbor migrations, we can reduce the
number of movement parameters from n
2
to about 4n, with the exact number of parameters depending on
the specic resolution of the grid. Since the number of equations would be n(n + 1)=2, we would typically
have more equations than unknowns, resulting in a unique solution.
11
State 1 State 2
Figure 2: A simple two state Markov chain for a single particle, with rates of movement between the two
states.
3.1.2 The Simplest Example
Consider a Markov chain with two states, where the rate of movement from state 1 to state 2 is G
12
and the
rate of movement from state 2 to state 1 is G
21
(see Figure 2). The generator matrix for this chain is
G =
2
4
G
12
G
12
G
21
G
21
3
5
:
Using equation 2, we can nd the G
, the generator matrix for the product chain, where we keep track of
two particles;
G
=
2
6
6
6
6
6
6
4
2G
12
G
12
G
12
0
G
21
G
12
G
21
0 G
12
G
21
0 G
12
G
21
G
12
0 G
21
G
21
2G
21
3
7
7
7
7
7
7
5
:
Now, in order to account for coalescence, we can add one more state, labeledy in Figure 3, and a corresponding
row and column, which gives us
~
G
=
2
6
6
6
6
6
6
6
6
6
4
2G
12
(1) G
12
G
12
0
(1)
G
21
G
12
G
21
0 G
12
0
G
21
0 G
12
G
21
G
12
0
0 G
21
G
21
2G
21
(2)
(2)
0 0 0 0 0
3
7
7
7
7
7
7
7
7
7
5
:
These movement rates are depicted in Figure 3. Multiplying out either equation 3 or 4, we get the
following system of equations:
H
11
G
12
+H
21
G
12
H
11
G
12
+H
12
G
12
H
11
(1) =1
H
12
G
12
+H
22
G
12
+H
11
G
21
H
12
G
21
=1
H
11
G
21
H
21
G
21
H
21
G
12
+H
22
G
12
=1
H
12
G
21
H
22
G
21
+H
21
G
21
H
22
G
21
H
22
(2) =1:
12
State 1
1
2
State 2
1
2
State 3
1
2
State 4
1 2
Figure 3: The product chain of the Markov chain in Figure 2 with coalescence added.
Note that since H is symmetric, the second and third equations are the same, so we have three equations
for the four unknowns, G
12
, G
21
,
1
, and
2
. By grouping terms together, factoring back into matrix form,
and writing H
ji
instead of H
ij
when i>j, we have
2
6
6
6
4
2H
11
+ 2H
12
0 H
11
0
H
12
+H
22
H
11
H
12
0 0
0 2H
12
2H
22
0 H
22
3
7
7
7
5
2
6
6
6
6
6
6
4
G
12
G
21
(1)
(2)
3
7
7
7
7
7
7
5
=
2
6
6
6
4
1
1
1
3
7
7
7
5
:
This is small enough to be solved symbolically. If H
11
6= H
12
, we can write the solution with
(1) as the
free variable:
G
12
=
1
2(H
11
H
12
)
H
11
2(H
11
H
12
)
(1)
G
21
=
2(H
11
H
12
) + (H
12
H
22
)
2(H
11
H
12
)
2
H
11
(H
12
H
22
)
2(H
11
H
12
)
2
(1)
(2) =
(H
11
2H
12
+H
22
)
2
H
22
(H
11
H
12
)
2
H
11
(H
12
H
22
)
2
H
22
(H
11
H
12
)
2
(1):
This implies that
(2) will always become smaller as
(1) increases as long as H
12
is not equal to H
22
.
This makes sense because, if it has an eect, increasing a rate of coalescence could only make the expected
coalescence times shorter, so in order to keep them the same, the rate of coalescence somewhere else must
be decreased. This also means that, in order for
(2) to be a value other than 0, H
11
2H
12
+H
22
6= 0.
In general, we can say that if there is a value or range of values of
(1) for which the other movement and
coalescence parameters are non-negative, then a solution exists.
Now we can plug in some real numbers to see the result. Let H
11
= 1, H
12
= 2, and H
22
= 1:5 in
arbitrary time units. In this case, the other parameters will be non-negative when 1
(1) 5. This can be
13
0 1 2 3 4 5 6
0 1 2 3 4 5 6
γ(1)
Parameter Values
Valid Range of Parameter Values
G
12
G
21
γ(1)
γ(2)
Figure 4: The green shaded region shows the valid range of the parameter values for the two state Markov
chain given that H
11
= 1, H
12
= 2, and H
22
= 1:5. Each line shows the other parameters as a function of
the free parameter
(1).
14
seen in Figure 4, where the range of valid values of
(1) is shaded in green. If suppose
(1) =
(2) (Problem
Modication 2), then there is a unique solution, with
(1) =
(2) = 1:29, G
12
= 0:14, and G
21
= 0:93.
Note that if we choose values of H for which there is no solution, this means that H is not a valid average
coalescence time matrix.
In order to do inference in this case, we need to remove one degree of freedom; for example, by requiring
the two coalescence rates to be equal. On larger grids with nearest neighbor migration, non-identiability
with be less of an issue as long as we have full, exact data.
3.2 Noisy Coalescence Times
With real world data, our average coalescence times will never be perfectly accurate, so it makes more sense
to solve the problem acknowledging this fact. As in Problem 2, suppose that
^
H
ij
= H +
ij
, where is
normally distributed with mean 0 and covariance . For now, we will make a few assumptions about .
First, that all
ij
are independent, identically distributedN (0;
2
) for ij, and
ij
=
ji
. Additionally, we
will assume that H
ij
is large compared to
such that the probability that a value of
^
H is ever negative
is negligible. Writing f(x) as the probability density function of the multivariate normal distribution , the
likelihood function for the pair (G;
) is
L(G;
;
^
H) =f(H(G;
)
^
H)
=f((G
diag())
1
1
^
H)
recalling thatH is the hitting time matrix for a given pair (G;
) dened by equation 3. Since we can solve
Problem 1 if
^
H is a valid coalescence time matrix, we can then nd the maximum likelihood estimate (MLE)
of G; however, this is in practice unlikely to happen, and the likelihood function is still useful for providing
information about the uncertainty of G.
Putting a prior distribution on G and
could help to determinine the error bounds on G; however, we
would have to make sure that the chosen priors do not greatly in
uence the answer. Assuming this is the
case, we would can use Markov chain Monte Carlo (MCMC) to sample from the posterior distribution of G
and
[Brooks et al., 2011]. For example, suppose that we know which non-diagonal entries ofG are non-zero
(Problem Modication 1). Let's say that i is said to be adjacent to j if i6= j and G
ij
is one of the values
known to be non-zero. Then we could let the prior probability density for the movement parameters be that
of an exponential random variable with rate parameter
G
. That is, the prior likelihood of G would be
L
G
(G) =
8
>
<
>
:
m
G
G
Q
ij
e
G
Gij
for G
ij
0 8 i6=j
0 otherwise
where i j means that i and j are adjacent and m
G
is the number of positive values of G. Similarly, we
15
can put a prior density on
:
L
(
) =
8
>
<
>
:
Q
n
i
e
i
for
i
0 8 i
0 otherwise.
Recall that n is the number of states in the single particle Markov chain. If we assume that all
ij
are IID
with
ij
N (0;
2
) for ij we can write the likelihood for
^
H:
L(
^
H;G;
) =
1
(2
2
)
m=2
Y
ij
exp
(
(H
ij
(G;
)
^
H
ij
)
2
2
2
)
;
wherem is the number of nonmissing values on or above the diagonal in
^
H. Then the posterior distribution
is
L(
^
H;G;
)L
G
(G)L
(
) =
m
G
G
n
(2
2
)
m=2
expf
X
ij
G
G
ij
n
X
i
i
1
2
2
X
ij
(H
ij
(G;
)
^
H
ij
)
2
g (5)
with the same support over G and
as in their individual priors. To sample from this distribution, we will
use the Metropolis-Hastings (MH) algorithm [Brooks et al., 2011]. This provides a method to solve Problems
2 and 3.
3.2.1 Quadratic Approximation to the Likelihood
We also nd the rst and second derivatives of the log-likelihood function analytically. This was originally
done in order to explore the possibility of using the Hamiltonian Monte Carlo (HMC) [Betancourt, 2017]
method as an alternative to Metropolis-Hastings; however, in practice, we found that in our case, the overal
mixing time was longer with HMC than MH due to the increased computation time per iteration outweighing
the decreased number of iterations needed. Nevertheless, although they are not directly used in my analysis,
the rst and second derivatives of the log-likelihood can be used to nd the quadratic approximation of
the log-likelihood surface at a given point, telling us which directions in parameter space in which the
log-likelihood is most sensitive to change.
First, let's dene ` = logfL(
^
H;G;
)L
G
(G)L
(
)g, `
^
H
= logfL(
^
H;G;
)g, `
G
= logfL
G
(G)g, and `
=
logfL
(
)g. We can see that
@
@g
i
` =
@
@g
i
`
^
H
+
@
@g
i
`
G
+
@
@g
i
`
for allg
i
whereg
i
is a model parameter (value ofG or
). Since the second two terms are trivial to compute,
we will focus on the rst. Using the chain rule, we see that
@
@g
i
`
^
H
=
X
ab
d
dH
ab
`
^
H
@
@g
i
H
ab
;
16
where
d
dH
ab
`
^
H
=
1
2
(H(G;
)
ab
^
H
ab
). We can nd
@
@gi
H by dierentiating Equation 3, then solving for
@
@gi
H:
@
@g
i
((G
diag(vec()))vec(H)) =
@
@g
i
(1)
@
@g
i
(G
diag(vec()))
vec(H) + (G
diag(vec()))
@
@g
i
vec(H) = 0
(G
diag(vec()))
@
@g
i
vec(H) =
@
@g
i
(G
diag(vec()))
vec(H):
(6)
This equation can be solved for
@
@gi
vec(H) given values of G, andH using standard linear algebra.
If we wish to compute the second derivative as well, we can proceed in a similar manner as before:
@
2
@g
i
@g
j
` =
@
2
@g
i
@g
j
`
^
H
+
@
2
@g
i
@g
j
`
G
+
@
2
@g
i
@g
j
`
:
Again, the right two terms are trivial to compute (and will evaluate to 0 with exponential priors on G and
). Using the chain and product rules, we have
@
2
@g
i
@g
j
`
^
H
=
X
ab
@
@g
j
d
dH
ab
`
^
H
@
@g
i
H
ab
+
d
dH
ab
`
^
H
@
2
@g
i
@g
j
H
ab
where we can use chain rule to see that
@
@gj
d
dH
ab
`
^
H
=
d
2
dH
2
ab
`
^
H
@
@gj
H
ab
. In this case, we could also write
@
@gj
d
dH
ab
`
^
H
=
@
@gj
(
1
2
(H
ab
^
H
ab
)) =
1
2
@
@gj
H
ab
, which gives us
@
2
@g
i
@g
j
`
^
H
=
X
ab
1
2
@
@g
i
H
ab
@
@g
j
H
ab
1
2
(H
ab
^
H
ab
)
@
2
@g
i
@g
j
H
ab
:
In order to nd
@
2
@gi@gj
H, we can dierentiate (6) with respect to g
j
and solve:
@
@g
j
(G
diag(vec()))
@
@g
i
vec(H)
=
@
@g
j
@
@g
i
(G
diag(vec()))
vec(H)
@
@g
j
(G
diag(vec()))
@
@g
i
vec(H)
+ (G
diag(vec()))
@
2
@g
i
@g
j
vec(H) =
@
2
@g
i
@g
j
(G
diag(vec()))
vec(H)
@
@g
i
(G
diag(vec()))
@
@g
j
vec(H)
(G
diag(vec()))
@
2
@g
i
@g
j
vec(H) =
@
2
@g
i
@g
j
(G
diag(vec()))
vec(H)
@
@g
i
(G
diag(vec()))
@
@g
j
vec(H)
@
@g
j
(G
diag(vec()))
@
@g
i
vec(H):
Since
@
@gi
(G
diag(vec())) is a matrix of constants,
@
2
@gi@gj
(G
diag(vec())) evaluates to 0, leaving us
with
(G
diag(vec()))
@
2
@g
i
@g
j
vec(H) =
@
@g
i
(G
diag(vec()))
@
@g
j
vec(H)
@
@g
j
(G
diag(vec()))
@
@g
i
vec(H):
Note that the matrix on the left hand side of this equation is the same as in equations 6 and 3.
17
3.3 Incomplete Observations
With real world data, unless the number of states is small or we have a large amount of data, our pairwise
average coalescent time matrix will have a large number of unobserved values. For example, if our landscape
is a 20 20 grid and we take 100 samples, each from a dierent grid square, we would have only 100(101)=2
out of 400(401)=2 of the unique values forH. Note that we are including the self-comparisons in this example;
if the organisms were haploid or inbreeding were substantial, ourH matrix would have only 99(100)=2 usable
values.
In order to nd G and
using only a subset of the values of H, we apply the MCMC method above,
leaving the unknown values unconstrained. Essentially, we can still use Equation 5, but the rightmost sum
will be only over observed entries of H.
3.4 Comparing Coalescence Time and Commute Time Methods
Another approach in spatial population genetics is to use commute time (also known as resistance distance)
instead of coalescence time. Commute time between two locations i andj is the expected time to hitj from
i plus the expected time to hit i from j; that is, R
ij
= h
j
(i) +h
i
(j). In order to be more comparable to
coalescence time, we must also account for local diversity. This is because coalescence time accounts for
both the rst meeting time and the eventual coalescence, while commute time is more analagous to the rst
meeting time. For local diversity, we will also have another set of model parameters,q, whereq
i
approximates
the observed coalescence time in location i. We can now approximate coalescence time using commute time
as done in Petkova et al. [2016]
H
ij
e
H
ij
=R
ij
=4 +q
i
=2 +q
j
=2 (7)
To gain an intuitive feel for why resistance distance is divided by 4, consider how, if the movement rates
are symmetric, a single particle moving from one state to another is similar to (but not the same as) the
rst meeting time of two particles, one moving forward in time from the rst location and the second one
moving backward in time from the other, so it is similar to twice the rst meeting time. Since the commute
time is the sum of the hitting times in each direction, it is similar to 4 times the rst meeting time. For
a conceptual picture of the dierences between the coalescence time and commute time, see Figure 5. The
average of the local diversity estimates (q) are then added to the rst meeting time estimate to account
for the eventual coalescence after meeting. In this case, these values are parameters in the model to be
inferred instead of within deme coalescence rates (
). Since the values a q are typically fairly easy to infer
and requiring them all to be the same can signicantly decrease the quality of the t, we will not be using
an analogue of Problem Modication 2 (where we would only have one parameter for q). Commute time
18
1 2 3 4 1 2 3 4
t
Coalescence Time Commute Time
Figure 5: Illustration of the conceptual dierences betweeen coalescence and commute time between states
1 and 3 for a continuous time Markov chain with four linearly connected states. Coalescence time (left) is
the time for two independently moving particles to meet and coalesce (it is possible for them to be in the
same location without coalescing). Commute time is time for a single particle starting in one state to travel
to another state and then back to the original state. Note how the time axis is scaled to be faster in the
gure showing commute time than the one for coalescence time in order for them to t in the same space.
inference has the advantage of only needing to use the base Markov chain instead of the product chain, but
it also leads to incorrect inferences in certain circumstances. Using the base Markov chain instead of the
product chain is advantageous because, as the grid size increases, it is much more computationally ecient
to work with a Markov chain with n states than n
2
states since the matrices would be much smaller.
For a simple example where the coalescence times and commute time approximations are very dierent,
consider a system of three islands in a line, where lineages currently in the outer islands are much more
likely to come from the central island than the other way around. In this case, lineages are likely to quickly
move to the center and coalesce regardless of starting position, so coalescence time will be relatively short
between all individuals. Commute time between the two outer locations, on the other hand, will be much
longer since it requires the lineage to leave the center. If we plug real numbers into this example, say, reverse
time gene
ow rate of 1 into the center and 0:1 out of it with
= 1 everywhere, then H
13
= H
31
= 2:90
but
e
H
13
=
e
H
31
= 7:94. However, the fact that
e
H as predicted from the correct parameters may be far from
H does not mean that the parameters inferred using resistance distance methods will necessarily be very
wrong. In principle, it still might be that the parameters for which
e
H is closest to H are close to the real
parameters, at least in some situations. This will be investigated further in section 4.4
One might question whether such a landscape could exist in nature. After all, if the forwards time
migration rates are asymmetric, it would seem like this could result in some locations having far greater
population density than other locations. However, because real habitats have a nite carrying capacity and
some populations have more individuals born per generation than could survive to adulthood, it is possible
for there to be relatively constant population density over time. For example, if individuals tend to have
higher fecundity in a particular region, it could act as a \source" for the surrounding areas, resulting in
19
1 2 3
Figure 6: A simple example where the coalescence time and commute time based approximation are very
dierent. In this case, commute time between the two outer locations will be much longer than coalescence
time.
reverse time gene
ow biased into that region.
3.5 MCMC Validation
The two MCMC methods used in this analysis are random walk Metropolis-Hastings (MH) [Brooks et al.,
2011] and Hamiltonian (Hybrid) Monte Carlo (HMC) [Betancourt, 2017]. For the MH algorithm, a random
walk is used to generate proposals for the new location in parameter space. If the new location is of higher
likelihood, the move is accepted. If it is of lower likelihood, it is accepted with probability equal to the
ratio of the new likelihood divided by the old likelihood. In HMC, the log-likelihood is used as a (negative)
potential energy surface, and proposals are generated by modeling the Hamiltonian dynamics of a particle
with momentum drawn from some symmetric random variable moving on that surface. If it is modeled
perfectly, the total energy at the end of each step will be the same as at the beginning of that step, but in
practice, energy may increase or decrease. Moves are accepted with probability depending on the total change
in energy, with moves automatically being accepted if energy stays constant or decreases and moves being
accepted with decreasing probability as energy increases. In both methods, since it might start in a region of
very low probability density, a period of burn-in is used prior to the beginning of collection of samples for the
posterior distribution in order to allow the MCMC to converge, as well as to tune hyperparameters (standard
deviations of the proposal distribution for MH and step size for HMC). In order to speed up the process of
burn-in, a period of pre-burn-in is used before burn-in which has a less restrictive likelihood function (larger
), which allows the MCMC to move more quickly through state space when it is very far from maximum
likelihood.
Before applying the method to the case where data is noisy or missing, we should see how it performs
with exact data. We will start with a particular set of migration rates between nearest neighbors on a 2 3
grid with constant coalescence rate using MH with 10
6
iterations and letting
in the likelihood function
be 2% of the average of the values in the coalescence time matrix. Figure 7 shows the trace for parameter
9, with the true value shown as a horizontal line. Looking at gure 7, We can see that it quickly nds the
correct region, although there is a fairly large posterior uncertainty. Results are similar when using HMC
with 8 10
4
iterations (see Figure 8). If we add noise with variance equal to
2
to H, the true value is still
20
Figure 7: The trace of the MCMC exploring the posterior distribution of parameter 9 for MH with
2
=
(mean(H)=50)
2
when the coalescence time matrix, H, is known fully and exactly.
well within the bulk of the posterior distribution. When we only have data from ve out of the six locations
but that data is exact (Figure 10) or when the data is both noisy and incomplete (Figure 11), the results
look similar.
Note, however, that this is with a fairly high coalescence rate,
, of about 1.7. If we set
to .2, then the
values of H are closer together (relatively), and the problem become more dicult (Figure 12). This is due
to the fact that if there is a slower rate of coalescence, lineages will move around the landscape more before
coalescing, so some of the information is harder to recover. We will discuss this more in section 4.2.
4 Results
We will now examine the performance of the inference methods under various conditions. We test the eect
of varying levels of noise in the input, faster or slower coalescence rates in the true model, and assuming
that the coalescence rate is (an unknown) constant everywhere. I will also compare both symmetric and
asymmetric 32 graphs as well as coalescence time to commute time based inference. Although some graphs
are symmetric (G
ij
= G
ji
for all i, j), this is not assumed to be the case, so all inference methods have
separate parameters for each direction. We also consider the problem of inferring a barrier on a 5 3 grid
and compare coalescence time and commute time methods on 4 4 grids with various degrees of migration
bias.
21
Figure 8: The trace of the MCMC exploring the posterior distribution of parameter 9 for HMC with
2
=
(mean(H)=50)
2
when the coalescence time matrix, H, is known fully and exactly.
Figure 9: The trace of the MCMC exploring the posterior distribution of parameter 9 for MH with
2
=
(mean(H)=50)
2
when the coalescence time matrix, H, has noise with variance equal to
2
added to the true
values, but all values are present.
22
Figure 10: The trace of the MCMC exploring the posterior distribution of parameter 9 for MH with
2
=
(mean(H)=50)
2
when the coalescence time matrix, H, has exact values, but the values for the second row
and column are completely unknown.
Figure 11: The trace of the MCMC exploring the posterior distribution of parameter 9 for MH with
2
=
(mean(H)=50)
2
when the coalescence time matrix, H, has noise with variance equal to
2
added to the true
values and the values in the second row and column are missing.
23
Figure 12: This gure shows the timeseries values of movement parameter 9 for MH with
2
=
(mean(H)=50)
2
when the coalescence time matrix, H, is known fully and exactly, but a low rate of coa-
lescence.
4.1 Inference with Noisy Data
Unsurprisingly, the amount of noise in the input values,
^
H, has an eect on how well we can infer the model
parameters. To asses this, we run the MCMC inference method on 25 random symmetric and 25 asymmetric
3 2 graphs. For each random graph, movement parameters were generated by rounding independent
exponential random variables up to the nearest one tenth. The coalescence rate was xed at 1 for all
locations, and the inference is performed both with and without making the assumption that we know the
coalescence rate is the same everywhere (Problem Modication 2). For 3 2 graphs, there are 21 equations
and 15 unknowns (14 movement parameters and 1 coalescence parameter) assuming all coalescence rates are
equal, while without this assumption, there are 21 equations and 20 unknowns (14 movement parameters
and 6 coalescence parameters). For the analysis in the next two subsections, we will focus on the inference
of the movement parameters. The noise-free coalescence times, H, are then calculated using equation 3.
Independent (but symmetric) Gaussian noise is then added at four dierent noise levels. For each grid, the
standard deviation of the noise was 1/1000, 1/200, 1/100, and 1/50 of the mean value of H for that grid.
We assume that we can estimate the amount of noise reasonably well, so the true variance of the noise in
the model is used in the likelihood function used for inference. MCMC inference is done on each graph and
noise level using both the coalescence time and commute time based methods. For each of the graphs in this
section and the next, inference is done using 10
5
iterations of MCMC after 2 10
4
iterations of pre-burn-in
(with a less strict likelihood function) and 8 10
4
iterations of burn-in.
24
We compute the mean of the absolute error between the true values under the model and posterior
medians for each grid, noise level, and inference type. Let ^ g and ^
be the posterior medians of the MCMC
inference ofg and
, respectively. Boxplots of these mean absolute errors in ^ g are shown in Figure 13. Here,
we see that when using coalescence time inference with a single coalescence parameter, inference is quite
accurate, but the error increases with noise. Commute time inference has an often substantially larger error
than coalescence time inference. The size of this error does not increase with higher noise and actually seems
to decrease slightly. This is likely due to there being a larger dierence between
e
H andH than
^
H andH, so
the additional noise is a relatively minor factor. We also see that allowing separate coalescence parameters
for each location (that is, not using Problem Modication 2), makes the problem substantially more dicult.
Errors for coalescence time inference are still smaller than those for commute time inference but are much
larger than the errors obtained in the constant coalescence rate case, even with small amounts of noise. Since
we always use separateq parameters when using commute time inference, the plots for \Multiple Coalescence
Parameters" are replicates in the case of commute time inference. Generally, coalescence time based inference
performs about equally as well in the asymmetric case as in the symmetric case while commute time based
inference does slightly better in symmetric case. In all cases, increasing the noise (and therefore standard
deviation in the likelihood function) increased the interquartile range of the sample posterior distributions,
which is shown in Figure 14.
4.2 Eect of Coalescence Rate on Inference
As suggested by the results above, the coalescence rate in the underlying model can have a strong eect on
how well we can estimate the model parameters. This makes intuitive sense because the slower coalescence
is compared to movement, the more the population will homogenize before coalescence is likely to occur,
making it more eectively panmictic. We perform the same type of analysis as before on 25 symmetric and 25
asymmetric graphs with a noise level of 1/200 and coalescence rates of
= 10,
= 1, and
= 0:1. We assume
that we have data from all locations. As expected, slowing the coalescence rate in the model decreases the
accuracy with which can estimate the movemement parameters using coalescence time inference (see Figure
15). Again, coalescence time based inference is more accurate than commute time based inference except in
the case where there is a separate coalescence parameter for each location and the coalescence rate is very
slow, where results can be very wrong, with some having mean errors larger than the average parameter
value. The reason for this is likely because the problem becomes ill conditioned when there are so many
parameters and the population is close to panmictic. Slowing the coalescence rate also increases the size of
the interquartile ranges (shown in Figure 16) for the same reason as the decreased accuracy.
25
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Symmetric Graph, Varying Noise
Mean absolute error
0.001
0.005
0.01
0.02
0.001
0.005
0.01
0.02
coalescence commute
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Asymmetric Graph, Varying Noise
Mean absolute error
0.001
0.005
0.01
0.02
0.001
0.005
0.01
0.02
coalescence commute
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Symmetric Graph, Varying Noise, Multiple Coalescence Parameters
Mean absolute error
0.001
0.005
0.01
0.02
0.001
0.005
0.01
0.02
coalescence commute
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Asymmetric Graph, Varying Noise, Multiple Coalescence Parameters
Mean absolute error
0.001
0.005
0.01
0.02
0.001
0.005
0.01
0.02
coalescence commute
Figure 13: Boxplots of the mean absolute error, dened to be the absolute value dierence between the
true value and posterior median, averaged across movement parameters g for each of the 25 graphs in each
situation. Each box is labeled with the relative amount of noise added to H to make
^
H. The top left shows
the results for symmetric graphs when there is a single coalescence parameter for all locations, the top right
for symmetric graphs when there is a separate coalescence parameter for each location, the bottom left for
asymmetric graphs when there is a single coalescence parameter for all locations, and the bottom right for
asymmetric graphs when there is a separate coalescence parameter for each location.
26
0.0 0.5 1.0 1.5
Symmetric Graph, Varying Noise
Mean Interquartile Range
0.001
0.005
0.01
0.02
0.001
0.005
0.01
0.02
coalescence commute
0.0 0.5 1.0 1.5
Asymmetric Graph, Varying Noise
Mean Interquartile Range
0.001
0.005
0.01
0.02
0.001
0.005
0.01
0.02
coalescence commute
0.0 0.5 1.0 1.5
Symmetric Graph, Varying Noise, Multiple Coalescence Parameters
Mean Interquartile Range
0.001
0.005
0.01
0.02
0.001
0.005
0.01
0.02
coalescence commute
0.0 0.5 1.0 1.5
Asymmetric Graph, Varying Noise, Multiple Coalescence Parameters
Mean Interquartile Range
0.001
0.005
0.01
0.02
0.001
0.005
0.01
0.02
coalescence commute
Figure 14: Boxplots of the mean interquartile ranges of the posterior distributions of g for each of the 25
graphs in each situation. Each box is labeled with the relative amount of noise added to H to make
^
H.
Subplot locations for each situation are the same as in Figure 13.
27
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Symmetric Graph, Varying Coalescence Rate
Mean absolute error
10
1
0.1
10
1
0.1
coalescence commute
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Asymmetric Graph, Varying Coalescence Rate
Mean absolute error
10
1
0.1
10
1
0.1
coalescence commute
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Symmetric Graph, Varying Coalescence Rate, Multiple Coalescence Parameters
Mean absolute error
10
1
0.1
10
1
0.1
coalescence commute
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Asymmetric Graph, Varying Coalescence Rate, Multiple Coalescence Parameters
Mean absolute error
10
1
0.1
10
1
0.1
coalescence commute
Figure 15: Boxplots of the mean absolute error, dened to be the absolute value dierence between the
true value and posterior median, averaged across movement parameters g for each of the 25 graphs in each
situation. Each box is labeled with the coalescence rate in the underlying model for each situation. Subplot
locations for each situation are the same as in Figure 13.
28
0.0 0.5 1.0 1.5
Symmetric Graph, Varying Coalescence Rate
Mean Interquartile Range
10
1
0.1
10
1
0.1
coalescence commute
0.0 0.5 1.0 1.5
Asymmetric Graph, Varying Coalescence Rate
Mean Interquartile Range
10
1
0.1
10
1
0.1
coalescence commute
0.0 0.5 1.0 1.5
Symmetric Graph, Varying Coalescence Rate, Multiple Coalescence Parameters
Mean Interquartile Range
10
1
0.1
10
1
0.1
coalescence commute
0.0 0.5 1.0 1.5
Asymmetric Graph, Varying Coalescence Rate, Multiple Coalescence Parameters
Mean Interquartile Range
10
1
0.1
10
1
0.1
coalescence commute
Figure 16: Boxplots of the mean interquartile ranges of the posterior distributions of g for each of the 25
graphs in each situation. Each box is labeled with the coalescence rate in the underlying model for each
situation. Subplot locations for each situation are the same as in Figure 13.
29
4.3 Inference of a Barrier
We now test the method's ability to locate barriers to migration on a 5 3 grid with asymmetric gene
ow.
There are impassable boundaries on the bottom left and top right of the landscape, but the overall graph
remains fairly well connected (see Figure 17). Each value of g in the true model is determined randomly
as before. The coalescence rates are set to 1 everywhere, and for the rst case, we assume that we know it
is the same everywhere, although the specic value is not known. Gaussian noise with standard deviation
equal to 1/1000 of the mean of H is added to form
^
H, and inference is performed using all values of
^
H. All
inference in this section is done using 5 10
6
iterations of MCMC after 4 10
6
iterations for burn-in and
1 10
6
iterations for pre-burn-in. In this case, the mean absolute error of ^ g was 0.019, while the error of ^
was 0.0028. This analysis is repeated in three more situations, all with noise set to have standard deviation
1/100 of the mean of H. In the rst case, there are no missing values of
^
H, in the second case, values from
rows and columns corresponding to populations 2, 5, 7, and 15 are missing from
^
H. The third case is the
same as the second except that there is a coalescence parameter for each location on the grid instead of one
parameter for all locations. With no missing values and noise of 1/100, the mean absolute error of ^ g was
0.17 and the error of ^
was 0.0066. With the four missing values, the mean absolute error of ^ g was 0.37
and the error of ^
was 0.033. With missing values and multiple coalescence parameters, the mean absolute
error of ^ g was 0.52 and the mean absolute error of ^
was 0.42. Boxplots of the posterior distributions of g
can be seen in Figure 18. We can see that the inference method does very well in the case with noise of
1/1000. With each modication, the problem becomes less well dened, so it is not surprising that the error
increases as well. Notably, we are still able to identify the barrier except when there is data missing from
both locations on either side of the barrier, as is the case for movement parametersg
5
andg
17
with locations
2 and 7. Additionally, inference of the coalescence rate is very accurate when there is only one coalescence
rate parameter, but is much harder with multiple parameters, which in turn makes the movement parameters
harder to estimate.
4.4 Comparison of Commute and Coalescence Time Inferences
In order to further illustrate the dierences between commute time and coalescence time inference, we
test both methods on four dierent 4 4 graphs (shown in Figure 19: one where movement parameters
are all equal to 1 (which we will call the uniform case), one where movement parameters are symmetric
but randomly generated, one asymmetric graph where all movement parameters are randomly generated,
and one asymmetric graph where there is a bias in moving towards one direction in the graph. For the
symmetric and unbiased asymmetric graphs, movement parameters are chosen randomly as before. In the
biased asymmetric graph, all parameters for moving left or down are 2 while all parameters moving right
or up are 0.5. Dierences between true mean coalescence times (H) and mean commute times (
e
H) are
30
5x3 Graph with Barriers
g1=0.5
g2=0.8
g3=0.2
g4=0.1
g5=0
g6=1.2
g7=0.9
g8=0
g9=0.6
g10=4.3
g11=1
g12=0.4
g13=0.6
g14=1.7
g15=2.1
g16=2.4
g17=0
g18=2.4
g19=1.5
g20=0.9
g21=0
g22=0.6
g23=3.8
g24=0
g25=3.4
g26=2.7
g27=0.3
g28=0
g29=1.8
g30=2.5
g31=1.8
g32=0.3
g33=2.5
g34=0.4
g35=0.5
g36=0.2
g37=0
g38=0.3
g39=0.6
g40=0
g41=0.1
g42=3.1
g43=0.2
g44=0.4
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
Figure 17: Plot of the grid structure for the 5 3 grid with barriers. Values for the non-zero movement
rates were chosen by rounding independent draws of an exponential random variable with mean 1 up to the
nearest tenth.
31
g5
g8
g17
g21
g24
g28
g37
g40
g4
g41
g3
g36
g43
g27
g32
g38
g12
g34
g44
g1
g35
g9
g13
g22
g39
g2
g7
g20
g11
g6
g19
g14
g29
g31
g15
g16
g18
g30
g33
g26
g42
g25
g23
g10
0
2
4
6
8
5x3 Graph with Barriers, Noise 1/1000
Parameter Index
Parameter Value (rate)
g5
g8
g17
g21
g24
g28
g37
g40
g4
g41
g3
g36
g43
g27
g32
g38
g12
g34
g44
g1
g35
g9
g13
g22
g39
g2
g7
g20
g11
g6
g19
g14
g29
g31
g15
g16
g18
g30
g33
g26
g42
g25
g23
g10
0
2
4
6
8
5x3 Graph with Barriers, Noise 1/100
Parameter Index
Parameter Value (rate)
g5
g8
g17
g21
g24
g28
g37
g40
g4
g41
g3
g36
g43
g27
g32
g38
g12
g34
g44
g1
g35
g9
g13
g22
g39
g2
g7
g20
g11
g6
g19
g14
g29
g31
g15
g16
g18
g30
g33
g26
g42
g25
g23
g10
0
2
4
6
8
5x3 Graph with Barriers, Noise 1/100, 4 Missing Values
Parameter Index
Parameter Value (rate)
g5
g8
g17
g21
g24
g28
g37
g40
g4
g41
g3
g36
g43
g27
g32
g38
g12
g34
g44
g1
g35
g9
g13
g22
g39
g2
g7
g20
g11
g6
g19
g14
g29
g31
g15
g16
g18
g30
g33
g26
g42
g25
g23
g10
0
2
4
6
8
5x3 Graph with Barriers, Noise 1/100, 4 Missing Values, Multiple Coalescence Parameters
Parameter Index
Parameter Value (rate)
Figure 18: Posterior distributions for values of g for the 5 3 graph with barriers for each of the analysis
cases. See Figure 17 to see movement parameter locations.
32
shown for each graph in Figure 20. The dierences are fairly small in the uniform and symmetric cases,
moderate in the asymmetric case, and extreme in the biased asymmetric case. Inference is done using
^
H
with a noise level of 1/200 the mean value of H for each grid. All trials with each inference method were
done using 4 10
6
iterations of MCMC after 1 10
5
iterations of pre-burn-in and 1 10
6
iterations of
burn-in. Posterior distributions for each graph and and inference method are shown in Figure 21. The
coalescence time inference was substantially more accurate in all cases. The commute time based inference
and coalescence time based inference methods had mean absolute errors in ^ g of 0.30 and 0.12 respectively
for the uniform graph, 0.49 and 0.053 for the symmetric graph, 0.87 and 0.035 for the asymmetric graph,
and 3.27 and 0.053 for the biased asymmetric graph, respectively. We can see that, as expected based on the
dierences between H and
e
H, the coalescence time inference performs very well in all four test cases while
the commute time inference performs fairly well in uniform and symmetric cases, poorly in the asymmetric
case, and very poorly in the biased asymmetric case. The behavior of the biased asymmetric graph supports
our previous thoughts about situations similar to the three state system seen in Figure 6, where commute
times are much longer than coalescence times. In this case, the commute time inference predicts that many
movement parameters are an order of magnitude larger than they are in truth.
5 Application of Inference to Simulated Populations in SLiM
In order to validate the methods in situations that are more similar to those that we would see with actual
populations, we performed simulations using SLiM (Selection on Linked Mutations) [Haller and Messer,
2017].
We used SLiM to simulate discrete generations based on the Wright-Fisher model. Simulations were per-
formed in three dierent situations: uniform migration on a continuous square landscape, uniform migration
on a rectangular landscape with barriers to migration, and biased migration on a square landscape. The
simulations performed in this section were of populations of 10000 diploid hermaphroditic individuals. Mates
were selected from nearby individuals using a Gaussian kernel with standard deviation of
d
, truncated at
3
d
. In each simulation,
d
was chosen such that there would be an average of 0.25 individuals per
2
d
(2.25
per (3
d
)
2
), which is 0.005 for a square landscape with one unit of area. Ospring are dispersed according
to a Gaussian distribution away from the mother using the same truncated Gaussian kernel as before. If
the location of the ospring falls outside of the habitable area, it is redrawn until it is in a valid location.
In order to prevent excessive clumpiness of the population density distribution [Felsenstein, 1975], we also
model local competition. To do this, we use the same truncated Gaussian kernel, with the strength of the
interaction chosen by visual inspection of the density distribution. Specically, the new tness of each indi-
vidual after accounting for local competition is equal to exp (totalStrength=40), where totalStrength is the
sum of the strength of interations with all individuals within the cuto distance according to the Gaussian
33
4x4 Uniform Graph
g1=1
g2=1
g3=1
g4=1
g5=1
g6=1
g7=1
g8=1
g9=1
g10=1
g11=1
g12=1
g13=1
g14=1
g15=1
g16=1
g17=1
g18=1
g19=1
g20=1
g21=1
g22=1
g23=1
g24=1
g25=1
g26=1
g27=1
g28=1
g29=1
g30=1
g31=1
g32=1
g33=1
g34=1
g35=1
g36=1
g37=1
g38=1
g39=1
g40=1
g41=1
g42=1
g43=1
g44=1
g45=1
g46=1
g47=1
g48=1
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
4x4 Symmetric Graph
g1=0.4
g2=1.9
g3=0.4
g4=1.6
g5=3.1
g6=1.6
g7=1.7
g8=0.1
g9=1.7
g10=1.3
g11=1.9
g12=0.2
g13=1.1
g14=3.1
g15=0.2
g16=1.5
g17=0.2
g18=0.1
g19=1.5
g20=0.1
g21=0.4
g22=1.3
g23=0.1
g24=1.4
g25=1.1
g26=1
g27=2.8
g28=0.2
g29=1
g30=1.2
g31=3.1
g32=0.4
g33=1.2
g34=0.4
g35=1.2
g36=1.4
g37=0.4
g38=0.7
g39=2.8
g40=0.1
g41=3.1
g42=0.1
g43=1.4
g44=1.2
g45=1.4
g46=0.1
g47=0.7
g48=0.1
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
4x4 Asymmetric Graph
g1=0.6
g2=0.6
g3=1.5
g4=0.2
g5=2.2
g6=1.8
g7=0.1
g8=2.1
g9=0.6
g10=0.3
g11=0.2
g12=2.1
g13=1
g14=0.6
g15=0.1
g16=0.1
g17=0.1
g18=2.3
g19=0.6
g20=0.4
g21=1.8
g22=1.1
g23=2.8
g24=0.2
g25=0.1
g26=0.2
g27=1.8
g28=0.5
g29=0.2
g30=1.3
g31=2.8
g32=0.3
g33=1.7
g34=0.9
g35=0.7
g36=1.2
g37=1.5
g38=0.4
g39=0.2
g40=1.9
g41=0.4
g42=0.6
g43=2.3
g44=0.9
g45=0.7
g46=0.7
g47=0.3
g48=1
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
4x4 Biased Asymmetric Graph
g1=2
g2=2
g3=0.5
g4=2
g5=2
g6=0.5
g7=2
g8=2
g9=0.5
g10=2
g11=0.5
g12=2
g13=2
g14=0.5
g15=0.5
g16=2
g17=2
g18=0.5
g19=0.5
g20=2
g21=2
g22=0.5
g23=0.5
g24=2
g25=0.5
g26=2
g27=2
g28=0.5
g29=0.5
g30=2
g31=2
g32=0.5
g33=0.5
g34=2
g35=2
g36=0.5
g37=0.5
g38=2
g39=0.5
g40=2
g41=0.5
g42=0.5
g43=2
g44=0.5
g45=0.5
g46=2
g47=0.5
g48=0.5
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Figure 19: Plots of the grid structure for the uniform graph, symmetric graph, the asymmetric graph, and
the biased asymmetric graph using the igraph R package.
34
15 16 17 18 19 20 21
16 18 20 22
Commute Time + Diversity vs Coalescence Time (Uniform Graph)
Coalescence Time
Commute Time + Diversity
12 14 16 18 20 22 24
15 20 25
Commute Time + Diversity vs Coalescence Time (Symmetric Graph)
Coalescence Time
Commute Time + Diversity
10 12 14 16
10 15 20 25
Commute Time + Diversity vs Coalescence Time (Asymmetric Graph)
Coalescence Time
Commute Time + Diversity
3.0 3.5 4.0 4.5 5.0 5.5
0 100 200 300 400 500
Commute Time + Diversity vs Coalescence Time (Biased Asymmetric Graph)
Coalescence Time
Commute Time + Diversity
Figure 20: Plots of the true values of
e
H vs H for the uniform graph, symmetric graph, the asymmetric
graph, and the biased asymmetric graph, shown with the y = x line. See Figure 19 to see overall graph
structure.
35
g1
g2
g3
g4
g5
g6
g7
g8
g9
g10
g11
g12
g13
g14
g15
g16
g17
g18
g19
g20
g21
g22
g23
g24
g25
g26
g27
g28
g29
g30
g31
g32
g33
g34
g35
g36
g37
g38
g39
g40
g41
g42
g43
g44
g45
g46
g47
g48
0
1
2
3
4
5
6
4x4 Uniform Graph (Commute Time Inference)
Parameter Index
Parameter Value (rate)
g1
g2
g3
g4
g5
g6
g7
g8
g9
g10
g11
g12
g13
g14
g15
g16
g17
g18
g19
g20
g21
g22
g23
g24
g25
g26
g27
g28
g29
g30
g31
g32
g33
g34
g35
g36
g37
g38
g39
g40
g41
g42
g43
g44
g45
g46
g47
g48
0
1
2
3
4
5
6
4x4 Uniform Graph (Coalescence Time Inference)
Parameter Index
Parameter Value (rate)
g8
g18
g20
g23
g40
g42
g46
g48
g12
g15
g17
g28
g1
g3
g21
g32
g34
g37
g38
g47
g26
g29
g13
g25
g30
g33
g35
g44
g10
g22
g24
g36
g43
g45
g16
g19
g4
g6
g7
g9
g2
g11
g27
g39
g5
g14
g31
g41
0
2
4
6
8
4x4 Symmetric Graph (Commute Time Inference)
Parameter Index
Parameter Value (rate)
g8
g18
g20
g23
g40
g42
g46
g48
g12
g15
g17
g28
g1
g3
g21
g32
g34
g37
g38
g47
g26
g29
g13
g25
g30
g33
g35
g44
g10
g22
g24
g36
g43
g45
g16
g19
g4
g6
g7
g9
g2
g11
g27
g39
g5
g14
g31
g41
0
2
4
6
8
4x4 Symmetric Graph (Coalescence Time Inference)
Parameter Index
Parameter Value (rate)
g7
g15
g16
g17
g25
g4
g11
g24
g26
g29
g39
g10
g32
g47
g20
g38
g41
g28
g1
g2
g9
g14
g19
g42
g35
g45
g46
g34
g44
g13
g48
g22
g36
g30
g3
g37
g33
g6
g21
g27
g40
g8
g12
g5
g18
g43
g23
g31
0
1
2
3
4
5
6
4x4 Asymmetric Graph (Commute Time Inference)
Parameter Index
Parameter Value (rate)
g7
g15
g16
g17
g25
g4
g11
g24
g26
g29
g39
g10
g32
g47
g20
g38
g41
g28
g1
g2
g9
g14
g19
g42
g35
g45
g46
g34
g44
g13
g48
g22
g36
g30
g3
g37
g33
g6
g21
g27
g40
g8
g12
g5
g18
g43
g23
g31
0
1
2
3
4
5
6
4x4 Asymmetric Graph (Coalescence Time Inference)
Parameter Index
Parameter Value (rate)
g3
g6
g9
g11
g14
g15
g18
g19
g22
g23
g25
g28
g29
g32
g33
g36
g37
g39
g41
g42
g44
g45
g47
g48
g1
g2
g4
g5
g7
g8
g10
g12
g13
g16
g17
g20
g21
g24
g26
g27
g30
g31
g34
g35
g38
g40
g43
g46
0
5
10
15
4x4 Biased Asymmetric Graph (Commute Time Inference)
Parameter Index
Parameter Value (rate)
g3
g6
g9
g11
g14
g15
g18
g19
g22
g23
g25
g28
g29
g32
g33
g36
g37
g39
g41
g42
g44
g45
g47
g48
g1
g2
g4
g5
g7
g8
g10
g12
g13
g16
g17
g20
g21
g24
g26
g27
g30
g31
g34
g35
g38
g40
g43
g46
0
5
10
15
4x4 Biased Asymmetric Graph (Coalescence Time Inference)
Parameter Index
Parameter Value (rate)
Figure 21: Boxplots of the posterior distributions forg for commute time and coalescence time based inference
for the uniform graph, symmetric graph, the asymmetric graph, and the biased symmetric graph. See Figure
19 to see movement parameter locations.
36
kernel, resulting in reduced tness for individuals in dense regions. Individuals are chosen to reproduce
proportional to their tness until the next generation. Clumpiness is similar to what is seen in Figure 23
In our simulations, individuals have 10 pairs of chromosomes with 200,000 base pairs each with a recom-
bination rate of 10
8
per generation per base pair. In the landscapes with barriers and biased migration,
mutation rate was set to 10
7
. In order to speed up simulation in the case of the landscape with uniform
migration, the mutation rate was lowered to 10
8
. This should not have a dramatic eect on the pattern of
genetic relatedness between individuals aside from reducing the genetic distance for those simulations by a
factor of ten.
In each location, individuals are sampled uniformly from those within the middle three quarters in each
dimension of the grid space, as shown for one situation iin Figure 23. This protocol was chosen as a
compromise between sampling all individuals from near the center of each location, which could result in a
large number of close relatives being sampled, leading to an underestimate of the mean genetic distance for
individuals in that location and uniform sampling from the whole area of each grid, which may result in many
sampled individuals being very close to other grid squares. An equal number of individuals are drawn from
each location, but this need not be the case. We could even have some locations from which no individuals
were drawn, but this lowers performance as discussed in Section 4.3. Mean genetic distances for each pair
of locations are calculated from the means of all of the pairwise genetic divergences between genomes in
the two locations, excluding self comparisons. Standard error of the mean is calculated conservatively using
the minumum number of independent samples, in this case the smallest number of individuals from each
location.
Before being passed to the MCMC inference method, genetic distance values are scaled so that the
resulting parameters are approximately of order one. This is accomplished by multiplying the mean genetic
distances by a constant such that the new mean of the mean genetic distances is equal to the number of
states in the discretization. For example, if we discretize a landscape into 16 squares, the mean of the mean
genetic distances after scaling will be 16.
For the 4 4 discretizations of the uniform and biased migration cases, as well the barrier case, inference
was performed using 410
6
iterations of MCMC after 110
6
iterations of pre-burn-in and 310
6
iterations
of burn-in. For the 22 discretizations, 110
6
iterations were used with 110
5
for pre-burn-in and 110
6
for burn-in.
5.1 Continuous Landscape with Uniform Migration
The case of a continuous landscape with uniform migration is the simplest situation, but it presents a dicult
inference problem. Since there is no strong signal from any particular feature, all of the observed dierences
in parameter values are due to noise. Noise that aects the inferred values can arise from several dierent
37
sources. First is due to the dierences between individual simulations. Because it is a stochastic process,
each simulation can eectively have dierent gene
ow rates due to recent population history. Another
source of noise is the particular individuals chosen to create the mean genetic distance matrix. Third,
the mean genetic distances will be dierent depending on which sections of the genome are used. Unless
otherwise noted, we will be using all of the polymorphic sites in the simulated populations for this analysis.
As shown by Figure 22, there is a substantially larger dierence between dierent simulations than between
non-overlapping groups of individuals from the same simulation.
Since these simulations were done on a square landscape, we can easily discretize it at dierent resolutions.
Here, we apply the gene
ow inference method on two dierent resolutions: a 2 2 grid and a 4 4 grid.
200 individuals are selected from each location in the 2 2 case and 50 individuals from each location in the
4 4 case for a total of 800 individuals in each case.
5.2 Continuous Landscape with a Barrier
We now revisit the earlier situation where a landscape has locations that are barriers to gene
ow, as in
Section 4.3. The landscape is designed to be a continuous version of this situation. Since the distribution
of ospring locations is truncated at three standard deviations away from the mother, a suciently thick
uninhabitable area can completely stop migration directly across it. As in 4.3, though, the landscape is still
relatively well connected. The layout of the landscape can be seen in Figure 23, with the red bars being
the uninhabitable regions that serve as barriers to migration. Unlike the case of a square landscape with no
barriers, there is not a natual way to discretize it more coarsely with grid squares of equal size and shape,
so a 5 3 grid will be the only discretization used in this section. We calculate mean genetic distances using
50 individuals from each grid location, here giving us a total of 750 individuals.
The gene
ow rates across the barriers are reliably estimated to be very small but there is considerable
variation in the other values. Note, however, that this analysis requires that we guess a reasonable place to
put the divisions between the locations. For instance, we may wonder whether a river or other geographical
feature restricts gene
ow. We would then discretize the landscape such that individuals on opposite sides
of the potential obstruction are in separate grid spaces.
5.3 Continuous Landscape with Biased Migration
The third situation that we will discuss is where migration is biased. We will be using a square landscape
as in the rst case, but the ospring locations of ospring are not centered around the parent. Instead,
the mean of the distribution of the locations is one tenth of a standand deviation (or 0.05% of the edge
length of the landscape) up and to the right of their parent's location. This results in reverse time gene
ow down and to the left as in the biased asymmetric graph in Section 4.4. In terms of population genetics,
38
600 800 1000 1200 1400
400 600 800 1000 1200 1400
Comparison of Di fferent Genetic Distance Matrices
H Values, Sim 1, Set 1
Alternative H Values
Sim 1, Set 2
Sim 2
Sim 3
Figure 22: Comparison of the values of the genetic distance matrix H for a 4 4 discretization of a square
landscape with uniform migration calculated from various situations. All values are plotted against the
corresponding values from theH matrix calculated using the rst set of individuals from the rst simulation,
along with they =x line. Black circles are from a second (non-overlapping) set of individuals from the rst
simulation, red triangles are from the second simulation, and blue plusses are for the third simulation. Mean
absolute value relative dierences between the new values of H for each situation and the original values
from the rst set of individuals from the rst simulation are 1.8% for the second set of indivuals from the
rst simulation, 13.2% for the second simulation, and 7.3% for the third simulation.
39
0 1 2 3 4 5
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Individual Locations: Landscape with Barriers 1
Position (x)
Position (y)
Figure 23: Locations of all and selected individuals on the 5 3 landscape with barriers in the rst repli-
cate. Locations of individuals used to compute the mean genetic distance matrix are shown in bold points
whereas the locations of other individuals are shown in smaller points. The red bars show the locations of
uninhabitable regions of the landscape. They are suciently thick so that no migration can occur directly
across them.
40
Graph Structure: Landscape with Barriers 1
g1=0.553
g2=0.135
g3=1.482
g4=0.031
g5=0.085
g6=0.569
g7=0.571
g8=0.095
g9=0.273
g10=0.784
g11=2.913
g12=1.784
g13=2.225
g14=4.107
g15=0.28
g16=1.006
g17=0.047
g18=0.616
g19=0.186
g20=0.591
g21=0.008
g22=1.176
g23=1.01
g24=0.019
g25=1.414
g26=0.118
g27=0.639
g28=0.045
g29=0.299
g30=0.316
g31=2.45
g32=0.043
g33=1.525
g34=1.513
g35=0.043
g36=0.848
g37=0.01
g38=0.486
g39=0.403
g40=0.096
g41=0.196
g42=3.632
g43=1.663
g44=3.228
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
Figure 24: Medians of the posterior distributions of the inferred values of g for the rst replicate of the
5 3 landscape with barriers. The gene
ow rate across the barriers are small, but there is no clear pattern
amoung the rest of the values.
41
g1
g2
g3
g4
g5
g6
g7
g8
g9
g10
g11
g12
g13
g14
g15
g16
g17
g18
g19
g20
g21
g22
g23
g24
g25
g26
g27
g28
g29
g30
g31
g32
g33
g34
g35
g36
g37
g38
g39
g40
g41
g42
g43
g44
0
1
2
3
4
5
Posterior Distributions: Landscape with Barriers 1
Parameter Index
Parameter Value (rate)
Figure 25: Boxplots of the posterior distributions for g for the rst replicate of the 5 3 landscape with
barriers. Boxes for gene
ow rates for edges that cross the barrier are colored in red.
42
this is very similar to the case where a population is expanding into new territory since in both cases,
ospring tend to be further from the edge of the expansion than their parents. The dierence is that, in
our case, the population is constrained to be a bounded distance away from the edge of the \expansion". In
both situations, individuals at the edge may have disproportionately more ospring, leading to low genetic
diversity and strong drift. This results in substantially more noise in the isolation by distance plots as seen
in Figure 26. The horizontal lines in that gure are likely from comparing genomes of individuals from two
families that have recently increased in size and geographical spread, resulting in many pairs of genomes
with similar genetic distance but varying geographic distance.
As in Section 5.1, we discretize the landscape into both a 22 grid and a 44 grid for inference. Looking
at the medians of the inferred posteriors of the gene
ow rates in the 22 case (Figure 27), it is immediately
clear how strong the bias against reverse time gene
ow out of the lower left corner is. That is, hardly any
individuals in the lower left area have ancestors from other regions of the landscape. This is to be expected
as even though the single generation migration bias is small, having a descendant in a location against the
bias after a large number of generations becomes less and less likely as the number of generations increases.
In the 4 4 discretization, the bottom left also shows very strongly biased gene
ow (Figure 28). Most (but
not all) of the other pairs also show bias in the expected direction. Boxplots of the posterior distributions
are shown in Figure 29.
6 Application of Inference to Mojave Desert Tortoises
We also apply the method to a population of Mojave desert tortoises (Gopherus agassizii) living in California.
Location and low coverage whole genome sequencing data was collected from 271 individuals [Shaer et al.,
2017]. Pairwise genetic distances between all individuals was obtained from the same study. The landscape
was discretized into 13 regions based on watersheds and the locations of individuals. The mean coalescence
time matrix and standard deviations were calculated as before. Since the regions vary substantially in size,
and may vary substantially in population density, we do not assume that coalescence rates are the same
everywhere, which gives 13 values of
. Since allowing only nearest neighbor migration gives 50 values of
g, we have a total of 63 unknown parameters. Because there are 13(14)=2 = 91 unique values of the mean
coalescence time matrix, we have more equations than unknowns.
Posterior distributions of gene
ow and coalescence rate parameters were inferred using 410
7
iterations
of MCMC after 1 10
6
iterations of pre-burn-in and 3 10
6
iterations of burn-in. The landscape and
associated graph are shown in Figure 30. Boxplots of the posterior distributions for the gene
ow rates can
be seen in Figure 31 and for the coalescence rates can be seen in Figure 32. There is substantial variation
in the gene
ow rates but not the coalescence rates, with the exception of region 4, which seems to have
a relatively slow coalescence rate. This is not too surprising given its large area and central location. In
43
Figure 26: Comparison of the isolation by distance plots for the square landscape with uniform migration
and the square landscape with biased migration. Faster coalescence in the biased migration case results
in more noticeable horizontal lines, indicative of families that recently increased in size and geographical
spread.
44
2x2 Grid (Biased Migration)
g1=0.482
g2=1.392
g3=0.004
g4=0.659
g5=0.009
g6=1.167
g7=0.448
g8=0.363
1 2
3 4
Figure 27: Medians of the posterior distributions of the inferred values of g for the rst replicate of the
2 2 discretization of the square landscape with biased migration. The reverse time gene
ow rates out of
the bottom left section are very small, indicating that very few individuals living in the bottom left of the
landscape have ancestors from other regions of the habitat, consistent with migration biased up and to the
right.
45
Graph Structure: Biased Migration 1
g1=0.171
g2=0.606
g3=0.001
g4=0.72
g5=3.903
g6=0.031
g7=0.01
g8=0.948
g9=0.014
g10=1.491
g11=0.002
g12=1.688
g13=0.214
g14=0.19
g15=0.028
g16=1.076
g17=0.198
g18=0.14
g19=0.792
g20=1.297
g21=1.141
g22=2.369
g23=0.139
g24=1.097
g25=0.006
g26=0.124
g27=0.586
g28=2.129
g29=3.611
g30=0.945
g31=0.43
g32=0.529
g33=0.138
g34=0.396
g35=0.043
g36=0.968
g37=0.268
g38=0.621
g39=0.025
g40=1.036
g41=0.142
g42=0.023
g43=0.29
g44=0.408
g45=0.29
g46=0.201
g47=0.327
g48=0.199
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Figure 28: Medians of the posterior distributions of the inferred values of g for the rst replicate of the 4 4
discretization of the square landscape with biased migration. As in the 2 2 case, the reverse time gene
ow
rates out of the bottom left section are very small, indicating that very few individuals living in that area of
the landscape have ancestors from other regions of the graph. Most, but not all of the other gene
ow rates
are biased in the direction we would expect.
46
g3
g1
g6
g4
g9
g7
g11
g2
g14
g5
g18
g8
g22
g10
g15
g12
g19
g16
g23
g20
g25
g13
g28
g17
g32
g21
g36
g24
g29
g26
g33
g30
g37
g34
g39
g27
g41
g31
g44
g35
g47
g38
g42
g40
g45
g43
g48
g46
0
1
2
3
4
5
Posterior Distributions: Biased Migration 1
Parameter Index
Parameter Value (rate)
Figure 29: Boxplots of the posterior distributions for g for the rst replicate of the 4 4 discretization of
the square landscape with biased migration. Black boxes are gene
ow rates down and to the left with each
green box being the corresponding gene
ow rate up and to the right.
47
general, the results are reasonable. For example, we observe relatively low gene
ow between areas with
mountains on the border, such as between regions 1 and 4, 1 and 9, 4 and 12, 4 and 5, and 9 and 10. On the
other hand, there is an open connection between 9 and 13, and we infer high rates of migration between the
two regions. Overall, the model is a better t than the resistance distance method of Shaer et al. [2017]
(Figure 33).
The strong asymmetries, such as those present between regions 9 and 11 or 9 and 12, are probably the
result of either source-sink dynamics or historical processes. The presence of source-sink dynamics would
be of great interest for tortoise conservation eorts; however, it is too early in this analysis to use these
numbers for that purpose. For example, we would want to check for the eects of the sample conguration
(e.g., are gene
ow rates between regions 2 and 3 low simply because individuals sampled in region 3 are
geographically far from region 2) as well as perform a sensitivity analysis, potentially including simulations
on the landscape.
7 Discussion
With these methods, we provide a framework for inferring demographic parameters of populations living
on a habitat spread over a two dimensional landscape by using present day location and genetic sequencing
information. We are able to do inference of reverse time gene
ow and coalescence rates with the matrix of
mean coalescence times in cases where the matrix is noise-free, noisy, or incomplete, or when it made after
discretization of a simulated or real landscape. To do this, we assume that the reverse time movements of
lineages are Markov and the population is near stationarity. Typically, there is more error and uncertainty in
the posterior distributions of the parameters when there is more noise, slower coalescence rate, more missing
values, and multiple parameters for coalescence rate. This allows us to study the evolutionary history of
populations and learn what factors led to their current states, including eective population sizes and how
likely it is for individuals in one location to have ancestors from certain other locations.
While the primary focus is coalescence time based inference, we also compare the results with that of
resistance distance (i.e., commute time) based inference. Coalescence based inference typically has lower error
than commute time based inference, along with the benet that you can also infer coalescence rates. The
commute time plus diversity estimate,
e
H, provides a good estimate of the coalescence times, H, when gene
ow rates are symmetric (G
ij
=G
ji
), but can be very wrong when they are asymmetric, including extreme
discrepencies when there is an overall bias in the direction of the asymmetry across the landscape (Figure
20). When inference is performed on discrete landscapes with varying degrees of migrational asymmetry,
coalescence time based inference is more accurate than commute time inference in all tested cases (Figure
21). Although performance of the commute time based method is reasonable when gene
ow rates in the
underlying model are symmetric, it becomes worse as the gene
ow rates become more asymmetric to the
48
Figure 30: Left: Locations of the sampled tortoises. Right: Sample maximum likelihood estimates from the
posterior distributions of the inferred values of g.
49
g1
g2
g3
g4
g5
g6
g7
g8
g9
g10
g11
g12
g13
g14
g15
g16
g17
g18
g19
g20
g21
g22
g23
g24
g25
g26
g27
g28
g29
g30
g31
g32
g33
g34
g35
g36
g37
g38
g39
g40
g41
g42
g43
g44
g45
g46
g47
g48
g49
g50
0
5
10
15
20
Posterior Distributions: Tortoise Gene Flow Rates
Parameter Index
Parameter Value (rate)
Figure 31: Boxplots of the posterior distributions forg for the Mojave desert tortoise data set on the thirteen
region discretization of the landscape.
50
gam1
gam2
gam3
gam4
gam5
gam6
gam7
gam8
gam9
gam10
gam11
gam12
gam13
0.0
0.5
1.0
1.5
Posterior Distributions: Tortoise Coalescence Rates
Parameter Index
Parameter Value (rate)
Figure 32: Boxplots of the posterior distributions for
for the Mojave desert tortoise data set on the thirteen
region discretization of the landscape.
51
0.0046 0.0048 0.0050 0.0052 0.0054 0.0056 0.0058
0.0046 0.0048 0.0050 0.0052 0.0054 0.0056 0.0058
Observed vs Modeled Mean Coalescence Times
Observed Mean Coalescence Times
Modeled Mean Coalescence Times
Figure 33: Mean coalescence times calculated from the sample maximum likelihood model parameters plotted
against the observed mean coalescence times calculated from the data, measured in the mean of pairwise .
The mean absolute value relative dierences between the two is 0.4%.
52
point where there is little correlation at all with the true values when there is biased asymmetric migration.
On the other hand, commute time inference is much less computationally intensive, so much ner resolution
spatial systems can be used. If the gene
ow rates are symmetric, the higher resolution allowed by commute
time inference may outweigh the loss in accuracy. The implication of these results for the body of literature
using resistance distance methods is that one may need to use caution when interpreting results. For example,
inferring high gene
ow rates between two areas by using resistance distance could potentially mean that
those were areas were actually being fed by a common source as in Figure 6 or the biased asymmetric case
in section 4.4. Therefore, it may be worthwhile to apply a coarse resolution coalescence time based method
alongside a ner resolution resistance distance method in order to check for evidence of asymmetric gene
ow patterns.
We also apply the method to data from continuous space simulations. While there is substantial variation
between the inferred values for dierent simulations with the same underlying demographic parameters, we
are able to reliably nd evidence of biased asymmetric migration (if it is present) and verify the exitence
of a barrier to migration, even when the landscape is otherwise relatively well connected. Further work
could involve more precise study of the relationship between reverse time gene
ow rates and forwards time
migration rates.
One diculty when using data from continous space landscapes, especially if it is from an actual pop-
ulation rather than a simulation, is the issue of how to discretize the landscape when there are no obvious
natural boundaries. If there are no hard barriers to migration, it is not particulary important where the
exact boundaries lie (see section 5.3). On the other hand, if the landscape does include a barrier to migra-
tion, we would want to avoid using a region that included individuals from both sides of the barrier. One
potential method to combat this would be to try multiple boundary locations and check whether any of the
regions have unusually high within location divergence (which could arise due to improper discretization).
Alternatively, we could initially discretize the landscape at a higher resolution, then group together adjacent
areas with low genetic divergence between them. Another strategy would be to sample individuals from
many relatively small locations so that there would be a relatively small chance that any of the clusters
would have a migrational barrier within them. The drawback of this method is that this could result in an
underestimation of the within region genetic divergences if many close relatives are sampled.
This analysis provides distinct information compared to methods such as Principal Componant Analysis
(PCA) [Patterson et al., 2006] [McVean, 2009] and Isolation with Migration (IM) [Hey and Nielsen, 2007].
While PCA can be used to identify population structure, it cannot nd gene
ow rates. IM is able to infer
posterior distributions for migration rates and population sizes, but it also infers the actual genealogies. This
is both an advantage and a disadvange: in theory, it is better since it uses and provides more information,
but since performing MCMC over the genealogies is very time consuming, it is only feasible when using a
53
relatively small number of loci. Therefore, this method and one presented here have dierent use cases.
Aside from learning about the demographic states of populations, another potential application of this
work is for use in population viability analysis (PVA). Using the information gained from solving the reverse
time problem, we can design a forwards time simulation in order to predict how a population will behave in
the future, including what factors may be the most important to the survival of a population. One challenge
in this situation is that the inference method assumes that the population has been stable for a relatively
long period of time. Since the populations on which we would be most interested in performing PVA would
most likely have recently impacted environments, unless the organisms in question had very short generation
times, it may be dicult to determine the eect of recent changes to the habitat. In this case, it would be
benecial to run multiple simulations with various modications based on disruptions of the landscape.
8 References
N H Barton, A M Etheridge, and A V eber. Modelling evolution in a spatial contin-
uum. Journal of Statistical Mechanics: Theory and Experiment, 2013(01):P01002, 2013. URL
http://stacks.iop.org/1742-5468/2013/i=01/a=P01002.
Nick H. Barton, Frantz Depaulis, and Alison M. Etheridge. Neutral evolution in spatially
continuous populations. Theoretical Population Biology, 61(1):31{48, February 2002. URL
http://dx.doi.org/10.1006/tpbi.2001.1557.
Michael Betancourt. A conceptual introduction to hamiltonian monte carlo, 2017. URL
http://arxiv.org/abs/1701.02434. arxiv:1701.02434.
Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of markov chain monte carlo.
CRC press, 2011.
Joseph Felsenstein. A pain in the torus: Some diculties with models of isolation by distance. The American
Naturalist, 109(967):359{368, 1975. ISSN 00030147. URL http://www.jstor.org/stable/2459700.
Benjamin C. Haller and Philipp W. Messer. Slim 2: Flexible, interactive forward genetic simula-
tions. Molecular Biology and Evolution, 34(1):230{240, 2017. doi: 10.1093/molbev/msw211. URL
http://dx.doi.org/10.1093/molbev/msw211.
J Hey and R Nielsen. Integration within the Felsenstein equation for improved Markov chain Monte Carlo
methods in population genetics. Proc Natl Acad Sci U S A, 104(8):2785{2790, February 2007. doi:
10.1073/pnas.0611164104. URL https://www.ncbi.nlm.nih.gov/pubmed/17301231.
54
Richard R. Hudson. The variance of coalescent time estimates from DNA sequences. Journal of
Molecular Evolution, 64(6):702{705, 2007. ISSN 0022-2844. doi: 10.1007/s00239-006-0261-1. URL
http://dx.doi.org/10.1007/s00239-006-0261-1.
John G. Kemeny and J. Laurie Snell. Finite Markov chains. Springer-Verlag, New York-Heidelberg, 1976.
Reprinting of the 1960 original, Undergraduate Texts in Mathematics.
Mal ecot, G. Les math ematiques de l'h er edit e. Masson, Paris, 1948.
B H McRae. Isolation by resistance. Evolution, 60(8):1551{1561, August 2006. URL
http://www.ncbi.nlm.nih.gov/pubmed/17017056.
Gil McVean. A genealogical interpretation of principal components analysis. PLoS Genet, 5(10):e1000686,
10 2009. doi: 10.1371/journal.pgen.1000686.
Nick Patterson, Alkes L Price, and David Reich. Population structure and eigenanalysis. PLoS Genet, 2
(12):e190, 12 2006. doi: 10.1371/journal.pgen.0020190.
D Petkova, J Novembre, and M Stephens. Visualizing spatial population structure with estimated ef-
fective migration surfaces. Nat Genet, 48(1):94{100, January 2016. doi: 10.1038/ng.3464. URL
https://www.ncbi.nlm.nih.gov/pubmed/26642242.
Y.P. Petrov and V.S. Sizikov. Well-posed, ill-posed, and intermediate problems with applications, volume 49.
Walter de Gruyter, 2005.
H. Bradley Shaer, Evan McCartney-Melstad, Peter L. Ralph, Gideon Bradburd, Erik Lundgren, Jannet
Vu, Bridgette Hagerty, Fran Sandmeier, Chava Weitzman, and C. Richard Tracy. Desert tortoises in
the genomic age: Population genetics and the landscape. bioRxiv, 2017. doi: 10.1101/195743. URL
https://www.biorxiv.org/content/early/2017/09/29/195743.
Tokuzo Shiga. An interacting system in population genetics. J. Math. Kyoto Univ., 20(2):213{242, 1980.
ISSN 0023-608X. URL http://projecteuclid.org/euclid.kjm/1250522276.
Tokuzo Shiga. Stepping stone models in population genetics and population dynamics. In Stochastic processes
in physics and engineering (Bielefeld, 1986), volume 42 of Math. Appl., pages 345{355. Reidel, Dordrecht,
1988.
John Wakeley. Coalescent Theory, an Introduction. Roberts and Company, Greenwood Village, CO, 2005.
URL http://www.coalescentheory.com/.
S Wright. Isolation by distance. Genetics, 28(2):114{138, March 1943. URL
http://www.genetics.org/cgi/reprint/28/2/114.
55
Abstract (if available)
Abstract
Genetic divergence is related to geographic distance, but the exact relationship is not straightforward, and is the result of how lineages move across the landscape. Our goal is to infer these movement rates from pairwise genetic divergences at known locations. If we approximate the habitat as a graph, there are two main models to explain patterns of genetic divergence on spatial landscapes: resistance distance and coalescence time. Coalescence time is mechanistically closer to reality and allows the inference of asymmetric migration rates, but resistance distance is more computationally feasible for large graphs and therefore used by popular programs such as EEMS and Circuitscape. Resistance distance is equivalent to coalescence time when the graph is isotropic, but in general the two measures are typically different. We investigate when the approximation of coalescent time with resistance distance breaks down. We also provide an MCMC method to infer migration and coalescence rate parameters using coalescence times and find that using commute time can produce dramatically misleading results in the case of biased gene flow. We apply the method to data from the model, forwards time population simulations, and a sample of Mojave Desert Tortoises.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Simulating genetic variant data for structured populations using an approximate coalescent algorithm
PDF
Beyond genotypes: genealogy-based inference of population structure and demographic history
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Stochastic inference for deterministic systems: normality and beyond
PDF
Simulating the helicase motor of SV40 large tumor antigen
PDF
Bayesian analysis of transcriptomic and genomic data
PDF
Evolutionary genomic analysis in heterogeneous populations of non-model and model organisms
PDF
3D modeling of eukaryotic genomes
PDF
Large scale inference with structural information
PDF
The cancer stem-like phenotype: therapeutics, phenotypic plasticity and mechanistic studies
PDF
Essays on bioinformatics and social network analysis: statistical and computational methods for complex systems
PDF
Investigating the effects of polycomb repressive complex inhibitors on cancer cell phenotypic plasticity
PDF
Inferring mobility behaviors from trajectory datasets
PDF
Bayesian hierarchical models in genetic association studies
PDF
The evolution of gene regulatory networks
PDF
Inverse modeling and uncertainty quantification of nonlinear flow in porous media models
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Temporal and spatial characterization of cisplatin treatment and emerging acute resistance in bladder cancer cells
PDF
Statistical algorithms for examining gene and environmental influences on human aging
PDF
Applications of next generation sequencing in sessile marine invertabrates
Asset Metadata
Creator
Lundgren, Erik
(author)
Core Title
Inferring gene flow across spatial landscapes
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
10/25/2018
Defense Date
08/10/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
biased migration,coalescence time,commute time,demographic inference,gene flow,habitats,OAI-PMH Harvest,population genetics,resistance distance,spatial landscapes
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ralph, Peter (
committee chair
), Dean, Matthew (
committee member
), Marjoram, Paul (
committee member
), Rohs, Remo (
committee member
)
Creator Email
elundgre@usc.edu,lundgren.erikj@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-92799
Unique identifier
UC11676655
Identifier
etd-LundgrenEr-6806.pdf (filename),usctheses-c89-92799 (legacy record id)
Legacy Identifier
etd-LundgrenEr-6806.pdf
Dmrecord
92799
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Lundgren, Erik
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
biased migration
coalescence time
commute time
demographic inference
gene flow
population genetics
resistance distance
spatial landscapes