Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Using artificial neural networks to estimate evolutionary parameters
(USC Thesis Other)
Using artificial neural networks to estimate evolutionary parameters
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
USING ARTIFICIAL NEURAL NETWORKS TO ESTIMATE
EVOLUTIONARY PARAMETERS
by
Chi-Chiang Lee
A Thesis Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(BIOSTATISTICS)
May 2010
Copyright 2010 Chi-Chiang Lee
ii
DEDICATION
To my family, for their support, understanding and encouragement
iii
ACKNOWLEDGEMENTS
I would like to thank Dr. Paul Marjoram, my advisor and thesis committee
chair, without his guidance and support, this thesis would have been impossible. I
would also like to thank committee co-chair, Dr. Stanley Azen, and committee
member, Dr. Kimberly Siegmund, for their invaluable reviews and feedback on this
thesis.
iv
TABLE OF CONTENTS
Dedication
Acknowledgements
List of Tables
List of Figures
Abstract
Introduction:
Background
Estimation of Genetic Variations: Recombination and Mutation
The Basic Concept of Artificial Neural Networks
The Biological Background of Artificial Neural Networks
The Basic Artificial Neurons
The Artificial Neurons with Layers
The Artificial Neural Networks with Parallel Computing
The Architectures of Artificial Neural Networks
The Applications of Artificial Neural Networks
Methods:
Simulation of SNP Data with Coalescent Model
Artificial Neural Network Package
Format of Simulated SNP Data
Summary Statistics to Capture Genetic Variations
Results:
Discrete Mutation Rate Prediction
Continuous Mutation Rate Prediction
Discrete and Continuous Recombination Rate Predictions
ii
iii
vi
vii
viii
1
1
2
3
4
5
7
8
8
9
10
10
11
11
13
15
15
17
20
v
Discussion
References
22
24
vi
LIST OF TABLES
Table 1: Predicting Mutation Rate by Artificial Neural Network Model and
Watterson’s Method Using θ = {1, 2, 5, 10}, Per 1000 Base Pairs
Table 2: Predicting Mutation Rate by Artificial Neural Network Model and
Watterson’s Method Using Continuous Range of θ Values (Per
1000 Base Pairs)
Table 3: Predicting Recombination Rate by Artificial Neural Network
Model Using the Summary Statistics S4 to S10 When Sample
Recombination Rates Were Either ρ=0 or ρ=10, Per 1000 Base
Pairs
Table 4: Predicting Recombination Rate by Artificial Neural Network
Model Using The Summary Statistics S4 to S10 When ρ was
Drawn Randomly from The Interval [0, 10] (Per 1000 Base Pairs)
16
18
21
21
vii
LIST OF FIGURES
Figure 1: A Simple Example of Artificial Neural Network
Figure 2: An Example of Collective Input (Y
j
) Function
Figure 3: Output Data of Artificial Neural Network (ANN) Predicted
Mutation Rates Corresponding to the 3
rd
Row of Table 2
(Training Range = 0-10, Test Range = 2.5-7.5)
Figure 4: Output Data of Watterson’s Estimate Mutation Rates
Corresponding to the 3
rd
Row of Table 2 (Training Range = 0-10,
Test Range = 2.5-7.5)
4
6
19
19
viii
ABSTRACT
The rapid growth in the amount of molecular genetic data being collected will,
in many cases, require the development of new analytic methods for the analysis of
that data. In this thesis, we explore the feasibility of using machine learning
algorithms, in particular artificial neural networks, to estimate two evolutionary
parameters of great interest: mutation and recombination rates. We show that this is
possible, and that the performance of such methods depends crucially upon the
existence of good summary statistics appropriate for the given parameter, as well as
the format in which the data itself is represented.
1
INTRODUCTION
Background
We are currently in the midst of a rapid expansion in both the quantity of
molecular data that is being collected, and the quality (and capacity) of the
computational hardware that is available to analyze this data. For example, the
recent development of SNP-chip hardware allows over 1 million Single
Nucleotide Polymorphisms (SNPs) to be interrogated on a single chip, allowing
for efficient undertaking of genome-wide association studies and or inference
regarding evolutionary demographics (e.g. Craig et al 2005, Shriver et al 2004,
The International HapMap Consortium 2007). While there are a variety of
approaches that allow the successful analysis of molecular data, these approaches
will often be incapable of analyzing data of such high dimension. At the same
time, there is a range of machine-learning algorithms that might well prove
successful in this context. In this thesis, we explore this issue.
By far the most popular model used in the context of evolutionary inference
for sequence or SNP data is the coalescent, which provides a mathematical
description of the underlying genealogy of a sample. This model was introduced
by Kingman (1982). In its original form it was specified in a context in which the
evolutionary history of the region was neutral (i.e. no selection), there was no
recombination, and the associated population of interest was unstructured. As
such, the genealogy was tree-like and the generic application was to mtDNA data
(see, for example, Marjoram and Donnelly 1994). The model was subsequently
2
generalized to allow for, among other things, the effects of recombination,
selection, population size-change and structure. Useful reviews can be found in
(Tavaré 1984, Hudson 1990, Nordborg 2001).
Estimation of Genetic Variations: Recombination and Mutation
We focus our interest on estimation of arguably the two most crucial
parameters in most models of DNA evolution: θ, the rate at which the region of
interest mutates, and ρ, the rate at which the region combines. It transpires that the
mutation/recombination rate is confounded with the population size (see, e.g.,
Tavaré 1984), so these parameters are defined as θ=4Nu, where N is the
(effective) population size and u is the mutation rate per sequence per generation,
and ρ=4Nc, where c is the recombination rate per sequence per generation. (The
factor of 4 in these expressions is simply for mathematical convenience.)
Our reason for investigating these parameters is two-fold. Firstly, they are
two of the most important parameters in this context. Secondly, they demonstrate
differing behaviors regarding the use of summary statistics. θ is considerably
easier to estimate and there is a simple, good estimator of θ, (Watterson’s
estimate, θ
W
, – Watterson 1975), that is a straightforward function of the summary
statistic S, where S is the number of mutations observed in a sample of data. The
estimate is defined as
∑
−
=
=
1
1
)
1
(
θ
n
i
i
S
w , where n is the sample size. However, with ρ
the situation is not so straightforward, and there is no equally good estimate of ρ
3
that is based on summary statistics (although, for recent progress on estimating ρ
see Hudson 2001, Li and Stephens 2003, Myers et al. 2005, for example). Thus,
by studying these two parameters we will be able to see how performance of our
method might depend upon the existence of good one-dimensional summary
statistics of the data.
The Basic Concept of Artificial Neural Networks
Artificial neural networks became popular in the field of biotechnology
when researchers attempted to predict the outcomes of biological and chemical
activities by modeling their structures. The general rationale of artificial neural
networks is to mimic the way the human brain processes information. Starting
with a raw data set, we select a layer structure and learning algorithm to form an
artificial neural network. An artificial neural network contains several layers: the
first layer consists of input units, the middle layers consist of hidden nodes, and
the last layer consists of output units. Hidden nodes are artificial neurons, the
basis unit of computation. There are many inputs in the first layer and one output
to the last layer (Figure 1).
Use of an artificial neural network requires dividing the raw data into two
sets: training and test. A training set is used to fit and optimize the model. The test
set is then used to generate estimates of the performance of the artificial neural
network. An artificial neural network can be used to perform classification and
prediction. We can also perform a sensitivity analysis on input variables to
4
understand their importance when predicting classification.
Figure 1: A simple example of artificial neural networks.
The Biological Background of Artificial Neural Networks
An artificial neural network is a mathematical model of computational
science for simulating the behavior of biological neural networks. It consists of a
group of artificial neurons that are connected reciprocally to form a dynamic
system of computational network. This adaptive system keeps improving its
structure based on input data that streams through the computational network
during the training process. In other words, an artificial neural network may be
considered as a black box that can receive x-variable inputs and transform them
into y-variable outputs. This black box is where artificial neurons are simulated to
mimic the behavior of living neural cells to find mathematical patterns in data.
Therefore, understanding the basic biological functions of human neurons should
5
help us to comprehend the theory of artificial neural networks.
There are roughly 10
10
neurons in the human nervous system, which
consists of the Central Nervous System and the Peripheral Nervous System. The
Central Nervous System, which consists of human brain and spinal cord, receives
and processes signals from the Peripheral Nervous System. In addition, each
neuron consists of three basic parts: a cell body, also termed soma, with a nucleus;
one or more dendrites that receive incoming signals and send them to the soma;
axon that carries information away from the soma and transmits signals to the
synapses of other neurons.
These synapses are the connection area between two neurons, and they also
are the barriers that the signals from neighboring neurons passing through. In fact,
the synaptic strength plays an important role to determine the amount of the signal
entering the neuron cell. Therefore, the synaptic strength represents the degree of
barrier, and the amount of signal change also depends on it.
The Basic Artificial Neurons
In artificial neurons, each signal, x
i
, has an associated weight, w
i
. As a
matter of fact, each neuron can receive many signals simultaneously since there is
a large number of dendrites and axons in the neuron. In order to simulate the
behaviors of the neural cells, the following function is to represent the collective
input (Y
j
) of the neuron (Figure 2).
Y
j
= w
1
x
1
+ w
2
x
2
+ …… + w
i
x
i
+ …… + w
n-1
x
n-1
+ w
n
x
n
6
Figure 2: An example of collective input (Y
j
) function.
For example, there are three sets of signals, which have intensities (0.9, 0.8,
-0.7), (6.0, 5.0, 4.1) and (-2.0, 5.0, 4.0), flowing through a neuron that has three
synapses with weights 1.0, -2.0 and 3.0. The calculations for the three collective
inputs of a neuron are as the following:
Y
1
= (1.0 x 0.9) + ((-2.0) x 0.8) + (3.0 x (-0.7)) = -2.8
Y
2
= (1.0 x 6.0) + ((-2.0) x 5.0) + (3.0 x 4.1) = 8.3
Y
3
= (1.0 x (-2.0)) + ((-2.0) x 5.0) + (3.0 x 4.0) = 0.0
Based on the values of these three collective inputs of a neuron, their signs
can be used to easily classify the three different signals. Hence, the weight vector
(1.0, -2.0, 3.0) is the decision maker to separate the signals into different
7
categories. Generally, the weight vector is started from an initial guess, and then
by using machine learning process, the weight vector is ameliorated iteratively. In
other words, the weight vector should keep improving slightly at each iteration
step of learning procedure. In essence, the artificial neural network is a method of
machine learning designed for obtaining reliable weight schemes. However, the
most difficult task is to acquire appropriate weight schemes to solve a given
problem especially when the data patterns are very complicated, as is common in
real-world applications.
The Artificial Neurons with Layers
In order to achieve solutions for complicated real-world applications, this
single neuron approach has to be integrated with the network concept that neurons
are interconnected. Hence, a single neuron is transformed to a group of neurons,
which is defined as a layer. In a layer, each neuron has a different set of weight
vectors. All of them receive the same input signals for processing at the same
time, and then produce as many independent output signals as the number of
neurons in a layer. Thus, a layer is formed by a group of neurons having different
sets of weights. They receive the same input signals and produce different output
signals simultaneously. These individual output signals of one layer can be used
as an input vector for another layer of neurons. Moreover, the output signals of
one layer are only dispatched to the next layer when all neurons of the current
processing layer have finished their tasks.
8
The Artificial Neural Networks with Parallel Computing
Compared with other approaches, the major advantage of the artificial
neural network method is that layers can be implemented as parallel computing
processes. The basic concept of parallel computing is that the use of multiple
computation resources at the same time to solve a specific problem. Since each
neuron processes information independently, a layer is a natural application for
parallel computing. In principle, an artificial neural network is to mimic the way
the human brain processes information, and it is devised a group of
interconnected artificial neurons computing in parallel. In theory, using more
computation resources for one problem should save more time to complete the
task, with possible reduction of costs. If there are enough processors available, the
computations in one layer can be performed simultaneously. Therefore, instead of
being executed in a sequential manner, the artificial neural network method gains
computing power and efficiency by parallel computing.
The Architectures of Artificial Neural Networks
There are three kinds of connections between two layers of neurons: full,
partial or random connection. Full connection means that every neuron in the first
layer is connected to all the neurons in the second layer. Partial connection means
that each neuron in the first layer is connected to some of the neurons in the
second layer. Random connection means that each neuron in the first layer is
9
randomly connected to some neurons in the second layer. Usually, full connection
is the most used scheme in artificial neural networks. In addition, there are two
types of network architectures: single layer and multiple layers architectures. The
architecture of single layer is composed of a group of artificial neurons in parallel.
Each neuron in this single layer usually receives the same source of input signal
concurrently and generates one output signal. In multilayer architecture, three
layers of neurons often are considered to be implemented for operating
sequentially. These three layers are a layer of input units, a layer of hidden
neurons and a layer of output nodes. The input layer only distributes the input
signals to the hidden layer, and it is not responsible for modifying the signals. The
layers that are between the input units and output nodes are called the hidden
layers because they are hidden from the outside world. The output layer produces
the final output signals.
The Applications of Artificial Neural Networks
In the real world, artificial neural networks have broad applications in many
industries such as biomedical fields. Since artificial neural networks are keen to
identify patterns in data, they are capable of recognizing diseases. In fact,
artificial neural networks are applied to recognizing disease patterns using various
medical imaging technologies, e.g. magnetic resonance imaging (MRI), x-ray
methods, computed axial tomography (CAT), ultrasound scans, etc. The way
artificial neural networks are trained to learn how to recognize disease patterns is
10
by the use of examples, which are the training data. However, in order to achieve
greater performance of artificial neural networks, quality selection of training data
is critical. Therefore, it is important to provide the training data that can
characterize all the variations of the diseases. Furthermore, their capability of
learning by training data makes artificial neural networks unique and adaptable.
Besides, they do not require any information, including a specific algorithm for
identifying the disease patterns or details of recognizing disease, to perform the
tasks. For a more detailed discussion of artificial neural networks see Gurney
(1997), as well as Zupan and Gasteiger (1999), for example.
METHODS
Simulation of SNP Data with Coalescent Model
Our application of artificial neural networks is to simulated SNP data for a
supposed chromosomal region of interest. In humans, the mutation and
recombination rates, θ and ρ, are approximately 1 per 1000 base pairs. In order to
apply the artificial neural network model we begin by simulating a large set of
training data. We do this by using the ms software of Richard Hudson (Hudson
2002), which fully implements the coalescent model under a variety of
complicating factors, including recombination. We generated samples under the
basic neutral model which assumes constant population size with no population
structure or selection, and an infinite-sites mutation model (intuitively speaking
the latter means that all mutations occur at unique positions along the
11
chromosome, see Tavaré 1984 for example). An infinite sites model was used
because Watterson’s estimate, which we will use as a benchmark, is known to
perform best under this scenario. Despite the obviously simplistic nature of such a
model, it has been shown to perform well in a variety of contexts (see Nordborg
2001, Schaffner et al. 2005 for example)
Artificial Neural Network Package
A variety of artificial neural network software exists. We used the R
package ‘neural’. We explored a range of possible values for the number of
hidden nodes and layers of the artificial neural network in order to get the best
prediction model. Results presented in this thesis use one hidden layer.
Format of Simulated SNP Data
Data simulated for a given individual by the coalescent model is represented
as a sequence of 0’s and 1’s, along with the locations (along the chromosome) at
which those mutations occur, and the information on location is not used. The
length of the sequence is the number of mutations that appear in the final sample.
At any given position in the sequence, a 0 denotes that the SNP is wild-type (i.e.
non-mutant), while a 1 indicates the type is that of the new mutant. Thus, a
sample of 4 sequences (known as haplotypes), containing 10 mutations, might be
represented as:
12
1000110101
1100100010
0011001010
0010001101
along with a set of locations such as 0.1, 0.14, 0.23, 0.28, 0.35,… for example
(where the chromosomal region of interest is re-scaled such that 0 denotes the
left-hand end and 1 denotes the right-hand end).
When fitting the artificial neural network we tried a variety of ways of
representing the data. Firstly, we input the raw haplotype information. An issue
with this is that when using the artificial neural network each data set must have
the same dimension. Independent replicates of the evolutionary process, such
those used to train and test the artificial neural network, will contain a number of
mutations that differs randomly from replicate to replicate, since only SNPs with
variations are included. We chose to circumvent this difficulty by appending a
sequence of 0’s to the haplotypes until each haplotype had length 100. This
number was chosen such that it was larger than the number of mutations that
appeared in any single data set we generated. An example of 4 chromosomes
containing 10 mutations is as follows:
1000110101000000………………….000
1100100010000000………………….000
0011001010000000………………….000
0010001101000000………………….000
13
Thus, the data is now of fixed dimension. It is not clear whether an artificial
neural network would work best if these haplotypes were represented as a set of
numbers or characters, so we explored both approaches.
Summary Statistics to Capture Genetic Variations
Since we might reasonably expect such data to be hard to interpret, we also
tried summarizing the data in a number of ways and then inputting these
summaries when fitting the artificial neural network model. Given our wish to
estimate the recombination rate, several of the statistics we employed represent
one-dimensional summaries of the amount of linkage disequilibrium (LD) in the
data. LD is a measure of dependency between the patterns of mutation observed at
pairs of SNPs and is known to be strongly influenced by the degree of
recombination in the data (e.g. Nordborg and Tavaré, 2002). The statistics we
used were five popular measures of LD, denoted by S
4
, S
5
, S
6
, S
7
, and S
8
below.
We also used two measures, S
9
and S
10,
which are based on the number of
recombination that could be inferred from the patterns of SNPs observed. A brief
explanation of these statistics follows shortly.
In order to introduce the summary statistics used here, we first introduce
some notation. Suppose we are considering a pair of SNPs, i and j. Let p
mn
denote
the proportion of haplotypes that are of type m at SNP i and type n at SNP j
(m,n=0 or 1). Let p
1●
be the proportion of haplotypes that are of type 1 at SNP i
(and similarly for p
●1
at SNP j). Finally, let a pair of SNPs satisfy condition G,
14
which is defined as that if we observe each the four types 00, 01, 10, and 11 on at
least one haplotype for those SNPs. (This is the so-called 4-gamete test. e.g.
Balding et al. 2001.)
The specific data summaries we use are as follows:
S= the number of SNPs in the data. (i.e. the number of columns containing
at least one 1.)
S
1
= a list of positions for the mutations appearing in that sample (again,
with an appended sequence of locations [set to 0], so that the sequences were of
fixed length).
S
2
= numerical representation of haplotypes as “01110”,
S
3
= character representation of haplotypes as “zero one one one zero”,
S
4
= the mean value of Δ across all pairs of SNPs,
S
5
= the mean value of D’, across all pairs of SNPs,
S
6
= the mean value of p
00
p
11
- p
10
p
01
, across all pairs of SNPs,
S
7
= the mean value of p
11
-p
1●
.p
●1
., across all pairs of SNPs,
S
8
= the mean value of r
2
, across all pairs of SNPs,
S
9
= the proportion of all pairs of SNPs that satisfy condition G,
S
10
= the proportion of adjacent pairs of SNPs that satisfy condition G.
Δ, D’ and r
2
are standard measures of LD, (see Balding et al. 2001, for
example, for detailed definitions). S is the statistic upon which Watterson’s
mutation rate estimate is based.
15
RESULTS
Discrete Mutation Rate Prediction
In order to benchmark our results we begin with a simple scenario in which
we assume that the mutation rate for the data of interest is known to take one of
several fixed values, θ
1
, θ
2,
,…. The training data contains 1000 data sets and the
test data 500. Each data set contains 20 haplotypes with the same mutation rate.
These 20 haplotypes are assumed from 20 independent individuals. In Table 1, we
show results of such an analysis for a variety of possible mutation rates. In this
analysis, the input to the artificial neural network model was S only (i.e. the
number of SNPs present). Column 1 indicates the range of fixed θ values
considered for the results in each row. If there are two different values of θ, then
each value of θ is used for roughly 500 replicates in the training set and 250
replicates in the test set. We measure performance by taking the estimate of
mutation, M, that is output by the artificial neural network and assigning the
estimated mutation rate to be the θ
i
value that is nearest to M. We then calculate
the “Binning Error Rate” to be the proportion of test data that has its mutation rate
incorrectly predicted. We compared this to the error under Watterson’s method,
when the predicted mutation rate was again assumed to be the θ
i
that was closest
to the estimate resulting from Watterson’s method. Unlike the measure “Absolute
Error Mean”, which can provide a good indication of accuracy, the measure
“Binning Error Rate” is a clear way to see whether the method is working or not.
16
Therefore, the measure “Binning Error Rate” is applied in the simple setting to
see if the method works, and the measure “Absolute Error Mean” is used in the
complex setting to see how the method performs.
Table 1: Predicting mutation rate by artificial neural network model and
Watterson’s method with using θ = {1, 2, 5, 10}, per 1000 base pairs
Fixed θ for
Training and Test
Sets
Replicates in
Training and Test
Sets*
Binning Error
Rates
Watterson’s
Binning Error Rates
{1, 2}
1000, 500 0.330 0.330
{1, 5}
0.088 0.134
{1, 10}
0.048 0.040
{2, 5}
0.226 0.474
{1, 2, 5, 10}
2000, 1000 0.476 0.399
*There are roughly 500 and 250 replicates for each fixed θ in the training set and
test set, respectively.
The differences between two fixed θ values are in some kind of inverse
proportion with the binning error rates. The worse binning error rate is from the
nearest two fixed θ values {1, 2}, and the best binning error rate is from the
farthest two θ values {1, 10}. The reason is that there is not much room for error
when two fixed θ values are close, so it is easy to predict a mistake within a close
range. Also, it is striking to note that the estimate resulting from the artificial
neural network model appears to be comparable that of Watterson’s method
(which is itself known to perform very well) in this scenario. S is known to be an
17
almost sufficient statistic for θ in this context, so these results demonstrate that an
artificial neural network model performs well in a setting such as this (i.e. when a
good summary statistic is available).
Continuous Mutation Rate Prediction
To reflect a situation, in which we might not know which if the summary
statistics carried information regarding the parameter of interest, we then
considered an example similar to that above, in which we presented the algorithm
with all summary statistics given in “Methods”, but in which the mutation rate
was again known to come from a fixed set of possible values (results not shown).
Once again the artificial neural network model slightly out-performed Watterson’s
estimate. This demonstrates that such an algorithm is able to select the ‘good’
statistics from the bad and still successfully infer θ (statistics S
4
through S
10
carry
virtually no information regarding θ).
A more challenging and realistic scenario is when θ may come from a
continuous range of values. In Table 2, we show results of an analysis, in which θ
is known to lie somewhere between 0 and 10. We then consider a range of
examples in which the artificial neural network model is trained on data simulated
with a range of θ that is greater than that used in the test set. This reflects a
situation in which we are not entirely certain about the likely range of θ. Once
again we performed a range of analyses including all summary statistics, with
similar results, but for reasons of space only present results for the case in which
18
the input data is S only. Since the input summary statistic is S only, the artificial
neural network predictor has a tendency to give the same estimation for the same
value of input data S. This kind of behavior is observed in Figure 3. We now
measure performance in terms of the absolute error mean, the mean value of | θ
k
–
M
k
| where M
k
is the value predicted for test value θ
k
(k=1,…,1000).
These results are appealing in that we again out-perform Watterson’s
estimate in some cases. However, it is important to note that, in some sense, we
are conditioning on the knowledge that θ lies between 1 and 10, whereas
Watterson’s estimate does not use that information. As is shown in Figure 3 and 4,
Watterson’s estimate will often predict a θ value outside the range [0, 10],
whereas this is not true for our artificial neural network predictor. It was in an
attempt to mitigate this effect that we used training data with a wider range of θ
values than is found in the test data. In Figure 3 and 4, the y-axis scales are
different in the two plots.
Table 2: Predicting mutation rate by artificial neural network model and
Watterson’s method with using continuous range of θ values (per 1000 base
pairs)
θ Range for
Training Sets
θ Range for
Test Sets**
Replicates in
Training and Test
Sets*
Absolute
Error
Mean
Watterson’s
Absolute Error
Mean
0-10
0-5
10000, 1000 1.1320 0.5826
1.5501 1.3697
2.5-7.5
0.5510 0.7743
1.3145 1.9135
5-10
1.4281 2.8885
1.4421 2.8617
*The θ values are randomly distributed in the specified range.
**Each Range of θ has two different test sets.
19
Figure 3: Output data of Artificial Neural Network (ANN) predicted
mutation rates corresponding to the 3
rd
row of Table 2 (Training range = 0-
10, Test range = 2.5-7.5).
Figure 4: Output data of Watterson’s estimated mutation rates
corresponding to the 3
rd
row of Table 2 (Training range = 0-10, Test range =
2.5-7.5).
20
The results we have shown so far illustrate that we can successfully use an
artificial neural network to estimate an evolutionary parameter, such as the
mutation rate, that is associated with a good summary statistic (and is, therefore,
reasonably easy to estimate).
Discrete and Continuous Recombination Rate Predictions
We now move on to consider the recombination rate, which is much harder
to estimate partly because of the lack of such an almost-sufficient summary
statistic. It is intuitively plausible that mutation rates should be easier to estimate
than recombination rates since mutations are directly observable (as SNPs, for
example), whereas recombinations are not. Only a small portion of the
recombinations that have actually occurred in the evolutionary history of a sample
are possible to infer from the observed haplotypes (see Myers and Griffiths 2003,
for example). Analyses in which we estimate ρ having input only the haplotype
data (i.e. S
2
or S
3
) performed poorly (results not shown). Thus we considered
analyses in which summary statistics were used as input. Example results are
presented in Table 3 for a case in which data sets were simulated using
recombination rates of either ρ=0 or ρ=10. Estimates of ρ were then binned in the
manner described previously for mutation rate. We see that the method performs
reasonably well in this case. Note that there is no natural analogue of Watterson’s
estimate for mutation rates here due to the greater inherent complexity of
estimating the recombination rate.
21
Table 3: Predicting recombination rate by artificial neural network model
with using the summary statistics S
4
to S
10
when sample recombination rates
were either ρ=0 or ρ=10, per 1000 base pairs
Input Statistics for
Artificial Neural Network
Fixed ρ Replicates in Training
and Test Sets*
Binning
Error
Rate
S
9
{0, 10} 1000, 1000 0.082
S
10
0.101
S
9
, S
10
0.100
S
4
, S
5
, S
6
, S
7
, S
8
, S
9
, S
10
0.105
S
4
, S
5
, S
6
, S
7
, S
8
, S
9
, S
10
10000, 1000 0.101
*There are roughly half of total replicates for each fixed ρ in the training set and
test set.
Given these benchmark results we then considered a more realistic example
in which ρ was drawn randomly from the interval [0, 10]. We used a training set
of 10000 samples, while the test set contained 1000 samples. Results are shown in
Table 4, where we see that we do somewhat worse than when estimating the
mutation rate.
Table 4: Predicting recombination rate by artificial neural network model
with using the summary statistics S
4
to S
10
when ρ was drawn randomly from
the interval [0, 10] (per 1000 base pairs)
Input Statistics for
Neural Network
ρ Range Replicates in
Training and Test
Sets*
Absolute
Error Mean
S
4
0-10
10000, 1000 2.4742
S
5
2.4913
S
6
2.4935
S
7
2.5046
S
8
2.3435
S
9
2.1155
S
10
2.0227
*The ρ values are randomly distributed in the specified range.
22
DISCUSSION
The rapid growth in complexity and quantity of the molecular data currently
being collected suggests that new analysis methods will need to be developed.
Many existing techniques are likely to prove intractable in this new context. To
some extent the analysis of such data is likely to shift away from explicit
likelihood-based methods and towards traditional statistics and data-mining. One
promising potential approach is machine learning algorithms. For example Zhang
and Horvath (2006) have applied genetic algorithms to map functional mutations.
In this thesis, we have shown a first application of machine learning
algorithms to the estimation of evolutionary parameters. In some situations the
investigator will have a good intuition with regards to what functions of the data
(in the form of summary statistics) carry a strong signal for the parameters of
interest. In those cases, one might reasonably expect an artificial neural network
to perform quite well (as shown in the analyses in this thesis). However, it will
frequently be the case that no such intuition is present. In such situations one
would like to input the entire data set (i.e. the haplotypes), or a large set of
potentially useful statistics, into the algorithm and have the artificial neural
network capture appropriate functions of this data during the training process. The
examples shown here demonstrate that this will be a difficult process for
molecular data. It is entirely non-obvious how we should best represent haplotype
data so that traditional artificial neural network models, for example, can
successfully interpret it and extract the signal from the noise. Unsurprisingly
23
perhaps, the naïve approach in this thesis (concatenating the raw haplotype data),
performs badly. Thus, greater thought will need to be given to this issue.
It is also important to remember that the performance of methods such as
those we present here is likely to be affected by the degree of closeness with
which the model we use to generate the training data successfully approximates
the processes by which the actual, observed data was generated. In our context,
there is a large body of literature that demonstrates that the coalescent model can
be used to generate data that mimics that which we see in reality (see Schaffner et
al. 2005). However, even given this fact, the value of other ‘noise’ parameters
(such as population growth rate) will influence the performance of the model. The
effects of such parameters might be mixed over within a Bayesian framework for
example, but the degree to which results are robust to the effects of the values of
other parameters remains to be explored.
In summary, the results shown here demonstrate that artificial neural
networks represent a promising approach for the analysis of complex, multi-
dimensional modern molecular genetic datasets. While the issue of choice of
summary statistics, viewed as compact ways of representing the data, remains
key, out results suggest that artificial neural networks can perform well in this
context even when the set of statistics under consideration is large and contains
many statistics that carry little signal regarding the parameters of interest.
24
REFERENCES
Balding DJ, Bishop MJ, and Cannings C, 2001. “Handbook of Statistical
Genetics”. John Wiley & Sons, Inc., New York.
Craig DW, Stephan DA, 2005. “Applications of whole-genome high-density SNP
genotyping”. Expert Rev Mol Diagn. 5:159-70.
Gurney K, 1997. “An Introduction to Neural Networks”. UCL Press.
Hudson RR,, 1991. “Gene genealogies and the coalescent process”. In Oxford
Surveys in Evolutionary Biology, eds. Futuyma D and Antonovics J. 7:1-44.
Hudson RR, 2001. “Two-locus sampling distributions and their application”.
Genetics, 159:1805-1817.
Hudson, RR, 2002. “Generating samples under a Wright-Fisher neutral model”.
Bioinformatics, 18:337-8.
Kingman JFC, 1982. “The coalescent. Stoch. Proc. Applns.”, 13:235-248.
The International HapMap Consortium, 2005. “A haplotype map of the human
genome”. Nature, 437:299-1320.
Li N, and Stephens, M, 2003. “Modelling Linkage Disequilibrium, and
identifying recombination hotspots using SNP data”. Genetics, 165:2213-2233.
Marjoram P, and Donnelly P, 1994. “Pairwise comparison of mitochondrial
{DNA} sequences in subdivided populations and implications for early human
evolution”. Genetics, 136:673-683.
Myers SR, and Griffiths RC, 2003. “Bounds on the Minimum Number of
Recombination Events in a Sample History”. Genetics, 163:375-394.
Myers S, Bottolo L, Freeman C, McVean G, and Donnelly P, 2005. “A fine-scale
map of recombination rates and hotspots across the human genome”. Science,
310:321-324.
Nordborg M, 2001. “Coalescent theory”. In Handbook of Statistical Genetics, eds.
Balding DJ, Bishop MJ, and Cannings C. pages 179-208, John Wiley & Sons,
Inc., New York.
Nordborg N, and Tavaré S, 2002. “Linkage disequilibrium: what history has to
tell us”. Trends in Genetics, 18:83-90.
25
Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, and Altshuler D, 2005.
“Calibrating a coalescent simulation of human genome sequence variation”.
Genome Research, 15:1576-1583.
Shriver M, Kennedy G, Parra E, Lawson H, Sonpar V, Huang J, Akey J, and
Jones K, 2004. “The genomic distribution of population substructure in four
populations using 8,525 autosomal SNPs”. Human Genomics, I(4):274-286.
Tavaré S, 1984. “Line-of-descent and genealogical processes, and their
applications in population genetics models”. Theor. Popn. Biol., 26:119-164.
Watterson GA, 1975. “On the number of segregating sites in genetical models
without Recombination”. Theor. Popn. Biol., 7:256-276.
Zhang B and Horvath S, 2006. “Ridge regression based hybrid genetic algorithms
for multi-locus quantitative trait mapping”. Int. J. Bioinformatics Research and
Applications, 1: 261-272.
Zupan J and Gasteiger J, 1999. “Neural Networks in Chemistry and Drug
Design”. Wiley-VS.
The International HapMap Consortium, 2007. “A Second Generation Human
Haplotype Map of Over 3.1 Million SNPs”. Nature, 449: 851-861.
Abstract (if available)
Abstract
The rapid growth in the amount of molecular genetic data being collected will, in many cases, require the development of new analytic methods for the analysis of that data. In this thesis, we explore the feasibility of using machine learning algorithms, in particular artificial neural networks, to estimate two evolutionary parameters of great interest: mutation and recombination rates. We show that this is possible, and that the performance of such methods depends crucially upon the existence of good summary statistics appropriate for the given parameter, as well as the format in which the data itself is represented.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Statistical downscaling with artificial neural network
PDF
Modeling mutational signatures in cancer
PDF
insideOut: Estimating joint angles in tendon-driven robots using Artificial Neural Networks and non-collocated sensors
PDF
Beyond genotypes: genealogy-based inference of population structure and demographic history
PDF
Ancestral inference and cancer stem cell dynamics in colorectal tumors
PDF
Evolutionary genomic analysis in heterogeneous populations of non-model and model organisms
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
Asset Metadata
Creator
Lee, Chi-Chiang
(author)
Core Title
Using artificial neural networks to estimate evolutionary parameters
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Publication Date
01/30/2010
Defense Date
01/17/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial neural networks,bioinformatics,coalescent theory,machine learning,mutations,OAI-PMH Harvest,recombinations,statistical genetics
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Azen, Stanley Paul (
committee chair
), Marjoram, Paul (
committee chair
), Siegmund, Kimberly D. (
committee member
)
Creator Email
sevenjohn@alumni.usc.edu,sevenjohn@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2818
Unique identifier
UC1500216
Identifier
etd-LEE-3454 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-287377 (legacy record id),usctheses-m2818 (legacy record id)
Legacy Identifier
etd-LEE-3454.pdf
Dmrecord
287377
Document Type
Thesis
Rights
Lee, Chi-Chiang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
artificial neural networks
bioinformatics
coalescent theory
machine learning
mutations
recombinations
statistical genetics