Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Application of Bayesian methods in association mapping
(USC Thesis Other)
Application of Bayesian methods in association mapping
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
APPLICATION OF BAYESIAN METHODS IN ASSOCIATION MAPPING
by
Keyan Zhao
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(STATISTICS)
May 2007
Copyright 2007 Keyan Zhao
Dedication
to my parents
ii
Acknowledgments
I would like to use this opportunity to express my special thanks to John Molitor and
Paul Marjoram. The work is mainly done in collaborating with them. I thank John
for the patience in explaining his spatial statistical ideas and the time spent in helping
me coding in C++. I thank Paul for bringing me to the field of statistical genetics, the
marriage of statistics and genetics. I would also like to thank Fengzhu Sun, my thesis
advisor, for his support in making this thesis possible. Furthermore, I am indebted to the
discussions and valuable comments from my thesis committee members Larry Goldstein
and Jasmine Zhou.
K.Z.
Los Angeles, California
March 2007.
iii
Table of Contents
Dedication ii
Acknowledgments iii
List of Figures vi
List of Tables vii
Abstract viii
Introduction 1
LD mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Unphased genotype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 1: Methods 5
1.1 Bayesian spatial statistical model . . . . . . . . . . . . . . . . . . . 5
1.1.1 Model description . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Estimation methods . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Extensions for unphased diploid data . . . . . . . . . . . . . . . . . 9
1.3 Interpretation of output . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 2: Applications 14
2.4 Cystic Fibrosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 GAW14 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.1 Data description . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Chapter 3: Discussion 22
References 24
Appendices 26
iv
A MCMC algorithm 27
A.1 General scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
A.2 Reversible Jump MCMC . . . . . . . . . . . . . . . . . . . . . . . 27
v
List of Figures
1.1 Emperical 95% Confidence Interval for the risk termδ associated with
each individual. Each individual is shown as a 95% Confidence Inter-
val vertical bar with big dots in the middle representing mean. Top
figures includes all individuals in the sample and bottom figure only
shows those significant individuals that don’t overlap zero. . . . . . 12
2.2 Posterior distribution for functional mutation site in CF dataset. Red
line represents the true position. . . . . . . . . . . . . . . . . . . . 15
2.3 Graphical representation of the genetic model used in the simulation.
D1-D4 are disease-causing loci. D5 and D6 influence disease expres-
sion if the disease genotype is present. P1-P3 are different phenotypes
caused by the disease loci to which they are connected by the lines.
The “a” and “b” after the phenotype designation indicate identical
phenotypes but caused by different genotypes. D5 changes pheno-
type P2a into P1 when allele 1 is present. D6 changes the penetrance
of P2b when allele 1 is present. . . . . . . . . . . . . . . . . . . . 17
2.4 Posterior distribution for functional mutation in first 6 replicates for
phenotype e. The figure shows the posterior distribution for the loca-
tion of the functional mutation related to trait e in a phase-unknown
analysis of each of the first 6 replicates of the packet 153 data. . . . 19
vi
List of Tables
2.1 Power study for GAW data . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Inferred functional locus across 100 replicates . . . . . . . . . . . . 21
vii
Abstract
There is great interest in the use of computationally intensive methods for fine mapping
of marker data. This thesis proposed a method to analyze diploid phase-unknown data
using ideas derived from Bayesian spatial statistics. We use spatial clustering of hap-
lotypes as a low-dimensional surrogate for the unobserved genealogy underlying a set
of genotype data. In doing so we hope to avoid the computational complexity inher-
ent in explicitly modelling details of the ancestry of the sample, while at the same time
capturing the key correlations induced by that ancestry at a much lower computational
cost.
We benchmark our methods using both the Cystic Fibrosis dataset and the simulated
GAW 14 data. For the GAW 14 dataset, we used 100 replicates of 4 phenotypes to
indicate the power of our method. When a functional mutation relating to a trait is actu-
ally present, we find evidence for that mutation in 97 out of 100 replicates, on average.
Results show that our method has the ability to accurately infer the location of functional
mutations from unphased genotype data. This shows the promise of integrated approach
in fine mapping this kind of genetic data.
viii
Introduction
LD mapping
One of the key questions in genetics and genomics is to understand the genetic basis
of common human diseases such as asthma, diabetes and schizophrenia. The last few
year has seen a rapid expansion of human genetic data. The finish of the sequencing of
the human genome in 2001 provided a genetic blueprint for the human genetic analy-
sis [Lan01, Ven01]. Yet there are differences between individuals in the genetic level.
The most predominant are the nucleotide polymorphisms. To find and characterize
these nucleotide polymorphisms in human populations, HapMap project [hap03, Con05]
was initiated and has thus far found over millions of SNPs (single nucleotide polymor-
phisms) in a sample of 270 individuals from 4 different ethic origins. These data pro-
vides us great opportunities to identify genes and causal disease mutations involved in
human diseases.
Observed patterns of these polymorphism data reflect the ancestral history of the
sample and the effects of mutation on that ancestry. Furthermore, the presence of
functional mutations on individuals induces correlations in phenotypes of interest and
implies increased similarity among those individuals in the regions surrounding those
mutations. This effect is degraded by the action of recombination over time, the exis-
tence of multiple mutations which may effect that phenotype, and the existence of other
1
factors which are likely to influence the phenotype of interest (such as population struc-
ture for example). Nonetheless, the pattern of non-random association between loci,
known as linkage disequilibrium (LD), has been used as the successful basis for a vari-
ety of methods that map functional mutations.
Over the last few years, association mapping has emerged as an important tool for
locating disease mutations by detecting associations between the incidence of a genetic
polymorphism and that of a disease.The key of association mapping relies on LD, which
makes it also called LD mapping. The basic idea behind LD mapping is straightforward:
one simply looks for genetic polymorphisms (genome-wide or in a candidate region) that
are statistically associated with the phenotype in samples of unrelated individuals. The
word unrelated here simply means that the individuals are not offspring of a controlled
cross or members of known pedigrees, because all individuals are of course related at
some distance. LD mapping would not work otherwise: the reason genetic markers may
be expected to be associated with phenotypic differences in the general population is
that individuals with similar phenotypes often share causative alleles that are identical
by descent, alleles which are embedded in conserved ancestral haplotypes.
However, a fundamental issue arising with the use of LD is that it is highly variable in
nature. This variability reflects the randomness inherent in the underlying evolutionary
processes that result in the data we see in the present day. Factors such as stochastic
drift, recombination and selection all strongly influence the pattern of LD that finally
results. This pattern itself is but a snapshot of a pattern that is evolving quickly over
time. This high level of variability means that it is not uncommon, in fact it is usual, to
see nearby loci with completely different patterns of LD, as well as a high degree of LD
between loci that are far apart even though the intermediate loci show little LD. Thus,
it is highly likely that some degree of modelling will be required in order to extract the
underlying signal from the superficial noise. As we have discussed above, the marker
2
data resulting from a sample is the result of the action of evolutionary forces such as
recombination and mutation over the ancestral history of the sample.
In principle , this ancestral history can be modeled by parametric models based on
coalescent theory [JFC82]. To explicitly model the whole process, it is extremely diffi-
cult to derive maximum-likelihood estimates for the parameters. Fortunately, Bayesian
methods that use Markov Chain Monte Carlo (MCMC) techniques provide a power-
ful way. MCMC techniques iteratively update a parameter’s value based on the current
estimate of all other parameters. Likelihoods that are hard to estimate jointly can be han-
dled easily by examining one parameter at a time, conditional on the other parameters
in the model. Although the methods utilizing or approximating coalescent models have
been proved to be powerful in many applications(e.g. [GT98, MWB00a, MWB02]), the
complexity of such models in the presence of these complicating factors is enormous,
which brings in the burden of computational efforts. Thus these methods are usually
limited to small samples and small genomic regions. There are also other methods
[MMT03a, MMT03b, DZC
+
04], using an approach more abstract in nature, instead of
modeling the coalescent process, they utilized the genetic consequences of the coales-
cent tree, and hope that they would capture most of the ancestral information necessary
for the inference of diseases in a way as simply as possible. The ideas are borrowed from
spatial statistics to produce the clustering of the data based on the shared observed geno-
type and similar phenotype. The latter type of analyses are much less complex in nature
than the former type. Thus, while they might lose some information through the approx-
imation to the underlying genealogy history, they gained efficiency in terms of compu-
tational time and are therefore likely to be able to handle larger datasets. Although both
types of methods have cons and pros, this thesis is mainly focused on the more abstract
models.
3
Unphased genotype
Current high-throughput genotyping technology produce only phase-unknown diploid
genotype data. Thus most of the association mapping methods apply an 2-stage
approach. First, haplotype phase is inferred from the phase-unknown genotypes using
softwares available such asPHASE [SST
+
01, SSD01, SD03]. Then association is done
on the inferred haplotypes. However, there are uncertainties in the process of phase
inference. Without allowing for this uncertainty in the method may leads to problems.
Falsely assigned haplotypes would never have the chance to be corrected in the method.
Thus, we regard an integrated approach, in which phase is estimated (or mixed over) as
part of the overall analysis, to be preferably. Conceptually at least, many of the algo-
rithms mentioned above could incorporate phase uncertainty by adding a layer to the
MCMC scheme in which phase is estimated, and mixed over, as part of the analysis in
an integrated way. In essence, this allows phase to be treated as a nuisance parameter. A
recent study by Morris et al. [MWB04] has shown that a two-stage approach of inferring
haplotypes followed by a haplotype-based analysis can be very inefficient for fine map-
ping, compared with an analysis based directly on the genotypes. See [LNL03, CCC04]
for more discussion of related issues.
In this thesis, we extend the methods of Molitor et al. [MMT03b] to contexts in
which we are presented with diploid, rather than haploid data, and explore phase assig-
nent as an integrated part of the analysis. We employ a Markov chain Monte Carlo
[MCMC] algorithm in which haplotypes are reconstructed from the genotype data as
part of the analysis, exploring the space of all likely haplotypes that are consistent with
the genotype data as an explicit part of the fine mapping analysis. In this case, the
uncertainty in the identification of haplotypes is explicitly included within the analysis,
thus leading to more realistic estimates of certainty in the posterior distribution for the
location of any functional mutations.
4
Chapter 1: Methods
1.1 Bayesian spatial statistical model
1.1.1 Model description
The key idea behind our method is the spatial statistical assumption: “similar” hap-
lotypes are likely to carry a common disease-causing variant and hence have “simil-
iar” risk to the disease. Similar ideas are employed in [MWB02, MWB00b], in which
regions (haplotypes in this case) often display some kind of spatial dependence structure
and regions of higher risk often cluster together.
One way of using this idea is utilizing spatial smoothing techniques to perform fine
mapping, as illustrated in [TSC
+
03, MMT03a]. In these 2 papers, the authors imposed
a kind of dependency structure on the haplotype effects so that similar haplotypes are
induced to similar disease risks and a conditional autoregressive (CAR) prior is used.
A matrix of weights was used to represent how close one haplotype is to another, with
higher values indicating higher similarity. Then conditionally, the prior for each hap-
lotype risk is expressed as a univariate normal distribution centered on the weighted
average of all the haplotype risks.
Another approach is clustering haplotypes directly based on the haplotype similarity.
There would be cluster “centers” serve as seed, and all haplotypes assigned to one of the
5
clusters based on the distances to the “centers”. Each cluster would have a unique dis-
ease risk value and all haplotypes assigned to this cluster would share the same risk. This
idea originated from [V or08] by V oronoi in 1908. Actually, the idea was widely used
[OBS92] in statistics. Even in the area of disease mapping, there were some research on
the spatial mapping of disease rates for small areas (e.g. [KHR00, DH01] ).
Our methods is an extension of this type clustering methods in [MMT03b], so I will
introduce the key points in details as in the original paper.
Suppose we have diploid data with I individuals with phenotypes y
i
,i = 1,...,I
and haplotypes h
ij
, j = 1 or 2, where h
ij
∈ 1,...,H with H denoting the number of
haplotypes. Here we just simply added the effects of 2 haplotypes to get the estimate
of an individuals’ risk. This is called an additive genetic model, which was also what
was focused on in the original paper[MMT03b]. An extension to estimate the genetic
model ( dominant , recessive or additive model) will be introduced in the later part of
the method section. If the phenotypes are continuous, we can write the model as:
y
i
=
2
X
j=1
γ c
h
ij
+
i
. (1.1)
where C
h
ij
∈ 1,...,C , with C
h
ij
denotes the cluster where haplotype h
ij
belongs to
and withC denotes the total number of clusters. Theγ c
represents the risk for haplotype
centerc, which has a normal distributionγ c
∼ N(α,σ 2
γ ).
For case-control study, we would have binary phenotype, then the model can be
written as:
Probit[Pr(y
i
= 1)] =
2
X
j=1
γ c
h
ij
. (1.2)
6
For case-control, we can reasonably set α = 0 and σ 2
γ ) = 1, which is equivalent to
setting a uniform prior Unif(0,1) on the probit probabilities of p
i
= Pr(y
i
= 1). For
datasets with only one haplotype per phenotype, the model can be further simplified as:
Probit[Pr(y
i
= 1)] = γ c
h
i
. (1.3)
whereγ c
∼ N(0,1).
To allow the full conditional distributions to follow standard forms, we use the con-
vention [GRS96] in transforming equation (1.3) into a normal model :
y
∗ i
= γ c
h
i
+
i
. (1.4)
Where∼ N(0,1). In this way, we are basically modeling the binary phenotypey
i
as an underlying continuous phenotypey
∗ i
.
1.1.2 Estimation methods
In the algorithm, the first step is to randomly define a set of haplotype “centers”T with
one center t
c
∈ T for each cluster c. There is also another parameter associated with
each cluster c, the mutation position x
c
. The cluster center haplotypes can be viewed
as the ancestral or most common haplotypes carrying the disease mutation at x
c
. After
assigning the center haplotypes for the clusters, we then deterministically allocate all
the haplotypes in the sample to the cluster with closest distance to its center haplotype.
One important factor that affects the performance of the method is the similarity matrix:
the measurement of how close each pair of haplotypes are. An intuitive way is based
on haplotype sharing. The similarity Lht
c
between haplotype h and center t
c
can be
calculated as the shared length identical by state(IBS) between the two haplotypes atx
c
.
We will give an improved similarity in the later part of this chapter.
7
We fit the model using standard MCMC estimation procedures.Essentially, each
parameter is updated conditional on the current estimate of other parameters. At each
stage, a parameter’s distribution conditional on other parameters is evaluated. See
[GRS96] for more detailed descriptions on the theory.
To allow the number of clusters to be random instead of arbitrarily fixed, we employ
reversible jump methods[Gre95]. We employ a simpler method proposed by [DH01],
a method in which, whenever the number of clusters changes, we integrate out the
values of g in equation (1.4). For the parameters, we define y = (y
∗ 1
,...,y
∗ I
) ;
γ = (γ ∗ 1
,...,γ ∗ I
);H denote the space of unique haplotypes in the data set andT denote
the current set of haplotype centers. We now introduce the conditional distribution for-
mula
f(y
∗ |T,H) =
Z
∞
−∞
...
Z
∞
−∞
f(y
∗ | γ, T,H)f(γ )d
γ 1
... d
γ C
. (1.5)
Given the distribution of y
∗ follows equation (1.4), f(y
∗ | γ, T,H) is a normal
distritution with variance of 1. Thus, it can be written as:
f(y
∗ | γ, T,H) =
I
X
i=1
1
√
π e
− (y
∗ i
− y
∗ i
)
2
/2
. (1.6)
wherey
∗ i
is the expected mean ofy
∗ i
based on the cluser assignment for individuali. For
haploid with only one chromosome, it would be γ c
i
, for diploid with 2 chromsomes, it
can be expressed as a variable model between additive model and dorminant model, as
explained later in equation (1.10).
When one proposes the addition of a new cluster, one would normally have to pro-
pose a new cluster riskγ c
as well. However, following [DH01], we propose a new cluster
based on construction of likelihoods obtained after integrating out the values of γ . At
the end of each cycle of the Gibbs sampler, we then generate the current set of values
8
forγ , using the full conditional distribution of eachγ c
based on the non-integrated like-
lihood, derived from equation (1.4). Once these parameters are updated, the values of
y are updated as described above. Considering the extra parameter of mutation site x
c
,
cluster assignment c
h
and number of clustersC, the joint distribution of all parameters
corresponding to the model formulation can be written as:
f(y
∗ | y,γ,c
h
)f(γ |C)f(c
h
|C,T,H,x)f(T )f(x). (1.7)
The MCMC procedure is based on this joint distribution equation (1.7). See
Appendix for more details on the MCMC procedure on estimating parameters.
1.2 Extensions for unphased diploid data
We assume we have binary marker data at J loci for I diploid individuals, consist-
ing of phenotypes y
i
, i = 1,...,I, and genotypes g
i
= {g
i1
,g
i2
,...,g
iJ
} where
g
ij
= 0,1, or 2, represents the number of copies of the less-frequent allele at locus j
for individuali. We employ extensions to the model of [MMT03b] in which haplotypes
are clustered according to ideas borrowed from spatial statistics.
At any given step of the MCMC algorithm, the set of genotype data will be resolved
into a set of2I haplotypes. Between iterations, we alter the way genotypes are resolved
into haplotypes by taking a random subset of the loci for each genotype and reversing the
way those loci are resolved for that genotype (so that alleles that were previously on the
first of its two haplotypes will now be on the second haplotype and vice-versa). We let
h
ik
, k = 1,2 denote the two haplotypes into which genotypeg
i
has been resolved. The
set of all current haplotypes is then clustered according to a similarity measure based on
shared-length IBS. Each cluster,c, is determined by a cluster centerh
c
, which is a haplo-
type, andx
c
, the location along the haplotype of a putative disease-associated mutation.
9
Note that x
c
is free to take different values for different clusters at any given iteration.
For a given set of cluster center haplotypes, we cluster each of our current sample haplo-
types by calculating the shared lengths IBS atx
c
between each of the sample haplotypes
and each of the haplotypes corresponding to a cluster center. LetL
mn
denote the shared
length between sample haplotype m and center haplotype n. To further improve the
similarity matrix, in a somewhat ad hoc attempt to avoid biases introduced by unequal
marker spacing, we defineL
mn
as a ratio of observed and expected shared length atx
c
:
L
mn
=
L
mn
E[L
mn
]
. (1.8)
where the expected shared length is calculated empirically from the observed data. As
per the methods in Molitor et al. [MMT03a] we assign haplotypes to clusters in a deter-
ministic manner, dependent upon shared length. Specifically, haplotype m is assigned
to the cluster n for which the value of L
mn
is highest. In principle, one could use any
other similarity metric, but clearly the power of the method will be adversely affected
by using a metric that does not capture local haplotype similarity in an efficient way.
Each cluster has an associated parameterγ c
which is used to define the expected trait
value for haplotypes assigned to that cluster. We use this as the basis of a probabilistic
assessment of the ability of the current clustering to explain the observed phenotypes.
We letc
h
ij
denote the cluster to whichh
ij
is assigned and write
y
i
= α +δ (γ c
h
i1
,γ c
h
i2
,φ )+
i
, (1.9)
whereα represents an intercept term, φ is a variable that takes values between 0 and 1,
i
∼ N(0,σ 2
) is an error term, and
δ (γ 1
,γ 2
,φ ) = (1− φ )min{γ 1
,γ 2
}+φ max{γ 1
,γ 2
}. (1.10)
10
We mix over φ as part of the MCMC algorithm. φ can be thought of as indicating
whether a recessive (φ = 0), dominant (φ = 1), or intermediate model is appropriate.
When φ = 0, the fitted risk for an individual is determined by the minimum of the
two haplotype risks and will therefore only take a high value if both haplotypes contain
the functional mutation (and therefore have a high risk themselves). When φ = 1,
individual risk will be high if either haplotype risk is high, which reflects a dominant
scenario. When φ = 0.5, for example, we have an additive model. For the binary
phenotypes, we use a probit link in the above model.
The MCMC algorithm explores the parameter space corresponding to the model,
including the number of clusters, cluster parameters and centers, assignment of geno-
types to haplotypes and imputed values for any missing marker information.
1.3 Interpretation of output
An approach such as ours provides a full clustering of the data, as well as assignment
of risks (i.e. γ c
’s) to haplotypes, and locations of putative functional mutations at each
iteration. It is unclear how best to exploit this output. We take the following approach.
For each diploid individual we construct an empiric 95% ‘confidence interval’ (CI) for
the risk termδ associated with that individual in equation (1.9). We do this by collating
theδ values associated with that individual across all iterations (after the usual MCMC
burn-in period) and constructing the smallest interval that contains the middle 95% of
those values. We label a data set as showing evidence for the presence of a functional
mutation if there is any individual for which the 95% CI forδ does not overlap 0.
An example is shown in Figure 1.1. For those data sets that show evidence for
the presence of a functional mutation (according to the definition above) we go on to
construct a posterior distribution for the location of that functional mutation. We do
11
!0.4 !0.2 0.0 0.2 0.4
Confidence Intervals
Haplotype Effects
!0.4 !0.2 0.0 0.2 0.4
Confidence Intervals
Significant Haplotypes
Figure 1.1: Emperical 95% Confidence Interval for the risk termδ associated with each
individual. Each individual is shown as a 95% Confidence Interval vertical bar with big
dots in the middle representing mean. Top figures includes all individuals in the sample
and bottom figure only shows those significant individuals that don’t overlap zero.
12
this by recording the location x
c
associated with each cluster at each iteration of the
algorithm. Each time a givenx
c
value is observed, we add a weight ofw
c
to the posterior
distribution for location at x
c
, where w
c
is defined as the probability that a N(γ c
,1)
random variable takes a value greater than 0. This weight is suggested by our use of a
probit link function, where P(y
i
= 1) = Φ( α +γ c
), where Φ( · ) is a standard normal
cdf. Thus, locations corresponding to clusters with a high risk parameter are given high
weight, whereas those corresponding to clusters with low risk are given low weight.
13
Chapter 2: Applications
2.4 Cystic Fibrosis
The cystic fibrosis (CF) dataset [KRB
+
89] was widely used as a benchmark dataset to
test association fine mapping methods. This dataset has 23 RFLP markers. It consists
of 92 haplotypes in the control group, and 94 in the disease group. Arbitrarily setting
the first location to 0, and then adding up intermarker distances to obtain the locations
for other markers, the marker locations range from 0.001.73 cM. It is known that one
founder mutation, DF508; is located between markers 17 and 18, approximately 0.88cM
away from the leftmost marker, and accounts for 67% of disease chromosomes. We
analyzed the dataset using the haplotype model instead of diploid model. In the MCMC
process, a burn-in of 20,000 iterations and 100,000 iterations are carried out to estimate
the posterior distributions.
As shown in Figure 2.2, the posterior distribution of disease mutation is correctly
pinpointed.
14
0.0 0.5 1.0 1.5
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Posterior distribution for Mutation
Location (cM)
Figure 2.2: Posterior distribution for functional mutation site in CF dataset. Red line
represents the true position.
15
2.5 GA W14 Data
2.5.1 Data description
The Genetic Analysis Workshops (GAWs) are a collaborative effort among genetic epi-
demiologists to evaluate and compare statistical genetic methods. By providing some
real genetic data or simulated data that mimic the real data, they serve as a forum for
statisticians, epidemiologists, geneticists, and other scientists interested in these fields
to introduce novel statistical methods and to evaluate and compare novel and exist-
ing methods. The main theme of GAW14 is the comparision of microsatellite and
single-nucleotide polymorphism (SNP) markers for genome-wide scans and the statisti-
cal methods that can best exploit the information provided in such scans for linkage and
association studies.[BWAdA
+
05] The simulated data in this conference is designed to
have similarities to the real dataset. It is widely accepted that complex disease have both
genetic and environmental causes. The disease simulated for GAW14 was Kofendrerd
Personality Disorder (KPD). It is a psychiatric syndrome characterized by an appar-
ent overwhelming concern, or even obsession, with the meaning of the patient’s inner
emotions and world view but at the same time subsuming the emotions of others into
the self. The diagnosis of this disease is ambiguous. There are multiple measurement
used to define this disease. To reflect the complex genetic interaction and diagnostic
ambiguity of KPD, the genetic model is summarized in Figure 2.3 from [GZS
+
05].
There were four geographically diverse “groups” in this dataset. We only focused on
one of them: the group from the county of Aipotu. Since there are too many phenotypes
in the dataset, we also only focused on 4 phenotypes used in the thesis. (e. Humor
impairment; f. Fascination with automobiles; g. Aversion to walking; h. Aversion to
walking ) The genomic region we are testing is around disease locus D2. Because our
16
Figure 2.3: Graphical representation of the genetic model used in the simulation. D1-
D4 are disease-causing loci. D5 and D6 influence disease expression if the disease
genotype is present. P1-P3 are different phenotypes caused by the disease loci to which
they are connected by the lines. The “a” and “b” after the phenotype designation indicate
identical phenotypes but caused by different genotypes. D5 changes phenotype P2a into
P1 when allele 1 is present. D6 changes the penetrance of P2b when allele 1 is present.
algorithm is mainly for fine-mapping. we only analyzed the dataset containing 19 SNPs
around the causal D2 locus (packet 153).
2.5.2 Results
The results we present here are an analysis of traits e, f, g and h for the Aipotu population
using the packet 153 data. In an attempt to estimate the likely power of our method
we analyzed 100 replicates for each of these traits. We argue below that, due to the
17
simulation method employed, the signal for the functional mutation for traits e, f and h
should be found in this packet, several SNPs in from the righthand end. As an example
of the output obtained from our method, in Figure 2.4 we give outputs from a phase-
unknown analysis for the location of the functional mutation related to trait e in the first
six replicates. Output for other replicates, and for analyses of traits f and h, are similar.
Note that, in general, no individuals are found to be significant when analyzing trait g.
This indicates that no evidence for a mutation related to trait g is indicated.
We note that our method makes no use of pedigree information when inferring phase.
It treats the data as if it were a random sample from the population of interest. We
chose to use all the data in each replicate, ignoring the pedigree information. As such,
it is interesting to note that even when applied in an environment for which it is not
explicitly designed, the algorithm appears to perform well and to be robust to departures
from assumptions about the sampling scheme.
We summarize our results across all replicates as follows. For each phenotype we
collect the analyses of the 100 replicates and record how often at least one genotype is
found to be significant in the analysis. This is used as an indication of evidence that a
functional mutation is present in the packet. In table 1 we summarize how often a signif-
icant genotype was found for each of the phenotypes of interest across all 100 replicates.
This gives us an indication of the power of our method. We see that our method finds
evidence of a functional mutation in almost all of the simulated data sets for which a
functional mutation is present. Interestingly, for phenotype g, we find evidence of a
significant genotype effect in 14 out of 100 replicates. This provides an estimate of the
false-positive rate for our method. This number appears reasonable in light of the failure
to allow for familial correlations. We construct a 95% confidence interval for the mean
genotype risk for each individual, but there are many (heavily correlated) individuals,
so our overall false-positive rate should be at least 5%.
18
Figure 2.4: Posterior distribution for functional mutation in first 6 replicates for phe-
notype e. The figure shows the posterior distribution for the location of the functional
mutation related to trait e in a phase-unknown analysis of each of the first 6 replicates
of the packet 153 data.
Given that there is at least one significant individual, we inspect the posterior distri-
bution for the location of the functional mutation for that replicate and record the marker
that has highest posterior mass. We then collect this information for all 100 replicates for
that phenotype. This gives us an indication of the accuracy of our method when a func-
tional mutation is indicated. These summaries are shown in Table 2. We note that the
location of the functional mutation is typically inferred with great accuracy. We argue
below why the signal should be found a few loci in from the righthand end of the region,
19
Phenotype e f g h
Number of replicates with evidence of functional mutation 96 98 14 98
Table 2.1: Power study for GAW data
around locus 16, rather than at the end itself. In the final column we report results for an
analysis of trait e in which we pre-process the data using the PedPhase program to infer
haplotype phase information and then use a version of our program which assumes the
inferred phases are true (and therefore does not explore other possible decompositions
of genotypes into haplotypes). Such an analysis finds evidence for a functional muta-
tion in 97 of the 100 replicates, but we note that the location of the functional mutation
appears to be inferred with less accuracy. Given the family-based nature of the simula-
tion study, we can assume that PedPhase will infer haplotypes with a reasonable degree
of accuracy. Therefore these results give some preliminary indication that the loss of
signal when using our method on unphased genotype data is minimal. In contexts in
which data represents a random sample from a population, or a case-control study, one
might reasonably expect programs such as PedPhase to infer haplotypes with a much
lower degree of accuracy, and that our more integrated analysis will therefore continue
to have superior performance to approaches that directly analyze inferred haplotypes.
When there is no association between a trait and the packet under consideration, i.e.
for trait g, we note that there appears to be a tendency to suggest a location towards
the righthand end of the packet. When we repeatedly re-analyzed these data sets after
randomly permuting the phenotypes we found no tendency for the analysis to suggest
functional mutations in these (or any other particular) locations. This suggests that these
(relatively few) spurious signals might be a consequence of the particular way in which
this trait was simulated.
20
Phenotype e f g h e (inferred phase)
locus 16 81 78 7 75 70
locus 17 9 9 1 11 19
locus 18 3 7 4 6 2
other loci 3 4 2 6 6
Table 2.2: Inferred functional locus across 100 replicates
21
Chapter 3: Discussion
We presented a Bayesian methods for analyzing diploid phase-unknown data, by extend-
ing the Bayesian spatial methods in [MMT03b]. We have demonstrated the potential
of more ‘heuristic’ approaches to fine mapping. Such approaches aim to capture the
correlations induced by the unobserved genealogy of a sample without incurring the
computational burden that a full coalescent-based model would imply. For example,
the analyses in the GAW14 dataset take of the order of 3.5 hours each. In doing so we
make it possible to analyze much larger data sets than are traditionally possible using
coalescent-based methods. In particular, one might hope to use methods such as ours to
perform a genome-wide scan. We will investigate this issue in future work.
We regard the results in this thesis as highly encouraging and feel that heuristic
methods such as ours offer the potential to be applicable to much larger data sets than
more intensive, model-based models. The results also suggest that loss of power due
to lack of phase information is minimal when using our method. With reference to
the particular GAW data sets analyzed in this paper, we note that given the alphabetical
sorting used to generate LD at locus D2, LD with the traits associated with this locus (i.e.
e, f and h) is expected to be highest somewhere towards the left-hand end of the sorted
region (SNP locus B03T3056) and non-existent at the right-hand end of the region. (i.e.
SNP locus B03T3068, where the functional mutation was notionally ‘placed’.) Thus, a
22
reasonable analysis method should expect to find a signal around B03T3056 rather than
at B03T3068. This intuition is supported by the results in this and other papers.
However, the phase imputation procedure in the method is still naive and further cor-
poration of probabilities of inferred haplotypes from available software such asPHASE
would increase the speed and accuracy. The work related to GAW14 dataset was pub-
lished in [MZM05].
23
References
[BWAdA
+
05] Joan Bailey-Wilson, Laura Almasy, Mariza de Andrade, Julia Bailey,
Heike Bickeboller, Heather Cordell, E Warwick Daw, Lynn Goldin,
Ellen Goode, Courtney Gray-McGuire, Wayne Hening, Gail Jarvik,
Brion Maher, Nancy Mendell, Andrew Paterson, John Rice, Glen Sat-
ten, Brian Suarez, Veronica Vieland, Marsha Wilcox, Heping Zhang,
Andreas Ziegler, and Jean MacCluer. Genetic analysis workshop
14: microsatellite and single-nucleotide polymorphism marker loci for
genome-wide scans. BMC Genetics, 6(Suppl 1):S1, 2005.
[CCC04] David Clayton, Juliet Chapman, and Jason Cooper. Use of unphased
multilocus genotype data in indirect association studies. Genet Epi-
demiol, 27(4):415–428, Dec 2004.
[Con05] International HapMap Consortium. A haplotype map of the human
genome. Nature, 437(7063):1299–1320, Oct 2005.
[DH01] D. G. T. Denison and C. C. Holmes. Bayesian partitioning for estimating
disease risk. Biometrics, 57:143–149, 2001.
[DZC
+
04] Caroline Durrant, Krina T Zondervan, Lon R Cardon, Sarah Hunt, Panos
Deloukas, and Andrew P Morris. Linkage disequilibrium mapping via
cladistic analysis of single-nucleotide polymorphism haplotypes. Am J
Hum Genet, 75(1):35–43, Jul 2004.
[Gre95] P. Green. Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination. Biometrika, 82:711–732, 1995.
[GRS96] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov
Chain Monte Carlo in Practice. Chapman and Hall, 1996.
[GT98] J. Graham and E. A. Thompson. Disequilibrium likelihoods for fine-
scale mapping of a rare allele. Am. J. Hum. Genet., 63:1517–1530, 1998.
24
[GZS
+
05] David Greenberg, Junying Zhang, Dvora Shmulewitz, Lisa Strug,
Regina Zimmerman, Veena Singh, and Sudhir Marathe. Construc-
tion of the model for the genetic analysis workshop 14 simulated data:
genotype-phenotype relationships, gene interaction, linkage, associa-
tion, disequilibrium, and ascertainment effects for a complex phenotype.
BMC Genetics, 6(Suppl 1):S3, 2005.
[hap03] The international hapmap project. Nature, 426:789–796, 2003.
[JFC82] Kingman JFC. The coalescent. Stoch. Proc. Applns, 13:235–248, 1982.
[KHR00] B. Knorr-Held and L. Rasser. Bayesian detection of clusters and discon-
tinuities in disease maps. Biometrics, 56:13–21, 2000.
[KRB
+
89] B. Kerem, J. M. Rommens, J. A. Buchanan, D. Markiewicz, T. K. Cox,
A. Chakravarti, M. Buchwald, and L. C. Tsui. Identification of the cystic
fibrosis gene: Genetic analysis. Science, 245:1073–1080, 1989.
[Lan01] Initial sequencing and analysis of the human genome. Nature,
409(6822):860–921, Feb 2001.
[LNL03] Xin Lu, Tianhua Niu, and Jun S Liu. Haplotype information and linkage
disequilibrium mapping for single nucleotide polymorphisms. Genome
Res, 13(9):2112–2117, Sep 2003.
[MMT03a] John Molitor, Paul Marjoram, and Duncan Thomas. Application of
Bayesian spatial statistical methods to analysis of haplotypes effects and
gene mapping. Genet Epidemiol, 25(2):95–105, Sep 2003.
[MMT03b] John Molitor, Paul Marjoram, and Duncan Thomas. Fine-scale map-
ping of disease genes with multiple mutations via spatial clustering tech-
niques. Am J Hum Genet, 73(6):1368–1384, Dec 2003.
[MWB00a] A. P. Morris, J. C. Whittaker, and D. J. Balding. Bayesian fine-scale
mapping of disease loci, by hidden Markov models. Am. J. Hum. Genet.,
67:155–169, 2000.
[MWB00b] A P Morris, J C Whittaker, and D J Balding. Bayesian fine-scale map-
ping of disease loci, by hidden Markov models. Am J Hum Genet,
67(1):155–169, Jul 2000.
[MWB02] A P Morris, J C Whittaker, and D J Balding. Fine-scale mapping of
disease loci via shattered coalescent modeling of genealogies. Am J
Hum Genet, 70(3):686–707, Mar 2002.
25
[MWB04] A P Morris, J C Whittaker, and D J Balding. Little loss of informa-
tion due to unknown phase for fine-scale linkage-disequilibrium map-
ping with single-nucleotide-polymorphism genotype data. Am J Hum
Genet, 74(5):945–953, May 2004.
[MZM05] John Molitor, Keyan Zhao, and Paul Marjoram. Fine mapping - 19th
century style. BMC Genet, 6 Suppl 1:S63, Dec 2005.
[OBS92] A. Okabe, B. Boots, and K. Sugihara. Spatial Tesselations: Concepts
and Applications of Voronoi Diagrams. John Wiley & Sons, New York,
1992.
[SD03] M. Stephens and P. Donnelly. A comparison of Bayesian methods for
haplotype reconstruction from population genotype data. Am. J. Hum.
Genet., 73:1162–1169, 2003.
[SSD01] M. Stephens, N. J. Smith, and P. Donnelly. A new statistical method
for haplotype reconstruction from population data. Am. J. Hum. Genet.,
68:978–989, 2001.
[SST
+
01] J. C. Stephens, J. A. Schneider, D. A. Tanguay, J. Choi, T. Acharya, S. E.
Stanley, R. Jiang, C. J. Messer, A. Chew, J. H. Han, J. Duan, J. L. Carr,
M. S. Lee, B. Koshy, A. M. Kumar, G. Zhang, W. R. Newell, A. Winde-
muth, C. Xu, T. S. Kalbfleisch, S. L. Shaner, K. Arnold, V . Schulz, C. M.
Drysdale, K. Nandabalan, R. S. Judson, G. Ruano, and G. F. V ovis. Hap-
lotype variation and linkage disequilibrium in 313 human genes. Sci-
ence, 293:489–493, 2001.
[TSC
+
03] D. Thomas, D. O. Stram, D. Conti, J. Molitor, and P. Marjoram.
Bayesian spatial modeling of haplotype associations. Hum. Hered.,
2003. (In press).
[Ven01] The sequence of the human genome. Science, 291(5507):1304–1351,
Feb 2001.
[V or08] M. G. V oronoi. Nouvelles applications des param` etres continus ` a la
th´ eorie des formes quadratiques. J. Reine Angew. Math., 134:198–287,
1908.
26
Appendix A
MCMC algorithm
A.1 General scheme
MCMC is a method for calculating expectations in high-dimentional distributionsf(X).
Monte Carlo integration evaluatesE[f(X)] by drawing samplesX
t
,t = 1,...,n from a
bayesian prior distributionπ (.) and then approximating
E[f(X)]≈ 1
n
n
X
t=1
f(X
g
). (A.1)
And the Markov Chain X
0
,...,X
t
can be constructed through Metropolis-Hastings
algorithm. Each parameter in the model described in equation (1.7) is updated, condi-
tional on current values of all other paramters in the model. Let L(.) denotes the joint
likelihood, and f(.) denotes a density such as a prior. Then for any parameter μ in the
model, the update ofμ would be as following. A new valueμ new
is generated from the
prior distribution f(.) as a proposal for a move. The proposal value is accepted with
probabilityR, which is defined according to Metropolis-Hasting ratio:
R = min(1,
L(μ new
)f(μ old
)
L(μ old
)f(μ new
)
). (A.2)
A.2 Reversible Jump MCMC
We make the prior assumption that all haplotype center configurations are equally likely
and that the number, of clustersC is constrained to lie within [C
min
,C
max
]. By using an
27
uninformative prior structure on the number of centers and by proposing and deleting
centers with equal probability, we effectively cancel out our center proposal and prior
distributions when Metropolis-Hastings updates are made. Using equation 1.7, we can
propose a new cluster (Birth Move) by proposing a new center and labeling the set
of center values with the new addition as new while the current set is denoted as old.
We then use the following Metropolis-Hastings ratio to decide whether to accept the
proposal:
R = min(1,
f(y
∗ |T
new
,H)
f(y
∗ |T
old
,H)
). (A.3)
For all the parameters(γ 1
,γ 2
,x
c
,y
∗ ,γ c
,φ ) in the model, each update is based on the
standard Metropolis-Hastings ratio based on equation (1.7).
28
Abstract (if available)
Abstract
There is great interest in the use of computationally intensive methods for fine mapping of marker data. This thesis proposed a method to analyze diploid phase-unknown data using ideas derived from Bayesian spatial statistics. We use spatial clustering of haplotypes as a low-dimensional surrogate for the unobserved genealogy underlying a set of genotype data. In doing so we hope to avoid the computational complexity inherent in explicitly modelling details of the ancestry of the sample, while at the same time capturing the key correlations induced by that ancestry at a much lower computational cost.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Association mapping in Arabidopsis thaliana
PDF
Probabilistic methods and randomized algorithms
PDF
New analysis methods for microarray data
PDF
Genomic mapping: a statistical and algorithmic analysis of the optical mapping system
PDF
Bayesian methods for autonomous learning systems
PDF
Copula methods and dependence concepts
PDF
A correction to the functional enrichment test used in computational biology
PDF
Genome-wide association study of factors influencing gene expression variation and pleiotropy
PDF
The effects of sample size on haplotype block partition, tag SNP selection and power of genetic association studies
PDF
Geometric bounds for Markov Chain and brief applications in Monte Carlo methods
PDF
Population modeling and Bayesian estimation for the deconvolution of blood alcohol concentration from transdermal alcohol biosensor data
PDF
Integrative analysis of gene expression and phenotype data
PDF
Between genes and phenotypes: an integrative network-based Monte Carlo method for the prediction of human-gene phenotype associations
PDF
From least squares to Bayesian methods: refining parameter estimation in the Lotka-Volterra model
PDF
Exchangeable pairs in Stein's method of distributional approximation
PDF
Stein's method via approximate zero biasing and positive association with applications to combinatorial central limit theorem and statistical physics
PDF
Symmetric and trimmed solutions of simple linear regression
PDF
Finite sample bounds in group sequential analysis via Stein's method
PDF
Monte Carlo methods of forward backward stochastic differential equations in high dimensions
PDF
Statistical analysis of microarray data and functional genomics of yeast ageing
Asset Metadata
Creator
Zhao, Keyan (author)
Core Title
Application of Bayesian methods in association mapping
School
College of Letters, Arts and Sciences
Degree
Master of Science
Degree Program
Statistics
Publication Date
04/23/2009
Defense Date
04/01/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
association mapping,Bayesian methods,OAI-PMH Harvest
Language
English
Advisor
Sun, Fengzhu (
committee chair
), Goldstein, Larry M. (
committee member
), Zhou, Jasmine (
committee member
)
Creator Email
kzhao@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m454
Unique identifier
UC1261824
Identifier
etd-Zhao-20070423 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-484151 (legacy record id),usctheses-m454 (legacy record id)
Legacy Identifier
etd-Zhao-20070423.pdf
Dmrecord
484151
Document Type
Thesis
Rights
Zhao, Keyan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
association mapping
Bayesian methods