Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
The effects of sample size on haplotype block partition, tag SNP selection and power of genetic association studies
(USC Thesis Other)
The effects of sample size on haplotype block partition, tag SNP selection and power of genetic association studies
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Transcript (if available)
Content
THE EFFECTS OF SAMPLE SIZE ON HAPLOTYPE BLOCK PARTITION, TAG SNP
SELECTION AND POWER OF GENETIC ASSOCIATION STUDIES
by
Li Wang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(STATISTICS)
December 2008
Copyright 2008 Li Wang
ii
Acknowledgements
I would like to thank my advisor Dr. Fengzhu Sun, for his great guidance during this
work. I also want to thank Dr. Larry Goldstein, for his suggestion and help in writing this
thesis. His very instructive lectures are essential for me to build a solid statistical
background. I owe great thanks to all the other members in my committee, Dr. Liang
Chen and Dr. Ting Chen.
This work was supported, in part, by NIH P50 HG 002790 (FS) and the NIMH Intramural
Research Program (FJM, TGS). The NIMH Genetics Initiative DNA samples were
provided by the Rutgers Cell Repository (Jay Tischfield, Principal Investigator). These
samples were collected in 4 projects that participated in the NIMH Bipolar Disorder
Genetics Initiative. From 1991-98, the Principal Investigators and Co-Investigators were:
Indiana University, Indianapolis, IN, U01 MH46282, John Nurnberger, Marvin Miller,
and Elizabeth Bowman; Washington University, St. Louis, MO, U01 MH46280, Theodore
Reich, Allison Goate, and John Rice; Johns Hopkins University, Baltimore, MD U01
MH46274, J. Raymond DePaulo, Jr., Sylvia Simpson, and Colin Stine; the NIMH
Intramural Research Program, Clinical Neurogenetics Branch, Bethesda, MD, Elliot
Gershon, Diane Kazuba, and Elizabeth Maxwell. Collection of the Johns Hopkins
University/Dana Foundation family sample was supported by grants from the National
Institute of Mental Health, The Charles A. Dana Foundation, and the Ted and Vada
Stanley Foundation; we thank J. Raymond DePaulo for sharing the DNA samples and
iii
pedigree data. CEPH and Coriell Human Diversity panel samples were purchased from
Coriell Institute for Medical Research, Camden, NJ. We thank Silvia Buervenich, Chunyu
Liu, and James Potash for assistance with SNP selection, Luana Galver of Illumina, Inc.
for assistance with SNP assay design and genotyping project management, and CJM
Steele for data management.
iv
Table of Contents
Acknowledgements ii
List of Figures v
Abstract
vi
Chapter I Introduction 1
Chapter II Materials and Methods
2.1 Regions Studied
Table 1. Characteristics of the three regions of
interest
2.2 Samples
2.3 SNP Selection and Genotyping
2.4 Haplotype block partition and tSNP selection
2.5 Average numbers of blocks and tSNPs
2.6 Average differences between the numbers of blocks
and tSNPs in non-overlapping samples
2.7 Distance
2.8 Random block partitioning
2.9 Selecting tSNPs
2.10 Power Estimation
4
4
4
4
5
5
6
6
7
8
8
10
Chapter III Results
3.1 Overall LD patterns
3.2 The effect of sample size on haplotype block
partitioning
3.3 Power to detect association using tSNPs
12
12
13
17
Chapter IV Discussion 22
References 26
v
List of Figures
Figure 1. Linkage disequilibrium patterns in each of the 3 regions
studied
12
Figure 2. The average number of blocks (blue) and tSNPs (red)
selected by HapBlock using different sample sizes
15
Figure 3. The average difference, between 2 independent samples of
the same size, in the ratio of the number of blocks to the average
number of blocks (blue) and in the ratio of the number of tSNPs to the
average number of tSNPs (red)
16
Figure 4. The average difference, between two independent samples of
the same size, in the block partitions based on HapBlock (red), and in
random partitions with the same distribution of the number of blocks
(black)
17
Figure 5. Power to detect association using tSNPs selected by
HapBlock versus power using evenly-spaced SNPs, with different
allele-frequency thresholds
19
Figure 6. Power to detect of association in a 2-locus haplotype
analysis at a Type I error rate of 5%
20
Figure 7. Power to detect association in a 2-locus haplotype analysis
using tSNPs selected by different criteria
21
vi
Abstract
Recent studies have shown that linkage disequilibrium (LD) varies significantly across
the human genome, with interspersed regions of high and low LD. In regions of high LD,
a fraction of single nucleotide polymorphisms (SNP), tag SNPs, can be used to capture
most of the haplotype information. We study the reliability of haplotype block
partitioning and tag SNP selection, and the power of the association studies with tag
SNPs based on actual data from a large number of 834 Caucasians genotyped with 97
SNPs in 3 regions with distinct LD patterns. We first assess the effect of sample size on
haplotype block partitioning and tag SNP selection and show that 40-50 unrelated
Caucasians are sufficient for reliable partitioning of haplotype blocks and selection of tag
SNPs. We then compare the power of association studies using tag SNPs selected by
differing criteria.
1
Chapter I Introduction
Linkage disequilibrium (LD) refers to the nonrandom association of alleles at different
loci along a genome. Recent studies have shown heterogeneous LD patterns along the
human genome, with regions of high LD (haplotype blocks) interspersed with regions of
low LD
1,2,3,4
. In regions of high LD, it is possible to use a fraction of the single
nucleotide polymorphisms (SNP), tag SNPs (tSNPs), to capture most of the haplotype
information. The objective of selecting tSNPs is to reduce genotyping effort in
association studies without a significant loss of power. However, different measures of
haplotype information result in different sets of tSNPs
5,6,7,8,9
. In addition, there is no
widely-acccepted definiton of haplotype blocks, and differing methods of block
partitioning yield different results
1,3,4,10,11,12
. Haplotype block-free methods of selecting
tSNPs have been developed
5,13,14
.
Other factors affect haplotype structure and tSNPs. The effects of population
structure and history, marker allele frequency, and marker density have been studied
using both simulated and real data sets
15,12
, while the effect of sample size has been
studied using only simulated data
16,12
. Large samples of real data have not been available.
For example, only 50 individuals were selected for each ethnic group in the HapMap
project. It is impossible to study the effect of varying sample size in such a small number
of individuals.
2
In this paper, we study 834 individuals genotyped with 97 SNPs in 3 regions with
distinct LD patterns, using HapBlock
17
, a software package based on Zhang et
al
10,11,12,17,18,19
. With this large sample size, we could address the following questions:
1) How do the average number of blocks and the average number of tSNPs change
as the sample size increases?
2) Block partitions and the resulting tSNPs can differ even in samples of the same
size. What is the difference in the numbers of haplotype blocks identified, and of
tSNPs selected, between two non-overlapping samples of the same size?
3) We define a distance measure between two block partitions. What is the average
distance between the block partitions in two non-overlapping samples of the same
size?
The above questions have to be studied first as they relate to the stability and the
reliability of the resulting haplotype blocks and the chosen tSNPs. Of course, the ultimate
goal of tSNP selection is to maximize the power and efficiency of association studies.
Significant questions remain as to how much power is lost when tSNPs are used. In a
simulation study, Zhai et al.
12,17
compared the power to detect association on the basis of
tSNPs selected by HapBlock with the power using evenly-spaced SNPs, in single-marker
analyses. They showed that there is often no advantage in selecting tSNPs based on
haplotype blocks because selections of SNPs based only on allele frequency and spacing
have equal power.
3
The results of that study prompted us to consider more carefully the usefulness and
limitations of tSNPs for association studies. Although HapBlock does not specifically
address the issue of optimal criteria for tSNP selection, we reasoned that our large data
set could provide answers to some additional questions concerning power:
4) How does the power to detect association using tSNPs compare with power to
detect association using the same number of evenly-spaced SNPs? What is the
impact of analyzing SNPs as two-locus haplotypes?
5) How does the power to detect association vary across differing criteria for tSNP
selection?
This paper is organized as follows. First we outline the algorithm used in HapBlock
to partition blocks and select tSNPs. We describe the methods we used to calculate the
average number of blocks, number of tSNPs, and distance between two block partitions.
We also describe the methods for simulating trait values and calculating power to detect
association. In Results, we present the number of blocks and tSNPs, and the distance
between 2 block partitions, as a function of sample size, in each of 2 non-overlapping
samples. We also compare the power of association using tSNPs selected by differing
criteria. The paper concludes with discussions of the implications and limitations of this
study.
4
Chapter II Materials and Methods
2.1 Regions Studied
Genotype data were generated in the course of ongoing genetic association studies of
bipolar disorder. The characteristics of the 3 regions are given in Table 1. SNPs in Region
1 were selected to cover the entire region at a mean spacing of 1 SNP/50 kb, while SNPs
in Regions 2 and 3 were selected to cover only known genes in these regions at a mean
spacing of 1 SNP/10 kb. In the course of genotyping, some SNPs were dropped (see
below) resulting in somewhat higher mean spacing between markers. Figure 1 shows the
distinct LD patterns across the three regions, as portrayed by GOLD
20
. The figure also
shows the spacing between consecutive markers.
Region I Region II Region II
Chromosome 13 22 22
Begin position 96,996,809 33,161,303 32,148,278
End position 99,521,400 33,398,803 32,768,569
Physical length (kb) 2525 238 620
Number of Markers 42 42 66
Average spacing (kb) 60 5.7 9.4
Coefficient of variation
in spacing
0.64 0.64 1.4
Max spacing (kb) 185 21 92
Min spacing (kb) 22 0.3 0.4
Table 1. Characteristics of the three regions of interest
2.2 Samples
A total of 834 unrelated subjects were selected from the NIMH Genetics Initiative
Bipolar Disorder sample
21
, the Johns Hopkins University/Dana Foundation family
sample
22
, 90 unrelated founders in the Centre d’Etude Polymorphism Humain (CEPH)
5
Utah and French pedigrees, and 200 unrelated, apparently healthy, self-declared
Caucasians from the Coriell Human Diversity panel. The DNA samples of this panel are
labeled NA17201-17300 and NA1801-18100. All samples were coded without personal
identifiers. All subjects gave written informed consent.
2.3 SNP Selection and Genotyping
SNPs were selected from existing sources (dbSNP’ Celera) on the basis of location.
When allele frequency in Caucasians was available, SNPs with minor allele frequency
<10% were rejected. Inter-marker LD was not considered in SNP selection.
SNPs were genotyped by Illumina, Inc. using their BeadArray platform, which has
very high accuracy
23
, Overall, 97% of markers produced genotypes of sufficient quality
for analysis. The missing data rate was 0.02%. Significant deviation from
Hardy-Weinberg equilibrium was observed for 2 SNPs; these were dropped prior to
analysis.
2.4 Haplotype block partition and tSNP selection
Haplotype block partition and tSNP selection were carried out using HapBlock
17
, a
program that uses dynamic programming algorithms to find the minimum number of
tSNPs across the region of interest. HapBlock defines tSNPs as the minimal set of SNPs
that can explain a given fraction of all haplotype information in each block. We defined
blocks as in Patil et al.
14
, where at least α% of observed or inferred haplotypes must have
frequency of at least β. The definition of tSNPs is also similar to that of Patil et al.
14
, i.e.,
6
the minimum set of SNPs that can uniquely distinguish α% of all the haplotypes. For this
study, we set α = 80 and β= 0.05.
2.5 Average numbers of blocks and tSNPs
For a fixed sample size, n, we:
1) Randomly chose n individuals from the data set,
2) Obtained the block partition and tSNPs in these individuals using HapBlock,
3) Repeated steps 1) and 2) 200 times (the quantities of interest tended to change little
after 200 simulations – data not shown),
4) Calculated the mean number of blocks and the mean number of tSNPs across all 200
experiments.
We also changed n and repeated steps 1) to 4) to gauge the effect of sample size on the
average numbers of blocks and tSNPs. All the samples in this study were carried out
without replacement.
2.6 Average differences between the numbers of blocks and tSNPs in
non-overlapping samples
Here we used a different strategy, since the samples derived above may have large
overlaps (especially when the sample size is relatively large) and it is not appropriate to
calculate variances based on such overlapping samples. To overcome this problem, we
used the following strategy. For a given sample size n, we:
7
1) Randomly chose n individuals from the 834 individuals and obtained the number of
blocks (b1) using HapBlock,
2) Randomly chose n individuals from among the remaining (834-n) individuals and
obtained the number of blocks (b2) using HapBlock,
3) Calculated the absolute value of the difference, Δb=|b1-b2|,
4) Repeated steps 1) to 3) 200 times to obtain the average value for Δb.
The average difference in the numbers of tSNPs obtained from non-overlapping samples
was obtained in a similar fashion.
2.7 Distance
Several measures of the distance between two block partitions have been defined
24
.
Here we used a measure that was used and studied in Liu et al.
24
. In brief:
1) For a block partition B, we define a s*s matrix B = ( b(i,j) ) where s is the number of
SNPs and b(i, j) = 1 if the i-th SNP and the j-th SNP are in the same block and b(i, j)
= 0, otherwise.
2) For two block partitions B1 and B2, let the corresponding matrixes be B1 = ( b1(i,j) )
and B2 = ( b2(i,j) ), respectively. The distance between the two block partitions is
defined by
∑
<
≠
−
=
j i
j i b j i b I
s s
B B d )) , ( 2 ) , ( 1 (
) 1 (
2
) 2 , 1 ( .
8
2.8 Random block partitioning
We generated block partitions at random in order to compare the results to those
generated by HapBlock. For a given sample size n, we obtained an empirical distribution
of the number of blocks using HapBlock. Then we:
1) Chose a random number n
b
based on this empirical distribution. This was used for the
number of blocks, and we randomly assigned n
b
– 1 break points to obtain one
random block partition, B1,
2) Repeated step 1) to obtain another block partition B2,
3) Calculated the distance between the two random block partitions d(B1, B2), as
discussed above,
4) Repeated steps 1) to 3) 200 times to obtain the average distance between two random
block partitions.
2.9 Selecting tSNPs
Zhai et al.
12,17
showed through simulation that the power to detect association using
tSNPs selected by HapBlock is similar to the power using evenly distributed SNPs, at
least in individual marker analysis with known phase for each individual. We set out to
test this finding in real genotype data, analyzed as 2-marker haplotypes, and uncertain
phase.
The markers in our data sets were not evenly spaced, with large variation in spacing
between consecutive markers (Figure 1), thus it was not possible to select evenly spaced
9
markers. Instead, we used the following algorithm to select a set of approximately
evenly-spaced markers for analysis: Let m be the number of markers to be selected with
minor allele frequency above a certain threshold, t (= 0.05, 0.2, 0.15, 0.2, 0.25), and d
0
be
the average distance between consecutive markers. Let G be the set of SNPs under study
and S be the set of markers that were already genotyped. We define the distance between
a SNP “s” and a set of SNPs S as the minimum physical distance from s to any SNP in S,
that is, ) ' , ( min ) , (
'
s s d S s d
S s∈
= . At the beginning, S = (empty set). Let d(s,) = .
a) Randomly choose a SNP “s” from G\S with minor allele frequency at least t and
d(s, S) ≥ d
0
. Add “s” to S.
b) Repeat step a) until no such SNPs satisfying conditions in a) exist, or at most
m/3 SNPs are chosen.
c) Choose a SNP “s” with minor allele frequency at least t that maximizes d(s, S).
Add “s” to the set S.
d) Repeat step c) until a total of m SNPs are selected.
We also assessed power to detect association using other criteria for selecting tSNPs.
We selected tSNPs based upon each of the 5 criteria outlined in Zhang et al.
17
. Briefly:
a) Common haplotypes
4
- The minimum set of SNPs that can uniquely distinguish at
least a certain fraction of all haplotypes.
b) Haplotype diversity
6
- The minimum set of SNPs that can account for a certain
fraction of overall haplotype diversity.
10
c) Haplotype entropy
8
- The minimum set of SNPs that can reduce overall haplotype
entropy by a certain fraction.
d) Haplotype determination coefficient
9
- The minimum set of SNPs with haplotype
prediction strength exceeding a certain threshold.
e) LD measure
5
r
2
- For a subset of SNPs S and a given SNP “s” not in that subset,
let p(s, S) = ) ' , ( max
2
'
s s r
S s∈
be the maximum value of r
2
between the SNP “s” and
other SNPs in S. The minimum set of SNPs, such that ) , ( min
'
S s p
S s∉
exceeds a
certain threshold, is considered the set of tSNPs.
2.10 Power Estimation
To avoid assumptions about the degree of LD between marker and trait locus, we let
each SNP serve as a candidate trait locus and assessed the power to detect association in a
case-control design with 200 cases and 200 controls. We let the minor allele D be the
high-risk allele, and the other allele was denoted d. Throughout the study, we assumed the
following penetrances: p(DD) = 0.75, p(Dd) = 0.5, and p(dd) = 0.25.
For selection of tSNPs under each of the above criteria, we randomly chose 40
individuals from among the 200 cases and 200 controls. From the study on the effect of
sample size on block partition and tSNP selection, we know that the selection of tSNPs
does not change significantly when the sample size grows above 40. The candidate trait
locus was excluded from the SNPs eligible for selection as a tSNP. The reason for such
11
an exclusion is that only a very small fraction of all the SNPs were used as tSNPs and the
probability that these tSNPs are true trait locus is small although possible.
We analyzed the selected tSNPs in the case-control data both as individual markers
and as two-locus haplotypes (sliding window). The statistics used in this study are the
same as in
12
and are not presented here. We repeated the procedure 1000 times to obtain
the approximate power. The power is approximated by the fraction of times association is
detected at a type I error rate of 5% with Bonferroni correction for multiple testing.
12
Chapter III Results
3.1 Overall LD patterns
Region I Region II
Region III
Figure 1. Linkage disequilibrium patterns in each of the 3 regions studied. D’ is depicted
in a pseudo-color image generated by GOLD (Abecassis and Cookson 2000), with
individual markers listed along the x- and y-axes, in map order. The spacing between
markers is proportional to the physical distance between them.
Figure 1 shows the LD patterns in each of the 3 regions studied. The three regions
show distinct LD patterns. Region I has essentially no high LD regions due to the large
spacing between markers. Region II has relatively high LD throughout, with no clear
13
blocks of higher LD. Region III has both high (from SNP 16 to SNP 29, a total of 14
SNPs) and low LD intervals (from SNP 30 to SNP 66, a total of 37 markers).
3.2 The effect of sample size on haplotype block partitioning
Figure 2 shows the average number of blocks called and the average number of tSNPs
selected for different sample sizes. Similar to earlier simulations
12
, both the average
number of blocks and the average number of tSNPs increase as the sample size increases,
up to about 40 individuals. When the sample size reaches 40, both the number of blocks
and the number of tSNPs stabilize, reaching values that are within 2% of their values
when the sample size is 200. Although the total numbers of SNPs in regions I and II are
the same, the numbers of blocks are different (17 for region I and 9 for region II).
Similarly, the ratio of the total number of SNPs to the total number of blocks is larger in
region I (44/17 = 2.6) than in regions II (66/11.5 = 5.7) or III (44/9 = 4.9). Both
observations are consistent with the different levels of LD in the 2 regions.
To assess variation in the number of blocks called in different samples, we calculated
the ratio of the difference in the numbers of blocks called to the average number of
blocks called across 2 non-overlapping samples of the same size (Figure 3). More
precisely, let b
1
and b
2
be the numbers of blocks for two non-overlapping samples. This
ratio is defined as
) (
) ( 2
2 1
2 1
b b
b b
r
+
−
=
14
r gives the relative variation of the number of blocks for 2 non-overlapping samples
of the same size. The ratio decreases as the sample size increases, indicating that the
variation in the number of blocks is small when the sample size is large. Similar results
were observed for the number of tSNPs (Figure 3).
Next we assessed the differences in the resulting haplotype blocks between the 2
non-overlapping samples (Figure 4). This difference decreases as the sample size
increases. Compared to random block partitions with the same distribution of the number
of blocks, block partitions called in the 2 non-overlapping samples were similar in region
I, but different in regions II and III.
15
0 50 100 150 200
5
10
15
20
block number
0 50 100 150 200
5
10
15
20
sample size
tag SNP number
0 50 100 150 200
4
6
8
10
12
14
16
18
block number
0 50 100 150 200
4
6
8
10
12
14
16
18
sample size
tag SNP number
0 50 100 150 200
5
10
15
20
block number
0 50 100 150 200
5
10
15
20
sample size
tag SNP number
Region I Region II
Region III
Figure 2. The average number of blocks (blue) and tSNPs (red) selected by HapBlock
using different sample sizes. Both quantities rise as sample size grows to about 40
individuals, then stabilize.
16
0 50 100 150 200
0
0.1
0.2
0.3
0.4
the ratio of difference of BN over BN
0 50 100 150 200
0
0.1
0.2
0.3
0.4
sample size
the ratio of difference of TN over TN
0 50 100 150 200
0
0.1
0.2
0.3
0.4
sample size
the ratio of difference of BN over BN
0 50 100 150 200
0
0.1
0.2
0.3
0.4
the ratio of difference of TN over TN
0 50 100 150 200
0
0.1
0.2
0.3
0.4
sample size
the ratio of difference of BN over BN
0 50 100 150 200
0
0.1
0.2
0.3
0.4
the ratio of difference of TN over TN
Region I
Region II
Region III
Figure 3. The average difference, between 2 independent samples of the same size, in the
ratio of the number of blocks to the average number of blocks (blue) and in the ratio of
the number of tSNPs to the average number of tSNPs (red). BN = number of blocks, TN
= number of tSNPs.
17
Region I Region II
Region III
Figure 4. The average difference, between two independent samples of the same size, in
the block partitions based on HapBlock (red), and in random partitions with the same
distribution of the number of blocks (black).
3.3 Power to detect association using tSNPs
We first compared the power to detect association using tSNPs selected by HapBlock
with the power to detect association using the same number of SNPs evenly spaced
across the region. For all the 3 regions, individual marker analysis resulted in lower
0 50 100 150 200
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0.24
0.26
0.28
0.30
0.32
0.34
difference of block partition
sample size
RandomPartition
ProgramPartition
0 50 100 150 200
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0.24
0.26
0.28
0.30
0.32
0.34
difference of block partition
sample size
RandomPartition
ProgramPartition
0 50 100 150 200
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0.24
0.26
0.28
0.30
0.32
0.34
difference of block partition
sample size
RandomPartition
ProgramPartition
18
power than 2-marker haplotype analysis (Figure 5). Therefore, we consider only
2-marker haplotype analyses in what follows.
The power to detect association using tSNPs selected by HapBlock depends on local
patterns of LD. In region I, where the LD is generally low, the power using all SNPs
(69%) is higher than that using tSNPs (62%), or the same number of evenly-spaced SNPs
(62%), and the power using tSNPs is very close to that using evenly-spaced SNPs. Thus
in this region tSNPs selected by HapBlock offer no power advantage over the same
number of evenly-spaced SNPs. In contrast, in region II, where LD is generally high, the
power using tSNPs (81%) is always higher than the power using evenly-spaced SNPs
(≤78%). In region III, where both high and low LD areas exist, tSNPs again had a power
advantage over evenly-spaced SNPs, but it was modest. Because the power to detect
association also depends on the trait allele frequency, we also examined power over a
range of minor allele frequencies. The same pattern held (Figure 6).
Finally, we compared the power to detect association using tSNPs selected
according to each of the 5 criteria discussed in Methods (Figure 7). In region I, power
was similar for all tSNPs, regardless of the criteria used to select them. In regions II and
III, the first 4 criteria gave very similar power for the same number of tSNPs, but SNPs
selected based on the haplotype determination coefficient were less powerful. In the 2
regions of high LD, SNPs selected according to the LD measure r,
2
did not perform as
well as SNPs selected by the other criteria.
19
Individual Marker analysis Haplotype analysis
Region I
Region II
Region III
Figure 5. Power to detect association using tSNPs selected by HapBlock versus power
using evenly-spaced SNPs, with different allele-frequency thresholds.
All SNP Tag SNP Even(0.05)Even(0.1)Even(0.15)Even(0.20)Even(0.25)
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
power
SNP selection methods
All SNP Tag SNP Even(0.05)Even(0.1)Even(0.15)Even(0.20)Even(0.25)
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
power
SNP selection methods
All SNP Tag SNP Even(0.05) Even(0.1) Even(0.15)Even(0.20)Even(0.25)
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
power
SNP selection methods
All SNP Tag SNP Even(0.05) Even(0.1) Even(0.15)Even(0.20)Even(0.25)
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
power
SNP selection methods
All SNP Tag SNP Even(0.05)Even(0.1)Even(0.15)Even(0.20)Even(0.25)
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
power
SNP selection methods
All SNP Tag SNP Even(0.05) Even(0.1) Even(0.15)Even(0.20)Even(0.25)
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
power
SNP selection methods
20
Region I
Region II
Region III
Figure 6. Power to detect of association in a 2-locus haplotype analysis at a Type I error
rate of 5%. Candidate blocks are defined by consecutive SNPs such that at least 80% of
haplotypes have frequency of at least 5%. tSNPs are defined as the smallest set of SNPs
that can distinguish 80% of the haplotypes.
0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
power
minor allele frequency
All
Tag
Even0.05
Even0.1
Even0.15
Even0.2
Even0.25
0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
power
minor allele frequency
All
Tag
Even0.05
Even0.1
Even0.15
Even0.2
Even0.25
0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
power
minor allele frequency
All
Tag
Even0.05
Even0.1
Even0.15
Even0.2
Even0.25
21
Region I
Region II
Region III
Figure 7. Power to detect association in a 2-locus haplotype analysis using tSNPs
selected by different criteria. 1 - Common haplotypes, 2 - Haplotype diversity, 3 -
Haplotype Entropy, 4 - Haplotype determination coefficient, and 5 - r
2
. Type I error rate =
5%.
10 11 12 13 14 15
0.50
0.52
0.54
0.56
0.58
0.60
0.62
0.64
0.66
0.68
0.70
0.72
0.74
0.76
0.78
0.80
0.82
0.84
1
2
3
4
5
power
tag SNP number
12 13 14 15 16 17 18 19 20 21
0.50
0.52
0.54
0.56
0.58
0.60
0.62
0.64
0.66
0.68
0.70
0.72
0.74
0.76
0.78
0.80
0.82
0.84 1
2
3
4
5
power
tagSNP number
10 12 14 16 18 20 22 24
0.50
0.52
0.54
0.56
0.58
0.60
0.62
0.64
0.66
0.68
0.70
0.72
0.74
0.76
0.78
0.80
0.82
0.84
1
2
3
4
5
power
tagSNP number
22
Chapter IV Discussion
Tag SNPs offer the promise of reducing genotyping effort while preserving power to
detect association. Several methods of tSNP selection have been developed, and the
resulting tSNPs have been studied in simulated and real data sets
15,25,26
. Ke et al.
25
showed that tSNPs selected in one population can capture a substantial portion of the
haplotype diversity in another population, but preservation of haplotype diversity does
not guarantee preservation of power for association studies
27
. Sample size is a key factor,
but the effect of sample size on haplotype block partitioning and tSNP selection has not,
to our knowledge, been studied in real data.
In this paper, we have addressed this problem by use of a dataset containing over
80,000 high-quality genotypes. This study differs from previous studies in several
significant ways. First we have data from a large number of individuals, so we can
examine sample size effects directly. Second, since we use real data, we avoid the special
problems that can arise when data is generated by simulation. For example, it is well
known that data generated from simulations of the coalescent process may not accurately
reflect real data. Real data captures the full complexity present in the real world. Third,
we use genotype data instead of haplotype data. Genotype data are the most common
form of data available, thus our study may better reflect the most likely scenarios
encountered in practice. Our data derived from largely Caucasian samples, but if other
23
populations differ mainly in patterns of LD as recent data suggest
28,29
, our results should
also apply in those populations.
Our results show that sample size has an important impact on haplotype block
partitioning and tSNP selection, but only when samples are relatively small. We showed
that both the number of blocks and the number of tSNPs increase with sample size, but
only until sample size reaches about 40-50 individuals, whereupon the values stabilize.
Furthermore, the average difference between the haplotype block partitions in 2
non-overlapping samples of the same size was very small for sample sizes over 40. These
results largely confirm, in real data, the results of previous simulation studies
12
.
We also compared the power to detect association using tSNPs selected based on
different criteria. We confirmed the results of Zhai et al.
12,17
that, for individual marker
analysis, the power to detect association with tSNPs can be lower than, similar to, or
higher than, the power to detect association with the same number of evenly-spaced
SNPs. However, we note that in our analyses, methods based on 2-locus haplotypes were
usually more powerful than individual markers. Thus the results of Zhai et al.
12,17
may
not be relevant to real studies, unless the trait locus is one of the markers tested. In that
situation, individual marker analysis can indeed be more powerful.
Our results show that, when local LD was relatively high, the power to detect
association with tSNPs in haplotype analyses was higher than the power with the same
number of evenly-spaced SNPs. On the other hand, tSNPs and evenly-spaced SNPs had
24
similar power when local LD was relatively low. This rule also held across a range of
trait allele frequencies. Thus it appears that when LD is high, tSNPs are more powerful,
but when LD is low, they confer little or no advantage. This seems intuitive, since tSNPs
depend on association with nearby SNPs. If there is little local LD, then each SNP varies
independently, and no SNP can represent others.
Finally, we compared the power to detect association with tSNPs selected according
to various criteria. We considered selections based on common haplotypes, haplotype
diversity, haplotype entropy, the haplotype determination coefficient, and the LD
measure r
2
. In Region I, where LD was generally low, power was similarly low
regardless of how tSNPs were chosen. However, in Regions II and III, where LD was
relatively high, power was higher for tSNPs selected by one of the first 4 criteria, but
somewhat lower for tSNPs selected based on the haplotype determination coefficient, or
based on r
2
. While tSNPs selected based on r
2
appeared to have the lowest power in all
regions, this was due to the need to set a low r
2
threshold of 0.2 in order to obtain a
number of tSNPs that was comparable to that selected according to the other criteria. In
practice, most studies use an r
2
threshold over 0.8. From our data, we cannot comment on
the power of this approach, but others have shown that it can be quite powerful
5
.
In this study we concentrated on one software package for haplotype block partition
and tSNP selection, HapBlock. Even within HapBlock, we fixed the definition of
candidate blocks. Many definitions of candidate blocks are available, and the
25
performance of the resulting tSNPs was not studied here. However, we expect that the
trends we observed with HapBlock would be similar for most of the commonly-used
methods of analysis. We note that block-free tSNP selection algorithms are being
developed
5,13,14
. The comparative performance of these methods will need further study.
Our results have several implications for the experimental design of SNP association
studies. First, the decision whether to use tSNPs at all depends on an assessment of the
degree of LD in the region under study. Fortunately, this can be accomplished through
high-density genotyping of only about 50 individuals, and this data can be used to derive
a set of haplotype blocks and tSNPs that will be representative of the larger population. It
remains unclear what are the best criteria for selection of tSNPs, but our results indicate
that most of the commonly-used criteria perform similarly. Finally, it is usually
advantageous to analyze data as 2-locus haplotypes, but SNPs that have a strong prior
probability of directly influencing the trait of interest deserve analysis also as single loci.
Further work is needed before we can give confident recommendations as to the ideal
populations to study.
26
References
20 Abecasis GR, Cookson WO: GOLD--graphical overview of linkage disequilibrium.
Bioinformatics 2000; 16: 182-183.
5 Carlson CS, Eberle MA for association analysis using linkage disequilibrium. Am J Hum
Genet 2004; 74: 106-120.
1 Daly MJ, Rioux JD, Schaffner SF, et al: High-resolution haplotype structure in the
human genome. Nat Genet 2001; 29: 229-232.
2 Dawson E, Abecasis GR, Bumpstead S, et al: A first-generation linkage disequilibrium
map of human chromosome 22. Nature 2002; 418: 544-548.
3 Gabriel SB, Schaffner SF, Nguyen H, et al: The structure of haplotype blocks in the
human genome. Science 2002; 296: 2225-2229.
23 Gunderson KL, Kruglyak S, Graige MS, et al: Decoding randomly ordered DNA
arrays. Genome Res 2004; 14: 870-877.
29 Gonzalez-Neira A, Calafell F, Navarro A, et al: Geographic stratification of linkage
disequilibrium: A worldwide population study in a region of chromosome 22. Hum
Genomics 2004; 1: 399-409
13 Halldorsson BV, Bafna V, Lippert R, et al: Optimal haplotype block-free selection of
tagging SNPs for genome-wide association studies. Genome Res 2004; 14: 1633-1640.
6 Johnson GC, Esposito L, Barratt BJ, et al: Haplotype tagging for the identification of
common disease genes. Nat Genet 2001; 29: 233-237.
25 Ke XY, Durrant C, Morris AP, et al: Efficiency and consistency of haplotype tagging of
dense SNP maps in multiple samples. Hum Mol Genet 2004; 13: 2557-2565.
7 Ke XY , Cardon LR: Efficient selective screening of haplotype tag SNPs.
Bioinformatics 2003; 19: 287-288
24 Liu NJ, Sawyer SL, Mukherjee N, et al: Haplotype Block Structures Show Significant
Variation Among Populations. Genet Epidemiol 2004; 27:385-400
14 Meng Z, Zaykin DV, Xu CF, Wagner M, Ehm MG: Selection of genetic markers for
association analyses using linkage disequilibrium and haplotypes. Am J Hum Genet 2003;
73: 115-130
8 Nothnagel M, Furst R, Rohde K: Entropy as a measure for linkage disequilibrium over
multilocus haplotype blocks. Hum Hered 2002; 54:186-198
27
21 Nurnberger JI, DePaulo JR, Gershon ES, et al: Genomic survey of bipolar illness in the
NIMH genetics initiative pedigrees: A preliminary report. Am J Med Genet 1997; 74:
227-237
4 Patil N, Berno AJ, Hinds DA, et al: Blocks of limited haplotype diversity revealed by
high-resolution scanning of, Rieder MJ, et al: Selecting a maximally informative set of
single-nucleotide polymorphisms human chromosome 21. Science 2001; 294:1719-1723
28 Rosenberg NA, Pritchard JK, Weber JL, et al: Genetic Structure of Human Populations.
Science 2002; 298: 2381-2385
9 Stram DO, Haiman CA, Hirschhorn JN, et al: Choosing haplotype-tagging SNPs based
on unphased genotype data using preliminary sample of unrelated subjects with an
example from the multiethic cohort study. Hum Hered 2003; 55:27-36
15 Schulze TG, Zhang K, Chen YS, Akula N, Sun FZ, McMahon FJ: Defining haplotype
blocks and tag single-nucleotide polymorphisms in the human genome. Hum Mol Genet
2004; 13: 335-342
22 Simpson SG, Folstein SE, Meyers DA, Mcmahon FJ, Brusco DM, Depaulo JR:
bipolarII---the most common bipolar phenotype. Am J Psych 1993; 150: 901-903
16 Thompson D, Stram D, Goldgar D, Witte JS: Haplotyping Tagging Single Nucleotide
Polymorphisms and Association Studies. Hum Hered 2003; 56: 48-55
26 Weale ME, Depondt C, Macdonald SJ, et al: Selection and Evaluation of Tagging
SNPs in the Neuronal-Sodium-Channel Gene SCN1A: Implications for Linkage
Disequilibrium Gene Mapping. Am J Hum Genet 2003; 73:551–565
10 Zhang K, Jin L: HaploBlockFinder: haplotype block analysis. Bioinformatics 2003; 19:
1300-1301
11 Zhang K, Deng M, Chen T, Waterman MS, Sun FZ: A dynamic programming
algorithm for haplotype block partitioning. Proc Natl Acad Sci USA 2002; 99:7335-7339
12 Zhang K, Qin Z, genotype data and their applications to association studies. Genome
Res 2004; 14: 908- Ting C, Waterman MS, Liu JS, Sun FZ: Haplotype Block Partitioning
and Tag SNP selection using 916
17 Zhang K, Qin Z, Ting C, Liu JS, Waterman MS, Sun FZ: HapBlock: Haplotype Block
Partitioning and Tag SNP Selection Software Using a Set of Dynamic Programming
Algorithms. Bioinformatics 2005; 21: 131-134.
18 Zhang K, Calabrese P, Nordborg M, Sun FZ: Haplotype block structure and its
applications in association studies: power and study design. Am J Hum Genet 2002; 71:
1386-1394
28
19 Zhang K, Sun FZ, Waterman MS, Chen T: Dynamic programming algorithms for
haplotype block partitioning: applications to human chromosome 21 haplotype data. Am J
Hum Genet 2003; 73:63-73
27 Zhai WW, Todd MJ, Nielsen R: Is haplotype block identification useful for association
mapping studies? Genet Epidemiol 2004; 27:80-83
Asset Metadata
Creator
Wang, Li (author)
Core Title
The effects of sample size on haplotype block partition, tag SNP selection and power of genetic association studies
Contributor
Electronically uploaded by the author
(provenance)
School
College of Letters, Arts and Sciences
Degree
Master of Science
Degree Program
Statistics
Publication Date
12/09/2008
Defense Date
10/25/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
association study,haplotype,OAI-PMH Harvest,sample size,tag SNP
Language
English
Advisor
Goldstein, Larry (
committee chair
), Chen, Liang (
committee member
), Chen, Ting (
committee member
)
Creator Email
tujiaojiao1981@gmail.com,wang7@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1897
Unique identifier
UC1501086
Identifier
etd-Wang-2505 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-141590 (legacy record id),usctheses-m1897 (legacy record id)
Legacy Identifier
etd-Wang-2505.pdf
Dmrecord
141590
Document Type
Thesis
Rights
Wang, Li
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
uscdl@usc.edu
Abstract (if available)
Abstract
Recent studies have shown that linkage disequilibrium (LD) varies significantly across the human genome, with interspersed regions of high and low LD. In regions of high LD, a fraction of single nucleotide polymorphisms (SNP), tag SNPs, can be used to capture most of the haplotype information. We study the reliability of haplotype block partitioning and tag SNP selection, and the power of the association studies with tag SNPs based on actual data from a large number of 834 Caucasians genotyped with 97 SNPs in 3 regions with distinct LD patterns. We first assess the effect of sample size on haplotype block partitioning and tag SNP selection and show that 40-50 unrelated Caucasians are sufficient for reliable partitioning of haplotype blocks and selection of tag SNPs. We then compare the power of association studies using tag SNPs selected by differing criteria.
Tags
association study
haplotype
sample size
tag SNP
Linked assets
University of Southern California Dissertations and Theses