Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Two-stage genotyping design and population stratification in case-control association studies
(USC Thesis Other)
Two-stage genotyping design and population stratification in case-control association studies
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
TWO-STAGE GENOTYPING DESIGN AND POPULATION STRATIFICATION IN
CASE-CONTROL ASSOCIATION STUDIES
by
Hansong Wang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(STATISTICAL GENETICS AND GENETIC EPIDEMIOLOGY)
May 2007
Copyright 2007 Hansong Wang
Acknowledgements
I am greatly indebted to my advisor, Dr. Daniel O Stram, for guiding me
through the process of developing this dissertation. Dr. Stram put in an enormous
amount of time and effort in helping me outline the topics, work out technical
difficulties, and improve quality of my manuscript. Besides his insights, I benefited
most from his patience and warm encouragement.
I would like to thank Dr. James Gauderman and Dr. Magnus Nordborg for their
generosity of serving on my committee and providing helpful suggestions in need.
Thanks also to the faculty members in this department who helped me at various
stages of my study: Dr. Richard Watanabe, Dr. Harland Sather who just retired, Dr.
Stanley Azen and Dr. Kiros Berhane.
A special thank to my friend Huiyan Ma for sharing her experience and pushing
me through the qualifying exam.
ii
Table of Contents
Acknowledgements ii
List of Tables iv
List of Figures vi
Abstract ix
Introduction 1
The Multiethnic Cohort Study (MEC) 13
Chapter 1 Optimal two-stage genotyping design for genome-wide association
scans 18
Methods 20
Results 27
Discussion 38
Chapter 2 Follow-up issues with the methods in Chapter 1
One-stage sample size calculation — compare to Quanto 44
Optimizing two-stage designs using Quanto’s method 50
Controlling for FDR 55
Chapter 3 Assessing population substructure in the current MEC data 68
Using GLMM to correct for population stratification 71
Population Stratification in the MEC study 93
References 102
iii
List of Tables
Table 1. Minimum expected cost for a one-to-one unmatched 28
study and various allele frequencies, p, and odds ratios,
ψ, and the set of parameters when the minimum is
reached, for m = 500,000, D=1, t
1
= $0.002,
t
2
= $0.035 and α = 1×10
-7
(1-sided), 1- β = 0.90.
Table 2. Parameters for a two-stage one-to-one unmatched 30
study such that the expected cost is within 1.1 fold
of the minimum and sample size is the least,
for various p and ψ, m = 500,000, D=1, t
1
= $0.002,
t
2
= $0.035 and α = 1×10
-7
(1-sided), 1- β = 0.90.
Table 3. The dependence of an optimal two-stage design 32
parameters upon cost ratio t
2
/t
1
, number of genotyped
SNPs in stage I m, for power 1- β = 0.90, f=0.5,
p = 0.2 and ψ =1.5. The overall type I error
rate α = 0.05/m (2-sided).
Table 4. The dependence of an optimal two-stage design 34
parameters upon cost ratio t
2
/t
1
, number of genotyped
SNPs in stage I m, for power 1- β = 0.80, f=0.5, p = 0.2
and ψ =1.5. The overall type I error
rate α = 0.05/m (2-sided).
Table 5. Minimum expected cost for a one-to-one unmatched 38
study and various p and ψ and the set of parameters
when minimum is reached, for m = 500,000, D=1,
t
1
= $0.002, t
2
= $0.035, 1- β = 0.90 and α = 5×10
-8
(1-sided, see text for detail). R
s
2
is 0.6 in stage I and 1.0 in stage II.
Table 2.1 The dependence of an optimal two-stage design parameters 53
upon cost ratio t
2
/t
1
, number of genotyped SNPs in stage I m, for
power 1- β = 0.90, f=0.5, p = 0.2 and ψ =1.5. The overall type
I error rate is 0.05/m (2-sided).
Table 2.2 The dependence of an optimal two-stage design parameters 54
upon cost ratio t
2
/t
1
, number of genotyped SNPs in stage I m,
for power 1- β = 0.80, f=0.5, p = 0.2 and ψ =1.5. The overall type
I error rate is 0.05/m (2-sided).
iv
Table 2.3 Optimal and nearly optimal two-stage design parameters 63
for m= 500,000, overall power 1- β = 0.90, t
2
/t
1
= 20 and various
D and FDR levels. Causal allele frequency is 0.2 and odds ratio is
1.4 with log-additive penetrance.
Table 2.4 Optimal and nearly optimal two-stage design parameters 65
for m= 500,000, overall power 1- β = 0.50, t
2
/t
1
= 20 and
various D and FDR levels. Causal allele frequency is 0.2
and odds ratio is 1.4 with log-additive penetrance.
Table 3.1 Rejection rate of the original and the modified 78
EigenSTRAT method, n=400 for 1000 markers.
Table 3.2 Parameter estimates and empirical power based on 84
analyses of 1000 causal SNPs with β = log(1.2) = 0.182
per allele (model A3). Sample size is 800. Power is evaluated
at α level 0.01.
Table 3.3 Parameter estimates and empirical power from analyses of 89
1000 copies causal SNPs with β = log(1.25) = 0.223 per
allele (model B2), for different combination of ancestral
allele frequencies, n= 800, at α level 0.05.
Table 3.4 Empirical quantiles of p-values obtained from 94
association test of breast cancer and various subsets
of SNPs, separately within each ethnic group.
Table 3.5 Empirical quantiles of p-values for pairwise 98
correlations among 55 unlinked SNPs, averaged over
60 random SNP sets, for controls only and for all
subjects, within each ethnic group.
Table 3.6 Empirical quantiles of p-values for pairwise 99
correlation among highly-differentiated (HD)
unlinked SNPs, for controls only and for all
subjects, within each ethnic group.
v
vi
List of Figures
Figure 1. The proportion, n
1
/(n
1
+ n
2
), of sample size allocated 36
to stage I in the optimal design plotted as a function of
the cost ratio t
2
/t
1
and the number of SNPs, m, genotyped
in stage I, for a study with 90% power to detect an effect
with type I error equal to 0.05/m.
Figure 2. The significance threshold α
1
for promoting a SNP from 36
stage I to stage II in the optimal design plotted as a function
of the cost ratio t
2
/t
1
and the number of SNPs, m, genotyped
in stage I for a study with 90% power to detect an effect with
type 1 error equal to 0.05/m.
Figure 2.1 Percent difference between sample size n and n
q
, 45
calculated from (1.1) and Quanto, respectively. Power 1- β = 0.90.
The dots above zero indicate n < n
q
and dots below zero
indicate n > n
q
. P
0
refers the baseline disease risk.
α is the type I error rate (one-sided).
Figure 2.2 Empirical power for sample sizes calculated from 47
Quanto (panel a) and (1.1) (panel b), where
α (one-sided) = 0.05, P
0
=0.0001, power = 0.90.
Figure 2.3 Percent difference between n, n
t
and n
q
, calculated 49
from (1.1), (2.1) and Quanto, respectively, with P
0
= 0.0001,
power = 0.90. The dashed line represents (1-n/n
q
)
×100%
and the solid line represents (1-n
t
/n
q
)
×100.
Figure 2.4 Relative reduction in minimum two-stage genotyping 61
cost and sample size of optimal two-stage designs when
FDR is applied, compared to when Bonferroni correction
is used with the number of true positives D =1.
“B” stands for Bonferroni correction.
Figure 3.1 Plot of interpersonal correlation K1
ij
based on 200, 76
1000, 10000 and 100000 markers (from left to right) simulated
under model A1 for within-continent structure. K1
ij
’s (i ≠ j)
were grouped into six categories along the horizontal axis,
depending on whether the two subjects come from the same
population (within-population) or different populations
(between-population). The box shows 25% and 75%
percentiles and the most extreme values are 1% to
99% percentiles.
Figure 3.2 Distribution of K
ij
’s (i ≠ j) for the four K matrices, 77
based on 10000 markers simulated under model A1
for within-continent structure. K
ij
’s (i ≠ j) were grouped
into six categories as in Figure 3.1.
Figure 3.3 Cumulative distribution of p-values of association 79
for 1000 random SNPs simulated under model A1
(within-continent), using different analyzing methods.
The first 3 eigenvectors of K1 were used in logistic
regression for the EigenSTRAT method
(denoted as “eigen(K1)” in plot).
Figure 3.4 Cumulative distribution of p-values for SNPs simulated 79
under model A1, using GLMM with K1, K2, K3 and K4
as correlation matrix. Results from K1, K2 and K4 are
overlapped in most places.
Figure 3.5 Cumulative distribution of p-values from EigenSTRAT 81
analyses, adjusting for the first one or two eigenvectors of
the four K matrices. The four curves in each graph corresponding
to the four matrices overlap with each other in most situations.
Figure 3.6 Scree plots of ordered eigenvalues of the four K matrices, 82
n=800, m=1000, model A1.
Figure 3.7 Distribution of differences in allele frequency between 86
the CEPH and Yoruba samples from the HapMap project,
for the selected 1636 markers (see text for details).
Figure 3.8 Cumulative p-value distribution from the unadjusted, 87
GC-adjusted, GLMM and EigenSTRAT tests,
based on model B1. GLMM (using K3) and
EigenSTRAT (adjusting for the first eigenvector of K3)
results overlap with each other.
Figure 3.9 Cumulative distribution of p-values from 88
EigenSTRAT analyses of model B1, adjusting
for the first eigenvector of K1, K3 and K4
(K2 results similar to K1).
vii
Figure 3.10 Cumulative distribution of the p-values of 94
association between breast cancer and all SNPs,
separately for each ethnic group.
Figure 3.11 Cumulative distribution of the p-values 96
of association between breast cancer and the
subset of SNPs with inter-marker distance 20kb,
separately for each ethnic group.
viii
ix
Abstract
The dissertation focuses on two problems identified from the ongoing Multiethnic
Cohort (MEC) study, i.e. two-stage genotyping design and population stratification.
Three chapters follow a general introduction to relevant concepts in genetic
epidemiological studies. Chapter one and two contain two published papers on optimal
two-stage genotyping design, using Bonferroni correction and false positive rate (FDR)
correction, respectively, for the overall type I error rate. The significance of this part is
that a procedure was designed to search for the most cost-effective two-stage studies
based on the current availability of high-throughput and very high-throughput
genotyping platforms (fixed SNP array) and without considering any constraints on
total sample size or available resources. The results are to provide guidance for two-
stage designs under a wide range of assumptions on the per-genotype cost ratio and total
number of markers. Chapter three describes the performance of the generalized linear
mixed model (GLMM) in controlling for population stratification with simulated
structured case-control data. Compared to Genomic Control (GC) and the relatively new
EigenSTRAT methods, GLMM does not offer much advantage. To assess the likely
significance of population stratification in the planned or future MEC association scans,
Chapter 3 also examines evidence of population stratification within four ethnic groups
(African Americans, Japanese, Latinos and Whites) represented in the newly available
MEC breast cancer data consisting of 1400 SNPs. The results indicate moderate
structure exists in African Americans and Latinos and the GC approach produced
acceptable empirical type I error in most cases.
Introduction
Common diseases such as cardiovascular disease, cancer, obesity, diabetes,
psychiatric illnesses and inflammatory diseases are often caused by combination of
multiple genetic and environmental factors. Although epidemiological studies have
successfully discovered various environmental risk factors for common diseases, the
roles played by genetic factors remain largely unknown. It is expected that identifying
these genetic causes will provide fundamental insights into pathogenesis, diagnosis,
prevention and treatment of human disease. (The International HapMap Consortium
2003)
It has been estimated that 99.9% of any two genomes (except monozygotic
twins) are roughly identical, and only 0.1% contains information that accounts for
heritable variation among individuals, including susceptibility to disease; but that still
leaves millions of differences among the 3.2 billion base pairs in the human genome
(Kruglyak and Nickerson 2001). Future developments in genotyping technology may
eventually allow us to examine all genetic variations for contribution to human diseases.
In the meantime, approaches for gene mapping which utilize only a limited number of
markers to make inferences on possible disease gene locations have achieved important
successes. These methods generally fall into two categories: linkage analysis and
association analysis.
Linkage analysis typically scans for large pieces of gene segments shared among
affected family members, using microsatellite markers that are a few mega bases (Mb)
apart. Linkage studies have been most successful in identifying rare, highly penetrant
1
variants for “mendelian” disorders, in which variation in a single gene is almost both
necessary and sufficient to cause disease (reviewed in The International HapMap
Consortium 2005). In contrast, for complex diseases that result from interactions of
numerous environmental and genetic factors, linkage studies have been less successful,
due to various reasons, including limited mapping resolution (usually exceeding 10
Mb), inadequate study power or low heritability of most complex traits (reviewed in
Hirschhorn and Daly 2005).
Association studies search for markers with different allele frequencies between
a set of affected subjects (cases) and unaffected controls. Significant association of a
marker allele and disease outcome is considered to be evidence of physical linkage
between the marker and a causal variant. Association analysis can be more powerful
than linkage analysis in detecting low penetrant variants (relative risk between 1.3 to 2)
(Risch 2000; Bostein and Risch 2003; Hirschhorn and Daly 2005). For example, the
Pro12Ala variant of the peroxisome proliferator-activated receptor γ (PPAR γ) gene was
shown to affect the risk of type 2 diabetes — the more common Pro allele (~85%) is
associated with 1.25-fold increase in disease risk (Altshuler, Hirschhorn et al. 2000).
Although this effect was validated by meta-analysis of over 3,000 individuals from
association studies, it would only be detected using linkage studies of over one million
affected sib-pairs (Altshuler, Hirschhorn et al. 2000; Hirschhorn and Daly 2005).
Moreover, since 90% of sequence variation among individuals is due to the most
common variations (Kruglyak and Nickerson 2001), it is reasonable to postulate that
susceptibility alleles to common late onset diseases are common variants that are not
2
under strong negative selection (CDCV hypothesis). Therefore association studies that
examine the most abundant polymorphisms, the single-nucleotide polymorphisms
(SNPs), are likely to uncover most causal variants for complex diseases. Since the SNPs
are distributed on average a few kilo bases (kb) apart in the human genome, such
association studies provide finer mapping resolution (~ a few kb) than linkage studies
(~10 Mb). In fact, we can find examples that are compatible with the CDCV hypothesis,
including associations of the Pro12Ala variant of PPAR γ with type II diabetes
(Altshuler, Hirschhorn et al. 2000), the Tyr402His variant of complement factor H with
age-related macular degeneration (Edwards, Ritter et al. 2005; Haines, Hauser et al.
2005; Klein, Zeiss et al. 2005), and SNPs in the 8q24 region with prostate cancer
(Amundadottir, Sulem et al. 2006) (also see Haiman et al Nature Genetics in press).
Even if the causal variant is relatively uncommon, association analysis may still
succeed, because most of the causal sequence variants originally arose from single
historical mutation events, and are therefore associated with nearby variants that were
present on the ancestral chromosome on which the mutation took place (The
International HapMap Consortium 2003). The specific array of markers in the
neighborhood of the causal variant can then serve as an indicator for detecting
association between the “causal” genomic region and the disease — this is the so-called
“indirect” association method (Collins, Guyer et al. 1997; The International HapMap
Consortium 2003).
Up until recently, candidate-gene studies have been the only affordable
association design and have identified many genetic risk factors (Hirschhorn and Daly
3
2005). However, this method can be laborious because it involves completely
resequencing/testing the putative functional parts of candidate genes, which are selected
on the basis of a previous genetic hypothesis or linkage peak (The International
HapMap Consortium 2003). If the prior hypotheses do not appropriately reflect the
underlying biological mechanism, it would fail to find the correct true positives. Even if
the prior hypothesis is broad (e.g., involving all genes in the DNA damage response
pathway), this approach would at its best discover a portion of all causal variants
(Hirschhorn and Daly 2005). In contrast, the genome-wide association approach, in
which most of the genome is surveyed for association with disease without relying on
any prior assumptions, is a more efficient design (Hirschhorn and Daly 2005; Thomas,
Haile et al. 2005). Needless to say, the genome-wide approach is also more expensive
than the candidate-gene approach due to the enormous amount of genotyping required.
But with the rapid developments in high-throughput genotyping technology and our
ever-growing knowledge of variation in the human genome, the genome-wide approach
has finally become a realistic option. Early success of this approach has been seen in
the detection of association between the Tyr402His variant of human complement
factor H and age-related macular degeneration (Klein, Zeiss et al. 2005). The following
sections briefly review a few key concepts involved in genome-wide association
studies.
Linkage Disequilibrium (LD). The phenomenon that particular alleles at nearby
sites co-occur on the same chromosome more frequently than expected by chance is
called LD. LD is of great importance to gene mapping. As discussed earlier, with the
4
“indirect” association approach (applicable both to candidate-gene and genome-wide
association studies), disease variants can be detected through neighboring sites that are
in LD with them.
There are various measurements of (pairwise) LD in the literature (Hedrick
1987; Devlin and Risch 1995) and two of them, D ′ and r
2
are most important.
Consider two SNPs, A/a and B/b with population frequencies p
A
/p
a
and p
B
/p
b
. The
corresponding haplotype (haploid genotype, a closely linked set of markers on the same
chromosome that tend to segregate together) frequencies are p
AB
, p
Ab
, p
aB
and p
ab
. A
2×2 table could be set up as follows.
A a
B p
AB
p
aB
p
B
b p
Ab
p
ab
p
b
p
A
p
a
One LD measure is D
AB
= p
AB
–p
A
p
B
, the difference between the observed and
expected frequency of the AB haplotype. To eliminate the dependence of D
AB
on allele
frequencies, define D’=D
AB
/D
max
(Hedrick 1987), where
min (p
a
p
B
, p
A
p
b
), if D
AB
> 0
D
max
=
min (p
a
p
b
, p
A
p
B
), if D
AB
≤ 0
Because the sign is arbitrary, the absolute value D ′ is often used. D ′ ranges from zero
to one and we say that LD is “complete” if D ′ =1, which happens when only two or
three out of the four possible haplotypes are present in the population. If there are k and
l alleles at the two sites, then a multi-allelic version of D ′ can also be defined as
5
∑∑
==
k
i
l
j
ij j i
D q p
11
' , where p
i
and q
j
denote the corresponding allele frequencies of the two
loci (Hedrick 1987).
Another LD measure is the square of the correlation coefficient between the two
loci under consideration, which can be computed as r
2
= /( ) (Franklin
and Lewontin 1970). Although r
2
explicitly depends on allele frequencies, it has some
appealing properties that make it more useful. First, multiplying r
2
by the sample size
yields the χ
2
value from which the tail probability of the observed LD can be calculated
(Pritchard and Przeworski 2001). Second, sample size required in “indirect” association
studies is inversely proportional to r
2
. Suppose that SNP1 is involved in disease
susceptibility, but we genotype a nearby SNP2 in cases and controls. Then to achieve
the same power to detect association at SNP2 as we would have at SNP1, the sample
size needs to be increased by a factor of approximately 1/r
2
(Pritchard and Przeworski
2001; Wall and Pritchard 2003). For two biallelic loci, r
2
can reach one only when 2 out
of the 4 haplotypes exist.
2
AB
D
b B a A
p p p p
In practice, there is typically a sample of chromosomes from the population.
Then estimates of D ′ and r
2
are usually obtained by plugging in the sample
frequencies , and into the above formulas. The sampling distribution of
A
p ˆ
B
p ˆ
B A
p ˆ
D ′ and r
2
depends on both sample size and allele frequencies. In particular, D ′ is biased
upward inversely with sample size (Teare, Dunning et al. 2002; Weiss and Clark 2002).
Simulation studies also showed that the observed D ′ is highly inflated when based on
6
low allele frequencies (e.g., minor allele frequencies are both 0.05 for the two variants);
the bias is at its worst at true D = 0 (Teare, Dunning et al. 2002).
Other than physical linkage, LD between two loci can be influenced by natural
selection or demographic factors such as inbreeding, population structure and
bottlenecks (Pritchard and Przeworski 2001). For example, it is consistently observed
that LD in non-African populations extends over longer distances than in Africans,
which might reflect a population bottleneck at the time when modern humans first left
Africa (Wall and Pritchard 2003). For details on how these and other factors affect LD,
refer to (Pritchard and Przeworski 2001; Nordborg and Tavare 2002; Weiss and Clark
2002).
LD block and tagSNPs. In 2001 to 2002, a number of studies reported a “block-
like” LD structure in individual regions of human chromosomes (Daly, Rioux et al.
2001; Patil, Berno et al. 2001; Gabriel, Schaffner et al. 2002). Within each LD block
(a.k.a. haplotype block), there is limited haplotype diversity and little evidence of
recombination so that a few SNPs can capture most of genetic variation within the block
— such SNPs are called tagSNPs. Further, one study found blocks of LD in
recombination coldspots to be separated by experimentally determined recombination
hotspots (Jeffreys, Kauppi et al. 2001; Wall and Pritchard 2003). These findings directly
motivated the development of the HapMap project, which will be discussed later.
As for the block structure, it is worth noting that blocks are defined "based on
the genetic information content and not on knowledge of how this information
originated or why it exists. As such, blocks do not have absolute boundaries and may be
7
defined in different ways, depending on the specific application” (Patil, Berno et al.
2001). For example, some methods define blocks based on limited haplotype diversity
(Daly, Rioux et al. 2001; Patil, Berno et al. 2001; Dawson, Abecasis et al. 2002), some
make use of pair-wise LD measure (e.g. D ′ ) to identify the transition zones in which
there is extensive historical recombination (Gabriel, Schaffner et al. 2002; Wang, Akey
et al. 2002; Phillips, Lawrence et al. 2003; Wall and Pritchard 2003) and some are based
on statistical model selection criteria (Anderson and Novembre 2003; Greenspan and
Geiger 2004). For block-defining methods that are based on D ′ , the blocks defined
would be affected by sample size, allele frequencies and population history, as the
sample estimate of D ′ is affected by these factors. For a list of other factors that affect
the observed block structure, refer to (Wall and Pritchard 2003).
The most immediate relevance of the LD block structure to association studies is
the possible savings on genotyping cost. Because a small fraction of SNPs (tagSNPs)
can serve as proxies for other SNPs in its neighborhood, genotyping the tagSNPs in
association studies is expected to reduce genotyping cost significantly without
sacrificing a lot of study power, although the opinion is not held universally
(Terwilliger and Hiekkalinna 2006). The issue of selecting a maximally informative set
of tagSNPs has also been studied extensively, resulting in various available methods
and software: some current methods are conditional on the defined LD block structure
(Zhang, Calabrese et al. 2002; Stram, Haiman et al. 2003), others are not (Meng, Zaykin
et al. 2003; Weale, Depondt et al. 2003; Carlson, Eberle et al. 2004); some make use of
pair-wise LD measure (Gabriel, Schaffner et al. 2002; Carlson, Eberle et al. 2004),
8
some use multi-marker statistics (Stram, Haiman et al. 2003; de Bakker, Yelensky e
2005); there are also methods that assign block boundaries and select tagSNPs
simultaneously by trying to maximize haplotype predictability (Patil, Berno et a
Zhang, Deng et al. 2002). Although different, these tagging methods all share the goal
of exploiting redundancy among SNPs and maximizing efficiency in the lab while
minimizing loss of information. As to the relative efficiency, multi-marker based
methods seem to allow for greater coverage with a fixed number of SNPs (The
International HapMap Consortium 2005). For a software review on tag SNP sele
refer to (Stram 2005).
As an example
t al.
l. 2001;
ction,
to show the benefit of tagging, consider one simple method that
uses th
f the
%
l HapMap Project. The International HapMap Project was
launche
n of
e LD measure r
2
as the criterion: SNPs are selected for genotyping until all
common SNPs (minor allele frequency >5%) are highly correlated to one or more o
tagSNPs (Carlson, Eberle et al. 2004). One study based on the Phase I HapMap data
showed that if r
2
= 1 is used, the reduction in the number of genotypes required was
46% in African samples and 65% in CEPH and the Chinese and Japanese combined
samples. Since all common sites are perfectly captured, relative power remains at 100
compared with genotyping and testing all common causal alleles directly (de Bakker,
Yelensky et al. 2005).
The Internationa
d in 2002 as an effort to systematically identify genetic similarities and
catalogue LD patterns of common variations in human beings and to provide
information needed as a guide for genetic studies of any size. At the completio
9
Phase I (2005), the HapMap project genotyped more than one million common SNP
269 samples with ancestry from parts of Africa, Asia and Europe (CEPH). The project
features nearly complete genotyping of all genetic variation in ten 500kb regions as part
of the ENCODE (Encyclopedia of DNA elements) project. Inclusion of parent-offspring
trios enabled highly accurate phasing of long-range haplotypes for each individual.
The Phase I HapMap data illustrate the following points: most of the human
s in
genom y
y
rker
bors
tion
mapping, the knowledge of LD patterns and the tools for selecting tagSNPs only help to
e falls into long segments of strong LD that contain many SNPs and yet displa
limited haplotype diversity — more than 80% of CEPH and Chinese + Japanese
chromosomes and more than 67% of African sample chromosomes are spanned b
blocks, using two block defining methods; blocks are not simply artifacts of low ma
density; although the LD block boundaries often co-localize with recombination hot
spots, this tendency is variant; a typical SNP is highly correlated with many of its
neighbors thus the most informative ones can serve as proxies for many of its neigh
in genetic studies — in the CEPH samples, on average, one in five common SNPs has
20 or more perfect proxies and three in five have 5 or more, while only one in five has
no perfect proxies, where a perfect proxy for a SNP means the pairwise r
2
between the
proxy and the locus of interest is equal to one; LD patterns are not the same across the
four ethnic groups thus the set of tagSNPs may not be transferred across populations.
The project has a phase II, which is now genotyping an additional 4.6 million SNPs in
each of the HapMap samples. (The International HapMap Consortium 2005)
Microarray-based high-throughput genotyping availability. For associa
10
reduce cial
he
e
by
Ps (fixed SNP array).
Affyme 50K
0K
s
mon
the amount of required genotyping without losing a lot of power. What is cru
to the realization of genome-wide association studies is the efficiency of accurately
genotyping a large number (several hundred thousand) of specific SNPs among the 3
billion bases that constitute the human genome (Syvanen 2005). Recent progress in t
development of microarray systems for highly-multiplexed genotyping has provided th
necessary platform (Syvanen 2005), allowing genotyping of both a fixed set of SNPs
across the genome (fixed SNP arrays) and a customized set of SNPs from any
chromosomal region of interest (custom-designed arrays).
The GeneChip Mapping Sets of Affymetrix and the Sentrix BeadChips
Illumina support the genotyping of pre-selected panels of SN
trix GeneChip arrays include the 10K, the 100K and the 500K arrays (2 × 2
SNP chips) with the 500K array released in October of 2005. SNPs included on the
Affymetrix products have been selected primarily on the basis of technical quality and
thus represent a quasi-random set of SNPs (Pe'er, de Bakker et al. 2006). With the 50
array, the median physical distance between SNPs is 2.5 kb and the average distance is
5.8 kb. The average heterozygosity of these SNPs is 0.30 and the average minor allele
frequency is 21%. Eighty-five percent of the human genome is within 10 kb of a SNP.
The HumanHap300 BeadChip from Illumina was released in January 2006 and support
the genotyping of over 300K SNPs on a single beadchip. SNPs on the HumanHap300
BeadChip were selected using a pair-wise correlation-based algorithm applied to
genotype data of HapMap Phase I SNPs in the CEPH panel (Pe'er, de Bakker et al.
2006), specifically to provide high coverage (r
2
≥ 0.7) of the vast majority of com
11
SNPs. The HumanHap 240S BeadChip (Illumina) is currently available that contain
additional 240,000 tag SNPs derived from the Phase I+II HapMap data and provides tag
SNP coverage in regions of lower LD. Recently the HumanHap550 BeadChip
(Illumina) was also introduced that contains over 550K SNPs on a single microarray
and provides the same genomic coverage level of HumanHap300 BeadChip +
HumanHap240S BeadChip.
To investigate the proportion of common variation captured by the fixed
SNPs on these arrays, one stud
s an
set of
y compared the Affymetrix 500K array and the
Human
5%) in
thousand customer-selected SNPs on one microarray (Gunderson,
Steeme
ady
t
Hap300 BeadChip with reference to the 3.9 million SNPs genotyped in phase II
of HapMap (Pe'er, de Bakker et al. 2006). The percentage of common SNPs ( ≥
phase II data that could be predicted with r
2
≥ 0.8 by individual SNPs on the Affymetrix
500K array is 63% for the CEPH (whites) and East-Asian samples and 44% for the
African samples. The corresponding predictability of the HumanHap 300K (Illumina)
array is 67% for the CEPH samples, 53% for Asians and 36% for Africans (Pe'er, de
Bakker et al. 2006).
Both the Multiplex Inversion Probe (MIP) and the GoldenGate assays support
genotyping of several
rs et al. 2005). Affymetrix is currently developing a 50K SNP array with the
MIP assay that will specifically add SNPs of high interest (such as those for coding
SNPs, or in conserved non-coding regions) in regions of the genome that are not alre
well captured by the 500K fixed array. The 50K array complements the 500K produc
in that supplemental SNPs can be targeted for inclusion to provide better coverage for
12
genome-wide association studies. The GoldenGate assay (Illumina) is designed for
genotyping as many as 1,536 custom SNPs per sample in a 96-well format, with a
potential to multiplex to 10,000 (Gunderson, Steemers et al. 2005). More recently,
Illumina has marketed the 12K BeadChip (using the Infinium Assay) that supports
genotyping of 12,000 user-defined SNPs. Each 12K customed beadchip has 6
independent arrays for the analysis of 6 samples at a time.
Despite the many advances in genotyping technology (including a rapid
genotyping price), genome-wide association studies are still
drop in
expensive. As an example,
the cur
o
is
enotyping errors, gene
and env
The Multiethnic Cohort Study (MEC)
The MEC is a population-based cohort study designed to provide prospective
data on cancer as it relates to diet and other exposures hypothesized to alter cancer risk.
rent price for genotyping a sample with the Affymetrix 500K+50K array is $600-
900, depending on the size of the order. A genome-wide study of 3,000 subjects would
spend about 2 to 3 million dollars just for genotyping. In order to further improve
efficiency, many groups now use the multi-stage design (Hirschhorn and Daly 2005;
Thomas, Haile et al. 2005), in which the fixed SNP array is used in the first stage t
genotype all the markers on a subset of individuals and a custom-designed microarray
used in later stages to follow up the most promising markers.
Other issues that arise in the context of genome-wide association studies include
the problem of multiple comparison adjustment, dealing with g
ironment interaction, population stratification, etc; for a complete review, refer
to (Hirschhorn and Daly 2005; Thomas, Haile et al. 2005; Wang, Barratt et al. 2005).
13
The MEC is composed of over 215,000 men and women primarily recruited from five
cial o
on
line
erson
ved in disease processes. Thirty candidate genes in the pathways of steroid
hormon a
l.
w
ra r ethnic groups: African Americans, Native Hawaiians, Japanese, Latinos and
Whites. The cohort was assembled in 1993-1999 by mailing a self-administered
questionnaire to persons aged 45-75 years who were identified through the driver’s
license files of the state of Hawaii and the county of Los Angeles, California. To
maximize the diversity of exposures, the cohort spans all socioeconomic levels.
Detailed information on demographics, medical and reproductive histories, medicati
use (including hormonal replacement therapy), family history of various cancers,
physical activity and quantitative food consumption were obtained from the base
and the year-five follow-up questionnaires. All primary incident cancer cases are
identified through the corresponding cancer surveillance registries. (Kolonel, Hend
et al. 2000)
Analysis of the incidence cancers in the MEC showed that not all cases are
“explainable” by established environmental risk factors, suggesting that genetic factors
may be invol
es/growth factors were sequenced and the LD patterns were characterized in
multi-ethnic panel of 350 controls (70 from each ethnic group). The nested case-control
analyses for 20 candidate genes have been completed; preliminary results indicated
possible associations of a long-range haplotype with breast cancer (Haiman, Stram et a
2003) and the insulin-like growth factor 1 (IGF1) locus with prostate cancer (Cheng,
Stram et al. 2006), for example. Very recently the MEC has contributed important ne
data to our knowledge of the relationship between genetic variations in region 8q24 and
14
risk of prostate cancer in all 5 ethnic groups (see Haiman 2007 Nature Genetics in
press). Genome-wide association studies are now being planned to search
comprehensively for all possible common variants related to colorectal cancer and other
cancers using the MEC data. Two issues arising from this plan are addressed in the
dissertation: minimizing the genotyping cost of a two-stage genome-wide a
study and properly accounting for population structure in the MEC data. Brief
introduction of these two issues follows.
Two-stage design. With the two-stage approach, all SNPs are genotyped in the
first stage in a fraction of samples, and a liberal significance level threshold is u
identify a subset of SNPs with putative associations. In later
ssociation
sed to
stages, these putative
associa
tial
otyping
revious
tions are re-tested in a larger (sometimes separate) set of subjects. Besides
improved efficiency, the two-stage approach also has the advantage of reducing the
chance of false positive associations that are results of genotyping artifacts ⎯ poten
data quality issues associated with the current highly-multiplexed genotyping
technology, such as heterozygote dropout, or case/control differential call/error rate,
may create false positives that are rare in general, but abundant among positive
associations detected in a genome-wide study. Since these artifacts are usually
technology-specific, genotyping the SNPs selected from the first stage on a different
platform in later stages, which is appropriate because of the difference in the gen
sizes of the two stages, may greatly alleviate this practical problem. Although p
papers have discussed the benefits of two-stage genotyping approach, their results are
not directly relevant to the current situation of designing genome-wide scans, given that
15
these papers were published more than two years ago when candidate-gene study was
the only available association design.
In Chapter 1, a procedure is documented for finding the minimum genotyping
cost of a two-stage case-control association study based on a range of realistic cost
ratios for fixed SNP chip (in stage I) and custom-designed multiplexed genotyping
platform
ge
rporating
gives the most accurate empirical power, which is also used by the
program
at
ugh family-based
associa
for a
(in stage II). Bonferroni correction is applied to adjust for multiple
comparisons.
In Chapter 2, several follow-up issues are addressed: (1) whether the one-sta
sample size calculation in Chapter 1 works well, compared to others; (2) inco
the method that
Quanto, into the two-stage designing procedure; (3) examining how a more
liberal type I error rate (given by the False Discovery Rate adjustment) could be used to
save even more on the two-stage genotyping cost and sample size, if it is believed th
multiple true causal variants are responsible for disease traits.
Population Stratification. One criticism faced by population-based case-control
association studies is population stratification, i.e. “spurious associations” that are
merely consequences of cryptic population substructure. Altho
tion studies can provide protection from population stratification, case-control
studies are preferred sometimes because of their simple design. The impact of
population stratification on case-control association studies has been under debate
long time, partly due to lack of real data to directly address this issue. As an indirect
16
result of the debate, various methods have been developed to control for hidden
substructure if it is of concern in case-control studies.
The two prevailing methods for correcting population stratification are
structured association (Pritchard, Stephens et al. 2000) and genomic control (Devlin and
Roeder ers for
ance of
cation
1999), both of which rely on genotypes of a set of unlinked "null" mark
inference on the underlying population structure. Recently, one paper described how a
mixed model is useful in modeling both the cryptic subpopulation relatedness and the
inbreeding random effect for quantitative traits (Yu, Pressoir et al. 2006). The
counterpart of the linear mixed model approach in case-control studies is the
generalized linear mixed model (GLMM). In Chapter 3, I describe the perform
GLMM in controlling for stratification in different settings of structured study
population, compared with other simple methods. The useful approaches were applied
to the MEC breast cancer data to examine possible impact of population stratifi
on future MEC studies.
17
Chapter 1
Optimal two-stage genotyping design for genome-wide
association scans
As discussed earlier, two-stage genotyping design has been shown to reduce
genotyping cost while preserving study power. Previous papers are not directly relevant
to the current situation of designing genome-wide scans for three reasons. First, in
genome-wide scans the total number of SNPs being considered far exceeds those
considered in earlier papers and the number of SNPs has an impact upon the optimal
sample sizes in the two stages and the threshold for significance in stage I. Second, the
per-genotype costs in the first and the second stage are dramatically different in a
genome-wide scan. The fixed SNP array used in the first stage is relatively cheaper
because a fixed set of SNPs are given on the SNP chip so that the costs of chip
development can be distributed over the very large number of SNP chips sold by the
developers, but the second stage genotyping requires specialty assays to be developed
focusing only upon the (at least partly random) set of SNPs that passed the significance
level of the first stage. Recently cost ratios appear to be in the range of 15 to 20,
comparing per-genotype costs using the fixed SNP array in the first stage versus the
specially designed genotyping platforms in the second stage. While the costs of both the
stage I and stage II genotyping technologies are rapidly dropping, it seems likely that a
large cost differential will remain for the foreseeable future. Third, based on the one
early release of the Affymetrix 500K array, it is clear that the choice of SNPs on the
chip has not as yet been optimized as tag SNPs. Therefore it seems appealing to
consider improving upon the set of SNPs used in the first stage by genotyping in the
18
second stage a full set of tag SNPs for all regions containing any SNPs that pass the
initial significance criteria. In the previous work, there has been no consideration of
attempting to expand the marker set at promising regions in the second stage.
In this chapter, we at first ignore issue (3) and consider the design of a genome-
wide case-control association study based upon a specific cost ratio that has lately been
widely discussed for SNP chip and high-throughput genotyping. We derive optimal
designs and minimum costs for studies that strictly control the type I error rate (using 1
× 10
-7
as the criterion for significance) while giving 90 percent power to detect the
effect of a causal SNP. When, as discussed in (Satagopan and Elston 2003), the
available sample size as well as monetary resources is under constraint, the optimal
design may be impractical in that it may use more subjects than are actually available.
To partially meet this additional limitation on study design we present the set of
parameters that achieve the desired overall power and type I error rate but give the
minimum sample size among all designs that cost less than 110% of the minimum ⎯
we call such designs “nearly optimal”. We then extend these initial findings and give
approximate optimal and “nearly optimal” designs for a wide range of cost ratios of the
two stages and the total number of SNPs being genotyped in stage I.
In the last portion, we take up the issue of increasing second-stage marker
density in promising regions. In this part, we treat the SNPs genotyped in the HapMap
study as comprising a target set of SNPs for which inference about disease risk is to be
performed and consider genotyping in stage II all SNPs (or their perfect surrogates)
from among the target set, in the regions in which a hit from stage I occurred. We
19
quantify the improvement in predicting variants in these regions by the R
s
2
statistic
(Stram 2004), which is the squared correlation between an unmeasured SNP and its best
prediction possible given the available surrogates for that SNP that are measured. We
give a simplified example of this design for a specific cost ratio and number of
associations (1 million) to be considered.
Methods
Measured SNPs association test
We first assume that the only SNPs of interest are those that actually appear on
the SNP chip, so that no imputation of other common but unmeasured (HapMap) SNPs
is performed and no increase in marker density is made in stage II. Here we assume a
population-based case-control study is used for the genome scan, i.e. all cases and
controls are unrelated. In the first stage, the full set of m markers (SNPs) is genotyped
on n
1
subjects. A stage I test of association is performed on all markers at a significance
level α
1
. The significant markers are then genotyped for an additional n
2
subjects, and
the stage II statistics based on all (n
1
+ n
2
) subjects are evaluated at a significance level
α
2.
Generally α
2
is slightly larger than the overall type I error, α, reflecting the
necessity that a marker must be significant in both stages to be significant overall.
Let p denote the population allele frequency and assume Hardy-Weinberg
equilibrium. Here the allele of interest is the one that increases disease risk, not
necessarily the minor allele. Let δ denote the number of risk alleles a subject carries
(allele dosage), δ = 0, 1 or 2. We assume a multiplicative genetic risk model, and let ψ
20
be the increased relative risk (odds ratio, OR) of carrying each additional copy of the
disease allele, i.e., ψ =
) 1 | 0 Pr( / ) 1 | 1 Pr(
) | 0 Pr( / ) | 1 Pr(
− = = − = =
= = = =
i Y i Y
i Y i Y
δ δ
δ δ
, where Y = 1 refers to the
status of being affected and i =1, 2. Under a rare disease assumption, the distribution of
allele dosage for a control subject δ
j0
and for a case subject δ
i1
can be expressed as
functions of p and ψ with Bayes formula. Specifically, δ
j0
~ Binom (2, p), with
variance = 2p(1-p). Let f denote the fraction of cases in a case-control sample of n
subjects. The test statistics S ′ is the difference of the average risk allele dosage between
cases and controls, i.e.,
2
0
σ
S′ =
) 1 (
) 1 (
1
0
1
1
f n nf
f n
j
j
nf
i
i
−
−
∑
∑
−
=
=
δ
δ
.
Asymptotically, S ′ is normally distributed with mean μ = p
d
− p and variance
=
2
σ
) 1 (
2
0
2
f n nf
d
−
+
σ σ
, where the expected value of δ
i1
is denoted as p
d
and variance
as . Again, we derive the form of μ and under the assumption of a rare disease.
Specifically,
2
d
σ
2
σ
μ
=
) 1 (
) 1 )( 1 ( 2
p p
p p
ψ
ψ
+ −
− −
, =
2
σ
2
2
) 1 )( 1 (
) 1 (
) 1 ( 2
p p f nf
f p p f
p p
ψ
ψ ψ ψ
+ − −
+ − + −
− .
Consider a one-sided test at both stages, i.e. H
0
: ψ = 1 vs. H
A
:
ψ >1. Then under
H
0
, we have μ
= 0 and =
2
σ
) 1 (
) 1 ( 2
f nf
p p
−
−
. Note that the value of depends upon which
hypothesis, H
0
or H
A
, is true. Let V
0
and V
A
be the term such that the variance of S ′
2
σ
21
under H
0
and H
A
is V
0
/n and V
A
/n, respectively. Let α′ and 1 −β′ be the usually defined
type I error and power, let Z
p
be the pth percentile point of a standard normal
distribution. Then
n μ
2
=
2
1 0 1
) (
A
V Z V Z
β α ′ − ′ −
+ . (1.1)
In the two-stage design, let f
1
and f
2
denote the case fraction in n
1
and n
2
subjects,
respectively. For the n
1
and n
2
subjects,
) 1 (
1 1
) 1 (
1
0
1 1
1
1
1
1 1
1 1
f n f n
S
f n
j
j
f n
i
i
−
− =
∑
∑
−
=
=
δ
δ
,
) 1 (
2 2
) 1 (
1
0
2 2
1
1
2
2 2
2 2
f n f n
S
f n
j
j
f n
i
i
−
− =
∑
∑
−
=
=
δ
δ
,
and for all subjects combined,
) 1 ( 1 (
2 2 ) 1 1
) 1 ( ) 1 (
1
0
2 2 1 1
1
1
2 2 1 1
2 2 1 1
f n f n f n f n
S
f n f n
j
j
f n f n
i
i
− + −
−
+
=
∑
∑
− + −
=
+
=
δ
δ
.
Asymptotically, S
1
, S
2
and S are normally distributed with mean μ and variances
, , determined by n
1
, n
2
, f
1
, f
2
, p and ψ. Furthermore, the covariance between
the first stage statistic S
1
and second-stage statistic S is
2
1 S
σ
2
2 S
σ
2
S
σ
) 1 ( ) 1 (
2 2 1 1
2
0
2 2 1 1
2
f n f n f n f n
d
− + −
+
+
σ σ
.
The vector (S
1
, S) at a marker location is then approximately normally distributed with
mean and variance matrix that can both be calculated. A two-stage study that achieves
the nominal power level 1 −β and type I error rate α (one-sided) should satisfy the
following two equations,
22
1 − β = Pr (S
1
>c
1
, S>c
2
| H
A
) = , (1)
∫∫
∞∞
>
Φ
12
1 | ,
) , (
1
cc
S S
dudv v u
ψ
α = Pr (S
1
>c
1
, S>c
2
| H
0
) = , (2)
∫∫
∞∞
=
Φ
12
1 | ,
) , (
1
cc
S S
dudv v u
ψ
where c
1
and c
2
are threshold values of the two stages respectively. Here for simplicity
we are considering only one-sided tests. When f
1
= f
2
, Pr (S
1
>c
1
, S>c
2
) = Pr (S
1
>c
1
, S
2
>(1
+ n
1
/n
2
) c
2
– S
1
n
1
/n
2
). Since S
1
, S
2
are independent, (1) and (2) can be easily calculated
using tools available in the statistical programming language R, to integrate the
bivariate normal integral.
Among the m markers, suppose D SNPs are true causal genes or in complete LD
with the causal genes. Let t
1
and t
2
be the per-genotype cost in stage I and stage II,
respectively. The expected genotype cost of the overall study is t
1
n
1
m
+ t
2
n
2
m
2
, where
the number of markers, m
2
, genotyped in stage II has expectation [(m - D) α
1
+ D (1 -
β
1
)]. Our goal is to find the minimum expected cost and the corresponding parameter n
1
,
c
1
, n
2
, c
2
(thus α
1
, β
1
, α
2
, β
2
), subject to the two constraints (1) and (2). As discussed
above, we assume that the total number of SNPs m = 500,000. With m fixed, the
expected cost is a function of the ratio of t
2
/t
1
. For the given values of m, D, t
1
, t
2
, 1−β,
the Bonferroni-corrected α and for each combination of p and odds ratio ψ, we
performed a grid search on α
1
and
1 −β
1
over the region α
1
∈(0,1) and (1 −β
1
) (1 −β,
1). We used a built-in R optimizing function to find n
2
and c
2
that minimize the
following
∈
condition
23
2
12
1 | ,
2
12
1 | ,
1
) , (
1
1
) , (
1 1
⎟
⎟
⎟
⎟
⎟
⎠
⎞
⎜
⎜
⎜
⎜
⎜
⎝
⎛
−
Φ
+
⎟
⎟
⎟
⎟
⎟
⎠
⎞
⎜
⎜
⎜
⎜
⎜
⎝
⎛
−
−
Φ
∫∫ ∫∫
∞∞
=
∞∞
>
α β
ψ ψ
cc
S S
cc
S S
dudv v u dudv v u
. (3)
We only accepted solutions such that the value of (3) is less than 10
-10
. In other words,
for given values of α
1
and
1−β
1
(thus n
1
and c
1
), we searched for n
2
and c
2
that minimize
(3) and make the empirical power and type I error rate less than 0.001% away from the
nominal levels 1 −β and α. Each set of parameters α
1
and
1−β
1
, n
2
and c
2
determine an
expected cost. The expected cost is then compared over values of α
1
and
1−β
1
within the
region α
1
∈(0,1) and (1−β
1
) ∈(1 −β, 1) and a minimum is returned.
One question is whether the α
1
, β
1
, α
2
and β
2
that achieve the minimum
expected cost vary with genetic effect size μ, which is determined by p and ψ. Written
in terms of α
1
, β
1
, α
2
and β
2
, the expected cost is
V
0
/ μ
2
{(t
1
m - t
2
m
2
)
2
0
1
1 1
) / (
1
V V Z Z
A β α − −
+ + t
2
m
2
2
0
2
1 1
) / (
2
V V Z Z
A β α − −
+ } (4)
Using α
1
= Pr (S
1
>c
1
| H
0
) and α
2
= Pr (S>c
2
| H
0
), the constraints (1) and (2) can be
written as functions of α
1
, β
1
, α
2
, β
2
and V
A
/V
0
. Although the set of parameters that
minimize (4) does not depend on μ, they depend on p and ψ through V
0
and V
A
, except
when V
A
/V
0
≈ 1, because then minimizing (4) is independent of μ or V
0
and (1) and (2)
depend only on α
1
, β
1
, α
2
and β
2
. Otherwise, it is not straightforward to see how the
solutions would change; thus we present results for various values of p and ψ.
For two-sided tests, the overall power function is still equation (1) while the
overall type I error becomes α = Pr (|S
1
|>c
1
, |S|>c
2
| H
0
) = 2 (Pr (S
1
>c
1
, S>c
2
| ψ=1) + Pr
24
(S
1
< -c
1
, S>c
2
| ψ=1)). The second probability term was ignored because calculations
show that it is much smaller than the first term. Other procedures are similar to those for
one-sided tests.
Increasing marker density in stage II
Assuming that fixed array of SNPs used in stage I have all been genotyped in
the HapMap samples, we consider treating these SNPs as tag SNPs and using them to
predict each HapMap SNP on the basis of (restricted) combinations of the total set of
markers on the array. This makes possible an approximate although admittedly
imperfect analysis of the relationship between all common HapMap SNPs in the first
stage and the disease of interest. For the purpose of this paper, we assume that for each
measured or predicted SNP that passes the stage I significance threshold, the original
SNP and 5 additional SNPs in the surrounding region will be genotyped in order to
define the pattern of LD around a potential causative allele in stage II. Basically we are
assuming that each HapMap SNP will have a surrogate formed from the SNPs on the
fixed array. All commercial SNP arrays will undoubtedly be applied in the HapMap
samples precisely for the purpose of seeing how well the SNP array tags all HapMap
SNPs. This involves forming an allelic surrogate for each HapMap SNP composed
either of single SNP surrogates or multiallelic (haplotype) predictors. Details of how
optimal haplotype predictors for SNPs are formed are given in (Stram 2004). Each of
these predictors then can be computed in stage I of the study and used to test the
additional unmeasured SNPs. For each region of the human genome one can a priori
decide upon the additional SNPs that will need to be genotyped in order to make sure
25
that all the target HapMap SNPs have perfect surrogates in stage II genotyping for
regions containing SNPs that pass the stage I threshold.
One complication involved in designing a study for which the analysis will be
based upon the best predictions of unmeasured SNPs given the measured SNPs is the
need to estimate a reasonable correction to nominal significance values to account for
the expansion of multiple testing. For the purpose of comparing designs, we choose to
use a nominal significance value of α = 5 ×10
-8
, which corresponds to an assumption of
a total of one million tests being performed, preserving a study-wise false positive rate
of 5 percent. This number of tests appears to be close to the effective number of tests
that would be produced if all HapMap SNPs had been genotyped directly in a panel of
White or Asian subjects (Pe’er, de Bakker et al. 2006). Our use of α = 5 ×10
-8
is
conservative because, even if this were an appropriate correction if all non-redundant
HapMap SNPs were to be genotyped and tested, the predicted allele dosages are in
general more strongly correlated than are the true SNP dosages being predicted. This
concern applies however to stage I only, since we assume that all common HapMap
SNPs are genotyped or perfectly tagged in the second stage.
For a measured SNP, hypothesis testing is the same as described in the previous
section. For unmeasured SNPs, the test is based on E( δ |G), the expected dosage given
genotype G of measured SNPs (Stram 2004). The test statistic for an unmeasured
marker is
S
E
=
) 1 (
) 1 (
1
0
1
1
f n nf
f n
j
jE
nf
i
iE
−
−
∑
∑
−
=
=
δ
δ
,
26
where f is the case fraction in n subjects and δ
iE1
and δ
jE0
refer to E( δ
h
|G) for an affected
subject i and unaffected subject j, respectively. Using general arguments common in the
measurement error literature (Tosteson and Ware 1990; Carroll, Ruppert et al. 1995) we
can correct for the effect on power of measurement error (in estimating the allele
dosage) by treating 1/R
s
2
as a sample size inflator, with this approximation holding more
precisely for small associations than for large ones.
Results
As an illustrative example, Table 1 gives details of the optimal design having 90
percent power to detect an effect on susceptibility at the Bonferroni type I error rate 10
-7
for a one-sided test for several values of the variant allele frequency, p, and odds ratio,
ψ, when a fixed array of 500,000 SNPs with a cost of one thousand dollars per subject
(0.2 cents per genotype) is used in stage I and where genotyping costs are 3.5 cents per
genotype in stage II (cost ratio 17.5). We note immediately from Table 1, that, as in all
power calculations, the optimized cost and sample sizes depend greatly on p and ψ.
However the allocation of the fraction, k
1
= n
1
/(n
1
+n
2
), of total participants assigned to
stage I, the significance level, α
1
, used to promote SNPs from stage I to stage II, the
power, 1- β
1
, to do this promotion under the alternative hypothesis, as well as the
significance level to be used at the conclusion of the study, α
2
, depend very little on the
assumed value of either p or ψ.
27
Table 1. Minimum expected cost for a one-to-one unmatched study and various allele frequencies, p, and
odds ratios, ψ, and the set of parameters when the minimum is reached, for m = 500,000, D=1, t
1
=
$0.002, t
2
= $0.035 and α = 1 ×10
-7
(1-sided), 1- β = 0.90.
p ψ α
1
1- β
1
α
2
( ×10
-7
) n
1
n
2
Total cost
($ thousands)
n
1
/(n
1
+ n
2
)
0.1 1.35 0.00370 0.907 1.6 3,238 7,490 3,724 0.302
1.5 0.00366 0.907 1.6 1,662 3,824 1,906 0.303
1.75 0.00362 0.907 1.6 792 1,818 907 0.303
2 0.00365 0.907 1.6 476 1,090 545 0.304
0.2 1.35 0.00373 0.907 1.7 1,920 4,458 2,211 0.301
1.5 0.00374 0.907 1.6 1,004 2,314 1,156 0.303
1.75 0.00372 0.907 1.7 494 1,146 568 0.301
2 0.00372 0.907 1.6 306 704 351 0.303
0.4 1.35 0.00376 0.907 1.6 1,420 3,298 1,637 0.301
1.5 0.00376 0.907 1.7 772 1,794 889 0.301
1.75 0.00379 0.907 1.7 402 936 463 0.300
2 0.00376 0.907 1.7 262 610 302 0.300
0.6 1.35 0.00378 0.907 1.6 1,568 3,648 1,809 0.301
1.5 0.00382 0.907 1.6 880 2,054 1,017 0.300
1.75 0.00382 0.907 1.7 482 1,126 557 0.300
2 0.00377 0.907 1.7 328 766 378 0.300
0.8 1.35 0.00383 0.907 1.6 2,582 6,030 2,987 0.300
1.5 0.00382 0.907 1.7 1,498 3,516 1,733 0.299
1.75 0.00386 0.907 1.7 854 2,020 991 0.297
2 0.00386 0.907 1.7 600 1,418 697 0.298
0.9 1.35 0.00379 0.907 1.7 4,814 11,272 5,562 0.299
1.5 0.00383 0.907 1.7 2,828 6,652 3,274 0.298
1.75 0.00386 0.907 1.7 1,648 3,884 1,910 0.298
2 0.00388 0.907 1.7 1,176 2,770 1,365 0.298
In Methods we show that the optimal parameters k
1
through α
2
do depend
formally on the ratio of the variance, V
A
, of S under the alternative versus the variance,
V
0
, under the null, and that this ratio is a function of both p and ψ. However much
additional calculation indicates that this near independence of k
1
through α
2
on p or ψ
holds quite generally despite the formal dependency of these parameters on In the
appendix we show that the optimal parameters k
1
the ratio of variances, V
A
/V
0
, which
28
varies over a range of 0.76 to 1.40 for the choices of p and ψ given. On the other hand
these four design parameters, k
1
, α
1
, 1- β
1
, and α
2
do depend importantly upon the cost
ratio t
1
/t
2
,
the desired power of the study, 1- β, and the total number of SNPs considered
in stage I.
With the values of the number of SNPs in stage I, overall power, cost ratio, and
Bonferroni adjusted one-sided p-value as in Table I, we find that the optimal design for
all values of allele frequency p and odds ratio ψ over a range from (0.05 to 0.95 and
1.35 to 2) assigns approximately 2.2 times more subjects to stage II than to stage I
(k
1
=0.31) and will promote approximately 0.4% of all SNPs to stage II and that the
power to do this promotion is just slightly higher (0.907) than the overall power of the
study (90 percent). Further calculations show that, the power (1- β
N
) of a one-stage
study that uses the optimal number of subjects in the two-stage design and tests at the
significance level α, is 0.98, regardless of p and ψ. (We call 1- β
N
the “nominal power”
of the study.) This is important as an aid to designing a two-stage study using standard
(one-stage) software, putting in the Bonferroni alpha (10
-7
) and (1- β
N
) rather than the
desired power of 90 percent we derive the optimal value of n
1
+n
2
for any p or ψ. If we
compare the cost of an optimal two-stage design to that of a one-stage study, both of
which achieve power 1- β when tested at significance level α, the cost fraction of the
two-stage design is about 42-45% that of the one-stage design.
Table 2 presents a few examples of the set of parameters that minimize the total
sample size among all the two-stage designs whose expected cost falls within 110% of
the minimum, for a one-to-one unmatched study and one true causal gene that is in
29
complete LD with a measured marker. Again for illustration purposes we consider
studies of 500,000 markers with power of 90 percent, a cost ratio of 17.5 and a
Bonferroni adjusted type I error level of 10
-7
for any variant. For these designs the
critical value α
1
in the first stage is approximately 0.005, and the power of stage I is
now approximately 0.95. The significance level α
2
at which the second stage statistics
is rejected lies just above 10
-7
, and (1 −β
N
) is about 0.93, with minor fluctuations (due
undoubtedly in part to numerical errors in the integrations and optimization described in
the appendix). Thus compared to the parameters at the minimum cost, the significance
level and power of stage I are both higher and the significance level and power of stage
II are both lower, resulting in an increased and decreased sample size in stage I and
stage II, respectively.
Table 2. Parameters for a two-stage one-to-one unmatched study such that the expected cost is within 1.1
fold of the minimum and sample size is the least, for various p and ψ, m = 500,000, D=1, t
1
= $0.002, t
2
=
$0.035 and α = 1 ×10
-7
(1-sided), 1- β = 0.90.
p ψ α
1
1- β
1
α
2
( ×10
-7
) n
1
n
2
Total cost
($ thousands)
n
1
/(n
1
+ n
2
)
0.2 1.35 0.00510 0.951 1.2 2,158 3,080 2,432 0.412
1.5 0.00517 0.951 1.1 1,126 1,606 1,271 0.412
1.75 0.00525 0.951 1.1 554 786 625 0.413
2 0.00520 0.950 1.1 344 484 387 0.415
0.4 1.35 0.00527 0.952 1.2 1,588 2,314 1,800 0.407
1.5 0.00531 0.952 1.2 862 1,258 978 0.407
1.75 0.00525 0.952 1.2 450 656 510 0.407
2 0.00529 0.952 1.2 292 428 332 0.406
0.6 1.35 0.00530 0.953 1.2 1,750 2,582 1,991 0.404
1.5 0.00518 0.953 1.2 986 1,452 1,119 0.405
1.75 0.00530 0.953 1.2 538 800 613 0.402
2 0.00531 0.953 1.2 364 546 416 0.400
0.8 1.35 0.00523 0.953 1.2 2,892 4,292 3,286 0.403
1.5 0.00519 0.953 1.2 1,678 2,500 1,906 0.402
1.75 0.00519 0.954 1.2 960 1,442 1,091 0.400
2 0.00517 0.954 1.2 674 1,020 767 0.398
30
Tables 3 and 4 present optimal and “nearly optimal” two-stage designs with
90% and 80% power to detect a causal variant, respectively, for a fixed choice of p= 0.2
and ψ = 1.5 but where the cost ratio t
2
/t
1
, and the number of measured SNPs m in stage
I vary over ranges relevant to both genome-wide and less extensive studies. For each
value of m, Bonferroni correction was applied, i.e., α= 0.05/m (2-sided). Complete LD
between the one causal variant and at least one of the markers is assumed. For each
choice, the fraction of subjects optimally allocated to stage I, the significance level α
1
and power 1−β
1
for moving from stage I to stage II, the cost fraction of the optimal two-
stage study relative to an equally powered one-stage study, and the nominal power
(1−β
N
) of a one-stage study that uses the sample size of the optimal two-stage study, are
shown.
31
Table 3. The dependence of an optimal two-stage design parameters upon cost ratio t
2
/t
1
, number of
genotyped SNPs in stage I m, for power 1- β = 0.90, f=0.5, p = 0.2 and ψ =1.5. The overall type I error
rate α = 0.05/m (2-sided).
Optimal design Nearly optimal design
#
t
2
/t
1
m
( ×10
3
) CF* (1- β
N
)
‡
n
1
/(n
1
+n
2
) α
1
†
1- β
1
α
2
†
( ×10
-5
)
(1- β
N
)
‡
n
1
/(n
1
+n
2
) α
1
†
1- β
1
α
2
†
( ×10
-5
)
1 1 0.42 .976 .257 .091 .911 7.2 .924 .347 .133 .954 5.6
10 .37 .979 .224 .078 .910 .72 .929 .311 .097 .951 .57
30 .35 .983 .209 .070 .908 .24 .930 .296 .086 .950 .19
100 .33 .984 .200 .064 .908 .072 .932 .276 .083 .950 .057
500 .30 .986 .187 .056 .907 .015 .933 .253 .080 .949 .011
1,000 .30 .984 .187 .053 .908 .0072 .934 .255 .066 .948 .0057
5 1 .56 .971 .377 .0211.910 8.0 .916 .518 .0301 .956 5.6
10 .48 .977 .324 .0166.908 .80 .921 .447 .0221 .952 .57
30 .46 .979 .305 .0151.908 .27 .923 .415 .0214 .952 .19
100 .43 .982 .286 .0136.907 .080 .925 .392 .0186 .951 .057
500 .40 .984 .263 .0123.906 .016 .928 .362 .0166 .950 .012
1,000 .39 .985 .256 .0115.906.0080 .928 .354 .0150 .949 .0058
10 1 .62 .966 .437 .0113.911 8.2 .912 .589 .0169 .957 5.5
10 .53 .976 .369 .0084.908 .84 .918 .502 .0124 .954 .57
30 .50 .978 .344 .0079.908 .28 .920 .472 .0108 .952 .19
100 .47 .978 .325 .0073.908 .082 .923 .443 .0095 .951 .057
500 .44 .984 .296 .0060.906 .017 .925 .409 .0084 .951 .012
1,000 .42 .983 .290 .0059.906.0082 .926 .394 .0083 .950 .0058
15 1 .65 .965 .466 .0078.910 8.4 .910 .635 .0114 .958 5.5
10 .56 .974 .392 .0062.908 .85 .916 .534 .0089 .954 .56
30 .53 .976 .371 .0054.908 .28 .919 .502 .0076 .953 .19
100 .50 .978 .348 .0049.908 .084 .921 .471 .0067 .952 .057
500 .46 .981 .321 .0042.907 .017 .923 .435 .0058 .951 .012
1,000 .45 .984 .305 .0040.906.0086 .924 .418 .0058 .951 .0058
20 1 .68 .964 .489 .0059.910 8.6 .908 .667 .0089 .959 5.4
10 .58 .974 .411 .0046.908 .87 .915 .563 .0064 .954 .56
30 .55 .976 .385 .0042.908 .29 .917 .524 .0059 .953 .19
100 .51 .979 .359 .0037.907 .087 .920 .490 .0052 .952 .057
500 .48 .981 .331 .0033.906 .017 .922 .450 .0047 .951 .011
1,000 .46 .981 .322 .0032.907.0084 .923 .437 .0044 .951 .0058
100 1 .82 .943 .651 .0015.914 8.1 .901 .856 .0024 .969 5.1
10 .69 .967 .515 .0011.908 .93 .908 .694 .0017 .958 .54
30 .65 .970 .485 .0009.908 .31 .910 .648 .0014 .956 .18
100 .61 .977 .441 .0008.906 .098 .913 .605 .0012 .955 .056
500 .56 .981 .403 .0007.905 .019 .916 .552 .0011 .954 .011
1,000 .55 .979 .397 .0007.906.0092 .917 .536 .0010 .953 .0057
* Cost fraction of an optimally designed two-stage study relative to a one-stage study. Both studies
achieve power 1- β when tested at significance level α.
#
Two-stage studies that cost 110% of the minimum expected cost but require the least sample size (n
1
+n
2
)
32
†
α
s
= Pr (|S
s
| > c
s
)
H0
, i.e. significance level in stage s.
‡
Power of a one-stage study that uses the sample size of an optimal two-stage study and tests at α.
33
Table 4. The dependence of an optimal two-stage design parameters upon cost ratio t
2
/t
1
, number of
genotyped SNPs in stage I m, for power 1- β = .80, f=.5, p = .2 and ψ =1.5. The overall type I error rate α
= .05/m (2-sided).
At minimum cost <10% above minimum cost
#
t
2
/t
1
m
( ×10
3
) CF* (1- β
N
)
‡
n
1
/(n
1
+n
2
) α
1
†
1- β
1
α
2
†
( ×10
-5
)
(1- β
N
)
‡
n
1
/(n
1
+n
2
) α
1
†
1- β
1
α
2
†
( ×10
-5
)
1 1 .39 .950 .212 .088 .818 8.5 .852 .307 .120 .891 6.1
10 .33 .955 .191 .068 .817 .84 .861 .271 .088 .885 .62
30 .32 .959 .177 .066 .816 .28 .864 .246 .092 .885 .21
100 .30 .960 .174 .056 .816 .084 .867 .237 .078 .883 .062
500 .28 .959 .161 .055 .817 .016 .871 .214 .078 .882 .012
1,000 .27 .968 .152 .051 .813 .0085 .873 .212 .069 .880 .0063
5 1 .53 .934 .340 .0195.818 9.3 .834 .476 .0289 .894 5.9
10 .45 .949 .286 .0156.815 .95 .846 .400 .0226 .888 .61
30 .43 .956 .266 .0139.813 .32 .850 .377 .0194 .886 .21
100 .40 .960 .249 .0126.812 .096 .854 .350 .0181 .885 .062
500 .37 .965 .229 .0115.811 .019 .859 .321 .0163 .883 .013
1,000 .36 .964 .227 .0103 .812 .0095 .861 .313 .0147 .882 .0063
10 1 .59 .925 .396 .0108.819 9.5 .826 .553 .0155 .897 5.8
10 .51 .945 .329 .0082.814 .99 .839 .463 .0116 .889 .61
30 .47 .947 .309 .0076.815 .32 .844 .429 .0107 .888 .20
100 .44 .955 .287 .0067.813 .099 .848 .403 .0091 .886 .062
500 .41 .960 .265 .0059.812 .020 .853 .369 .0082 .884 .013
1,000 .40 .963 .254 .0056 .811 .010 .855 .358 .0076 .883 .0063
15 1 .63 .921 .429 .0076.819 9.7 .821 .594 .0116 .900 5.7
10 .54 .940 .358 .0057.815 1.0 .835 .499 .0080 .891 .60
30 .50 .950 .327 .0052.813 .34 .840 .462 .0074 .889 .20
100 .47 .957 .305 .0044.811 .10 .845 .433 .0062 .886 .062
500 .43 .958 .283 .0041.812 .020 .850 .396 .0056 .885 .013
1,000 .42 .961 .273 .0039 .811 .010 .852 .381 .0054 .884 .0063
20 1 .66 .917 .452 .0060.819 9.7 .818 .631 .0087 .901 5.7
10 .56 .938 .374 .0045.815 1.0 .832 .520 .0066 .892 .60
30 .52 .949 .344 .0039.812 .35 .837 .482 .0059 .890 .20
100 .49 .953 .320 .0036.812 .10 .842 .451 .0050 .887 .061
500 .45 .961 .293 .0030.810 .021 .848 .413 .0044 .886 .013
1,000 .43 .961 .287 .0028 .810 .010 .850 .399 .0042 .885 .0062
100 1 .81 .878 .619 .0015 .825 9.7 .802 .833 .0023 .918 5.7
10 .67 .927 .481 .0010.813 1.0 .817 .665 .0016 .898 .60
30 .63 .932 .448 .0009.813 .35 .823 .613 .0014 .895 .20
100 .59 .944 .408 .0008.811 .10 .829 .568 .0012 .892 .061
500 .54 .946 .379 .0007.812 .021 .835 .520 .0010 .889 .012
1,000 .52 .950 .362 .0007 .812 .010 .837 .503 .0009 .888 .0062
* Cost fraction of an optimally designed two-stage study relative to a one-stage study. Both studies
achieve power 1- β when tested at significance level α.
#
Two-stage studies that cost 110% of the minimum expected cost but require the least sample size (n
1
+n
2
)
34
†
α
s
= Pr (|S
s
| > c
s
)
H0
, i.e. significance level in stage s.
‡
Power of a one-stage study that uses the sample size of an optimal two-stage study and tests at α.
35
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
1
10
30
100
500
1,000
1
10
30
100
500
1,000
1
10
30
100
500
1,000
1
10
30
100
500
1,000
1
10
30
100
500
1,000
1 5 10 15 20
Cost Ratio and Number of Markers (x1,000)
Proportion of sample size in Stage I
Figure 1. The proportion, n
1
/(n
1
+ n
2
), of sample size allocated to stage I in the optimal design plotted as a
function of the cost ratio t
2
/t
1
and the number of SNPs, m, genotyped in stage I, for a study with 90%
power to detect an effect with type I error equal to .05/m.
0.001
0.01
0.1
1
1
10
30
100
500
1,000
1
10
30
100
500
1,000
1
10
30
100
500
1,000
1
10
30
100
500
1,000
1
10
30
100
500
1,000
1 5 10 15 20
Cost Ratio and Number of Markers (x1,000)
Significance threshold of Stage I
Figure 2. The significance threshold α
1
for promoting a SNP from stage I to stage II in the optimal design
plotted as a function of the cost ratio t
2
/t
1
and the number of SNPs, m, genotyped in stage I for a study
with 90% power to detect an effect with type 1 error equal to .05/m.
36
Figures 1 and 2 plot the relationships between k
1
, α
1
and the cost ratio t
2
/t
1
and
the number of SNPs m, based on results in Table 3, for the optimal study designs. It can
be clearly seen that as the ratio of costs increases (holding the number of SNPs fixed)
the fraction of subjects assigned to stage I increases and the p-value for promotion to
stage II decreases, while as the number of SNPs increases (with fixed cost ratio) both
the number of subjects assigned to stage I and the p-value for promotion decrease.
Table 5 shows the most cost efficient design of a 1:1 case-control study
assuming 1,000,000 tests, in which R
s
2
changes from .6 in stage I to 1.0 in stage II of the
study due to genotyping of 5 additional markers in stage II. Here we assume that in
stage II all HapMap SNPs (or their perfect surrogates) are genotyped in the regions that
pass the stage I criteria. Otherwise the assumptions made in the analysis are the same as
given in Table 1. Comparing Table 5 to Table 1 we see that the optimal sizes of the first
and second stage of the study are approximately equal, compared to Table 1 where the
second stage was approximately 2.2 times larger than the first stage. In Table 5 the
threshold, α
1
, for passing a SNP (or predicted SNP) from stage I to stage II is
approximately equal to .0005, nearly 8 times smaller than in Table 1. Both these
changes reflect the considerably larger costs associated with the genotyping in Stage II
in this scenario where 6 SNPs are assumed genotyped for every (predicted or measured)
SNP that passes the stage I criteria. However, the overall fraction of costs spent in stage
I versus stage II of the study does not differ very much from Table 1 (where 87 percent
of costs are spent in stage I) compared to Table 5 (where 90 percent of costs are spent in
stage I).
37
Table 5. Minimum expected cost for a one-to-one unmatched study and various p and ψ and the set of
parameters when minimum is reached, for m = 500,000, D=1, t
1
= $.002, t
2
= $.035, 1- β = .90 and α =
5 ×10
-8
(1-sided, see text for detail). R
s
2
is .6 in stage I and 1.0 in stage II.
p ψ α
1
1- β
1
α
2
( ×10
-8
) n
1
n
2
Total cost
($ thousands)
n
1
/(n
1
+ n
2
)
.1 1.35 .00050 .904 5.3 7,080 7,412 7,860 .489
1.5 .00050 .904 5.3 3,620 3,754 4,015 .491
1.75 .00048 .904 5.3 1,730 1,752 1,906 .497
2 .00053 .905 5.3 1,024 1,050 1,141 .494
.2 1.35 .00051 .904 5.3 4,200 4,450 4,676 .486
1.5 .00052 .904 5.3 2,188 2,316 2,440 .486
1.75 .00048 .904 5.3 1,086 1,118 1,198 .493
2 .00047 .904 5.3 672 686 740 .495
.4 1.35 .00051 .903 5.3 3,114 3,344 3,473 .482
1.5 .00048 .903 5.3 1,704 1,804 1,886 .486
1.75 .00047 .903 5.3 890 938 983 .487
2 .00054 .908 5.2 578 550 639 .512
.6 1.35 .00049 .903 5.3 3,464 3,728 3,849 .482
1.5 .00052 .903 5.3 1,934 2,118 2,167 .477
1.75 .00058 .910 5.2 1,064 1,010 1,188 .513
2 .00059 .908 5.2 718 706 807 .505
.8 1.35 .00051 .903 5.3 5,700 6,256 6,372 .477
1.5 .00048 .903 5.4 3,336 3,630 3,703 .479
1.75 .00049 .902 5.4 1,906 2,102 2,123 .476
2 .00049 .902 5.4 1,342 1,490 1,496 .474
.9 1.35 .00049 .903 5.3 10,676 11,658 11,879 .478
1.5 .00048 .902 5.4 6,310 6,902 7,007 .478
1.75 .00050 .902 5.4 3,668 4,092 4,098 .473
2 .00050 .902 5.4 2,624 2,948 2,934 .471
Discussion
We have expanded the optimization of two-stage genotyping in association
studies from the candidate gene scale to the genome-wide scale. The most immediate
difference between these two problems is the per-genotype cost difference between the
two stages. Currently costs of genotyping 500,000 SNPs using genome-wide fixed SNP
arrays are being widely quoted as approximately one thousand dollars per subject. This
appears to be 15 to 20-fold lower than the per-genotype costs associated with the best
38
high-throughput genotyping methods for specially designed assays to be used in stage
II. Nevertheless, it is clear that despite the markedly higher costs per genotype in stage
II, considerable cost savings can accrue when a second stage is used to follow-up the
most promising SNPs genotyped in stage I. The most promising SNPs may be defined
as constituting approximately the one third of one percent most significant associations
in the first stage (under a conservative assumption about the number of true positive
associations to be discovered), when it is assumed a fixed array of 500,000 SNPs is
used in stage I and genotyping is not expanded in stage II (i.e. Tables 1-2).
The lack of any important dependency of the design parameters for two-stage
studies upon the assumptions about p and ψ, makes designing a two-stage study
relatively simple, using Table 3 or Table 4 as a guide. In order to use standard sample
size procedures intended for use in a one-stage study, such as Quanto (Gauderman
2002), to design a two-stage study, one simply computes sample size (n
1
+ n
2
) by
inputting the desired significance level α, allele frequency p, and odds ratio ψ as usual
but replacing the desired power of (1- β) with (1- β
N
) of Table 3 or 4. The sample size
allocated to stage I can then be calculated by applying the proportion n
1
/(n
1
+ n
2
). Since
neither n
1
/(n
1
+ n
2
), nor α
1
nor (1 −β
N
) depend very much on p or ψ, this procedure gives
appropriate approximate sample sizes for all p and ψ. The sheer number of genotypes
being considered here also distinguishes our paper from earlier work on candidate genes
in which at most a few thousand markers were considered. This alone has some
important effects on the resulting optimal design. For cost ratios or numbers of SNPs
39
not shown in Tables 3 or 4 interpolation between the nearest values that are shown will
give quite reasonable approximate solutions.
This experiment also provides insights to why a two-stage design could be more
cost-efficient than the one-stage design. First, a two-stage study only results in savings
if it is carefully designed, i.e. carefully selecting the type I error stage I, sample size
allocated to stage I n
1
/(n
1
+n
2
), total sample size, etc. as in Tables 3 and 4. With our
procedure, a two-stage genotyping cost is returned for each pair of ( α
1
, 1- β
1
) if both the
power and type I error are within .01% of the nominal level. There are many two-stage
designs that cost more than an equally powered one-stage design. Second, the per-
genotype cost t
2
/t
1
is an important factor in determining the magnitude of possible
savings incurred by the two-stage method. The per subject genotyping cost ratio of
stage II vs stage I, α
1
(t
2
/t
1
), is the key to understand this effect. When t
2
/t
1
goes up, the
cost of genotyping a subject in stage II is relatively higher, then to minimize cost, the
proportion of subjects allocated to stage I n
1
/(n
1
+n
2
) is elevated (Figure 1) to allow more
people to be genotyped in stage I. But the absolute per subject cost of stage I is still
higher than stage II, therefore allowing more subjects in stage I results in less benefit of
the two-stage design. Tables 3 and 4 already show that as t
2
/t
1
goes up from 1 to 100,
the benefit of an optimal two-stage design decreases, when other assumptions are the
same. When t
2
/t
1
(per genotype ratio) is 1000 and 10000, the optimal two-stage design
costs 68% and 82% of an equally powerful one-stage design with m =500000; the
corresponding cost of genotyping a subject in stage I is 11 and 9 that of stage II,
respectively. Therefore as t
2
/t
1
increases to some large enough value, which may never
40
happen in practice, the optimal two-stage design would put all subjects into stage I and
we get a simple one-stage design as a result.
We have also described possible implications for study design of explicitly
assuming that the SNPs that are genotyped in the first stage of the association study do
not constitute the full set of variants that are to be evaluated in the scan. Using SNP
prediction methods such as those described in other papers (Chapman, Cooper et al.
2003; Stram 2004), these unmeasured SNPs may be predicted on the basis of the
measured SNPs and used for testing of associations. In such an analysis scheme, it is
very tempting to consider increasing marker density in stage II in regions where a SNP
has passed the stage I significance threshold, particularly since the fixed arrays do not
yet fully tag all variation. We have assumed a study in which the first stage will capture
most common SNPs in the HapMap with R
s
2
of approximately .6 or greater and the
second stage will capture all common HapMap SNPs in regions where a “hit” has been
observed in stage I with an R
s
2
of 1. Of course, this estimate of R
s
2
is at best only an
average over the entire genome, and one would expect the first-stage power to be better
in regions of high LD, worse in regions of low LD.
We have not directly asked whether this second design is cost effective
compared to a “pure” two-stage study in which no enrichment is performed. If sample
size is truly unconstrained then spending money on additional genotyping in stage II is
not cost efficient compared to adding more subjects to stage II. This follows because
genotyping additional SNPs increases genotyping costs by six-fold (in our calculation)
in stage II but only increases the effective stage II sample size by a factor of 5/3. The
41
same increase in genotyping costs could have increased the actual stage II sample size
by a factor of 6. Nevertheless while genotyping additional makers in stage II is not the
optimal use of resources when the total number of cases and controls available is
unconstrained, in most realistic settings total sample size is constrained for a number of
practical reasons. Thus attempts to improve the prediction in stage II by more extensive
genotyping around each SNP promoted from stage I may be a reasonable option,
particularly since in large scans the total cost of the second part of the study is small
compared to the first stage. In comparing Table 1 to Table 5 it is important to realize
that the costs are so much larger in the latter table mainly because it is assumed there
that the causal variant is not in perfect LD with a measured SNP whereas it is assumed
in Table 1 that the causal variant is in perfect LD with a SNP on the array. A fairer
comparison would be to inflate the costs in Table 1 by a factor of 5/3 to bring these
assumptions more into line.
We used the conservative Bonferroni correction to control for the experiment-
wise type I error, assuming independence of all markers. In practice, the independence
assumption as well may be violated, thus the cost reported here is for the worst scenario
and may only be used as guideline in the design stage. We argue, however that the
penalty paid for using the Bonferroni test in the main effects analysis is not great. Even
if the effective number of tests being performed (as assessed by permutation analysis) is
only 250,000 rather than 500,000 in the main effects analysis, Bonferroni adjustment
gives only a modest reduction in power (of 2-3 percent for a study with 90 percent
power after adjusting for 500,000 vs. 250,000 tests). This penalty appears to be small
42
enough so that the Bonferroni test remains useful for design considerations. We believe
that this is fully appropriate and corresponds to the view that each result from whole
genome scans will ultimately need to be fully evaluated and confirmed before they can
be generally accepted as real. But when the markers are highly correlated, Lin (Lin
2006) discussed a Monte Carlo procedure to assess empirical power in two-stage
studies, with given sample size n
1
and n
2
and critical value of stage I (analogous to c
1
in
our method). With his program, to design a two-stage study with the minimum cost, one
would simulate the genotype data to be tested, choose sample sizes in both stages and
the critical value of the first stage test statistic and use his program to find the study
power for these choices of parameters, then search through possible sample sizes and
critical values of stage I for a study design that returns the minimum cost and also
achieves the nominal power and type I error rate. This is helpful but could also be
cumbersome in the design stage. Other approaches of controlling the false discovery
rate (Benjamini and Hochberg 1995) will be discussed in Chapter 2.
Skol et al. (Skol, Scott et al. 2006) considered situations when the design effects
μ of the causal gene are different between the two stages, which may happen in studies
where the n
2
subjects are sampled from a different population than the n
1
subjects. They
combined the test statistics from n
1
and n
2
subjects using weights proportional to the
square root of n
1
/(n
1
+ n
2
)
and n
2
/(n
1
+ n
2
)
to do hypothesis testing in stage II. Our
program can be modified to accommodate these situations of heterogeneity. However,
when the design effects are believed to vary between stages, the optimal α
1
and 1- β
1
will depend on the risk allele frequency and odds ratio.
43
Chapter 2
Follow-up issues with the methods in Chapter 1
One-stage sample size calculation — compare to Quanto
The one-stage sample size calculated from equation (1.1)
n μ
2
=
2
1 0 1
) (
A
V Z V Z
β α ′ − ′ −
+
is different from what is given by Quanto (Gauderman 2002), especially when allele
frequency p is different from 0.5. Denote the sample size given by equation (1.1) and
Quanto as n and n
q
, respectively. At p = 0.5, n ≈ n
q
; as |p – 0.5| increases, |n – n
q
|
increases. One simplification made in (1.1) was the rare disease assumption in
calculating μ and the variance terms (V
0
and V
A
). This assumption is not the cause of the
difference and it is straightforward to incorporate the baseline risk (P
0
) in calculation.
Figure 2.1 plots the relative difference (1- n/n
q
)
×100% as a function of the risk allele
frequency p for type I error α (one-sided) from 10
-7
to .05, power 1- β = .90, odds ratio
ψ =1.5 and 2, and P
0
= .0001 and .01, after the rare disease assumption was corrected.
Other assumptions are: a one-to-one unmatched case-control study, a logistic disease
model and log-additive penetrance so that ψ is the odds ratio for carrying each
additional risk allele — these assumptions apply to all calculations in chapter 2. The
plots show that |n – n
q
| increases as α decreases and as odds ratio ψ increases,
independent of P
0
; when p < .5, n < n
q
, when p > .5, n > n
q
. Note that the vertical scale
is different for ψ =1.5 and 2.
44
Figure 2.1 Percent difference between sample size n and n
q
, calculated from (1.1) and Quanto,
respectively. Power 1- β = .9. The dots above zero indicate n < n
q
and dots below zero indicate n> n
q
. P
0
refers the baseline disease risk. α is the type I error rate (one-sided).
To investigate the discrepancy between the two methods, simulations were
performed that compare the empirical power for case-control studies with sample sizes
n and n
q
, under the following assumptions: population prevalence of disease P
0
= .0001,
α (1-sided) = .05, power = .90, ψ = 1.2, 1.5 and 2, and risk allele frequency p in the
range (0.1, 0.9). Under the same assumptions, the genotypes of n (n
q
) cases and controls
were generated separately based on the conditional distribution of genotypes given
disease status. For each combination of ψ and p, one thousand replicates of case-control
45
samples were generated. Each replicate was tested with likelihood-based trend tests
(Likelihood Ratio Test, Score and Wald test) with 1 degree of freedom (df), Pearson’s
approximate and Fisher’s exact χ
2
tests for 2 ×2 contingency table (test for allelic
association with disease) and two-sample t-tests assuming both equal variance between
groups and unequal variance with Satterthwaite’s
1
adjustment in df, testing for
difference in average allele frequency between cases and controls. The empirical power
was calculated as the proportion of the 1000 replicates in which statistically significant
association was detected at the nominal significance level.
Figure 2.2 plots the empirical power against allele frequencies for both methods
with four of the above tests. The three likelihood-based tests gave similar results, so did
the two t-tests. So the only results shown are from Likelihood Ratio Test (LRT),
Pearson’s χ
2
and Fisher’s exact χ
2
tests and the t-test assuming equal variance. Fisher’s
exact test appears to be somewhat conservative; otherwise, the empirical power for n
q
(the left panel) appears to be correct for almost all other cases. In contrast, the empirical
power of n (the right panel) is close to the nominal level only when the odds ratio is
small ( ψ=1.2). When ψ =1.5 or 2, there is a trend that the empirical power deviates
from .90 when |p – 0.5| gets larger. These results indicate that if there is considerable
difference in the minor and major allele frequencies and the genetic effect is relatively
strong ( ψ >1.5 for carrying one allele), sample size calculated from equation (1.1) is not
appropriate if any of the above standard tests is used in analysis. The accuracy of n
q
was
1
For a t-test comparing n
1
and n
2
independent observations, t =
2 1 2 1
/ ) ( w w x x + −
, where w
1
= s
1
2
/n
1
, w
2
= s
2
2
/n
2
and s
1
and s
2
are sample standard deviations. Satterthwaite’s adjusted d.f. = (w
1
+w
2
)
2
/[w
1
2
/(n
1
-1)
+ w
2
2
/(n
2
-1)] (see SAS online doc)
46
also verified for the following two settings, with other parameters held the same: P
0
=
.0001 and α=10
-7
; P
0
= .01 and α=10
-4
.
(a) (b)
Figure 2.2 Empirical power for sample sizes calculated from Quanto (panel a) and (1.1) (panel b), where
α (one-sided) = .05, P
0
=0.0001, power = 0.9.
47
These results suggest that the method used in Quanto is more appropriate for
calculating sample sizes of unmatched case-control studies, at least when the log-
additive risk model is assumed to hold. Therefore incorporating it in the two-stage
design procedure should produce more precise estimates.
Before that, one simple method is worth mentioning that gives sample sizes very
close to Quanto’s. The idea is to use some t-test based statistic, since the t-test performs
as well as LRT (Figure 2.2). For an observed sample of size n (cases plus controls) with
f as the case fraction, use S ′
to denote the average differences of allele dosage between
cases and controls as in chapter 1, then t = S ′ /
) 1 (
2
0
2
1
f n
s
nf
s
−
+ , and E(S ′)
H0
= 0 and
E(S ′)
HA
= μ, where s
1
2
and s
0
2
are the sample estimates of the variances of δ
i1
and δ
j.
When there is no observed data, use the “expected” variance σ
d
2
and σ
0
2
to replace s
1
2
and s
0
2
in (2.2), then t = S ′/ σ and =
2
σ
) 1 (
2
0
2
f n nf
d
−
+
σ σ
is exactly the term V
A
/n in
equation (1.1). When sample size is large, a t distribution can be approximated by
standard normal distribution. Then the formula for calculating n is
n μ
2
/V
A
= . (2.1)
2
1 1
) (
β α ′ − ′ −
+ Z Z
The only difference between (2.1) and equation (1.1) n μ
2
=
2
1 0 1
) (
A
V Z V Z
β α ′ − ′ −
+ is
the factor V
0
/V
A
that is attached to Z
1- α
in equation (1.1).
Denote sample size from (2.1) and (1.1) as n
t
and n, respectively. Figure 2.3
shows two examples of comparing both of them to n
q
(Quanto). At most allele
frequencies, n
t
is closer to n
q
than n, except when p is around 0.5 (a similar trend for
48
empirical power should follow). The percent difference between n
t
and n
q
becomes
larger when ψ increases or α decreases, but remains ≤ 5% for ψ ≤ 2 and α ≥ 10
-7
.
Difference in empirical power caused by 5% difference in sample size may not be
detected by simulation studies.
Figure 2.3 Percent difference between n, n
t
and n
q
, calculated from (1.1), (2.1) and Quanto, respectively,
with P
0
= .0001, power = .9. The dashed line represents (1-n/n
q
)
×100% and the solid line represents (1-
n
t
/n
q
)
×10.
The closeness of n
t
to n
q
implies that in sample size calculations, variances
calculated from H
A
should be used as variance of the test statistic under both H
0
and H
A
,
at least this is the case when difference in allele frequencies is used as the test statistic.
It also helps in understanding why |1 – n/n
q
| increases for larger odds ratio and a more
stringent α level (see both Figures 2.1 and 2.3). Consider n
t
as the acceptable correct
sample size (except at p around 0.5), then the only difference in obtaining n and the
correct sample size is the factor V
0
/V
A
that is attached to Z
1- α
in equation (1.1). For
simplification, assume a rare disease and a one-to-one case-control study, then V
0
/V
A
can be expressed as
2
) 1 (
1
2
ψ
ψ
p p + −
+
. When odds ratio ψ is close to 1.0, V
0
/V
A
≈1, n ≈
49
n
t
, so n is roughly correct; as ψ (>1) and p increase, |V
0
/V
A
– 1| increases, resulting in
greater difference between n and the correct sample size. For the same p and ψ, and
hence constant V
0
/V
A
(considering V
0
/V
A
≠ 1.0), a more stringent α means a larger Z
1- α
and more weight on the factor V
0
/V
A
. Therefore, if there is any difference between n and
the correct sample size, the difference is amplified with a stricter significance level α.
Sample size calculation for unmatched case-control studies has been discussed
by a number of papers. One other available program Power (Gail and Lubin 1990) is
based on the score function. Let U
0
denote the score function of one case-control unit
that is evaluated at H
.
Power calculates E(U
0
) under H
A
, variances σ
u0
2
and σ
u1
2
of U
0
under H
0
and H
A
, then the number of cases n =
2
0
1 1 0 1
) (
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛ +
− −
U E
Z Z
u u
σ σ
β α
. The advantage of
this method is that one need not maximize the likelihood under H
A
. With 1 − β = .90,
Power produces slightly greater (2%) sample sizes than n
q
, for type I error from .00011
to .05, baseline risk P
0
in (0.0001, 0.1) and ψ from 1.2 to 2. Power (V.3.0.0) only
calculates sample size for type I error greater than 0.0001 (when α is set as 0.0001, the
sample size from Power is about 60% greater than n
q
, which appears to be an error in
Power).
Optimizing two-stage designs using Quanto’s method
Quanto calculates one-stage sample size for unmatched case-control data in the
following way (Gauderman 2002; Longmate 2002). Let Y and G denote the binary
disease status, the observed genotype and the observed causal allele dosage for all
50
subjects. With all notations the same as before, the prospective likelihood Pr(Y|G), from
n cases and nK controls is L( γ
0
, γ
1
) =
∏ ∏
=
+
=
+
+
+ +
nK
j
n
i
j i
i
e e
e
1 1
0 1 0 1 1 0
1 1 0
1
1
1
δ γ γ δ γ γ
δ γ γ
.
For an unmatched case-control study, the conditional distribution of genotypes
in cases and controls given disease status can be calculated with Bayes formula for
given baseline risk (thus γ
0
) and ψ (thus γ
1
). Given the “expected” distribution of
genotypes in cases and controls, one can treat them as real data and form LRT on H
0
:
γ
1
=0 vs. H
A
: γ
1
≠ 0. Denote the natural logarithm of maximum likelihood from one case-
control unit (one case and K controls) under H
0
and H
A
as and , respectively. Let Δ
= 2( – ), then n Δ is the non-centrality parameter of the statistic for n cases and nK
controls under H
A
(Gauderman 2002; Longmate 2002). The number of cases required
for a study with power 1- β when rejected at significance level α (2-sided) is n = (Z
1- α/2
+
Z
1- β
)
2
/ Δ, where Z
p
denotes the pth percentile of a standard normal distribution. This
means that if S is the test statistic based on one case-control unit (for example, the
normalized difference of allele frequency between case and control subjects), then nS,
the statistic based on n cases and nK controls, is approximately distributed as N(0, n)
and N(n
0
ˆ
l
2
1
χ
A
l
ˆ
A
l
ˆ
0
ˆ
l
Δ , n) under H
0
and H
A
, respectively. Note that the exact form of S is not
important.
In a two-stage design, the stage I and stage II test statistics are based on n
1
and
(n
1
+ n
2
) case-control units, respectively. We obtain the distribution of n
1
S and (n
1
+ n
2
)
S for stage I and stage II with the above method. For the combined study, type I error
51
rate α = Pr (|n
1
S|>c
1
, |(n
1
+ n
2
)S|>c
2
)
H0
, which is approximated by 2 Pr (n
1
S > c
1
, (n
1
+
n
2
)S > c
2
)
H0
, power 1- β = Pr (n
1
S > c
1
, (n
1
+ n
2
)S > c
2
)
HA
, where c
1
and c
2
are the critical
values for the two stages that satisfy α
1
=Pr (|n
1
S|>c
1
)
H0
, 1-β
1
= Pr (n
1
S >c
1
)
HA
, α
2
=Pr
(|(n
1
+ n
2
)S|>c
2
)
H0
and 1- β
2
= Pr ((n
1
+ n
2
)S > c
2
)
HA
. As described in Chapter 1, we
search over α
1
and 1- β
1
to find the optimal two-stage design for a given combination of
per genotype cost ratio and number of total markers. Note that using this method in
calculating the one-stage sample size, the set of design parameters α
1
, 1– β
1
, α
2
, and 1–
β
2
are completely independent of the design effect Δ, which is a function of p and ψ.
Specifically, Tables 2.1 and 2.2 give results analogous to Tables 3 and 4 in Chapter 1,
but using the new method to calculate sample size. Because of the parameters chosen in
Tables 3 and 4 do not contain absolute values for n
1
, n
2
or total genotyping cost, the
results did not change much, especially for realistic cost ratios (t
2
/t
1
> 1).
52
Table 2.1 The dependence of an optimal two-stage design parameters upon cost ratio t
2
/t
1
, number of
genotyped SNPs in stage I m, for power 1- β = .90, f = 0.5, p = .2 and ψ =1.5. The overall type I error rate
is .05/m (2-sided).
At minimum cost <10% above minimum cost
#
t
2
/t
1
m
( ×10
3
) CF* (1- β
N
)
‡
k
1
α
2
†
( ×10
-5
)
(1- β
N
)
‡
k
1
α
1
†
α
2
†
α
1
†
1- β
1
1- β
1
( ×10
-5
)
1 1 .41 .974 .255 .095 .912 7.2 .923 .341.131 .955 5.7
10 .36 .980 .220 .078 .910 .73 .928 .301.101 .952 .57
30 .34 .983 .210 .066 .908 .25 .929 .281.097 .951 .19
100 .32 .984 .195 .065 .908 .073 .930 .265.090 .951 .057
500 .30 .982 .189 .058 .909 .014 .932 .247.082 .951 .012
1,000 .29 .984 .183 .055 .908 .0072 .933 .242.076 .950 .0058
5 1 .55 .968 .378 .0222 .911 7.8 .915 .511.0301 .957 5.6
10 .48 .978 .319 .0170 .908 .81 .920 .437.0234 .954 .57
30 .45 .980 .301 .0149 .907 .27 .922 .408.0218 .953 .19
100 .42 .982 .282 .0141 .907 .080 .924 .388.0181 .952 .058
500 .39 .984 .264 .0120 .907 .016 .926 .355.0171 .951 .012
1,000 .38 .984 .258 .0114 .907 .0080 .927 .345.0161 .951 .0058
10 1 .61 .967 .433 .0114 .910 8.2 .911 .583.0166 .958 5.6
10 .53 .974 .367 .0091 .909 .83 .917 .497.0123 .955 .57
30 .50 .975 .349 .0079 .909 .27 .919 .464.0112 .954 .19
100 .47 .981 .319 .0071 .907 .085 .921 .437.0096 .953 .058
500 .43 .983 .295 .0064 .906 .017 .924 .399.0091 .952 .012
1,000 .42 .983 .289 .0061 .907 .0082 .925 .387.0087 .952 .0058
15 1 .65 .966 .458 .0082 .910 8.5 .909 .625.0120 .960 5.5
10 .56 .973 .395 .0059 .908 .85 .915 .534.0081 .955 .57
30 .52 .975 .371 .0054 .908 .28 .918 .494.0079 .954 .19
100 .49 .977 .347 .0049 .908 .084 .920 .463.0070 .953 .058
500 .45 .981 .317 .0044 .907 .017 .922 .430.0058 .952 .012
1,000 .44 .983 .305 .0041 .906 .0085 .923 .413.0059 .952 .0058
20 1 .67 .961 .489 .0065 .911 8.3 .907 .656.0093 .961 5.4
10 .58 .971 .417 .0046 .909 .85 .914 .558.0063 .955 .57
30 .54 .977 .381 .0042 .907 .29 .916 .516.0061 .955 .19
100 .51 .981 .352 .0038 .906 .088 .919 .487.0051 .954 .058
500 .47 .982 .329 .0033 .906 .017 .921 .446.0046 .953 .012
1,000 .46 .983 .320 .0030 .906 .0087 .922 .430.0045 .952 .0058
100 1 .82 .941 .656 .0015 .915 7.9 .901 .844.0025 .970 5.1
10 .69 .960 .527 .0012 .911 .87 .907 .688.0017 .960 .55
30 .64 .968 .485 .0010 .909 .30 .910 .643.0014 .957 .19
100 .60 .972 .449 .0009 .908 .091 .912 .595.0013 .956 .056
500 .56 .978 .410 .0007 .906 .019 .915 .548.0011 .955 .011
1,000 .54 .976 .402 .0007 .907 .0090 .916 .531.0010 .954 .0057
* Cost fraction of an optimally designed two-stage study relative to a one-stage study. Both studies
achieve power 1- β when tested at significance level α.
#
Two-stage studies that cost 110% of the minimum expected cost but require the least sample size (n
1
+n
2
)
53
†
α
s
= Pr (|S
s
| > c
s
| ψ =1), i.e. significance level in stage s.
‡
Power of a one-stage study that uses the sample size of an optimal two-stage study and tests at α.
Table 2.2 The dependence of an optimal two-stage design parameters upon cost ratio t
2
/t
1
, number of
genotyped SNPs in stage I m, for power 1- β = .80, f = 0.5, p = .2 and ψ =1.5. The overall type I error rate
is .05/m (2-sided).
At minimum cost <10% above minimum cost
#
t
2
/t
1
m
( ×10
3
)
CF* (1- β
N
)
‡
k
1
α
1
†
1- β
1
α
2
†
( ×10
-5
)
(1- β
N
)
‡
k
1
α
1
†
1- β
1
α
2
†
( ×10
-5
)
1 1 .38 .945 .218 .084 .820 8.4.850 .307 .114 .892 6.1
10 .33 .953 .190 .070 .818 .84.858 .263 .094 .888 .62
30 .31 .955 .180 .065 .818 .28.862 .248 .085 .886 .21
100 .29 .959 .170 .059 .816 .084 .865 .232 .080 .885 .063
500 .27 .961 .158 .055 .816 .017 .869 .215 .072 .883 .013
1,000 .26 .967 .152 .051 .813 .0085 .871 .209 .069 .882 .0063
5 1 .52 .934 .339 .0199 .818 9.3.832 .472.0287 .896 6.0
10 .45 .947 .286 .0160 .816 .94.843 .396.0226 .891 .61
30 .42 .955 .267 .0138 .813 .32.848 .373.0194 .888 .21
100 .39 .957 .250 .0130 .813 .095 .852 .347 .0180 .887 .063
500 .36 .962 .233 .0111 .812 .019 .856 .318 .0163 .885 .013
1,000 .35 .969 .219 .0108 .809 .0098 .858 .309 .0153 .885 .0063
10 1 .59 .926 .393 .0110 .818 9.6 .824 .544 .0162 .899 5.9
10 .50 .942 .332 .0084 .815 1.0.837 .463.0110 .891 .61
30 .47 .951 .307 .0074 .813 .33.841 .424.0110 .890 .21
100 .44 .955 .288 .0064 .812 .10 .846 .396 .0098 .889 .0623
500 .40 .959 .264 .0060 .812 .020 .851 .363 .0086 .887 .0126
1,000 .39 .959 .259 .0056 .812 .0097 .853 .350 .0082 .886 .0063
15 1 .62 .917 .430 .0079 .820 9.5 .820 .596 .0107 .900 5.8
10 .53 .937 .362 .0057 .816 .98.833 .496.0080 .893 .61
30 .50 .953 .325 .0050 .811 .35.838 .455.0077 .891 .20
100 .46 .951 .310 .0047 .813 .10 .842 .426 .0067 .889 .062
500 .43 .957 .286 .0040 .812 .020 .847 .391 .0058 .887 .013
1,000 .41 .953 .280 .0040 .814 .0097 .849 .380 .0053 .887 .0063
20 1 .65 .911 .459 .0061 .821 9.4 .817 .626 .0086 .902 5.7
10 .55 .939 .372 .0046 .814 1.0.830 .521.0062 .894 .60
30 .52 .946 .351 .0037 .813 .35.835 .482.0057 .892 .20
100 .48 .952 .325 .0034 .812 .10 .840 .445 .0053 .890 .062
500 .44 .959 .297 .0030 .811 .021 .845 .412 .0043 .888 .013
1,000 .43 .956 .291 .0029 .812 .010 .847 .400 .0039 .887 .0063
100 1 .80 .875 .618 .0016 .826 8.9 .802 .825 .0023 .920 5.2
10 .67 .921 .490 .0010 .815 1.1.815 .661.0016 .901 .57
30 .62 .937 .442 .0009 .811 .38.821 .609.0014 .897 .20
100 .58 .940 .415 .0008 .812 .11 .826 .566 .0012 .894 .060
500 .54 .950 .375 .0007 .810 .023 .833 .523 .0009 .890 .012
1,000 .52 .949 .363 .0007 .811 .011 .835 .501 .0009 .890 .0062
54
* Cost fraction of an optimally designed two-stage study relative to a one-stage study. Both studies
achieve power 1- β when tested at significance level α.
#
Two-stage studies that cost 110% of the minimum expected cost but require the least sample size (n
1
+n
2
)
†
α
s
= Pr (|S
s
| > c
s
| ψ =1), i.e. significance level in stage s.
‡
Power of a one-stage study that uses the sample size of an optimal two-stage study and tests at α.
Controlling for FDR
In genome-wide association studies, a few hundred thousand well-chosen SNPs
across the genome will be genotyped and tested for association with disease traits
(Hirschhorn and Daly 2005). This raises the need to control for false positive findings
that arise simply by chance. The traditional approach in multiple hypotheses testing to
tackle this increased probability of making false discoveries has been to control the
probability of making even one false discovery — the control of the family-wise error-
rate (Benjamini, Drai et al. 2001). Methods for controlling family-wise error rate
impose an extremely small p-value for a single marker test to claim genome-wide
significance (Chapter 1) and have been criticized for being conservative and not
powerful enough to detect common variants with modest disease risk in complex
diseases, especially when the markers are correlated (Benjamini, Drai et al. 2001; van
den Oord and Sullivan 2003; Lin 2006). Some authors have proposed the use of
permutation tests to assess empirical genome-wide significance level (Hirschhorn and
Daly 2005). Various ways of improving efficiency in permutation tests have also been
developed (Dudbridge and Koeleman 2004; Nyholt 2004). However, most of these
methods are not easily applied to the design of a study, since their statistical properties
are dependent on empirical genome-wide LD structure.
55
Alternatively, under the assumption that there could be multiple true positive
associations, one may consider controlling for False Discovery Rate (FDR) (Benjamini,
Drai et al. 2001). The FDR is the expected proportion of false positive features among
all of those called significant (Storey and Tibshirani 2003). When the number of tests is
large, FDR is often approximated by the proportion of false association among all that
are expected to be significant (Storey and Tibshirani 2003). FDR is determined by three
factors: the proportion of true causal variants among all the markers p
0
, the Proportion
of True causal variants to be Discovered (PTD) and the significance level α for each
single test. If we consider PTD as the probability or power of detecting a single true
causal gene (1 −β) and p
0
as the probability of being a true variant, then the FDR is
equivalent to the False Positive Report Probability (FPRP) (Wacholder, Chanock et al.
2004) and can be expressed as (1 −p
0
) α/[(1 −p
0
) α + p
0
(1 −β)]. It is advocated that
controlling the FDR offers a good balance between false and true discoveries and has
been applied in animal studies and micro-array analyses where expression levels of
thousands of genes are measured to identify those that are differentially expressed
(Benjamini, Drai et al. 2001; Storey and Tibshirani 2003; van den Oord and Sullivan
2003).
Van den Oord and Sullivan devised a procedure to find the minimum cost for a
two-stage study while controlling the FDR (van den Oord and Sullivan 2003). However
they treated the second stage as a replicate set of the first so that the statistics from the
two stages are independent. Intuitively, pooling together all subjects at the end of the
second stage would provide better power to detect true causal genes, which was verified
56
by a recent simulation study (Skol, Scott et al. 2006), and is implicit in the work on two-
stage genotyping designs provided above. Moreover, as discussed in Chapter 1, with
today’s genotyping availability, the per-genotype costs in the first stage and the second
stage are considerably different. In the original study and the corresponding software,
they did not allow the per-genotype cost to vary between stages (van den Oord and
Sullivan 2003).
Here we use a similar method as in Chapter 1 to find the optimal design for
given power and significance levels. We ask two questions: (1) when Bonferroni
correction is used to adjust for the overall type I error rate, does the number of true
causal variants affect the choice of design parameters, namely, α
1
, 1- β
1
, etc. (2) how
much one can save on genotyping cost and sample size with a relaxed overall type I
error rate by applying the FDR method, compared to those designs when strict control
of the global false positive rate with the Bonferroni correction is used.
Assume all cases and controls are unrelated. For a two-stage design, the full set
of m markers (SNPs) are considered and assume that there are D causal variants and
that each one is in complete LD with one of the markers. We consider the power to
detect the smallest design effect as the power of the study. Given a fixed total number
of causal variants D, study power 1 −β and the FDR, the α for a single SNP test is
calculated as FDR(1 −β)p
0
/(1 − p
0
)(1 −FDR), where p
0
= D/m. The expected number of
false positive findings is calculated as FDR(1 −β)D/(1 −FDR), where we treat the
proportion of true variants that one wishes to discover as if it is the power of the study
1- β. For a study with m = 500,000, p
0
is typically small so that α increases
57
approximately linearly with the number of true causal genes D for given FDR and
power. When Bonferroni correction is used, the type I error α
B
= 0.05/m.
The cost function of a two-stage design is (1+K){t
1
n
1
m + t
2
n
2
[(m – D) α
1
+ D(1–
β
1
)]}, where t
1
and t
2
are the per genotype costs in the two stages, n
1
is the number of
cases in stage I and n
2
is the number of additional cases genotyped in stage II, and K is
the control:case ratio. The method discussed in Section II is used to construct power
functions and find optimal and “nearly optimal” two-stage designs for specific power 1-
β and type I error rate α. We assume m =500,000, t
1
= $.018, t
2
/t
1
=20, K=1 throughout
this chapter.
Results and Discussion
The design parameters α
1
, 1 −β
1
, α
2
and 1 −β
2
that reach the minimum expected
cost do not depend on μ (Wang, Thomas et al. 2006), but do depend on the per
genotype cost ratio of stage II vs. stage I, the overall power 1- β and significance level
α, which is a function of FDR and D. For the purpose of our calculation, we fix the
causal allele frequency p = .2 and odds ratio ψ = 1.4, corresponding to a design effect μ
of 0.14. This can be viewed as the effect that is least possible to be detected among all
the causal ones. Because our goal is to see how much the design parameters α
1
, 1 −β
1
,
α
2
and 1 −β
2
and the minimum expected cost change as a result of a more liberal type I
error rate than given by Bonferroni correction, we only consider FDR, 1 −β and p
0
that
are realistic and also give α > .05/m. This generally occurs when D ≥ 2, 1 −β ≥ .5 and
FDR ≥ .05.
58
Figures 2.4 (a) and (b) plot, respectively, the relative reduction in minimum
genotyping cost and sample size when D is in the range (5, 250) and FDR in (.05, .70),
for optimal two-stage studies that achieve 90% and 50% power; in each case, total cost
or sample size is compared to those when D = 1 and α
B
is applied (indicated by the
letter “B” on the plot), within the same power level. As expected, genotyping cost and
sample size both decrease as α increases with either FDR or D. Maximum savings
occur at the extreme values of D = 250 and FDR = .70 when power is .50; at this point,
it is also expected that up to 292 false positive findings are to be picked up along with
125 true positive ones. The pattern and magnitude of decreasing in cost and sample size
according to FDR and D is similar for power levels .50 and .90, therefore it is most
likely preserved for other power levels between .50 and .90. The magnitude of relative
reductions in sample size and cost, however, do not agree with each other. For example,
at 1- β = .90, FDR = .7 and D = 250, 17% savings in minimum genotyping cost ($1.5 M
vs. $1.9 M) result from applying the FDR method, while the corresponding saving in
sample size is about 57% (1184 vs. 2775). This is because the proportion of sample size
allocated to the first stage, k
1
=n
1
/(n
1
+ n
2
), is different for the two situations — k
1
is .33
when α
B
is applied but is .65 for FDR = .7 and D = 25. To understand this somewhat
unexpected result, we turn to the per-subject cost ratio of stage II vs. stage I, which we
use CR to denote, then CR = (t
2
/t
1
) [(1 − p
0
) α
1
+ (1 − β
1
) p
0
]. Comparing the optimal
design parameters as the overall significance level α goes up, we note that the optimal
α
1
increases for a relaxed α, while the optimal 1 − β
1
stays about the same within the
same power level; thus as D and p
0
= D/m
increase, CR tends to increase. Because it is
59
then relatively more expensive to genotype a subject in stage II, the procedure assigns
more subjects to the first stage, i.e. increase k
1
, in order to get the minimum cost.
However, because the absolute per subject cost in stage I is still much higher than that
in stage II, putting more subjects in stage I causes the minimum cost decrease not as fast
as the reduction in sample size. This finding makes the FDR method especially
attractive when sample size is a more stringent constraint than budget.
60
Figure 2.4 Relative reduction in minimum two-stage genotyping cost and sample size of optimal two-
stage designs when FDR is applied, compared to when Bonferroni correction is used with the number of
true positives D =1. “B” stands for Bonferroni correction.
61
62
Tables 2.3 and 2.4 present the design parameters for optimal and nearly optimal
two-stage studies when 90% and 50% of the true variants are expected to be discovered,
i.e. 1 −β = .9 and .5, respectively. The nearly optimal design is the smallest two-stage
study among all that cost less than 110% of the minimum, as in Chapter 1. In each case,
the fraction of subjects optimally allocated to stage I k
1
, the significance level α
1
and
power 1 −β
1
for moving a marker from stage I to stage II, the cost fraction of the optimal
two-stage study relative to an equally powered one-stage study, and the power (1 −β
N
)
of a one-stage study that uses the sample size of the optimal two-stage study, are shown.
These are again intended to provide guidance for designing a study. With the FDR
method, the overall α increases up to 1800 fold of α
B
, and both α
1
and 1 −β
1
increase as
α increases. Specifically, α
1
increases up to 3-fold compared to that when α
B
is applied
whereas 1 −β
1
changes less radically. In contrast, the significance level of the second
stage, in which (n
1
+ n
2
) subjects are combined for hypothesis testing, is highly related
to the overall α rate, compatible to the results in Chapter 1. Although using a more
liberal α significance level results in cost and sample size savings, the benefit of using a
2-stage design decreases as the overall α increases with FDR and D. Column 5 in tables
2.3 and 2.4 displays the cost fraction of an optimal two-stage study relative to that of a
one-stage study, both of which achieve the same power when rejected at the same
adjusted α level. For example, when 1 −β = .90 and FDR = .5 (Table 2.3), for D = 5, α =
9.0 ×10
-6
, the cost fraction is 59%; for D = 100, α = 4.5 ×10
-4
, the cost fraction becomes
72%.
Table 2.3 Optimal and nearly optimal two-stage design parameters for m= 500,000, overall power 1- β = .90, t
2
/t
1
= 20 and various D and FDR levels.
Causal allele frequency is 0.2 and odds ratio is 1.4 with log-additive penetrance.
Optimal two-stage design Nearly optimal design
D FDR
α
( ×10
-7
)
FP
2
CF
3
(1- β
N
)
‡
n
1
/(n
1
+n
2
)
α
1
†
( ×10
-3
) 1- β
1
α
2
†
( ×10
-7
) (1- β
N
)
‡
n
1
/(n
1
+n
2
)
α
1
†
( ×10
-3
) 1- β
1
α
2
†
( ×10
-7
)
1 B
1
1 .05 .470 .982 .329 3.3 .906 1.7 .921 .447 4.6 .953 1.2
B
1
1 .05 .471 .982 .329 3.3 .906 1.7 .921 .447 4.6 .953 1.2
.05 4.7 .24 .506 .977 .359 3.8 .908 8.1 .919 .483 5.3 .954 5.4
.1 10 .50 .526 .977 .373 4.0 .908 17.2 .917 .503 5.6 .954 11.5
.2 22.5 1.13 .549 .975 .390 4.4 .908 39 .916 .527 6.1 .955 25.6
5
.5 90 4.5 .593 .970 .427 5.1 .909 154 .913 .578 6.9 .956 101
B
1
1 .05 .471 .982 .329 3.3 .906 1.7 .921 .447 4.6 .953 1.2
.05 9.5 .47 .525 .978 .368 4.0 .907 17 .917 .505 5.3 .954 1.9
.1 20 1 .546 .975 .389 4.2 .908 35 .916 .523 6.1 .955 22.8
.2 45 2.25 .571 .973 .407 4.7 .908 78 .914 .551 6.5 .955 51
10
.5 180 9 .619 .967 .451 5.5 .910 306 .911 .605 7.6 .957 201
B
1
1 .05 .471 .982 .329 3.3 .906 1.7 .921 .449 4.4 .953 1.2
.05 18.9 .95 .544 .977 .382 4.3 .907 33 .916 .523 5.9 .955 21.6
.1 40 2 .567 .974 .403 4.6 .908 70 .915 .547 6.4 .955 45.4
.2 90 4.5 .594 .971 .425 5.1 .909 156 .913 .578 6.9 .956 101
20
.5 360 18 .646 .965 .477 5.7 .91 617 .909 .634 8.6 .958 398
63
Table 2.3, Continued
Optimal two-stage design Nearly optimal design
D FDR
α
( ×10
-7
)
FP
2
CF
3
(1- β
N
)
‡
n
1
/(n
1
+n
2
)
α
1
†
( ×10
-3
) 1- β
1
α
2
†
( ×10
-7
) (1- β
N
)
‡
n
1
/(n
1
+n
2
)
α
1
†
( ×10
-3
) 1- β
1
α
2
†
( ×10
-7
)
B
1
1 .05 .472 .982 .329 3.3 .906 1.7 .921 .449 4.5 .953 1.2
.05 47.4 2.4 .573 .974 .407 4.6 .908 82.8 .914 .550 6.9 .956 53.5
.1 100 5 .598 .971 .426 5.3 .909 172 .912 .577 7.5 .957 112
.2 225 11.3 .628 .966 .457 5.7 .910 383 .910 .610 8.4 .958 250
50
.5 900 45 .688 .963 .505 6.5 .909 1574 .907 .677 1.4 .960 978
B
1
1 .05 .473 .982 .329 3.3 .906 1.7 .921 .447 4.7 .953 1.2
.05 94.8 4.7 .598 .971 .429 5.0 .909 164 .912 .577 7.4 .957 106
.1 200 10 .625 .969 .448 5.6 .909 347 .911 .612 7.6 .957 223
.2 450 22.5 .658 .964 .482 6.2 .910 763 .909 .647 8.8 .958 495
100
.5 1800 90 .723 .958 .541 7.4 .911 3064 .905 .724 11 .961 1929
1
Using Bonferroni correction to control the type I error rate α
2
Number of accepted false positive findings, calculated as FDR(1−β)D/(1 −FDR).
3
Cost fraction of an optimally designed two-stage study relative to a one-stage study that achieves power 1- β when tested at significance α.
‡
Power of a one-stage study that uses the sample size of an optimal two-stage study and tests at significance level α.
†
α
s
: significance level of stage S.
64
Table 2.4 Optimal and nearly optimal two-stage design parameters for m= 500,000, overall power 1- β = .50, t
2
/t
1
= 20 and various D and FDR levels.
Causal allele frequency is .2 and odds ratio is 1.4 with log-additive penetrance.
Optimal two-stage design Nearly optimal design
D FDR
α
( ×10
-7
)
FP
2
CF
3
(1- β
N
)
‡
n
1
/(n
1
+n
2
)
α
1
†
( ×10
-3
) ) 1- β
1
α
2
†
( ×10
-7
) (1- β
N
)
‡
n
1
/(n
1
+n
2
)
α
1
†
( ×10
-3
1- β
1
α
2
†
( ×10
-7
)
1 B
1
1 .05 .39 .846 .231 2.7 .521 3.3 .615 .329 4.0 .635 1.6
B
1
1 .05 .39 .846 .231 2.7 .521 3.3 .615 .333 3.7 .634 1.6
.05 2.6 .13 .41 .841 .243 2.9 .521 8.8 .608 .349 4.3 .637 4.3
.1 5.6 .28 .43 .829 .254 3.2 .522 18.3 .6 .372 4.3 .638 8.9
.2 12.5 .63 .45 .828 .266 3.3 .521 42.2 .593 .391 4.8 .640 19.6
5
.5 50 2.5 .49 .801 .298 3.8 .524 163 .578 .438 5.4 .644 75.6
B
1
1 .05 .39 .846 .231 2.7 .521 3.3 .615 .333 3.7 .634 1.6
.05 5.3 .26 .43 .831 .258 2.9 .521 17.5 .601 .37 4.3 .638 8.4
.1 11.1 .56 .45 .824 .266 3.3 .522 36.8 .594 .389 4.7 .640 17.5
.2 25 1.25 .47 .816 .280 3.6 .522 82.9 .585 .412 5.2 .643 38.5
10
.5 100 5 .52 .798 .308 4.3 .523 329 .57 .463 6 .647 148
B
1
1 .05 .39 .846 .231 2.7 .521 3.3 .615 .333 3.7 .634 1.6
.05 1.5 .53 .45 .817 .270 3.2 .523 34.2 .594 .389 4.6 .640 16.6
.1 22.2 1.11 .47 .817 .277 3.6 .522 73.6 .587 .41 5 .642 34.4
.2 50 2.5 .49 .806 .294 3.9 .523 165 .578 .437 5.5 .645 75.4
20
.5 200 10 .55 .774 .336 4.5 .526 631 .561 .493 6.7 .652 287
65
66
Table 2.4, Continued
Optimal two-stage design Nearly optimal design
D FDR
α
( ×10
-7
)
FP
2
CF
3
(1- β
N
)
‡
n
1
/(n
1
+n
2
)
α
1
†
( ×10
-3
) ) 1- β
1
α
2
†
( ×10
-7
) (1- β
N
)
‡
n
1
/(n
1
+n
2
)
α
1
†
( ×10
-3
1- β
1
α
2
†
( ×10
-7
)
B
1
1 .05 .39 .846 .231 2.7 .521 3.3 .615 .332 3.8 .635 1.6
.05 26.3 1.32 .47 .812 .280 3.8 .523 85.9 .584 .418 5 .643 4.4
.1 55.6 2.78 .50 .803 .301 3.7 .523 184 .576 .435 6.1 .648 83.1
.2 125 6.25 .53 .796 .312 4.4 .523 414 .566 .47 6.5 .651 181.6
50
.5 500 25 .59 .748 .361 5.7 .53 1491 .548 .535 8.2 .660 682
B
1
1 .05 .39 .846 .231 2.7 .521 3.3 .614 .334 3.7 .635 1.6
.05 52.6 2.63 .50 .799 .296 4.1 .524 169 .576 .444 5.3 .645 79.1
.1 111 5.56 .53 .790 .311 4.5 .525 357 .567 .469 6.1 .649 163
.2 250 12.5 .56 .780 .335 4.7 .524 813 .557 .512 6.4 .652 354
100
.5 1000 50 .63 .724 .402 5.7 .531 2898 .539 .587 8.2 .660 1321
1
Using Bonferroni correction to control the overall type I error rate α
2
Number of accepted false positive findings, calculated as FDR(1- β)D/(1-FDR).
3
Cost fraction of an optimally designed two-stage study relative to a one-stage study that achieves power 1- β when tested at significance α.
‡
Power of a one-stage study that uses the sample size of an optimal two-stage study and tests at significance level α.
†
α
s
: significance level of stage S.
When Bonferroni correction is applied at α
B
= 10
-7
, the optimal design
parameters α
1
, 1 −β
1
, α
2
and total sample size n
1
+ n
2
do not depend on D over the range
(1, 250); but the minimum cost does increase slightly as D goes up, which is also
evident in Figure 2.4 (indicated by the letter “B”).
The FDR method would be useful if one believes that there are multiple true
disease genes and that the penalty of making some false positive findings is not so
severe. Recently Kraft discussed similar calculations (Kraft 2006) and our method is
different from his in two aspects. First, we consider the power of discovering the SNP
with the minimum design effect as the power of the study, i.e. the worst possible
scenario, while Kraft explicitly requested effects of all causal variants be input as
parameters, resulting in different power levels for each variant in calculating FDR.
Although it sounds more precise, it is not always possible to determine all the causal
variants and the distribution of their effects before hand. Second, he treated the second
stage as a replicate set of the first stage, which, as discussed earlier, is not as powerful
as pooling the two stages together.
Because in a genome-wide study markers are probably correlated, we
recommend Lin’s procedure (Lin 2006) to assess empirical significance in two-stage
studies, with given sample size n
1
and n
2
and critical value of stage I (analogous to c
1
in
our method). Then a standard FDR procedure (see (Benjamini, Drai et al. 2001) for
detail) could be carried out to determine whether a finding is noteworthy.
67
Chapter 3
Assessing population substructure in the current MEC data
It is well known that case-control association studies may suffer from “spurious
associations” due to population stratification. When the population under study consists
of subpopulations that differ both in disease prevalence and marker allele frequencies
(Wacholder, Rothman et al. 2002), the proportions of cases and controls sampled from
each subpopulation will tend to differ, as will the allele or genotype frequencies
between cases and controls at any locus at which the subpopulations differ (Marchini,
Cardon et al. 2004). When this happens, subpopulation membership is a confounding
factor in the relationship between disease and the genotype of interest and if not
properly recognized, may result in either false positive association of a null marker or
masking of a true causal variant (Marchini, Cardon et al. 2004). Although the use of
family-based controls provides protection from population stratification, case-control
association studies remain an attractive design because of technical convenience and
potential cost efficiency (Devlin and Roeder 1999; Pritchard, Stephens et al. 2000).
Therefore it is still important to deal with stratification for case-control studies (Klein,
Zeiss et al. 2005; Clayton, Walker et al. 2005).
Several approaches have been developed to construct valid association tests in
the presence of population substructure. All these methods use the idea that if
population substructure affects allele frequency of the variant of interest, it should also
affect allele frequencies at other loci (Devlin and Roeder 1999; Pritchard, Stephens et
68
al. 2000; Satten, Flanders et al. 2001). Information from a number of random unlinked
markers can then be used to infer certain aspects of the underlying substructure.
Genomic control (GC) is the simplest method for controlling inflation of type I
error caused by population stratification. Devlin and Roeder showed (Devlin and
Roeder 1999) that under certain conditions the deviation of Armitage test statistics from
the null χ
2
distribution at a biallelic marker due to substructure can be encapsulated by a
multiplicative over-dispersion factor λ, which is estimated from the association
statistics of a few dozen “null” markers, under the assumption that λ is constant for all
loci. Dividing the observed test statistics Y
2
at the candidate locus by λ approximately
restores the validity of the null distribution. This method generally produces valid type I
error rate, but the variability in estimating λ becomes influential when the type I error
rate is as small as would be required by a genome-wide study (Marchini, Cardon et al.
2004). More importantly, GC correction will produce wrong results if the assumption of
constant λ across all loci is not valid; for example, GC correction could be biased
against true causal markers that do not have differences in allele frequency among
subpopulations.
With the Structured Association (SA) approaches, the characteristics of the
underlying population are first inferred with random unlinked markers and then
association tests are performed conditional on the estimated population structure. Using
the method proposed by (Pritchard, Stephens et al. 2000) as an example, the computer
program STRUCTURE is first used to empirically determine the number of underlying
populations, which could be arbitrary sometimes (Pritchard, Stephens et al. 2000), then
69
software STRAT is applied to determine the association of a candidate SNP and
disease. SA approaches are known to be computationally demanding and the amount of
calculation involved can be prohibitive for the scale of genome-wide studies, even with
available software. However, one appealing feature about it is that estimated
proportions of ancestry can be derived and used as covariates in regression analysis.
A mixed-model approach was claimed to be at least as efficient as both STRAT
and GC in modeling multi-level relatedness of quantitative traits (Yu, Pressoir et al.
2006), using a relative “kinship” coefficient matrix to model correlations among
subjects. As a natural extension from continuous trait to binary disease outcome, we
propose a Generalized Linear Mixed Model (GLMM) approach to test for association
between disease and candidate SNPs while controlling for population stratification in
case-control studies, because population stratification leads to reduction in
heterozygosity, similar to the effect of inbreeding.
While the GLMM approach was under testing, the method “EigenSTRAT” was
reported to be efficient in correcting for population stratification (Price, Patterson et al.
2006). In essence, the EigenSTRAT method calculates a normalized covariance matrix
among subjects and adjusts for the first few eigenvectors corresponding to the first few
largest eigenvalues in assessing the relationship between disease and genotype. The
most prominent feature of this method is that it works in situations where GC fails, i.e.
when the candidate SNP has a larger allele frequency difference across populations than
most random null markers.
70
In this chapter, first the performance of the GLMM approach is reported with
simulated structured case-control data and then the best approaches were applied to
actual MEC data in an attempt to quantify the extent of population mixture within each
ethnic group comprising the study.
Using GLMM to correct for population stratification
Methods
Let y denote the
(n × 1) vector of the binary case-control status of n
observations, and μ = E(y). The logit link function g(t) = log(t/(1 – t)) models the
relationship between μ and the set of explanatory variables by g( μ) = X β + γ. The
vector β is of dimension (p ×1) for p fixed effects ⎯ in the following experiments p=2
as only the candidate SNP is considered as an explanatory variable in the logistic
regression; X is the (n × p) design matrix containing the observed covariate values; γ is
the subject-specific random effect vector (n × 1) and Var( γ) = σ
k
2
K representing
background association among subjects, where K is the estimated “correlation” matrix.
SAS macro %glimmix was used to fit this model.
Four choices of the K matrix are available. Let p
il
denote the allele frequency of
subject i at a bi-allelic locus l, p
il
= 0, .5 or 1, then the estimated population allele
frequency is p
l
=
l
n
k
kl
n
p
l
∑
=1
, where n
l
is the number of subjects with non-missing
genotypes at locus l. The choices for K are
71
1) K1: the “relative kinship" coefficient proposed by (Loiselle, Slork et al. 1995)
and used in (Yu, Pressoir et al. 2006),
K1
ij
=
∑
∑ ∑
−
− − + − −
l
l l
ll
l l l l jl l il
p p
n p p p p p p
) 1 (
) 1 2 /( ) 1 ( ) )( (
for subject i and subject j.
The term involving n
l
is a sampling bias correction factor. K1
ij
measures the probability
of identity by state of two individuals i and j, relative to the average of any two random
subjects sampled from the population. It is different from the definition of kinship
coefficient estimated from pedigree structure (the probability that randomly selected
alleles from relatives are identical by descent). Negative K1
ij
is set to zero because it
simply means subject i and j are less related than two random subjects.
2) The "adjusted correlation" coefficient, K2
ij
=
∑ ∑
∑
− −
− −
l
l jl
l
l il
l
l jl l il
p p p p
p p p p
2 2
) ( ) (
) )( (
, for
at least two loci. K2 is similar to the definition of Pearson’s correlation coefficient.
Negative K2
ij
is set to zero.
3) K3: the proportion of alleles shared by two individuals across all the m loci.
Analogous to the distance matrix proposed by (Mountain and Cavalli-Sforza 1997), K3
ij
= /2m, where the similarity score s
ijl
= 0, 1 and 2 if subject i and j have 0, 1 or 2
alleles in common at the marker locus l. For example, s
ijl
= 1 for genotypes AA and Aa.
K3
ij
has the property of always between 0 and 1, but it does not consider the difference
in allele frequencies, i.e. it treats the sharing of rare alleles equivalently to the sharing of
common alleles. For SNPs with highly imbalanced minor/major allele frequency, K3
ij
could be high just because they share the more common allele.
∑
=
m
l
ijl
s
1
72
4) K4, the covariance matrix of “standardized” genotype score =
il
p ′
) 1 (
l l
l il
p p
p p
−
−
, as in (Price, Patterson et al. 2006). K4
ij
=
∑ ∑
∑
′ − ′ ′ − ′
′ − ′ − ′
j jl
l
i il
l
j jl il
p p p p
p p p
2
) ˆ ( ) ˆ (
) ˆ ˆ ( ′
l
i
p
2
)(
,
where and refer to the average of the normalized score
i
p ′ ˆ
j
p ′ ˆ
il
p ′ and across m loci.
jl
p ′
We first examined whether the GLMM approach corrected for type I error rate
when no causal variant is present (H
0
); if it did, the power of the GLMM approach in
analyzing causal variants was estimated under H
A
. In all the simulations, the GLMM
approach was compared to the crude Armitage trend test and two other established
approaches ⎯ the GC test, where the inflation factor λ is the median χ
2
test statistics
divided by 0.456 (Devlin and Roeder 1999) and a modified EigenSTRAT method,
where we derived the first few eigenvectors of the K matrices and treated them as
covariates in logistic regression. The original EigenSTRAT method separately
computed residuals from linear regressions of genotype and disease status on the first t
eigenvectors, then calculated the squared correlation coefficients r
2
between the two
residuals and examined (n – t – 1) r
2
against χ
1
2
, where n is the number of subjects. This
is actually a test for significance of the partial correlation coefficient between genotype
and binary disease trait, while adjusting for the first t eigenvectors in a linear regression
model. We believe using the first t eigenvectors in logistic regression will give similar
results and an odds ratio estimate of the SNP effect is obtainable with logistic
regression. We include comparisons between the original and the modified
EigenSTRAT method in our simulations under H
0
and H
A
.
73
Simulations and Results
Model A. Mixture of three distinct populations. Unless otherwise noted,
genotypes of 200 cases and 200 controls were simulated as sampled from three distinct
populations. Disease prevalence was assumed to be two-fold higher in population 3 than
in the other two populations. The expected numbers of cases and controls from each
subpopulation are respectively, 50, 50, 100, and 67, 67, 66, or a multiple of these
numbers when other sample sizes were assumed. Model A contains three sub-models,
depending on the property of candidate SNPs.
Model A1. Analysis of non-causal random SNPs. Subpopulation allele
frequencies of SNPs were assumed to follow the Balding-Nicholson model (Balding
and Nichols 1995): the allele frequency at locus l in population i, denoted as q
il
, was
drawn from a beta distribution with parameters (1 −F
i
) q
l
/F
i
and (1 −F
i
)(1 −q
l
)/F
i
, where i
=1, 2, 3 and q
l
, the ancestral allele frequency is assumed to follow a uniform
distribution on (0.1, 0.9). The subpopulation allele frequency q
il
has mean q
l
and
variance F
i
q
l
(1 −q
l
). F
i
measures how different the ith subpopulation’s allele frequencies
are from their typical values, analogous to Wright's F
st
(Nicholson, Smith et al. 2002;
Marchini, Cardon et al. 2004). The F
i
’s were first set to .01, .02, and .04, respectively,
for the three subpopulations to mimic a situation of modest amount of structure when
sampling from the same continent, as in (Pritchard and Donnelly 2001). Two hundred,
500, 1000, 10000 and 100000 SNPs were generated for a sample size of 400 and 1000
and 10000 random SNPs for a sample size of 80. Next the F
i
’s
were set to .23, .12 and
.15 to simulate sampling from different continents, where the F
i
’s are estimates for
74
Asian, African American and European American samples from another study
(Marchini, 2004 #164). We simulated 504 and 8846 null SNPs with minor allele
frequency greater than .01 for 800 subjects.
Model A2. Highly-differentiated non-causal SNPs. The motivation behind this
model is that previous studies suggested substantial variation in allele frequency for a
SNP in the lactase (LCT) gene within the European populations, due to positive
selection (Bersaglieri, Sabeti et al. 2004). We simulated 1000 copies of a highly-
differentiated non-causal SNP locus, with subpopulation allele frequency .2, .5 and .8,
respectively, for sample sizes of 400 and 80. The K matrix and the over-dispersion
parameter λ of the GC approach were calculated using 1000 or 10000 random null
SNPs simulated under model A1 and without the highly-differentiated marker.
Model A3. Random causal SNPs with relative risk R=1.2 per allele (log-additive
model). As in (Price, Patterson et al. 2006), genotypes 0, 1 and 2 in controls from the ith
population at locus l were assigned with probability (1- q
il
)
2
, 2q
il
(1- q
il
) and q
il
2
,
respectively, as in model A1. Genotypes of cases from the ith population was assigned
according to the relative probability (1- q
il
)
2
, 2q
il
(1- q
il
)RR, q
il
2
RR
2
scaled by (1- q
il
)
2
+
2q
il
(1- q
il
)RR + q
il
2
RR
2
, where RR stands for relative risk. One thousand copies of
causal random SNPs were simulated in this way for sample sizes of 400 and 80. The
background “correlation” K matrices were calculated using 10,000 random null markers
simulated in model A1 and without the causal SNP.
Results of Model A. Figure 3.1 shows the distribution of the interpersonal
correlation estimated by K1, the off-diagonal elements, based on 200, 1000, 10000 and
75
100000 simulated SNPs under model A1 using 400 subjects from within-continent
structure. It shows that the estimates of “correlation” among individuals within each
population became more and more accurate as more markers were included in analysis.
The medians of K1
ij
from 100000 markers are .014, .019 and .017 respectively for the
three subpopulations. The median of the correlation among individuals coming from
different populations is close to zero, regardless of the number of SNPs involved, partly
due to the fact that negative K1
ij
’s were set to zero. This pattern holds for all the K
matrices and also for the diagonal elements, or “correlation” of a subject with himself.
Figure 3.1 Plot of interpersonal correlation K1
ij
based on 200, 1000, 10000 and 100000 markers (from
left to right) simulated under model A1 for within-continent structure. K1
ij
’s (i ≠ j) were grouped into six
categories along the horizontal axis, depending on whether the two subjects come from the same
population (within-population) or different populations (between-population). The box shows 25% and
75% percentiles and the most extreme values are 1% to 99% percentiles.
Figure 3.2 shows the distribution of the interpersonal correlations for the four K
matrices, based on model A1 with 10000 random SNPs and 400 subjects. The
distribution of K2
ij
’s is similar to K1
ij
’s. K3
ij
’s are much higher than the other K
ij
’s and
76
the range of K4
ij
’s is wider than the other K
ij
’s. This plot indicates that K1 and K2
estimates represent the underlying relationship better than K3 and K4 in that the
medians of correlation for individuals from different subpopulations are close to zero,
and K1 and K3 estimates have less variability than K2 and K4.
Figure 3.2 Distribution of K
ij
’s (i ≠ j) for the four K matrices, based on 10000 markers simulated under
model A1 for within-continent structure. K
ij
’s (i ≠ j) were grouped into six categories as in Figure 3.1.
Under model A1, the median difference in allele frequency across
subpopulations is about .07 (1
st
, 3
rd
quartile: .03, .11) for within-continent and .20 (1
st
,
3
rd
quartile: .09, .33) for across-continent simulations. The estimated over-dispersion
factor λ for within-continent structure (the first set of F
i
’s) is in the range of 1.20 to 1.35
for sample size 400 and 1.45 to 1.56 for sample size 800, consistent with theoretical
results that the size of λ generally increases with sample size. For across-continent
simulations (the second set of F
i
’s), λ shows more variability ⎯ λ is 2.6 and 4.3 for 504
and 8846 SNPs, respectively.
77
For within-continent simulations of 400 subjects and 1000 markers, we checked
the resulting p-values of association from both the original and our modified
EigenSTRAT analyses. The difference in Individual p-values between these two
methods is less than .003, .004, .005 and .02 when adjusting for the first one, two, three
and ten eigenvectors, respectively. The resulting rejection rates are almost identical
between the two approaches (Table 3.1).
Table 3.1 Rejection rate of the original and the modified EigenSTRAT method, n=400 for 1000 markers.
Significance level k* Original Modified
α = .01 1 .008 .008
2 .007 .007
3 .007 .006
10 .007 .007
α = .05 1 .057 .055
2 .039 .039
3 .041 .041
10 .045 .047
* k denotes the number of eigenvectors used in the regression models.
A GLMM analysis was performed on simulations of 200, 500 and 1000 SNPs,
but not on those with 1000 SNPs, due to the length of time required for convergence (3
to 5 min. on a current machine for one SNP when n = 400 and 30 min. to one hour n =
800). The GLMM results for the within- or across-continent structure are similar.
Under model A1, the GLMM approach worked as well as GC and EigenSTRAT
in correcting for over-dispersion of the test statistics. Figure 3.3 displays an example of
the empirical distribution of the p-values of 1000 random SNPs for the within-continent
simulation (model A1). The p-values of association from the GLMM and EigenSTRAT
approaches are not significantly different from standard uniform distribution (p > .4,
Kolmogorov-Smirnov test, or KS test). GC adjustment was not perfect in this example
(p = .04, KS test) but good enough in general. The empirical type I error for the
78
unadjusted, GC-adjusted, population-adjusted, EigenSTRAT-adjusted (with three
eigenvectors from K1) and GLMM tests is .067, .041, .044, .042 and .042, respectively
at significance level .05. Analyses of 200, 500 markers or on 800 subjects yielded
similar comparisons.
Figure 3.3 Cumulative distribution of p-values of association for 1000 random SNPs simulated under
model A1 (within-continent), using different analyzing methods. The first 3 eigenvectors of K1 were
used in logistic regression for the EigenSTRAT method (denoted as “eigen(K1)” in plot).
Figure 3.4 Cumulative distribution of p-values for SNPs simulated under model A1, using GLMM with
K1, K2, K3 and K4 as correlation matrix. Results from K1, K2 and K4 are overlapped in most places.
79
Despite the observed difference in the distributions of the various K
ij
’s (shown
in Figure 3.2), the four K matrices (especially K1, K2 and K4) gave similar results
when applied to the GLMM model (Figure 3.4). When applied to the EigenSTRAT
approach, the choice of K also did not make much difference (see below) if enough
markers and eigenvectors were included in the model ⎯ adjusting for 2 to 10
eigenvectors gave similar results under model A1.
Figure 3.5 includes a few Q-Q plots of p-values obtained from EigenSTRAT
analyses on 800 subjects, using the first one or two eigenvectors of the four correlation
matrices. For the within-continent structure, the four curves in each graph
corresponding to using K1, K2, K3 and K4 are almost indistinguishable except when
the first eigenvector is used with K calculated on 1000 SNPs. When the number of
markers is large (e.g. 10000), the first eigenvector is good enough to correct for over-
dispersion; specifically, the first eigenvector is significantly related to population
membership (R
2
> .75, p < .0001 for all choices of K). For the across-continent
structure, adjusting for the first eigenvector is clearly not enough even with 8846 SNPs;
but results from applying two eigenvectors are perfect (p-values > .43, KS test). This
shows that the number of eigenvectors required depends on the underlying structure ⎯
for the within-continent structure, population 1 and 2 with F values .01 and .02 are not
very different and treating them as one group does not have much adverse effect, but
will for the across-continent structure. This point was not obvious from the original
paper (Price, Patterson et al. 2006), since they only performed simulations of two
subpopulations ⎯ not surprisingly adjusting for the first eigenvector gave satisfactory
80
results in all cases. Of course in reality the across-continent structure can usually be
determined by the ethnicity variable.
Adjust for 1 eigenvector Adjust for 2 eigenvectors
Within-continent
1000 SNPs
Within-continent
10000 SNPs
Across-continent
8846 SNPs
Figure 3.5 Cumulative distribution of p-values from EigenSTRAT analyses, adjusting for the first one or
two eigenvectors of the four K matrices. The four curves in each graph corresponding to the four matrices
overlap with each other in most situations.
It is not straightforward to determine the number of eigenvectors for inclusion in
regression models without any information on substructure. In principal components
81
analysis, one common way to choose a subset of principal components is by looking at
a scree graph, in which the ordered eigenvalues are plotted against their ranks. An
“elbow” point is determined such that the slopes are “steep” to the left and “not steep”
to the right of this point; then the principal components to the left of the “elbow” are
retained (Jollife 2002). This criterion, unfortunately, is not totally helpful here. Take
analyses of 1000 SNPs for example, adjusting for the first two eigenvector is good
enough for all choices of the correlation matrices. But the scree plot (Figure 3.6)
suggested the first four eigenvalues (eigenvectors) are important when using K1, K2
and K4, although it did indicate that two eigenvectors are enough for K3.
Eigenvalues of K1
Eigenvalues of K2
Eigenvalues of K3
Eigenvalues of K4
Figure 3.6 Scree plots of ordered eigenvalues of the four K matrices, n=800, m=1000, model A1.
In situations of highly-differentiated alleles (model A2), only the EigenSTRAT
method is efficient in correcting for stratification and it worked only when the
82
correlation matrix was based on large enough number of random markers (e.g. 10000
SNPs). With sample size 400 and at significance level .05, the estimated type I error
rate was .49 from the crude Armitage test, .03 from the population-adjusted test, .32
from the GC test ( λ is estimated from 1000 random SNPs simulated under model A1),
.08 from the EigenSTRAT using 1000 random SNPs to calculate K and .04 from
EigenSTRAT using 10000 random SNPs. The empirical type I error rate of GLMM
(K1) based on 1000 random SNPs is .25. GLMM failed to converge after one hour on
the first 20 SNPs with K matrices calculated from 10000 random SNPs, due to infinite
likelihood. Changing initial values or increasing the maximum number of iterations did
not help. So analyses on more SNPs were aborted. Results from analyzing 800 subjects
are consistent with the above.
For analyses of 1000 random true causal SNPs (model A3), results from 400 or
800 subjects are similar. We compared the original and modified EigenSTRAT
methods for n=40. The individual difference in the resulting p-values between the two
methods is less than .003, .004, .003 and .005 when adjusting for the first one, two,
three and ten eigenvectors, respectively. There is less than .007 difference in the
empirical power at significance level .05. These results were repeated by data simulated
under odds ratio of 1.5 per allele.
Table 3.2 summarizes the parameter estimates and empirical study power for the
unadjusted analysis, GC test, population-membership adjusted analysis and
EigenSTRAT analyses with different K matrices and different numbers of eigenvectors.
Empirical power was evaluated at significance level .01. It is seen that the modified
83
EigenSTRAT approach produced acceptable odds ratio estimates, regardless of the K
matrix used. EigenSTRAT results are as good as if the population membership is
known; the number ( ≤10) of eigenvectors used has little effect on the power but the bias
in the log odds ratio estimates increased slightly when 10 eigenvectors were used. GC
in this case has lower power than the EigenSTRAT method. GLMM failed to converge
for K1 after 2 hours for the first 20 SNPs so no more testing was performed.
Table 3.2 Parameter estimates and empirical power based on analyses of 1000 causal SNPs with β =
log(1.2) = .182 per allele (model A3). Sample size is 80. Power is evaluated at α level 0.01.
β
ˆ
(SD)
Power
Unadjusted test .181 (.183) .224
Population-adjusted test .188 (.173) .210
GC test --- .160
EigenSTRAT
K matrix No. of eigenvectors
K1, K2 1 .186 (.171) .207
2 .188 (.173) .206
10 .192 (.179) .206
K3 1 .185 (.172) .205
2 .186 (.175) .206
10 .191 (.177) .210
K4 1 .185 (.170) .205
2 .187 (.172) .201
10 .192 (.177) .205
Model B. An admixed population in which most individuals have a large
proportion of ancestry from a second, divergent subpopulation. Here 800 subjects (400
cases + 400 controls) from an African American population with 20% European
admixture were simulated. The ancestry z from Europeans is assumed to follow a beta
distribution with parameters (1, 4), with mean 0.2 and the central 80% lying between
0.03 and 0.45 (Pritchard, Stephens et al. 2000).
84
SNP allele frequencies from the CEPH and Yoruba panels of the HapMap
project (release 21a, 2007) were considered as the true allele frequencies of the
European and African populations. BioMart, a web-based application (see the HapMap
website) was used to select SNP loci with minor allele frequency ≥ .1. The resulting
1636 SNPs used in simulation are at least 1Mb apart and are common to both ethnic
groups.
Model B1. Simulation of random non-causal SNPs. For controls, the European
ancestry z was sampled from the beta distribution described above. For cases, rejection
sampling was used to simulate individual z as follows: assume the disease risk changes
linearly with z, i.e. is proportional to (1+7z)/8, draw z from the beta prior and accept it
with probability (1+7z)/8; keep drawing until a z value is accepted (Pritchard, Stephens
et al. 2000). At each of the 1636 loci, two alleles were drawn independently with
probability z from Europeans and 1 – z from African-Americans to construct genotypes.
Model B2. Simulation of a causal SNP with relative risk RR = 1.25 per allele.
The European ancestry z for a control subject was simulated from Beta (1,4) as in model
B1. For cases, rejection sampling was again implemented to simulate z. Using pW and
pB to denote the ancestry allele frequency at the candidate locus for the European and
African-American population, p
m
for the allele frequency of the mixture population,
then p
m
= zpW + (1 − z)pB. Assume the baseline risk is proportional to (1+7z)/8 and the
causal SNP increases disease risk by RR, then Pr (D =1 | z) ∝ [(1+7z)/8]
∑
RR
G
Pr(G) =
, where RR
G
is the relative risk associated with genotype G. Repeatedly drawing z
G
z D
p
|
85
from the beta prior and accepting it with probability gave the proper distribution of
z in cases. Then genotypes of the 1636 loci were simulated as background information
to calculate K and λ in GC test, with genotype at each locus determined by two random
draws from a mixture distribution as in model B1 for given z. At the candidate locus,
genotypes of controls were sampled from Binom (2, p
m
); genotypes of cases were
assigned according to (1 − p
m
)
2
, 2p
m
(1 − p
m
)RR, p
m
2
RR
2
], scaled by [(1 − p
m
)
2
+ 2p
m
(1
− p
m
)RR + p
m
2
RR
2
]. The values of (pW, pB) were set to (.2, .4), (.4, .5) and (.6, .2). One
thousand copies of the candidate SNPs were generated.
z D
p
|
Figure 3.7 Distribution of differences in allele frequency between the CEPH and Yoruba samples from
the HapMap project, for the selected 1636 markers (see text for details).
Results from model B. Figure 3.7 displays the distribution of the differences in
allele frequency for the selected markers between the two ethnic groups; 77% of all the
SNPs show less than a 0.3 difference in allele frequency. The median of proportion of
86
European ancestry in cases and controls is significantly different (p < 0.05, KW test).
More independent simulations showed λ for the GC test is in the range of 1.2 to 1.6.
Analysis of random non-causal SNPs (model B1) showed that all the approaches
were able to correct for stratification, although the GLMM approach was again slow to
converge ⎯ requiring 30 minutes to 1 hour for analyzing one SNP. Figure 3.8 displays
the Q-Q plots of p-values from applying GC, GLMM (with K3) and EigenSTRAT
using the first eigenvector of K3. The rejection rate at significance level .05 for the
analyses in Figure 3.8 is .042 for GC and .045 for GLMM and EigenSTRAT. It is also
observed that different K matrices did not make a lot of difference for the GLMM
approach. As for the EigenSTRAT method, the results for adjusting 2 to 10
eigenvectors are similar for all K matrices. However, when only the first eigenvector
was used in the regression, the Q-Q curve resulted from K4 is clearly much worse than
the others (Figure 3.9).
Figure 3.8 Cumulative p-value distribution from the unadjusted, GC-adjusted, GLMM and EigenSTRAT
tests, based on model B1. GLMM (using K3) and EigenSTRAT (adjusting for the first eigenvector of K3)
results overlap with each other.
87
Figure 3.9 Cumulative distribution of p-values from EigenSTRAT analyses of model B1, adjusting for
the first eigenvector of K1, K3 and K4 (K2 results similar to K1).
Parameter estimates and empirical power from analyzing model B2 are shown in
Table 3.3. We used the population-adjusted test, in which the true European ancestry z
was treated as covariate in a regression model, as a standard to measure the
performance of other tests. EigenSTRAT resulted in the closest rejection rates and
parameter estimates to those obtained from population-adjusted test, when two to ten
eigenvectors were included. When only one eigenvector was used, there is noticeable
bias in parameter estimates and the rejection rates are not ideal for all the matrices, with
K4 the worst. The GLMM approach resulted in unbiased estimates of the log odds ratio
only when the ancestral allele frequencies are similar (e.g. 0.4 and 0.5); at this time, the
power of GLMM analysis is lower than the EigenSTRAT method when enough
eigenvectors were included. When the ancestral allele frequencies are different by more
than 0.2, GLMM correction is inadequate but better than GC.
88
Table 3.3 Parameter estimates and empirical power from analyses of 1000 copies causal SNPs with β =
log(1.25) = .223 per allele (model B2), for different combination of ancestral allele frequencies, n= 800,
at α level 0.05.
(pW, pB) Test
β
ˆ
SD ( ) β
ˆ
Rejection Rate
Unadjusted .178 .102 .400
Population-adjusted .224 .104 .565
GLMM .191 .104 .438
GC -- -- .334
EigenSTRAT
k = 1 K1 .214 .104 .534
K4 .179 .102 .401
k = 2 K1 .218 .104 .544
K4 .219 .104 .547
k = 10 K1 .220 .106 .542
(0.2, 0.4)
K4 .222 .104 .554
Unadjusted .199 .102 .509
Population-adjusted .223 .104 .578
GLMM .222 .115 .548
GC -- -- .340
EigenSTRAT
k = 1 K1 .218 .104 .564
K4 .199 .102 .507
k = 2 K1 .222 .104 .576
K4 .222 .104 .575
k = 10 K1 .224 .105 .580
(0.4, 0.5)
K4 .224 .105 .574
Unadjusted .328 .105 .865
Population-adjusted .223 .111 .522
GLMM .275 .111 .704
GC -- -- .775
EigenSTRAT
k = 1 K1 .237 .110 .585
K4 .334 .105 .874
k = 2 K1 .226 .111 .535
K4 .226 .111 .536
k = 10 K1 .228 .112 .535
(0.6, 0.2)
K4 .227 .112 .535
Note: EigenSTRAT results from K2 and K3 are similar to K1. k stands for the number of eigenvectors
included in the model.
Discussion
Whether population stratification is a severe threat to validity of case-control
studies has been under debate. Although simulations have shown that serious bias is not
89
likely to occur in well-designed studies, new methods targeted at this problem continue
to emerge (Devlin and Roeder 1999; Pritchard and Donnelly 2001; Price, Patterson et
al. 2006). Here a generalized linear mixed model approach was proposed, motivated by
the observation that population substructure results in loss of heterozygosity, similar to
the effect of population inbreeding, and modeling the level of relatedness was shown to
generate valid test results in data with certain amount of inbreeding (Yu, Pressoir et al.
2006). We investigated whether GLMM is as good as two other simple approaches, the
GC and EigenSTRAT methods, in correcting for excessive false positives in case-
control studies due to population substructure. Analyses of simulated case-control data
sampled from discrete subpopulations and an admixed population were performed and
the results are not promising. Under both the null and alternative hypothesis, GLMM
can only correct for modest effect of population stratification. When it does,
EigenSTRAT also gave accurate empirical type I error rate and was more powerful at
identifying causal variation. Therefore GLMM did not offer any advantage over the
existing methods.
A bigger problem with GLMM is that sometimes it did not converge. Even
when it did converge, it was very slow using %glimmix. For 400 subjects, it usually
took 2-5 minutes to analyze one locus on a Pentium 4 3.6 GHz machine with 1.0 GB
memory; for 800 subjects, it took 30 minutes to 1 hour to analyze one locus. The
reasons may include: 1) the covariance matrix among subjects is large (n ×n) and the
memory required is a function of the dimension of the random effect covariance matrix;
2) missing or imbalanced data or insufficient sample size to estimate the random effect
90
covariance matrix, suggested by the fact that the data set with 800 subjects converged
more often than when 400 subjects were used (for model B); 3) it takes a lot of
iterations to reach the convergence point, which could indicate some extent of model
misspecification.
In contrast, the EigenSTRAT method offers much more computational
advantage to the GLMM approach. It corrects for stratification for highly-differentiated
alleles across subpopulations when GC and GLMM are inefficient. It can be more
powerful than the GC and the STRAT approach in analyzing true causal SNPs (Price,
Patterson et al. 2006). Further, when modified using logistic regression, it produced
acceptable parameter estimates, with a range of choices for the number of eigenvectors
and the K matrices.
One extreme situation that relates the GLMM approach to the EigenSTRAT
approach is that if we can get an estimate of the correlation matrix K that is block-
diagonal, with each block corresponding to each subpopulation and the entries within a
block are all the same, using GLMM is roughly equivalent to using the EigenSTRAT
method. In this case, the eigenvectors of the K matrix are just indicators of the
subpopulation membership. In our choice of K1, the interpersonal correlations of
subjects within population go to some fixed values and the correlations of subjects
between populations go to zero, as the number of markers increases. But the diagonal
elements typically go to a much higher value, if the medians from different parts of the
matrix are used as an estimate. An estimate of K1 is displayed below. Therefore the
resulting matrix seems far different from the ideal matrix.
91
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
⎞
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
⎛
1 1 1
1 1
1 1 1
1 1 1
1 1
1 1 1
1 1 1
1 1
1 1 1
O
O
O
Ideal K =
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
⎞
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
⎛
7 . 34 22 . 1 22 . 1
22 . 1 22 . 1
22 . 1 22 . 1 7 . 34
1 . 36 32 . 1 32 . 1
32 . 1 32 . 1
32 . 1 32 . 1 1 . 36
3 . 36 1 1
1 1
1 1 3 . 36
O
O
O
K1 ≈
In summary, our results indicate the GLMM approach
2
is not a good solution for
correction of population stratification in case-control studies, especially when simpler
methods exist.
2
Another way of using the mixed model approach in assessing the relationship between genotypes and
disease would be using proc mixed (SAS) to test for difference in genotypes between cases and controls.
We tried it for a few examples and stopped due to the same problem of computational difficulty.
92
Population Stratification in the MEC study
In the MEC, genotypes of 1394 SNPs on 60 genes are available for the breast
cancer study containing 4 ethnic groups ⎯ Blacks, Whites, Japanese and Latinos. We
looked for evidence of population stratification separately within each ethnic group.
“Good” SNPs were selected within each group according to these criteria: at least 80%
of controls genotyped, the p-value from HWE test in controls not less than .01 and
minor allele frequency in controls ≥ 0.005. All analyses were based on these “good”
SNPs, unless otherwise noted. The resulting number of SNPs was 1219, 1036, 1190 and
1125 for the data of Blacks, Japanese, Latinos and Whites, respectively.
First, associations of breast cancer with each SNP in turn was assessed by
logistic regression and the resulting distribution of p-values was examined for deviation
from the expected uniform distribution, within each ethnic group. The Q-Q plots
(Figure 3.10) and empirical quantiles (Table 3.4) both show there are too many positive
associations in the data for Whites and a slight tendency for more positives than
expected is seen in the data for the Latinos, while the p-values from the data for African
Americans and Japanese appear to be uniformly distributed. Adjusting for age or using
a permutation test (10000 replicates) to get empirical p-values did not alter the result.
Next the GC and EigenSTRAT were applied to the Whites and Latinos data to correct
for over-dispersion, where λ for the GC test was estimated as median of the test
statistics divided by .456. Table 3.4 shows that GC adjustment roughly restored the
uniform distribution, but EigenSTRAT did not, for up to 10 eigenvectors in the model.
93
Figure 3.10 Cumulative distribution of the p-values of association between breast cancer and all SNPs,
separately for each ethnic group.
Table 3.4 Empirical quantiles of p-values obtained from association test of breast cancer and various
subsets of SNPs, separately within each ethnic group.
Empirical quantiles
n (ca/co) No. of SNPs
α = .01 α = .05
p-value
f
Blacks 879 (426/453) 1219
a
.018 .057 .58
240
e
-- .075 .25
Japanese 1119 (554/565) 1036
a
.013 .051 .13
232
e
-- .047 .83
839 (420/419) 1190
a
.012 .054 .07
1190
b
.008 .040 .75
1190
c
.011 .054 .06
Latinos
1190
d
.013 .057 .01
240
e
-- .038 .97
1100 (532/568) 1125
a
.015 .065 10
-5
1125
b
.006 .036 .13
1125
c
.016 .067 10
-5
Whites
1125
d
.013 .063 .001
233
e
-- .056 .55
a
Unadjusted test on all SNPs
b
GC test on all SNPs with λ = median/.456
c, d
EigenSTRAT test on all SNPs, adjusting for one (c) or ten (d) eigenvectors.
94
e
Unadjusted test on SNPs at least 20kb apart
f
P-value of deviation from uniform distribution, from KS test.
These results are somewhat unexpected, because African Americans instead of
Whites are usually considered an admixed population in which stratification has an
effect. But it was noted also that these SNPs are tagSNPs so that certain degree of
correlation among them may exist which may distort the sampling distribution of the
tests. To reduce this effect, a subset of SNPs was selected such that the inter-marker
distance is at least 20kb, resulting in about 230 SNPs for each ethnic group. The Q-Q
plots (Figure 3.11) and the empirical quantiles (Table 3.4) from this subset of SNPs
showed diminished over-dispersion in the Whites and Latinos data, compared to
analyses on all SNPs. Therefore, the pattern observed in Figure 3.10 for Whites could
be owing to inter-marker correlation, rather than population structure. One other thing
noted is that the p-value distribution in the African American data changed slightly in
analyses of the subset of SNPs ⎯ a little more over-dispersion is observed in Figure
3.11 than in Figure 3.1. But more data are required to directly examine this issue.
95
Figure 3.11 Cumulative distribution of the p-values of association between breast cancer and the subset of
SNPs with inter-marker distance 20kb, separately for each ethnic group.
To indirectly address whether possible hidden structure exists within each ethnic
group in the breast cancer data, another experiment was performed. Specifically, we
computed the pairwise correlation among physically unlinked SNPs, since population
structure leads to nonzero correlation among otherwise unlinked SNPs (Nordborg and
Tavare 2002). Another way to look at this experiment is to assume disease status is
completely determined by genotype. Thus each SNP is considered as an observed
disease outcome and association of this SNP (disease) with all other SNPs is assessed
with correlation coefficients between SNPs. The resulting distribution of p-values was
then compared to the standard uniform distribution. One drawback of this approach is
that the correlation coefficients are not independent from each other, because different
96
pairs can share one of the two SNPs. In order to address the effect of this sharing on the
distribution of the correlation coefficients we simulated 100 and 60 independent SNPs
with allele frequencies uniformly distributed on (0.1, 0.9). No deviation from uniform
(from KS test) was observed for the p-values for all pairs of unlinked SNPs, with
empirical quantiles .009 and .046 for 100-SNP simulation and .008 and .054 for 60-SNP
simulation, respectively, at α level .01 and .05. Therefore the experiment is reasonable.
To select physically unlinked SNPs, first 55 genes were selected such that SNPs
from different genes are at least 2Mb apart. Then one SNP was randomly selected from
each gene and pairwise correlation among the 55 SNPs was assessed by linear
regression. The empirical quantiles of p-value at α level .01 and .05 were averaged over
sixty sets of randomly selected 55 SNPs. Whenever too many false positives were
observed adjustments using a GC-like test and EigenSTRAT were performed. For the
GC-like test, a parameter λ′ was estimated as median of the Wald test statistics divided
by .456; then individual Wald statistics were divided by λ′ to obtain the adjusted p-
value. For EigenSTRAT, the K4 matrix was calculated based on all SNPs and the first 5
or 10 eigenvectors were used in linear regression. The above experiments were
performed within each ethnic group, separately for unaffected subjects and for all
subjects.
Table 3.5 displays the empirical quantiles at significance level .01 and .05 of the
pairwise correlations. The Japanese data did not show evidence of structure thus no
adjustments were made for this group. In other groups, first, the over-dispersion is
always more severe in analyzing all subjects than in analyzing controls only, which
97
could be due to more structure in cases or certain degree of association between breast
cancer and some SNPs. Second, from examining the results of unaffected subjects only,
there is suggestive evidence of population substructure in the Blacks, Latinos and
Whites data ⎯ the empirical quantiles at α level .01 and .05 are .015 to .017 and .062 to
.065, respectively. Third, the GC-like approach was able to correct for too many false
positives in all cases. In contrast, EigenSTRAT was not equally efficient even with 10
eigenvectors in the model.
Table 3.5 Empirical quantiles of p-values for pairwise correlations among 55 unlinked SNPs, averaged
over 60 random SNP sets, for controls only and for all subjects, within each ethnic group.
α = .01 α = .05
EigenSTRAT EigenSTRAT
Unadjusted GC k = 5 k = 10 Unadjusted GC k = 5 k
=10
Blacks ca+co .026 .014 .022 .022 .085 .056 .076 .075
co .017 .012 .014 .014 .065 .052 .059 .058
Japanese ca+co .012 --- --- --- .056 --- --- ---
co .011 --- --- --- .053 --- --- ---
Latinos ca+co .022 .013 .021 .020 .076 .055 .076 .074
co .016 .012 .014 .014 .062 .051 .058 .059
Whites ca+co .016 .012 .015 .015 .065 .052 .061 .062
co .015 .012 .014 .015 .062 .054 .059 .060
Note: “ca” means breast cancer cases and “co” controls; “k” refers to the number of eigenvectors in the
model.
Another even more extreme experiment was to select unlinked “highly-
differentiated” (HD) SNPs and look at the pairwise correlation coefficients. HD SNPs
are SNPs that show high difference d in allele frequency between two groups. The
minimum difference d was set to .3, except for the comparison between Whites and
Latinos, where the minimum difference was set to .2 to get enough markers. The six
sets of “highly-differentiated” SNPs, contrasting each pair of ethnic groups, were
further refined to ensure that the inter-marker distance was at least 2Mb. The same
98
analyses were performed as before and the empirical quantiles at α level .05 were
summarized in Table 3.6.
Table 3.6 Empirical quantiles of p-values for pairwise correlation among highly-differentiated (HD)
unlinked SNPs, for controls only and for all subjects, within each ethnic group.
Quantiles at α = .05
EigenSTRAT
Subjects HD set No. of SNPs Unadjusted GC
k =5 k =10
Stepwise
selection
B ca+co B vs W 35 .47 .01 .37 .35 .12
ca+co B vs J 47 .17 .07 .14 .14
ca+co B vs L 34 .43 .01 .33 .32
co B vs W 35 .28 .02 .20 .16 .11
co B vs J 47 .25 .04 .19 .16
co B vs L 34 .11 .06 .10 .09
J ca+co J vs W 34 .08 .07 .08 .09
ca+co J vs L 19 .15 .05 .13 .13
ca+co B vs J 48 .07 .05 .07 .07
co J vs W 34 .08 .06 .08 .07
co J vs L 19 .13 .05 .14 .12
co B vs J 48 .05 .05 .06 .07
L ca+co L vs W 16 .42 .01 .41 .38 .23
ca+co J vs L 18 .11 .05 .12 .12
ca+co B vs L 34 .16 .10 .12 .12
co L vs W 16 .25 .03 .08 .09 .13
co J vs L 18 .05 .05 .05 .06
co B vs L 34 .07 .06 .07 .07
W ca+co B vs W 32 .16 .03 .10 .11
ca+co L vs W 16 .18 .07 .17 .17
ca+co J vs W 34 .07 .06 .07 .06
co B vs W 32 .13 .04 .07 .07
co L vs W 16 .13 .03 .09 .09
co J vs W 34 .06 .06 .06 .06
Note: “ca” means breast cancer cases and “co” controls; “k” refers to the number of eigenvectors in the
model. B, J, L and W stand for Blacks, Japanese, Latinos and Whites.
Again there is more over-dispersion in analyzing all subjects compared to
analyzing only controls. In general, the inter-marker correlation using HD SNPs show
more severe overdisperson than a random set of unlinked SNPs, even in the Japanese
99
controls. This could be because HD SNPs come from the same ancestry and therefore
are more related than expected. One example is in the analyses of the African
Americans vs. the Whites HD set. In both the White and African American controls, the
empirical quantiles at significance level .05 are greater than .13. Presumably this SNP
set ancestry informative markers between the African American and European
populations. The GC method showed variability in correcting for over-dispersion,
giving sometimes too much correction and sometimes too little. The EigenSTRAT
correction had some effect but overall the empirical quantiles from the corrected p-
values are still large. Results from this experiment indicate that the p-value distribution
will be distorted if disease is caused by a HD SNP and only HD SNPs are entered into
analysis; at this time, neither GC nor EigenSTRAT is a reliable choice for adjustment,
at least with the current available SNPs in the breast cancer data.
Recently Setakis et al 2006 (Setakis, Stirnadel et al. 2006) reported that stepwise
logistic regression corrected for population stratification. It is thus of interest to see this
procedure’s performance in correcting the pairwise correlation among HD SNPs
analyses. Two HD sets (highlighted in Table 3.6), Blacks vs. Whites in Blacks and
Latinos vs. Whites in Latinos, were reanalyzed with this method. The procedure was as
follows: for a HD SNP A, select a subset of the remaining HD SNPs based on Akaike
information criterion (AIC) in a linear regression model (R function step()); extract the
p-values of correlation between SNP A and other SNPs with the subset of significant
SNPs as covariates in the regression model; repeat the above steps for all HD SNPs; get
the quantiles at α level .05. The empirical quantiles (Table 3.6) are smaller than those
100
from the unadjusted analyses. While the stepwise procedure appeared to work better
than either EigenSTRAT or GC, it was not completely successful in reducing false
positive associations.
In conclusion, these experiments indicate that there is modest amount of
population stratification in the data for the MEC African Americans, Latinos and
Whites, which is not observed in the MEC Japanese. In the Japanese data, population
stratification is only a concern when the only SNPs in analyses are HD SNPs and
disease is also caused by HD SNPs. The modest amount of substructure in the data for
African Americans, Latinos and Whites could be adjusted for by a GC-like approach,
but not by EigenSTRAT with the current amount of markers. For future work,
genotyping more markers is highly recommended to further examine the existence of
subpopulations within each ethnic group.
101
References
Altshuler, D., J. N. Hirschhorn, et al. (2000). "The common PPARG Pro12Ala
polymorphism is associated with decreased risk of type 2 diabetes." Nat Genet
26(1): 76-8.
Amundadottir, L. T., P. Sulem, et al. (2006). "A common variant associated with
prostate cancer in European and African populations." Nat Genet 38(6): 652-
658.
Anderson, E. C. and J. Novembre (2003). "Finding haplotype block boundaries by using
the minimum-description-length principle." Am J Hum Genet 73(2): 336-54.
Balding, D. J. and R. A. Nichols (1995). "A method for quantifying differentiation
between populations at multi-allelic loci and its implications for investigating
identity and paternity." Genetica 96: 3-12.
Benjamini, Y., D. Drai, et al. (2001). "Controlling the false discovery rate in behavior
genetics research." Behav Brain Res 125: 279-84.
Bersaglieri, T., P. C. Sabeti, et al. (2004). "Genetic signatures of strong recent positive
selection at the lactase gene." Am J Hum Genet 74(6): 1111-2.
Bostein, D. and N. Risch (2003). "Discovering genotypes underlying human
phenotypes: passed successes in mendelian disease, future approaches for
complex disease." Nat Genet 33 (Suppl.): 228-37.
Carlson, C. S., M. A. Eberle, et al. (2004). "Selecting a maximally informative set of
single-nucleotide polymorphisms for association analyses using linkage
disequilibrium." Am J Hum Genet 74(1): 106-2.
Carroll, R. J., S. Wang, et al. (1995). "Prospective analysis of logistic case-control
studies." J Am Stat Assoc 90: 157-169.
Chapman, J. M., J. D. Cooper, et al. (2003). "Detecting disease associations due to
linkage disequilibrium using haplotype tags: a class of tests and the determinants
of statistical power." Hum Hered 56(1-3): 18-31.
Cheng, I., D. O. Stram, et al. (2006). "Common Genetic Variation in IGF1 and Prostate
Cancer Risk in the Multiethnic Cohort." J Natl Cancer Inst 98(2): 123-134.
Clayton, D. G., N. M. Walker, et al. (2005). "Population structure, differential bias and
genomic control in a large-scale, case-control association study." Nat Genet
37(11): 1243-1246.
102
Collins, F. S., M. S. Guyer, et al. (1997). "Variations on a theme: cataloging human
DNA sequence variants." Science 278: 1580-1581.
Daly, M. J., J. D. Rioux, et al. (2001). "High-resolution haplotype structure in the
human genome." Nat Genet 29(2): 229-232.
Dawson, E., G. R. Abecasis, et al. (2002). "A first-generation linkage disequilibrium
map of human chromosome 22." Nature. 418(6897): 544-548.
de Bakker, P. I. W., R. Yelensky, et al. (2005). "Efficiency and power in genetic
association studies." Nat Genet 37(11): 1217-1223.
Devlin, B. and N. Risch (1995). "A comparison of linkage disequilibrium measures for
fine-scale mapping." Genomics. 29(2): 311-322.
Devlin, B. and K. Roeder (1999). "Genomic control for association studies." Biometrics
55: 997-1004.
Dudbridge, F. and B. P. Koeleman (2004). "Efficient computation of significance levels
for multiple associations in large studies of correlated data, including
genomewide association studies." Am J Hum Genet 75: 424-435.
Edwards, A. O., R. Ritter, III, et al. (2005). "Complement Factor H Polymorphism and
Age-Related Macular Degeneration." Science 308(5720): 421-424.
Franklin, I. and R. C. Lewontin (1970). "Is the gene the unit of selection?" Genetics
65(4): 707-734.
Gabriel, S. B., S. F. Schaffner, et al. (2002). "The structure of haplotype blocks in the
human genome." Science 296(5576): 2225-2229.
Gail, M. H. and J. H. Lubin (1990). "On power and sample size for studying features of
the relative odds of disease." Am J Epidemiol 131: 552-566.
Gauderman, W. J. (2002). "Sample size requirements for matched case-control studies
of gene–environment interaction." Statistics in Medicine 21: 35-35.
Greenspan, G. and D. Geiger (2004). "High density linkage disequilibrium mapping
using models of haplotype block variation." Bioinformatics 20 Suppl. 1: i137-
144.
Gunderson, K. L., F. J. Steemers, et al. (2005). "A genome-wide scalable SNP
genotyping assay using microarray technology." Nat Genet 37(5): 549-554.
103
Haiman, C. A., D. O. Stram, et al. (2003). "A comprehensive haplotype analysis of
CYP19 and breast cancer risk: the Multiethnic Cohort." Hum Mol Genet 12(20):
2679-2692.
Haines, J. L., M. A. Hauser, et al. (2005). "Complement Factor H Variant Increases the
Risk of Age-Related Macular Degeneration." Science 308(5720): 419-421.
Hedrick, P. W. (1987). "Gametic disequilibrium measures: proceed with caution."
Genetics 117(2): 331-341.
Hirschhorn, J. N. and M. J. Daly (2005). "Genome-wide association studies for
common diseases and complex traits." Nat Rev Genet 6: 95-108.
Jeffreys, A. J., L. Kauppi, et al. (2001). "Intensely punctate meiotic recombination in
the class II region of the major histocompatibility complex." Nat Genet 29(2):
217-222.
Jollife, I. T. (2002). Principal components analysis. New York, Springer-Verlag.
Klein, R. J., C. Zeiss, et al. (2005). "Complement factor H polymorphism in age-related
macular degeneration." Science 308: 385-388.
Kolonel, L. N., B. E. Henderson, et al. (2000). "A multiethnic cohort in Hawaii and Los
Angeles: baseline characteristics." Am J Epidemiol 151: 346-357.
Kraft, P. (2006). "Efficient two-stage genome-wide association designs based on false
positive report probabilities." Pac Symp Biocomput 11: 523-534.
Kruglyak, L. and D. A. Nickerson (2001). "Variation is the spice of life." Nat Genet
27(3): 234-236.
Lin, D. Y. (2006). "Evaluating Statistical Significance in Two-Stage Genomewide
Association Studies." Am J Hum Genet 78: 505-509.
Loiselle, B. A., V. L. Slork, et al. (1995). "Spatial genetic structure of a tropical
understory shrub." Am J Bot 82: 1420-1425.
Longmate, J. A. (2002). "Complexity and power in case-control association studies."
Am J Hum Genet 68(5): 1229-37.
Marchini, J., L. R. Cardon, et al. (2004). "The effects of human population structure on
large genetic association studies." Nat Genet 36(5): 512-517.
Meng, Z., D. V. Zaykin, et al. (2003). "Selection of genetic markers for association
analyses, using linkage disequilibrium and haplotypes." Am J Hum Genet 73(1):
115-123.
104
Mountain, J. L. and L. L. Cavalli-Sforza (1997). "Multilocus Genotypes, a Tree of
Individuals, and Human Evolutionary History." Am J Hum Genet 61: 705-718.
Nicholson, G., A. V. Smith, et al. (2002). "Assessing population differentiation and
isolation from single-nucleotide polymorphism data." Journal of the Royal
Statistical Society: Series B (Statistical Methodology) 64(4): 695-715.
Nordborg, M. and S. Tavare (2002). "Linkage disequilibrium: what history has to tell
us." Trends in Genetics 18(2): 83-89.
Nyholt, D. R. (2004). "A simple corection for multiple testing for single-nucleotide
polymorphisms in linkage disequilibrium with each other." Am J Hum Genet
74: 765-769.
Patil, N., A. J. Berno, et al. (2001). "Blocks of limited haplotype diversity revealed by
high-resolution scanning of human chromosome 21." Science 294(5547): 1719-
1723.
Pe'er, I., P. I. W. de Bakker, et al. (2006). "Evaluating and improving power in whole-
genome association studies using fixed marker sets." Nat Genet 38: 663-667.
Phillips, M. S., R. Lawrence, et al. (2003). "Chromosome-wide distribution of
haplotype blocks and the role of recombination hot spots." Nat Genet 33(3):
382-387.
Price, A. L., N. J. Patterson, et al. (2006). "Principal components analysis corrects for
stratification in genome-wide association studies." Nat Genet 38(8): 904-909.
Pritchard, J. K. and P. Donnelly (2001). "Case–Control Studies of Association in
Structured or Admixed Populations." Theor Popul Biol 60: 227–237.
Pritchard, J. K. and M. Przeworski (2001). "Linkage disequilibrium in humans: models
and data." Am J Hum Genet 69(1): 1-14.
Pritchard, J. K., M. Stephens, et al. (2000). "Inference of population structure using
multilocus genotype data." Genetics 155(2): 945-959.
Pritchard, J. K., M. Stephens, et al. (2000). "Association mapping in structured
populations." Am J Hum Genet 67(1): 170-181.
Risch, N. J. (2000). "Searching for genetic determinants in the new millennium." Nature
405: 847-856.
Satagopan, J. M. and R. C. Elston (2003). "Optimal two-stage genotyping in
population-based association studies." Genet Epidemiol 25(2): 149-157.
105
Satten, G. A., W. D. Flanders, et al. (2001). "Accounting for unmeasured population
substructure in case-control studies of genetic association using a novel latent-
class model." Am J Hum Genet 68(2): 466-477.
Setakis, E., H. Stirnadel, et al. (2006). "Logistic regression protects against population
structure in genetic association studies." Genome Res 16(2): 290-296.
Skol, A. D., L. J. Scott, et al. (2006). "Joint analysis is more efficient than replication-
based analysis for two-stage genome-wide association studies." Nat Genet 38(2):
209-213.
Storey, J. D. and R. Tibshirani (2003). "Statistical significance for genomewide
studies." Proc Natl Acad Sci USA 100(16): 9440-9445.
Stram, D. O. (2004). "Tag SNP selection for association studies." Genet Epidemiol
27(4): 365-374.
Stram, D. O. (2005). "Software for tag single nucleotide polymorphism selection."
Human Genomics 2: 144-151.
Stram, D. O., C. A. Haiman, et al. (2003). "Choosing haplotype-tagging SNPS based on
unphased genotype data using a preliminary sample of unrelated subjects with
an example from the Multiethnic Cohort Study." Hum Hered 55(1): 27-36.
Syvanen, A. (2005). "Toward genome-wide SNP genotyping." Nat Genet 37: S5-S1.
Tang, H., T. Quertermous, et al. (2005). "Genetic structure, self-identified
race/ethnicity, and confounding in case-control association studies." Am J Hum
Genet 76: 268-275.
Teare, M. D., A. M. Dunning, et al. (2002). "Sampling distribution of summary linkage
disequilibrium measures." Ann Hum Genet 66(3): 223-233.
Terwilliger, J. D. and T. Hiekkalinna (2006). "An utter refutation of the 'Fundamental
Theorem of the HapMap'." Eur J Hum Genet 14(4): 426-437.
The International Hapmap Consortium (2003). "The International HapMap Project."
Nature 426: 789-796.
The International Hapmap Consortium (2005). "A haplotype map of the human
genome." Nature 437: 1299-1321.
Thomas, D. C., R. W. Haile, et al. (2005). "Recent developments in genomewide
association scans: a workshop summary and review." Am J Hum Genet. 77(3):
337-345.
106
107
van den Oord, E. J. C. G. and P. F. Sullivan (2003). "A framework for controlling false
discovery rates and minimizing the amount of genotyping in the search for
dsease mutations." Hum Hered 56: 188-199.
Wacholder, S., S. Chanock, et al. (2004). "Assessing the probability that a positive
report is false: an approach for molecular epidemiology studies." J Natl Cancer
Inst 96(6): 434-342.
Wacholder, S., N. Rothman, et al. (2002). "Counterpoint: bias from population
stratification is not a major threat to the validity of conclusions from
epidemiological studies of common polymorphisms and cancer." CEBP 11: 512-
522.
Wall, J. D. and J. K. Pritchard (2003). "Haplotype blocks and linkage disequilibrium in
the human genome." Nat Rev Genet 4(8): 587-597.
Wang, H., D. C. Thomas, et al. (2006). "Optimal two-stage genotyping designs for
genome-wide association scans." Genet Epidemiol 30: 356-368.
Wang, N., J. M. Akey, et al. (2002). "Distribution of recombination crossovers and the
origin of haplotype blocks: the interplay of population history, recombination,
and mutation." Am J Hum Genet 71(5): 1227-1234.
Wang, W. Y., B. J. Barratt, et al. (2005). "Genome-wide association studies: theoretical
and practical concerns." Nat Rev Genet 6(2): 109-118.
Weale, M. E., C. Depondt, et al. (2003). "Selection and evaluation of tagging SNPs in
the neuronal-sodium-channel gene SCN1A: implications for linkage-
disequilibrium gene mapping." Am J Hum Genet 73(3): 551-565.
Weiss, K. M. and A. G. Clark (2002). "Linkage disequilibrium and the mapping of
complex human traits." Trends in Genetics 18(1): 19-24.
Yu, J., G. Pressoir, et al. (2006). "A unified mixed-model method for association
mapping that accounts for multiple levels of relatedness." Nat Genet 38(2): 203-
208.
Zhang, K., P. Calabrese, et al. (2002). "Haplotype block structure and its applications to
association studies: power and study designs." Am J Hum Genet 71(6): 1386-
1394.
Zhang, K., M. Deng, et al. (2002). "A dynamic programming algorithm for haplotype
block partitioning." Proc Natl Acad Sci USA 99: 7335-7339.
Abstract (if available)
Abstract
The dissertation focuses on two problems identified from the ongoing Multiethnic Cohort (MEC) study, i.e. two-stage genotyping design and population stratification. Three chapters follow a general introduction to relevant concepts in genetic epidemiological studies. Chapter one and two contain two published papers on optimal two-stage genotyping design, using Bonferroni correction and false positive rate (FDR) correction, respectively, for the overall type I error rate. The significance of this part is that a procedure was designed to search for the most cost-effective two-stage studies based on the current availability of high-throughput and very high-throughput genotyping platforms (fixed SNP array) and without considering any constraints on total sample size or available resources. The results are to provide guidance for two-stage designs under a wide range of assumptions on the per-genotype cost ratio and total number of markers. Chapter three describes the performance of the generalized linear mixed model (GLMM) in controlling for population stratification with simulated structured case-control data. Compared to Genomic Control (GC) and the relatively new EigenSTRAT methods, GLMM does not offer much advantage. To assess the likely significance of population stratification in the planned or future MEC association scans, Chapter 3 also examines evidence of population stratification within four ethnic groups (African Americans, Japanese, Latinos and Whites) represented in the newly available MEC breast cancer data consisting of 1400 SNPs. The results indicate moderate structure exists in African Americans and Latinos and the GC approach produced acceptable empirical type I error in most cases.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Computational design for analysis of SNP association studies
PDF
A comparison of methods for estimating survival probabilities in two stage phase III randomized clinical trials
PDF
Cluster sample type case-control study designs
PDF
Sequential analysis of large scale data screening problem
PDF
Power and sample size calculations for nested case-control studies
PDF
Polygenic analyses of complex traits in complex populations
PDF
Two-step study designs in genetic epidemiology
PDF
X-linked repeat polymorphisms and disease risk: statistical power and study designs
PDF
Bayesian hierarchical models in genetic association studies
PDF
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study
PDF
The role of genetic ancestry in estimation of the risk of age-related degeneration (AMD) in the Los Angeles Latino population
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Disease risk estimation from case-control studies with sampling
PDF
Population substructure and its impact on genome-wide association studies with admixed populations
PDF
Risk factors associated with smoking initiation among Chinese adolescents: a matched case-control study
PDF
The effects of sample size on haplotype block partition, tag SNP selection and power of genetic association studies
PDF
Pharmacogenetic association studies and the impact of population substructure in the women's interagency HIV study
PDF
A genome wide association study of multiple sclerosis (MS) in Hispanics
PDF
Evaluating the use of friend or family controls in epidemiologic case-control studies
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
Asset Metadata
Creator
Wang, Hansong (author)
Core Title
Two-stage genotyping design and population stratification in case-control association studies
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
06/08/2007
Defense Date
03/14/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
association design,OAI-PMH Harvest,population stratification,two stage study
Language
English
Advisor
Stram, Daniel O. (
committee chair
), Gauderman, W. James (
committee member
), Nordborg, Magnus (
committee member
)
Creator Email
hansongw@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m523
Unique identifier
UC1160234
Identifier
etd-Wang-20070608 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-500954 (legacy record id),usctheses-m523 (legacy record id)
Legacy Identifier
etd-Wang-20070608.pdf
Dmrecord
500954
Document Type
Dissertation
Rights
Wang, Hansong
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
association design
population stratification
two stage study