Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study
(USC Thesis Other)
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EFFICIENT TWO-STEP TESTING APPROACHES FOR DETECTING
GENE-ENVIRONMENT INTERACTIONS IN GENOME-WIDE ASSOCIATION
STUDIES, WITH AN APPLICATION TO THE CHILDREN’S HEALTH STUDY
by
Cassandra Elizabeth Murcray
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BIOSTATISTICS)
December 2010
Copyright 2010 Cassandra Elizabeth Murcray
ii
DEDICATION
For my family and my friends.
iii
ACKNOWLEDGEMENTS
First I would like to thank my advisor, Professor William Gauderman, for his
guidance and support, and also the other members of my dissertation committee,
Professors David Conti and Fengzhu Sun, for their valuable contributions. I also owe
thanks to the other USC faculty members for helping me learn and giving me advice and
guidance, in particular Professors Juan Pablo Lewinger, Frank Gilliland, Kimberly
Siegmund, and Duncan Thomas.
I would also like to acknowledge my graduate student colleagues from whom I
have learned, found support and received helpful advice and suggestions.
This work was supported by the Southern California Environmental Health
Sciences Center (grant # P30ES007048) funded by the National Institute of
Environmental Health Sciences, the Children's Environmental Health Center (grant #s
P01ES009581, R826708-01 and RD831861-01) funded by the National Institute of
Environmental Health Sciences and the Environmental Protection Agency, the National
Institute of Environmental Health Sciences (grant #s T32ES013678 and P01ES011627),
and the National Heart, Lung and Blood Institute (grant # R01HL087680).
iv
TABLE OF CONTENTS
Dedication ii
Acknowledgements iii
List of Tables v
List of Figures vi
Abstract viii
Introduction 1
Chapter I: Two-Step Method for Detecting Gene-Environment Interactions 7
Notation 7
Case-Control Test (CC) 7
Environment-Gene Two-Step Test (EG2) 8
Power Comparisons 9
Conclusions 14
Chapter II: Software Package to Calculate Power and Sample 16
Alternative Two-Step Approaches to Detect G×E Interactions in a GWAS 17
Disease-Gene Two-Step (DG2) 17
Hybrid Two-Step (H2) 18
Development of Design Software 19
Notation 19
Likelihood Theory 20
Relative Efficiency of the Various Designs 26
Results 27
Conclusions 34
Chapter III: Application to the Children’s Health Study 40
Exposure Data 41
Genotype Data 43
Statistical Analysis 43
Optimization of Hybrid Two-Step Analysis 44
Results 47
Conclusions 50
Summary 54
References 55
Appendix : Sample Power and Sample Size Code 59
v
LIST OF TABLES
Table I: Type I error and Power for Case-Control (CC) and Environment-Gene 12
Two-step (EG2) Test for Gene-Environment Interaction
Table II: Simulated and Empirical Power for Case-Control One-Step (CC) and 25
Environment-Gene Two-Step (EG2) Tests for Gene-Environment
Interaction
Table III: Sample Size (N) and Relative Efficiency (RE) Required to Achieve 29
80% Power to Detect a True Gene-Environment Interaction for a
Collection of Testing Strategies across a Range of Parameter Settings
Table IV: Sample Size (N) Required and Relative Efficiency (RE) to Achieve 33
80% Power to Detect a True Gene-Environment Interaction for the
Hybrid Two-Step Analysis for Various Allocations of α to the Association
Two Step (ρ).
Table V: Sample size and school grade each year of the five CHS cohorts 40
Table VI: Descriptive Statistics for Environmental Exposures Considered for 42
Testing G×E Interactions in the Children’s Health Study
Table VII: Most Significant SNPs Involved in G×E Interactions for each 52
Environmental Exposures in the Children’s Health Study Using the
Hybrid Two-Step Approach
vi
LIST OF FIGURES
Figure I: Power for Case-Control (CC) and Environment-Gene Two-Step (EG2) 13
Analyses for Increasing Levels of Interaction Effect size (R
ge
)
Figure II: Binomial distribution for S=10,000 markers and α
A
=0.05 24
Figure III: Sample Size Required to Achieve 80% Power for Tests of 27
Gene-Environment Interaction in a Genome-wide Association Study
by Interaction Effect Size for Binary Environmental Exposure
Figure IV: Sample Size Required to Achieve 80% Power for the Hybrid 31
Two-Step Test (H2) of Gene-Environment Interaction in a Genome-wide
Association Study by Step 1 Significance Thresholds for the Disease-Gene
and Environment-Gene (EG2) Two-Step Tests for a Binary Exposure and
no Genetic Main Effect (R
g
= 1.0)
Figure V: Relative Efficiency of the Hybrid Two-Step Test (H2) of Gene- 32
Environment Interaction in a Genome-wide Association Study by ρ,
the Allocation of the Experiment-wise Significance Level (α) to the
Environment-Gene (EG2) Two-Step Test
Figure VI: Sample Size Required to Achieve 80% Power for Tests of Gene- 34
Environment Interaction in a Genome-wide Association Study by
Interaction Effect Size for a Continuous Environmental Exposure in the
Absence of a Genetic Main Effect (R
g
= 1.0)
Figure VII: Power for Case-Control (CC), Environment-Gene Two-Step (EG2), 38
Disease-Gene Two-Step (DG2), Hybrid Two-Step (H2), and Case-Only
(CO) Analyses for Increasing Levels of Interaction Effect size (R
ge
)
Figure VIII: Power of the CC and EG2 Tests for G×E Interaction Across a 46
Range of α
A
for a Selection of Interaction Effect Sizes (R
ge
) (a) and
Minor Allele Frequencies (q
A
) (b). The Optimal Settings of α
A
are shown
by Vertical Lines.
Figure IX: Power of the CC and DG2 Tests for G×E Interaction Across a 47
Range of α
M
for a Selection of Interaction Effect Sizes (R
ge
) (a) and
Minor Allele Frequencies (q
A
) (b). The Optimal Settings of α
M
are shown
by Vertical Lines
vii
Figure X: Power of the Case-Control, Environment-Gene, Disease-Gene, and 48
Hybrid Two-Step Tests to Detect G×in utero Tobacco Smoke Exposure
Interaction in the Children’s Health Study Across a Range of R
ge
Figure XI: Manhattan Plot of Log
10
P-Values from Step 1 of the Disease-Gene 49
Scan (DG2) of 527,918 SNPs from the Children’s Health Study
Figure XII: Manhattan Plot of Log
10
P-Values from Step 1 of the Environment- 50
Gene Scan of 527,918 SNPs for in utero Tobacco Smoke Exposure from
the Children’s Health Study
Figure XIII: Manhattan Plot of Log
10
P-Values from Step 2 of the (a) Case- 51
Control (CC), (b) Disease-Gene (DG2), (c) Environment-Gene (EG2)
Two-Step Scan of 527,918 SNPs for in utero Tobacco Smoke Exposure
from the Children’s Health Study.
viii
ABSTRACT
Many complex diseases (e.g. asthma, diabetes) are likely to be a result of the interplay of
genes and environmental exposures. The standard analysis in a genome-wide association
study (GWAS) scans for main effects and ignores the potentially useful information in
the available exposure data. This dissertation explores alternative approaches to detect
gene-environment interactions (G×E) in GWA studies. The first chapter explores a novel
approach aimed at prioritizing the large number of SNPs tested to highlight those most
likely to be involved in a G×E interaction. This approach screens all markers available in
a GWAS on a test that models the G-E association induced by an interaction in the
combined case-control sample. Power and Type I error of this approach are compared to
a traditional approach. In the second chapter of this dissertation, I explore alternative
two-step approaches through the development of a likelihood based software package
designed to compute power and sample size of a variety of approaches to detect G×E
interactions. In the final chapter, I demonstrate the use of this software package in the
optimization and analysis of a nested case-control sample from the Children’s Health
Study to investigate heterogeneity of genetic risk by subgroups defined by environmental
exposures on asthma susceptibility in children in southern California. I optimize this
procedure to efficiently scan for G×E interactions that effect asthma susceptibility for
binary (e.g.i.e. in utero tobacco smoke, close proximity to a major road) exposure,
distance to major road or freeway) and continuous exposures (e.g. i.e. PMpm
2.5
, ozone,
traffic-related pollution NO
x
) exposures.
1
INTRODUCTION
The advancement of high-throughput genotyping technology has introduced the unique
opportunity to study the role of genetic variability in the etiology of complex disease (e.g.
diabetes, cancer, asthma). Not only can investigators perform candidate gene and
pathway-based studies, they can now conduct genome-wide association studies to
potentially uncover novel genetic susceptibility loci. For all of these study designs, it is
important to recognize that most complex diseases are a result of the combined effects of
genes and environment. With the abundance of information available and the daunting
task of understanding both genetic and environmental effects on disease comes the
responsibility to develop novel methods capable of teasing apart true significant findings
from the expected large number of false positives.
Despite the belief that complex disease is a result of the interplay of both genes
and environmental exposures, little work has been done to investigate these types of
interactions in the current age of genome-wide association studies (GWAS). Most
published GWAS have reported a short list of significant single nucleotide
polymorphisms (SNPs) with additional follow-up of these markers to be conducted
(Hunter, Kraft et al. 2007; Saxena, Voight et al. 2007; Scott, Mohlke et al. 2007; Zeggini,
Weedon et al. 2007). This strategy could potentially miss many important SNPs specific
to a subgroup of the population defined by some exposure. In fact, interactions with
opposite effects in two different subgroups (crossing-interaction) will not show a
marginal genetic effect and therefore will not be identified by using standard approaches.
2
Traditional tests for gene-environment (G×E) interaction have been shown to
have poor power compared to marginal effect tests of the gene or environmental factor
alone (Hwang, Beaty et al. 1994; Garcia-Closas and Lubin 1999). This weakness of G×E
interaction analyses is even more pronounced in genome-wide association studies where
the correction for multiple testing currently requires p-values less than 10
-8
to be declared
statistically significant at the genome-wide level. Due to the small power of tests for
interaction, it has been argued that even in the presence of a G×E interaction, an induced
marginal effect may be detectable by testing for a genetic effect alone (Clayton and
McKeigue 2001). On the other hand, it has been shown that exploiting the interaction
can often increase statistical power to detect important disease susceptibility loci
(Chatterjee, Kalaylioglu et al. 2005; Kraft, Yen et al. 2007).
The goal of any GWAS is to identify novel genetic variants involved in the
development of complex disease. Although most of these studies will perform a marginal
effect scan of all available SNPs before pursuing G×E or G×G interactions, it has been
shown that power can be gained by studying these types of risk jointly. Kraft et al (Kraft,
Yen et al. 2007) developed a 2-degree of freedom (DF) joint test of the genetic main
effect and the interaction parameter. They showed that in the even absence of non-zero
main effects, their 2 DF test could have more power than the test of the interaction
parameter alone. For genome-wide association studies with only a few environmental
factors of interest, this may be the ideal choice to identify important loci involved in
disease etiology. However, since we assume that pure marginal tests of each SNP will
always be the first analysis method for GWA studies, all tests that include the main effect
parameter for the gene will be replicated for each new environmental factor analyzed.
3
This redundant testing could impact the overall power of the study to identify new
susceptibility loci.
Many investigators have addressed the limited power of tests for G×E interaction
by developing sophisticated methods designed to increase power to detect associations
beyond marginal genetic effects. One important contribution was the case-only design,
where tests for interaction can be done without sampling any controls (Piegorsch,
Weinberg et al. 1994; Khoury and Flanders 1996). It was shown that under the
assumption of independence between gene and environment in the population, the G×E
interaction relative risk can be estimated by the odds ratio between gene and environment
in the cases alone (Piegorsch, Weinberg et al. 1994). The resulting estimate of the G×E
interaction can be much more precise than the corresponding estimate from a traditional
logistic regression model applied to cases and controls that does not exploit independence
between gene and environment in the population. However, if even only a small subset
of null single nucleotide polymorphisms has a detectable population level association
between gene and environment, one could generate several thousand false-positive results
using a case-only analysis of interaction given the overall number of SNPs being tested in
a GWAS. Although there are likely to be few genes that have a true association with
environment in the population, G-E associations can be induced through population
stratification. For this to occur, the distribution of the environmental exposure must vary
across population subgroups (e.g. by ethnic subgroup). If this occurs, then every SNP
whose minor allele frequency also varies by subgroup will appear to have G-E
dependence in the case sample being studied, thus yielding a flawed inference about G×E
interaction. It may be possible to adjust for population stratification or to study an
4
environmental factor that is known not to vary by population subgroups. In that situation,
the case-only analysis can be an efficient choice.
To adapt the potentially powerful case-only analysis to be robust to deviations
from the assumption of G-E independence, recent attention has been given to extensions
of the case-only analysis (Chatterjee, Kalaylioglu et al. 2005; Mukherjee, Ahn et al.
2008; Mukherjee and Chatterjee 2008; Li and Conti 2009). The goal of these methods is
to leverage the power of the case-only test with the unbiasedness of the traditional case-
control comparison analysis in order to improve power to detect G×E interaction. Li et al
(Li and Conti 2009) suggested using Bayes-model averaging (BMA) to average over the
powerful case-only analysis and the unbiased case-control test. They show increased
power for the BMA approach over the traditional test for G×E interaction. Their method,
however, is vulnerable to bias if the assumption of independence between gene and
environment is violated and proper priors are not used. Similarly, Mukherjee et al
(Mukherjee and Chatterjee 2008) developed an empirical Bayes-type shrinkage estimator
to balance bias and efficiency. They achieve smaller mean-squared error (MSE)
estimates and reduced bias using their proposed method under independence between
gene and environment as well as for modest departures from independence. Like Li et al
(Li and Conti 2009), they noted that for modest departures from gene-environment
independence, their method can still lead to a biased test.
Kooperberg et al (Kooperberg and Leblanc 2008) showed that power could be
gained by screening the markers available in a GWAS by genetic marginal effects in the
pursuit of G×G interactions. They demonstrated that by reducing the number of markers
formally tested for interaction to only those that were identified to have a detectable
5
marginal effect at a liberal screening threshold, they were able to achieve significant
power gains over a traditional approach of testing all SNP pairs. Here, I explore power of
this approach for G×E interactions and show that under some scenarios, screening on
marginal genetic effects is a powerful alternative to traditional methods.
Recently, Murcray et al(Murcray, Lewinger et al. 2009) developed a two-step
analysis framework designed to efficiently use all information in a case-control sample to
detect gene-environment interactions in a genome-wide association study. The method
was always more powerful than a traditional test of G×E interaction for the simulation
parameters examined, even in the presence of population level association between gene
and environment. Furthermore, there was no inflation in Type I error in the presence of
G-E association in the population. Here I will highlight important conclusions from that
paper, which suggest possible improvements to this two-step method. In addition, I show
that a hybrid approach that combines the Murcray et al (Murcray, Lewinger et al. 2009)
and Kooperberg et al (Kooperberg and Leblanc 2008) methods is a robust and powerful
choice across a wide range of true underlying population parameters.
I describe an empirical tool to calculate power and sample size using likelihood
theory that is quicker and easier to adapt to different model specifications than
simulations. Using this tool, I examine the potential optimization and relative efficiency
of the various two-step tests across a broad range of population parameters. This
software tool is available for investigators designing genetic association studies where
effect modification by environmental factors is a primary concern. Finally, I optimize the
proposed methods for an asthma GWAS conducted in the Children’s Health Study (CHS)
6
to scan for interactions between a genomewide panel of SNPs and in utero tobacco
smoke, traffic, and air pollution exposures.
7
CHAPTER I: TWO-STEP METHOD FOR DETECTING G×E INTERACTION
Notation
Let D be an indicator of disease status, and assume we have a sample of cases
(D=1) and unrelated controls (D=0). Assume information is available for a binary
environmental exposure, with E as an indicator for exposure. Further assume that for
each individual we have genotyped S single nucleotide polymorphisms (SNPs) spanning
the genome, with g
1
, g
2
,…,g
S
denoting the genotypes at the S loci. Letting G
1
, G
2
, … G
S
denote some genetic coding (e.g. additive, dominant) for each genotype, we consider a
model for a given SNP of the form
€
logit P D =1| g,e
( )
=β
0
+β
g
G +β
e
E +β
ge
GE (1)
Under a dominant coding of the genotype, for example, is the odds ratio
(OR) comparing carriers of at least one risk allele (G=1) to non-carriers (G=0) in those
unexposed (E=0). Similarly, is the odds ratio comparing risk in exposed
(E=1) to that in unexposed (E=0) individuals among non-carriers of the risk allele (G=0).
Lastly, is the ratio of the genetic odds ratios comparing exposed to
unexposed subjects, i.e. . If this ratio is equal to 1.0, or β
ge
= 0, we say
that there is no interaction between genotype and the environmental exposure.
Case-Control Test (CC)
In the context of a GWAS, a standard approach to test for gene-environment interaction
would be to perform a 1-df test of H
0
: β
ge
= 0 for each SNP based on the model in
equation (1). We assume a likelihood ratio test will be used to test this hypothesis. A
8
correction for multiple comparisons (e.g. Bonferroni, controlling the False Discovery
Rate(Benjamini and Hochberg 1995)) is required to achieve a desired genome-wide type
I error rate.
Environment-Gene Two-Step Test (EG2)
Murcray et al (Murcray, Lewinger et al. 2009) proposed an alternative two-step test to
scan for interactions that combines the power of the case-only test with the protection
from bias of the traditional case-control comparison analysis (CC). This approach uses a
screening step to prioritize all S markers to identify a subset that are most likely to be
involved in an interaction. The analysis consists of the following two steps:
Step 1, E-G screening test: For each of the S SNPs, perform a likelihood-ratio test of
association between G and E, based on the logistic model logit Pr(E=1| g) = α
0
+ β
A
G.
This is the standard test that would be applied in a case-only analysis of G×E interaction
(Piegorsch, Weinberg et al. 1994; Khoury and Flanders 1996), although in this context
the test is applied to the combined sample of cases and controls. The subset of s
A
SNPs
that exceed a given significance threshold (i.e. with p-value < α
A
) for the test of H
0
: β
A
=
0 are analyzed in Step 2.
Step 2, case-control test: The s
A
SNPs that pass Step 1 are tested in the traditional test of
gene-environment interaction, i.e. based on a likelihood-ratio test of H
0
: β
ge
= 0 derived
from the model in equation (1). Significance at this step is defined as having a p-value
less than α/s
A
, where α is the desired overall Type I error rate.
9
Like the case-only analysis, Step 1 of this procedure is sensitive to the assumption of
independence between gene and environment in the population from which the cases
were sampled. However, the Step 2 comparison of cases to controls is unbiased even
when this assumption fails to hold. Therefore, the overall EG2 procedure will provide a
valid test in the presence of population level association between genotype and exposure,
a claim verified by simulation (Murcray, Lewinger et al. 2009).
Given the reported power of a case-only analysis applied to only diseased
individuals (Piegorsch, Weinberg et al. 1994; Khoury and Flanders 1996), one may be
tempted to apply Step 1 to only diseased individuals, i.e. to perform a true case-only test
of interaction in Step 1 and use this to define the subset of s
A
markers to analyze in Step
2. However, since the case-only analysis and the traditional test of G×E interaction using
cases and controls are not independent, this approach produces a correlation between the
Step 1 and Step 2 test statistics and leads to an inflated Type I error rate for the overall
procedure. The screening test of G×E interaction applied to the entire sample of cases
and controls eliminates the correlation between tests in Steps 1 and 2 and, as we showed
formally and verified by simulation, preserves the overall Type I error rate (Murcray,
Lewinger et al. 2009).
Power Comparisons
In Murcray et al., simulations were used to study the power achieved by the EG2
testing framework compared to the traditional one-step method (CC). For each of 1,000
replicate data sets, we simulated a sample of N
1
=500 cases and N
0
=500 controls, each
with genotype information on a large number of markers (S =10,000, 25,000, and
50,000). Although larger marker sets are likely to be used in practice, our chosen set
10
sizes are sufficient to demonstrate the relative power of the two-step method. Markers
were assumed to be independent loci distributed across the genome. For each replicate, a
single marker was chosen to be the true disease susceptibility locus (DSL), with
remaining markers assumed to have no association with disease. We considered a range
of minor allele frequencies (q
A
) for the DSL, including 0.10, 0.20, and 0.30. For the
remaining null markers, we simulated a uniform distribution of allele frequencies
between 0.05 and 0.30. We also considered a range of values for the exposure prevalence
(p
E
), including 0.10, 0.25, and 0.50 and set the population disease prevalence (p
0
) to 0.05.
Finally, we considered a range of possible values for the genetic and environmental main
effects (R
g
= exp(β
g
) and R
e
= exp(β
e
), respectively) as well as for the interaction effect
(R
ge
= exp(β
ge
)).
As described above, the traditional case-only method of testing for gene-
environment interaction is based on the assumption of no population level association
between the gene and environment. We explored the sensitivity of the EG2 method to
population level association between gene and environment by introducing a parameter
p
ge
, defined to be the probability that a given null marker is associated with the exposure
at the population level and is detectible at an α
A
significance level. With p
ge
= 0.0, none
of the markers were assumed to be associated with E in the population. It is unlikely that
a large percentage of SNPs will be associated with a given environmental factor, but for
completeness we considered a wide range of values, ranging from 0.01 to 0.95. For each
simulated marker, we randomly decided whether it was associated with E in the
population based on probability p
ge
. For any marker chosen to have an association,
11
genotypes were generated conditional on the assigned E and an assumed population SNP-
exposure odds ratio of 2.0.
For each parameter setting, we applied the traditional one-step CC G×E test and
our EG2 approach. We estimated the experiment-wise Type I error as the proportion of
1,000 replicates in which at least one of the null markers was found to be significant after
a Bonferroni correction for multiple comparisons. Power was calculated as the number
of replicates in which the DSL was detected at an overall significance level of 0.05, again
after a Bonferroni correction for multiple comparisons.
Table I shows power and Type I error for both the traditional one-step Case-
Control (CC) method for testing for gene-environment interaction in a case-control
GWAS as well as the described Environment-Gene Two-Step (EG2) framework. For all
parameter settings simulated, the experiment-wise error rate was well approximated by
both methods. Using a pure case-only analysis to test for gene-environment interaction
results in an inflated Type I error rate under the simulated scenario of population level
dependence between gene and environment (p
ge
≠0). However, the EG2 method is
unbiased even when a large proportion of markers are assumed to be associated in the
population (p
ge
=0.95).
Across a range of interaction effect sizes (R
ge
=exp(β
ge
)), our EG2 method was
more powerful than the standard CC test for detecting an interaction (Figure I). For
example, when R
ge
= 3.0, power was 33.2 percent using the standard CC approach,
compared to 57.9 percent using the EG2 method. As we would expect, as the effect size
of the interaction gets larger both tests gain power. For a small interaction effect, both
tests have low power to detect a causal locus while at a sufficiently large effect size, the
12
Table I. Type I error and Power for Case-Control (CC) and Environment-Gene Two-step
(EG2) Test for Gene-Environment Interaction
two tests approach 100 percent power. The largest differences in power between the two
methods occurred when the interaction effect was of moderate magnitude, from R
ge
= 2.5
to 4.0. All the estimates of power in Figure I assumed a DSL allele frequency q
A
=0.2,
exposure frequency p
E
=0.5, no main effects (R
g
= R
e
= 1), no population level association
13
Figure I. Power for Case-Control (CC) and Environment-Gene Two-Step (EG2)
Analyses for Increasing Levels of Interaction Effect size (R
ge
). All Other Parameter
Settings Remain Constant Under the ‘Base’ Model Specifications (S=10,000, number of
cases/controls=500/500, q
A
=0.2, p
e
=0.5, R
g
=1, R
e
=1, p
ge
=0, α
A
=0.05).
between G and E (p
ge
=0), 10,000 SNP markers, and a screening test significance
threshold of α
A
= 0.05 for the EG2 approach.
The Environment-Gene Two-Step test was consistently more powerful than the
traditional one-step Case-Control test over a wide variety of parameter settings (Table I).
As expected, power for both tests was highest for common exposures and alleles. The
EG2 method was at least twice as powerful as the CC test when the exposure was rare or
when the disease allele was rare, although absolute power in these situations was low for
both procedures. Power for the EG2 test depended somewhat on the significance
threshold for Step 1 (α
A
). Specifically, relative to our base model with α
A
=0.05, a
smaller threshold value (α
A
=0.01) resulted in increased power to detect the DSL, while
we saw reduced power when we allowed more markers to move into Step 2 (α
A
=0.10).
14
As expected, a population level association between markers and environment
(p
ge
>0) increased the number of markers that proceeded to Step 2. However, for the
range of values we considered plausible in a genome-wide scan (p
ge
=0.01 or 0.05), there
was not an appreciable impact on power for the two-step (EG2) method. At more liberal
values for the proportion of markers with a population level association between gene
and environment (p
ge
=0.30, 0.95), power for the EG2 method approached that for the CC
test. Specifically, when we assumed 95% of the markers were associated with the
environmental factor, power estimates for the CC and EG2 methods were identical (29.3
percent).
Conclusions
For genome-wide association studies in a case-control sample, we have shown in
Murcray et al. (2009) that the two-step testing approach provides a powerful alternative
for testing gene-environment interaction relative to a traditional one-step test. For the
parameter settings we examined, the Environment-Gene Two-Step (EG2) method was
always more powerful than the Case-Control (CC) method. Given its increased power
and ease of implementation, the EG2 test is an attractive alternative for identifying G×E
interactions in genome-wide association studies for complex diseases.
Current genome-wide association studies are conducted on large samples sizes in
order to have power to detect modest-sized effects at genome-wide significance after
correction for multiple testing. We considered a scenario with only 500 cases and 500
controls genotyped on 10,000 markers for our base model. With an increase in sample
size, power to detect a marker involved in a gene-environment interaction would increase
for both the CC and EG2 methods. An increase in number of markers could increase or
15
decrease power, depending on whether the increase in linkage disequilibrium between the
DSL and the markers offsets the penalty for a larger number of tests. However, we
would expect variations in sample size and number of markers to affect both the one- and
two-step approaches similarly and therefore not affect the relative comparison of power
for the two methods.
The additional power of the EG2 procedure comes from exploiting independent
information provided by over sampling of cases relative to their prevalence in the
population. In the presence of G×E interaction, this over sampling of cases induces an
association between G and E in the combined case-control sample. Although it would be
possible to develop an alternative one-step test based on a likelihood that incorporates
this additional information, such a test would not preserve the type I error in the presence
of population level G-E association. In the GWAS context, however, we can use the
additional information derived from the over sampling of cases in a screening step to
reduce the number of SNPs to be tested in the second step. When the power of the first-
step screening test is high, the chance that a true positive will be carried to the second
step is also high. At the same time, a large number of null SNPs will be eliminated by the
first-step screen. This reduces the multiple testing burden and results in our observed
gain in power to detect interaction at the causal locus.
16
CHAPTER II: SOFTWARE PACKAGE TO CALCULATE POWER AND REQUIRED
SAMPLE SIZE
Previously, we investigated the behavior of the Environment-Gene Two-Step
(EG2) method under a wide variety of parameters by simulation. Although the
simulation framework was flexible enough to get a better understanding of the benefits
and possible pitfalls of the new method, it is not optimal for answering many of the
additional questions that can be asked. Using likelihood theory, we can optimize the
previously proposed two-step method by understanding the distribution of power as a
function of the Step 1 threshold (α
A
), underlying population parameters (q
A
, p
E
, p
0
, R
g
,
R
e
, R
ge
), and number of markers (S). We can use this framework to fully explore the
properties of the EG2 method without relying on simulation. Additionally, extension of
the two-step approach to a test for interaction between genotype and a continuous
environmental exposure can be fully explored quickly and efficiently.
We have shown that implementing a two-step approach to screen for G×E
interactions in a GWAS can improve power to detect novel loci. However, our previous
work has done little to guide investigators to use this method. For study design
considerations, it is important to know the sample size required to detect these types of
effects, how best to optimize the analysis and get a realistic expectation of what types of
interaction effects can be detected in a GWAS. Using likelihood theory and efficiently
designed software, we can provide guidance on how to optimally design and analyze
GWAS data to discover novel genetic markers and regions involved in interactions. This
software is also easily adapted to compute power and sample size for alternative
approaches that may be used to analyze GxE interactions.
17
Alternative Two-Step Approaches to Detect G×E Interactions in a GWAS
We have shown the Environment-Gene Two-Step (EG2) analysis that screens on
the association between the environmental factor and each genetic marker is a powerful
alternative to the traditional approach to detect G×E in a case-control GWAS. Recently,
Kooperberg et al (Kooperberg and Leblanc 2008) proposed an alternative two-step
approach to detect gene-gene (G×G) interactions. They showed improved power to
detect G×G interactions by screening on marginal genetic effects for the individual
markers at a liberal screening p-value. This screening could either be restricted to testing
only those pairs that both pass a marginal test for genetic effect or those ‘significant’
markers at the marginal level could be tested against all available markers. Either
method would reduce the total number of interaction tests conducted, and thus reduce the
correction for multiple comparisons. This methodology can easily be extended to
investigate gene-environment interactions.
Disease-Gene Two-Step (DG2)
In this scenario, one would test for G×E interaction for only those markers that
have a marginal genetic effect at a liberal significance threshold, α
M
. The formal analysis
structure would be as follows:
Step 1, screening test: Fit the logistic model: logit(Pr(D = 1|G)) = β
0
+ β
M
G for all S
SNPs to test for a genetic marginal association. Test the following hypothesis, H
0
: β
M
=0
with a 1 DF likelihood ratio test at a pre-specified significance level, α
M
. A 2 DF test can
be used for co-dominantly coded genotypes.
18
Step 2, case-control test: For those s
M
markers that pass Step 1, fit the full logistic
regression model, equation (1), and test the hypothesis H
0
: β
ge
= 0 at significance level
α/s
M
, only for the number of markers formally tested in Step 2.
Like the Environment-Gene Two-Step (EG2), the DG2 approach allows the
investigator the choice of significance level for Step 1, α
M
. Kooperberg et al
(Kooperberg and Leblanc 2008) show that although Steps 1 and 2 are not independent,
this dependence is small enough that the DG2 maintains acceptable Type I error rates
across a wide range of scenarios by simulation for gene-gene interactions. This result
should also hold for G×E interactions.
Hybrid Two-Step (H2)
Finally, we explore the possibility of combining the latter method with the EG2
approach into a single scan by allocating a fraction of the total experiment-wise Type I
error rate to each scan. Specifically, any SNP that shows a G-E association in the
combined case-control sample at an α
A
significance level in the EG2 Step 1, or a genetic
marginal effect at an α
M
significance level in the DG2 Step 1 will be formally tested for
G×E interaction in Step 2. The allocation of the overall α is defined to be ρα for the
Environment-Gene Two-Step (EG2) and (1-ρ)α for the Disease-Gene Two-Step (DG2)
procedure, where ρ can take any value between 0 and 1. Specifically, if a SNP passes the
screening step of the EG2 method but not Step 1 of the DG2, it would be formally tested
in equation (1) at ρα/s
A
. Similarly, if a SNP passes the screening step of the DG2 method
but not Step 1 of the EG2, it would be formally tested in equation (1) at (1-ρ)α/s
M
.
Lastly, if this SNP were to pass both the G-E and marginal G screening steps, it would be
tested at both significant levels. For this type of SNP, genome-wide significance would
19
be achieved if the test of interaction achieves the more liberal threshold for Step 2, i.e. p
< max(ρα/s
A
, (1-ρ)α/s
M
). Under the null hypothesis, this overlap will be negligible.
Development of Design Software
The following sections outline the methods used to develop a general software
program that will allow users to compute power or sample size for G×E scans based on
any of the approaches described above.
Notation
Assume the same notation as was described previously. Specifically, let D be an
indicator of disease status, with D=1 being a case and D=0, a control. Assume E
represents an environmental factor of interest. For these comparisons, we consider the
situations where E is either binary or continuously distributed. For the binary case, E can
be either 1 or 0, indicating presence or absence of the environmental exposure. For a
binary environmental exposure, define p
E
as the exposure prevalence, Pr(E=1). For a
continuous E, assume a log-normal distribution, such that ln(E) ~ N(µ = 0, σ
2
= 1).
Assume S independent single nucleotide polymorphisms (SNPs) have been
genotyped on N
1
cases and N
0
controls (N=N
1
+N
0
). Assume there exists a single disease
susceptibility locus (DSL), G. Further assume G follows an additive disease model, with
increasing disease risk per allele, but all tests described can easily be extended to other
penetrance models (e.g. additive, recessive, co-dominant). Assume G follows Hardy-
Weinberg equilibrium with a minor allele frequency of q
A
. Also assume a true
underlying disease model of the form in equation (1). For the binary case, the parameters
in this model are the same as was described previously. For the scenario where E is
continuous, β
g
represents the log genetic main effect per allele in the unexposed group
20
(E=0), β
e
is the log environmental main effect per unit increase in exposure in the non-
carriers of the susceptible genotype (G=0), and β
ge
is the ratio of the genetic odds ratios
comparing individuals with exposures differing by a single unit of measure, ie.
OR
g|E=e+1
/OR
g|E=e
.
Likelihood Theory
To develop this likelihood framework, we assume that exposure and genotype
data are available on all cases and unaffected controls. The likelihood for the
unconditional logistic regression model including the interaction term has the form
€
L β
0
,β
g
,β
e
,β
ge
( )
=
e
β
0
+β
g
G
i
+β
e
E
i
+β
ge
G
i
E
i
1+e
β
0
+β
g
G
i
+β
e
E
i
+β
ge
G
i
E
i
⋅
1
1+e
β
0
+β
g
G
j
+β
e
E
j
+β
ge
G
j
E
j
j=1
N
1
×K
∏
i=1
N
1
∏
(2)
where the first and second products are taken over the N
1
cases and N
1
×K controls,
respectively. Maximum likelihood estimates (MLEs) obtained from this model are
consistent estimators of the log odds ratio parameters from the logistic model
€
Pr D =1|G,E ( ) =
e
γ
0
+γ
g
G+γ
e
E+γ
ge
GE
1+e
γ
0
+γ
g
G+γ
e
E+γ
ge
GE
(3)
In equation (3), the baseline probability of disease is
€
e
γ
0
1+e
γ
0
, and the
parameters , and are the genetic, environmental, and interaction
odds ratios, respectively.
In a case-only design, the interaction between gene and environment is
approximated by the odds ratio between gene and environment in the cases alone.
21
Specifically, likelihood for the association between gene and environment has the form
(4)
In equation (4), β
ge
is a consistent estimator of the log relative risk ratio parameter, γ
ge
,
from the log-linear model
€
Pr(D =1|G,E) =e
γ
0
+γ
g
G+γ
e
E+γ
ge
GE
(Yang, Khoury et al. 1997) (5)
The Environment-Gene Two-Step analysis incorporates a modification of the
case-only analysis by including controls in the calculation of the first step parameter.
Specifically, the likelihood for the first step has the form
€
L(µ,β
A
) =
e
µ+β
A
E
i
( )
G
i
1+e
µ+β
A
E
i
i=1
N
1
∏
⋅
e
µ+β
A
E
j
( )
G
j
1+e
µ+β
A
E
j
j=1
N
1
×K
∏
. (6)
where the first and second products are taken over the N
1
cases and N
1
×K controls,
respectively. The second step is exactly the same test as that used in traditional test for
gene-environment interaction, as described in equation (2).
Finally, the screening step for the marginal screening step for the DG2 approach
ignores information on the environmental factor to test for a genetic effect in all subjects.
This likelihood has the form
€
L(µ
∗
,β
M
) =
e
µ
∗
+β
M
G
i
1+e
µ
∗
+β
M
G
i
i=1
N
1
∏
⋅
1
1+e
µ
∗
+β
M
G
j
j=1
N
1
×K
∏
(7)
where the products are taken over the N
1
cases and N
1
×K controls, respectively. In
equation (7),
€
e
β
M
is the marginal genetic odds ratio. Like the EG2, the second step
22
involves the same test that is traditionally used to test for G×E interaction in a case-
control sample, as described in equation (2).
To calculate power for the two-step tests, we implement the approach used in
QUANTO (Gauderman and Morrison 2006). We assume the hypothesis of interest is H
0
:
β
ge
=0, tested against the 2-sided alternative hypothesis H
1
: β
ge
≠0. We let Λ={α
0
, γ
g
, γ
e
,
γ
ge
} denote the assumed parameter values for the logistic (equation 3) or log-linear model
(equation 5). The additional parameters, minor allele frequency (q
A
), prevalence of
exposure (p
E
), and baseline prevalence (p
0
), are denoted as Ω={q
A
,
p
E
,
p
0
}. The following
steps are utilized to calculate sample size or power for a given set of model parameter
settings.
1. Maximize the expected log-likelihood L
1
=ln[L(β
g
, β
e
, β
ge
)] with respect to
the distribution of the observable genotype and exposure data conditional
on the true parameters Λ and Ω.
2. Similarly, maximize the expected log-likelihood L
0
=ln[L(β
g
, β
e
)], i.e. the
expectation of the log-likelihood when β
ge
is fixed at zero.
3. Define
€
Δ = 2
ˆ
L
1
−
ˆ
L
0
⎛
⎝
⎜
⎜
⎞
⎠
⎟
⎟
. The value Δ is the likelihood ratio statistic for a
single sampling unit based on the expected maximum log-likelihoods
under the alternative and null hypotheses. For a given N
1
sample units,
N
1
Δ is the non-centrality parameter of the chi-squared distribution under
the alternative hypothesis.
4. For a two-sided alternative hypothesis, the required number of sampling
units is computed as
23
N
1
= (z
a/2
+ z
b
)
2
/ Δ
where a is the significance level, 1-b is the desired power, and z
u
denotes
the (1-u)
th
percentile of the standard normal distribution.
5. For a given N
1
, power can be calculated as
1-b = .
For the EG2, since the screening and testing steps are statistically independent
(Murcray, Lewinger et al. 2009), power can be calculated as the product of the powers
from the screening first and testing second step tests. Power for Step 1 is calculated with
the significance threshold set at a designated α
A
. For Step 2, the significance level will
be defined as the desired experiment-wise Type I error rate divided by the number of
markers that pass Step 1, or α/s
A
. Similarly, for the DG2, power is calculated as the
product of the powers from the two testing steps. Power for Step 1 is calculated with the
significance threshold set at α
M
. For Step 2, the significance level will be defined as the
desired experiment-wise Type I error rate divided by the number of markers that pass
Step 1, or α/s
M
. Finally, power for the Hybrid (H2) design is computed as the probability
that a true disease susceptibility locus is detected by either the EG2 or DG2 approaches.
Formally, this is represented by the following:
Pow(H2) = Pow(EG2) + Pow(DG2)
- Pow(EG2 Step 1) •Pow(DG2 Step 1)•Pow(Step 2 | α* = min(ρ•α/s
A
, (1-ρ)•α/s
M
))
where α* is the corrected significance level for Step 2.
Using this empirical likelihood method to determine power of the two-step
methods requires information about the number of markers that will pass Step 1 (i.e. s
A
,
24
s
M
). Under the assumption of independence between gene and environment for all
markers (S) the number of SNPs that will pass to Step 2 based solely on Type I error will
follow a Binom(S, α
A
) distribution, where α
A
is the Step 1 significance threshold for the
EG2 approach. Similarly, s
M
~ Binom(S, α
M
). For the large S we expect to see in
genome-wide association studies, the number of markers we would expect to pass Step 1
can be well approximated by α
A
S for the EG2 (α
M
S for the DG2) since there will be a
large point-mass in the distribution, with small variance around this mass due to large S
(Figure II). However, it is not likely true that the assumption of independence between
Figure II. Binomial distribution for S=10,000 markers and α
A
=0.05.
gene and environment will hold across the large number of markers being tested. To
investigate the effect of additional detectible G-E associations passing to Step 2, we
incorporate a parameter (p
ge
) to allow this value to be specified in the likelihood power
calculations of the EG2 and H2 methods.
25
The likelihood methodology to calculate power of the two-step methods has been
coded in the software package R (Team 2009). In Table II, power for the EG2 from the
Table II. Simulated and Empirical Power for Case-Control One-Step (CC) and
Environment-Gene Two-Step (EG2) Tests for Gene-Environment Interaction
simulations (Chapter I) and analytic power calculations are compared. All analytic
values for power are close to those we reported from the simulations. This new tool
26
allows for a quick and efficient exploration of the properties of the proposed two-step
tests.
Since there does not exist a closed form solution to calculate the required sample
size for the two-step methods, these values are calculated iteratively. Specifically, for an
initial sample size value N
0
, initial power (P
0
) is calculated. Power at the next iteration is
computed for a value of N that is greater or less than N
0
depending on if P
0
< P (or P
0
>
P), where P is the desired study power. The outputted sample size is the value N
k
such
that P
k
is sufficiently close to P (i.e. within a specified tolerance, (P-ε, P+ε)). Validity of
these sample size estimates was ensured by comparison to pre-existing software designed
for case-control studies (Gauderman and Morrison 2006).
Relative Efficiency of the Various Designs
As mentioned above, the described software functions provide a computationally
efficient means of calculating power or sample size for any of the analytic approaches
described. In this section, we apply this software to compare the necessary sample size
required to achieve 80% power to detect a true interaction effect between gene and
environment across the various approaches. We report relative efficiency of each method
compared to the Case-Control (CC) test of interaction, defined as the ratio of required
sample sizes (RE = N
CC
/N
1
, where N
CC
= number of cases required for the CC method
and N
1
= number of cases required for the comparison approach). A ratio greater than 1.0
indicates that the comparison approach makes more efficient use of the available data
than the CC approach, or equivalently that the comparison approach requires a lower
sample size to achieve the same power.
27
We define a ‘base’ model scenario and modify each parameter setting
individually to determine the effects of each parameter on sample size comparisons. For
the base model, we assume a binary exposure (E) with exposure prevalence (p
E
) 0.3 and
that S =1 million SNPs are genotyped. We assume that the DSL has minor allele
frequency (q
A
) of 0.15 and an additive (0,1,2) coding of alleles in the disease model. We
assume a baseline disease prevalence of p
0
= 0.01, no genetic or environmental main
effects (i.e. R
g
= R
e
= 1), and an interaction effect size of R
ge
= 2.0. For the ‘base’ model
scenario, we assume that no additional null markers have a detectable G-E association
that will pass Step 1 for the EG2 or EG2 portion of the H2 method (p
ge
= 0.0).
Results
The Case-Only (CO) analysis was always the most efficient choice to test for
G×E interaction across a range of interaction effect sizes, R
ge
(Figure III), which is not
Figure III. Sample Size Required to Achieve 80% Power for Tests of Gene-Environment
Interaction in a Genome-wide Association Study by Interaction Effect Size for Binary
Environmental Exposure
28
surprising given past reports (Piegorsch, Weinberg et al. 1994; Khoury and Flanders
1996; Wang and Lee 2008; Li and Conti 2009). Although this approach is potentially
biased in practice, we include it in the comparison of relative efficiencies to show a lower
bound for sample size requirements for a GWAS. For small interaction effects (1.5-2.0),
there are noticeable differences in the required sample sizes to achieve 80% power for the
two-step approaches. To detect an R
ge
= 1.5, the least efficient two-step strategy is to
screen on marginal effects with a required sample size of N
1
= 7,932 cases (and N
0
=
7,932 controls), only slightly more efficient than the standard CC approach. On the other
hand, the Environment-Gene Two-Step (EG2) approach requires 4,468 cases under the
same assumed parameter settings. As the interaction effect size gets larger, the relative
efficiencies (RE) of the methods remain relatively constant compared to the Case-Control
(CC) test (DG2 ≈1.85, H2 ≈ 1.95, EG2 ≈ 2.08, CO ≈ 2.72).
The EG2 test was often more efficient relative to the CC test than the DG2 (Table
III). The RE of the latter method was more sensitive to assumptions about exposure
prevalence and minor allele frequency than the EG2 or Hybrid (H2) approaches.
Specifically, if the exposure is rare (p
E
= 0.1), the RE of the DG2 test is 0.36, less
efficient than the traditional one-step scan (CC). On the other hand, the DG2 is the most
efficient of the two-step methods (RE = 2.0) for a more common exposure, p
E
= 0.5.
Minor allele frequencies (q
A
), exposure prevalence (p
E
), and main effects (R
g
, R
e
) have
little effect on the RE of the EG2 and H2 approaches. However, the EG2 test is sensitive
to disease prevalence, with higher RE for rare diseases (RE=1.90 when p
0
=0.01) than for
a more common disease (RE=1.37 when p
0
=0.10). The relative efficiency of the EG2
29
Table III. Sample Size (N) and Relative Efficiency (RE) Required to Achieve 80%
Power to Detect a True Gene-Environment Interaction for a Collection of Testing
Strategies across a Range of Parameter Settings
30
decreased dramatically when a large number of markers had a G-E association in the
population (p
ge
>0). For example, if 10,000 markers (p
ge
= 0.01) have a detectable G-E
association in Step 1, the RE of the EG2 approach decreases by 56%, from RE = 1.90 to
1.34. As the proportion of markers increases to 100%, the relative efficiency of the EG2
approaches 1.0, i.e. requiring equal sample size to that of the CC approach. The H2 has a
similar trend, with decreasing efficiency for increasing number of markers with a G-E
association, however the decrease is more gradual (a decrease of 28% in RE for p
ge
=
0.01). In general, the H2 is a robust approach that provides either the best or nearly the
best efficiency across a wide range of models.
Both the EG2 and DG2 scans can be optimized for a set of assumed population
parameters as a function of the Step 1 significance thresholds, α
A
and α
M
. The EG2 was
more efficient for stricter significance thresholds, α
A
= 1.0E-05 for the base model
parameters (RE = 1.89) (Table III). Conversely, the DG2 analysis was more efficient for
a more liberal screening threshold, α
M
= 1.0E-03 (RE = 1.41). There does exist a single
optimal choice for both α
A
and α
M
for a single set of parameter settings. However, as
many of these parameters are unknown at the time of study design, a robust choice can be
made across minor allele frequencies and penetrance models. For the base model, the H2
can be optimized across both significance thresholds with the required sample size across
choices for α
M
being flatter than across a range of α
A
near the optimal choice (Figure IV).
Specifically, for α
M
∈ (6.0E-04, 1.4E-03) and α
A
∈ (8.0E-06, 1.0E-06) the minimum
sample size required to achieve 80% power is 1,298 (RE = 2.04). Although there exists
single optimal choice for α
M
and α
A
, there is only an 8% drop in relative efficiency for
31
Figure IV. Sample Size Required to Achieve 80% Power for the Hybrid Two-Step Test
(H2) of Gene-Environment Interaction in a Genome-wide Association Study by Step 1
Significance Thresholds for the Disease-Gene and Environment-Gene (EG2) Two-Step
Tests for a Binary Exposure and no Genetic Main Effect (R
g
= 1.0).
the H2 approach when α
M
=0.01 and α
A
=1.0E-4, outside the optimal range (RE=1.87).
The software we have developed can be used to choose these significance thresholds
based on optimizing power or can calculate sample size given assumed significance
thresholds.
The H2 approach is robust to the choice of ρ, the allocation of the overall Type I
error rate to the EG2 method, with the relative efficiency remaining relatively flat across
a range of interaction effect sizes (Figure V). Generally, the H2 has the highest
efficiency when ρ ≥ 0.5, except when there exists a non-zero main effect (R
g
>1.0), a
common exposure (p
E
=0.5), or when more controls than cases are sampled from the
32
Figure V. Relative Efficiency of the Hybrid Two-Step Test (H2) of Gene-Environment
Interaction in a Genome-wide Association Study by ρ, the Allocation of the Experiment-
wise Significance Level (α) to the Environment-Gene (EG2) Two-Step Test
population (e.g. case: control ratio = 1:3) (Table IV). The H2 is often most powerful
when ρ=0.9 and is robust to the specific values of several population parameters,
including minor allele frequency and genetic main effect. Except when there is a sizeable
genetic main effect (R
g
≥1.3) or for a rare exposure (p
E
=0.1), the H2 approach was always
more powerful than the EG2 or DG2 alone for some choice of ρ.
All of the tests described can be applied to test for interaction between G and a
continuous environmental factor. In general, the relative efficiencies of the methods are
similar to the binary E situation. Under our base model parameters, the relative
33
Table IV. Sample Size (N) Required and Relative Efficiency (RE) to Achieve 80%
Power to Detect a True Gene-Environment Interaction for the Hybrid Two-Step Analysis
for Various Allocations of α to the Association Two Step (ρ).
34
efficiencies of the two-step methods are similar when the interaction effect size is of a
modest size (R
ge
= 1.15) (Figure VI). For an interaction effect size of 1.3, the EG2, DG2
Figure VI. Sample Size Required to Achieve 80% Power for Tests of Gene-Environment
Interaction in a Genome-wide Association Study by Interaction Effect Size for a
Continuous Environmental Exposure in the Absence of a Genetic Main Effect (R
g
= 1.0)
and H2 tests all converge to be approximately twice as efficient as the traditional CC test
(EG2 = 2.12, DG2 = 2.02, H2 = 1.95).
Conclusions
I have described an efficient tool based on likelihood theory to calculate power and
sample size requirements to detect G×E interactions in a GWAS. Using this resource, I
demonstrated that increased efficiency for testing G×E interactions can be achieved by
screening the large number of markers tested using G-E association or genetic marginal
screening steps. We also showed that a hybrid approach (H2) that combines the strengths
of both screening procedures is often a robust choice across a wide range of parameters.
35
The software tool can also be used to compute optimal power for the two-step tests as a
function of the Step 1 significance thresholds (α
A
, α
M
) for individual study designs and
parameters, and to explore the robustness of estimated power to deviations in model
parameters.
For a binary environmental factor, the Environment-Gene Two-Step was often
more efficient than both the Disease-Gene and Hybrid Two Step approaches in the
absence of a marginal genetic effect. However, the EG2 is more sensitive to the number
of markers that have a population-level association with the environmental factor (either
real or induced by populations stratification) than the DG2 or H2 approaches. As the
number of markers with detectable G-E associations increases, the relative efficiency of
the EG2 approaches 1.0 compared to the CC design. Like the case-only analysis,
adjustment for population structure using STRUCTURE(Pritchard, Stephens et al. 2000)
or EIGENSTRAT(Price, Patterson et al. 2006) could reduce the number of false
associations passing to Step 2. Although G-E associations do not affect the relative
efficiency of the DG2, the total number of markers that are passed through Step 1 is still a
potential concern. Admixture bias in marginal effect scans for GWAS have been widely
discussed(Pritchard and Rosenberg 1999; Sarasua, Collins et al. 2009; Wang 2009). If
proper correction is not implemented, the relative efficiency of the DG2 will likely show
similar decreases as the EG2 approach, given that many SNPs will pass Step 1 of the
DG2 screen without correction for population stratification bias.
The efficiency of the EG2 approach is sensitive to the population prevalence of
the disease being studied. This is because the additional power of the EG2 procedure
comes from exploiting independent information provided by over sampling of cases
36
relative to their prevalence in the population. When the population disease prevalence
becomes closer to the case ratio in the study sample, the ascertainment of cases becomes
less informative. Therefore, the EG2 method is less desirable for more common diseases,
such as asthma. The DG2 method is also sensitive to the baseline disease prevalence but
the relative efficiency of the DG2 approach converges to 1.0 more slowly than the EG2.
The H2 approach is the most robust choice to baseline disease prevalence and would be a
good choice for more common disease outcomes.
It is possible that there exists a SNP that is involved in a G×E interaction that also
is associated with the environmental factor of interest. For those SNPs with an
association and interaction in the same direction, the combined true and induced
associations will increase the power of the screening step for the EG2. However, when
the association and interaction effects are in opposite directions, the power of the EG2
approach can be poor, as the pooling of these types of effects will negate any power
induced by ascertainment of cases in the first step. Although the likelihood of this
phenomenon is low, it is a potential weakness of the EG2. The H2 approach would be a
good compromise for this scenario since this phenomenon would have little effect on the
Disease-Gene (DG2) screening approach.
Since it is not possible to know whether a true interaction will be accompanied by
marginal effects, it may be advantageous to use the Hybrid approach that has good power
to detect a wide variety of penetrance models. Since a marginal effect scan is likely to be
the first analysis applied to GWAS data, the screening p-values should be available
without any additional computational time or effort. The incorporation of the EG2 would
only require an additional scan of associations between G and E in the combined sample.
37
The Hybrid approach is often a compromise between the efficiencies of the Disease-Gene
and Environment-Gene Two-Steps, with required sample sizes often falling within the
range of the EG2 and DG2 methods. Though the H2 approach is not often the most
efficient method, it is nearly optimal across all models investigated. This is beneficial
when little is known a priori about the types of interactions that occur for complex
diseases.
We have developed a software program to compare sample size requirements for
GWAS. This tool can be used in the design of future studies and to optimize the analysis
using two-step approaches by choosing significance thresholds appropriate to specific
studies. To calculate power and required sample size, this software package requires
specification of estimable population parameters (i.e. p
E
, R
e
), assumptions about the
underlying disease mechanism (i.e. p
0
, R
g
, R
ge
, q
A
) and study design characteristics (i.e.
S, N
1
, K). Using these inputs, the various program outputs available include power,
required sample size, optimal choices for screening thresholds (α
A
, α
M
), and allocation of
the experiment-wise Type I error rates (ρ) for the H2 approach. As an example,
Appendix A summarizes the output of a subset of the utilities of this software package.
For the base model parameters from Figure V and sample size required to achieve 80%
power for the Case-Control analysis (N
1
= 2,632), power for the Case-Control, Case-
Only, EG2, DG2, and H2 are found on lines 76, 90, 98, 112, 127 of the output,
respectively. An example of the optimization of the EG2 and DG2 methods as a function
of α
A
and α
M
are shown on lines 135-150. For the base model, the optimal choice of α
A
(6.61E-5, line 139) and α
M
(1.65E-3, line 147) would produce optimal powers of 0.998
(line 142) and 0.971 (line 150) for the EG2 and DG2 respectively. Finally, to explore
38
power as a function of a single parameter setting, strings of parameters can be passed to
the function. For a range of interaction effect sizes (R
ge
), power can be calculated for all
analysis methods in five lines of code (lines 154-170). These outputs are easily
integrated in plotting algorithms in R(Team 2009) (Figure VII) or exported for use in
other software programs (i.e. Excel, SAS).
Figure VII. Power for Case-Control (CC), Environment-Gene Two-Step (EG2), Disease-
Gene Two-Step (DG2), Hybrid Two-Step (H2), and Case-Only (CO) Analyses for
Increasing Levels of Interaction Effect size (R
ge
). All Other Parameter Settings Remain
Constant Under the ‘Base’ Model Specifications (S=1,000,000, number of
cases/controls=2632/2632, q
A
=0.15, p
e
=0.3, R
g
= R
e
=1, p
ge
=0, α
A
=α
M
=0.0001).
39
Two-step approaches that screen the large number of markers available in a
GWAS are powerful, efficient alternatives to the traditional CC approach to test G×E
interactions. The Hybrid Two-Step test (H2) that combines the G-E association and D-G
marginal screening approaches does not require the assumption of G-E independence in
the population that is so crucial to the performance of the Case-Only analysis. In
addition, this analysis framework is a robust method that provides significant
improvements in efficiency compared to the standard one-step case-control test. For
current and future GWAS studies, the Hybrid Two-Step method is an efficient choice for
detecting G×E interactions across a wide range of complex disease outcomes and
underlying disease mechanisms.
40
CHAPTER III: APPLICATION TO THE CHILDREN’S HEALTH STUDY
The Children’s Health Study (CHS) is a prospective cohort study that has enrolled
over 11,000 school children in southern California. The study has enrolled students at 3
separate times to 5 different cohorts. Table V summarizes the sample sizes and
Table V. Sample size and school grade each year of the five CHS cohorts
follow-up of each of these cohorts. Cohorts A-D were enrolled in 12 southern California
communities to study the effects of ambient air pollution on children’s respiratory health.
Cohort E children were selected from 13 communities in southern California, including 9
of the original communities and 4 new communities.
The two-step methods described above rely on a priori knowledge of factors that
might be expected to modify the risk of genotype on disease. The Children’s Health
Study (CHS) has shown evidence to suggest that both regional air quality and proximity
to traffic contribute to risk of asthma, reduced lung function growth and/or other
respiratory outcomes (McConnell, Berhane et al. 1999; McConnell, Berhane et al. 2002;
Gauderman, Avol et al. 2004; Wenten, Berhane et al. 2005; McConnell, Berhane et al.
2006; Gauderman, Vora et al. 2007; Wang, Salam et al. 2008). It has also been reported
that exposure to maternal tobacco smoke in utero increases risk of respiratory disease
outcomes (Gilliland, Berhane et al. 2000; Li, Gilliland et al. 2000; Gilliland, Li et al.
2001; Gilliland, Berhane et al. 2003; Wenten, Li et al. 2009). For the genome-wide
association study being conducted in this cohort, simply a scan of the marginal genetic
41
effects ignoring the potential modification of genetic effects by air pollutants and
personal exposures might lead investigators to miss genetic variants that are important
determinants of complex respiratory diseases. In fact, in the CHS, Wang et al (Wang,
Salam et al. 2008) showed a statistically significant interaction between maternal
smoking in utero and the Arg16Gly SNP in the ADRB2 gene that increases risk for
wheeze outcomes. However, they did not find a statistically significant marginal effect
for this locus.
Exposure Data
A detailed questionnaire was administered to all study subjects at baseline. This
questionnaire included items about health history (including early life events, personal
and family history of wheezing, asthma, bronchitis, pneumonia, and other respiratory
conditions and symptoms), residential history, housing characteristics, history of
exposure to tobacco smoke including in utero and second-hand smoke exposure, and
allergen sources such as pets and pests. Asthmatic cases were identified as children with
reported doctor-diagnosed asthma at anytime during study follow-up. In utero exposure
to maternal smoking was assessed by responses to the baseline questionnaire question
“Did your child’s biological mother smoke while she was pregnant with your child?” For
each study subject, exposure to traffic-related pollutants was characterized by two types
of measures – proximity of the child’s residence to the nearest freeway or to the nearest
major non-freeway road, and model-based estimates of traffic-related air pollution at the
residence, derived from dispersion models that incorporated distance to roadways,
vehicle counts, vehicle emission rates, and meteorological conditions (Benson 1989). All
42
traffic variables were centered on their respective community means to focus inference
on local, within community variation in exposures.
Air-pollution monitoring stations were established in each of the 12 study
communities and provided continuous monitoring data from 1994 to 2003. Each station
measured average hourly concentrations of ozone, nitrogen dioxide, and particulate
matter with aerodynamic diameter less than 10 µm (PM
10
). Stations also collected 2-
week integrated filter samples for measuring acid vapor and PM
2.5
mass and chemistry.
We calculated yearly averages on the basis of 24 h (PM10, nitrogen dioxide) or 2-week
(PM2.5, elemental carbon, organic carbon, acid) average concentrations. For ozone, we
calculated the yearly average of the 10 a.m. to 6 p.m. (8 h daytime) average. The
distribution and correlation structure of these pollutants across communities, and their
effect on lung-function development, have been previously reported(Gauderman,
McConnell et al. 2000; Gauderman, Gilliland et al. 2002; Gauderman, Avol et al. 2004).
Table VI summarizes the distributions of the exposures considered for this application.
Table VI. Descriptive Statistics for Environmental Exposures Considered for Testing
G×E Interactions in the Children’s Health Study.
43
Genotype Data
Genome-wide data has been collected on 1,248 doctor diagnosed asthmatics and
1,593 control subjects. Subjects were sampled from the large population of Hispanic
(HW) and non-Hispanic white (NHW) children enrolled in the CHS. Study samples were
genotyped at the USC Epigenome Center using Illumina HumanHap550, HumanHap550-
Duo or Human610-Quad BeadChip microarrays. SNPs were excluded from analysis if
they were annotated as “Intensity Only” on the Human610-Quad (S=28,369), had a call
rate < 90% (S=21,595), departed from Hardy-Weinberg equilibrium (p < 9E-8) in
controls (S=6,013), or had a minor allele frequency (MAF) < 0.01 (S=16,668). The
HumanHap550, HumanHap550-Duo and Human610-Quad respectively contain 366, 366
and 418 SNPs that overlap with a candidate gene study containing a large number of the
subjects in this study (N = 2,905). The average concordance rate between matching
subjects for the overlapping SNPs was > 99.69% for >99% of the samples having a call
rate >90%. Subjects with poor concordance with genotypes from the candidate gene
study were excluded (N=19). In addition to a scan for marginal effects, a goal in this
GWAS is to incorporate the high quality environmental exposure data to detect new loci
involved in the etiology of respiratory outcomes. After applying these quality control
measures, a total of N=2838 subjects and S=528,995 SNPs were available for analysis.
Statistical Analysis
The Hybrid Two-Step (H2) analysis, described previously, was used to test for
gene-environment interaction effects for in utero tobacco smoke exposure, exposure to
ambient PM
2.5
and ozone (O
3
), and four traffic-related metrics. The traffic-related
exposures considered were distance to freeway, distance to non-freeway major road, and
44
model based estimates of pollution from freeways and non-freeways. Residential
distance to non-freeway major road was dichotomized as < 75m and >75m, based on
results of previous studies showing markedly increased exposure and risk of asthma
within 75m of large roadways(Zhu, Hinds et al. 2002; Gilbert, Goldberg et al. 2005).
Residential distance to freeway was dichotomized as < 500m and >500m. We also
scanned for interactions with two non-environmental factors, specifically gender and
GSTM1 null genotype,
As the Children’s Health Study is composed of an admixed population of
Hispanic and non-Hispanic white children from southern California, adjustments for
population structure were included in the screening and testing steps of the Hybrid
analysis. These adjustments included an indicator for self-reported Hispanic ethnicity
based on the baseline questionnaire as well as Q-factors based on ancestry informative
markers (AIMs) describing population ancestry computed using the program
STRUCTURE(Pritchard, Stephens et al. 2000). These Q-factors reflected the
proportional ancestry for each study subject from four populations: Caucasian, African,
Asian, and Native American. In addition to adjusting for these Q-factors, the marginal
genetic model and all Step 2 interaction test models included adjustments for gender, age
at baseline, cohort, and community of residence.
Optimization of Hybrid Two-Step Analysis
For each exposure considered for gene-environment interactions in this study, we
used the software described above to optimize the H2 approach as a function of the
screening thresholds and allocation of the overall Type I error rate, i.e. α
A
, α
M
, and ρ. In
order to optimize the analysis for these parameters, assumptions needed to be made about
45
the population from which these subjects were sampled, as well as the underlying disease
mechanism. For this application, we assume that the prevalence of asthma in southern
California children is 15% (p
0
=0.15), all SNPs are coded additively (0, 1, 2 minor alleles)
in each statistical model, and that there is no genetic main effect of a disease
susceptibility locus. The number of markers, number of cases, and the control:case ratio
(K) were all fixed by study design, such that S = 550,000, N
1
= 1,248, and K = 1.28 (total
N=2,838). For each exposure, we use prior study results to estimate exposure prevalence
(or distribution for continuous E) and marginal environmental effects. As an example of
the Hybrid Two-Step approach, I will walk through the optimization and analysis for
exposure to maternal smoking in utero. Following this demonstration, I will summarize
results for the collection of potentially important environment factors considered for this
study.
Based on the full cohort data from the Children’s Health Study, approximately
17.9% of subjects from this population were exposed to maternal tobacco smoke in utero.
Based on previous work, the marginal environment effect of this exposure is estimated to
be approximately
€
R
e
=1.1 (Gilliland, Li et al. 2002; Wang, Salam et al. 2008). For the
purposes of optimization, we assume a zero main effect of maternal tobacco smoke
exposure (i.e. R
e
=1.0). Using this information, and the assumptions described above, we
can estimate the power of the two-step approach as a function of the screening thresholds
for the remaining unknown parameters, i.e. minor allele frequency (q
A
) and true
interaction effect size (R
ge
). For many of the assumed population parameters, allocation
of the overall Type I error rate that would give preference to those SNPs highlighted by
the EG2 approach is a more efficient choice, i.e. ρ≥0.5 (Table IV). Therefore, we will
46
assume a value for ρ of 0.9. In effect, then, the EG2 SNPs will be tested at an
experimentwise α=0.045 significance level and the DG2 SNPs at α=0.005. Given this
allocation, we can optimize each of the component analyses independently. Across a
range of R
ge
and q
A
, α
A
= 0.025 is a robust choice for the EG2 analysis (Figure VIII).
a) b)
Figure VIII. Power of the CC and EG2 Tests for G×E Interaction Across a Range of α
A
for a Selection of Interaction Effect Sizes (R
ge
) (a) and Minor Allele Frequencies (q
A
) (b).
The Optimal Settings of α
A
are shown by Vertical Lines.
Figure VIII also highlights an important point, specifically that we will have low power
to detect interactions smaller than 2.25, or interactions of even larger effects for rarer
disease SNPs (q
A
< 0.1). Across a range of R
ge
and q
A
, α
M
= 0.03 is a robust choice for
the DG2 analysis (Figure IX). For these settings, we have 80% power
47
a) b)
Figure IX. Power of the CC and DG2 Tests for G×E Interaction Across a Range of α
M
for a Selection of Interaction Effect Sizes (R
ge
) (a) and Minor Allele Frequencies (q
A
) (b).
The Optimal Settings of α
M
are shown by Vertical Lines.
to detect an interaction effect size of R
ge
= 2.5 using the Hybrid analysis compared to 2.8
for the traditional CC analysis (Figure X). The Hybrid scan is also more powerful than
doing solely the EG2 or DG2 scans under these model assumptions.
Results
Figure XI summarizes the results for all S = 527,918 SNPs scanned for marginal effects
with the grey line indicating genome-wide significance corrected for multiple testing
using a Bonferroni correction. One SNP, rs10119122, on chromosome 9 had a
statistically significant marginal effect at the genome-wide level (p = 7.7E-9). The black
line in Figure XI indicates the threshold required to pass the screening step for the DG2
48
Figure X. Power of the Case-Control, Environment-Gene, Disease-Gene, and Hybrid
Two-Step Tests to Detect G×in utero Tobacco Smoke Exposure Interaction in the
Children’s Health Study Across a Range of R
ge
. Two-Step Power Optimized Assuming
N = 2838, M=527,000, p
E
= 0.17, p
0
=0.15, q
A
= 0.15, R
g
=R
e
=1.0 (α
A
=0.025, α
M
=0.03
and ρ=0.9)
approach, at α
M
= 0.03. At this significance level, 3.2% of the total markers were passed
to Step 2 (s
M
= 16,760). Similarly, Figure XII shows the results for the screening scan for
the EG2 approach at the significance level α
A
= 0.025 (black line). Approximately 2.6%
of the markers tested (S = 527,918, s
A
= 13,753) passed the screening threshold for the
EG2 approach. The overlap between the two screening tests was minimal, with s = 464
SNPs passing both the EG2 and DG2 screen. These SNPs will be tested at the more
liberal (EG2) Step 2 threshold.
Markers that pass either of the screening steps were formally tested for interaction
at the corrected significance threshold α = 0.005/(s
M
- 464) = 3.0E-7 for the DG2 SNPs
and α = 0.045/s
A
= 3.3E-6 for the EG2 SNPs. Neither the traditional Case-Control (CC)
49
Figure XI. Manhattan Plot of Log
10
P-Values from Step 1 of the Disease-Gene Scan
(DG2) of 527,918 SNPs from the Children’s Health Study. The Grey Line Indicates the
Bonferroni Corrected Genome-wide Statistical Significance for all SNPs and the Black
Line Shows the Significance Level Required to Pass Step 1 for the DG2 Analysis (α
M
=
0.03).
approach nor the proposed Hybrid Two-Step (H2) identified any statistically significant
SNPs involved in a G×E interaction with in utero tobacco smoke after correction for
multiple testing (Figure XIII). The H2 method does highlight a potentially important
SNP through the EG2 method (rs807532) that is close to genome-wide statistical
significance (p = 3.52E-6).
This optimization process was repeated for each of the additional seven
environmental exposures and two non-environmental factors (gender and GSTM1)
considered for this study. Although the Hybrid Two-Step approach reduced the multiple
testing penalty for the formal test of interaction (Step 2), no statistically significant
50
Figure XII. Manhattan Plot of Log
10
P-Values from Step 1 of the Environment-Gene
Scan of 527,918 SNPs for in utero Tobacco Smoke Exposure from the Children’s Health
Study. The Black Line Shows the Significance Level Required to be Tested in Step 2 for
the EG2 Analysis (α
M
= 0.025).
interactions were identified for any of these additional factors (Table VII). The
traditional Case-Control test also failed to identify any statistically significant G×E
interactions for these factors after correcting for the large number of tests (data not
shown).
Conclusions
I demonstrated the application of the Hybrid Two-Step analysis approach to the
Children’s Health Study to identify gene-environment interactions. Although this new
approach did not highlight any SNPs that achieved genomewide significance, this
application to the CHS has highlighted some important considerations for testing G×E in
a genome-wide association study. Specifically, this work has given greater emphasis to
51
a) b)
c)
Figure XIII. Manhattan Plot of Log
10
P-Values from Step 2 of the (a) Case-Control (CC),
(b) Disease-Gene (DG2), (c) Environment-Gene (EG2) Two-Step Scan of 527,918 SNPs
for in utero Tobacco Smoke Exposure from the Children’s Health Study
the problem of low power to detect these types of interactions in large-scale genetic
studies.
52
Table VII. Most Significant SNPs Involved in G×E Interactions for each Environmental
Exposures in the Children’s Health Study Using the Hybrid Two-Step Approach.
53
For some underlying disease models, it is possible that a true interacting disease
SNP failed to pass either the genetic marginal (DG2) or the environment-gene association
(EG2) screening steps. This type of SNP would therefore not formally be tested for
interaction in a staged analysis design. However, as we showed in Chapters I and II of
this dissertation, the substantial gain in power of our two-step methods relative to the
traditional CC approach outweighs the risk of missing these effects. For our analysis in
the CHS, neither the CC nor the H2 approaches identified statistically significant
interactions. It is possible that the true interaction effect sizes are not detectible with the
sample size available. We demonstrated that the relative efficiency of the Hybrid
approach is sensitive to disease prevalence. For prevalent asthma in children, the relative
efficiency compared to the CC approach is low.
For current or future GWA studies, it will be important for collaboration between
investigators to obtain the large number of subjects required to detect these types of
effects. Although no SNP reached GWA significance in the Children’s Health Study,
availability of replication samples both in the CHS and other groups may warrant in silico
replication analysis to determine whether some of the top hits from this analysis warrant
further investigation.
54
SUMMARY
Genome-wide association studies offer the opportunity to more fully understand the
etiology of complex disease. Although complex disease is likely to be more complicated
than can be defined by simple two-way interaction models, the development of powerful
tools that incorporate the joint effects of genes and environment is an important step to
understanding disease outcomes. The focus of my dissertation was on the development
and application of one class of efficient two-step tests designed to detect novel loci
involved in G×E interactions. I demonstrated that greater power could be achieved by
efficiently screening a set of genetic markers from a GWAS to highlight those SNPs most
likely to be involved in an interaction. Published paper on this topic, including invited
commentaries and rejoiner are included in Appendix B. In addition, I extended this
methodology to include a parallel two-step scan that follows up SNPs with evidence of
marginal genetic effects with a hybrid two-step design. I have developed an efficient
software package in R(Team 2009) designed to facilitate researchers in the optimal
analysis design using these methods. Using this software, I demonstrated that the
proposed Hybrid Two-Step approach was substantially more powerful than the traditional
method used in case-control studies across a wide range of model parameters. Draft of
manuscript summarizing this work is attached in Appendix C. Finally, I applied the H2
method to several environmental factors to investigate G×E interactions in the etiology of
childhood asthma in the Children’s Health Study.
55
REFERENCES
Benjamini, Y. and Y. Hochberg (1995). "Controlling the false discovery rate: a practical
and powerful approach to multiple testing." J Royal Statist Soc B 57: 289-300.
Benson, P. (1989). CALINE4- A dispersian model for predicting air pollution
concentrations near roadways. C. D. o. Transportation. Sacramento.
Chatterjee, N., Z. Kalaylioglu, et al. (2005). "Exploiting gene-environment independence
in family-based case-control studies: increased power for detecting associations,
interactions and joint effects." Genet Epidemiol 28(2): 138-156.
Clayton, D. and P. M. McKeigue (2001). "Epidemiological methods for studying genes
and environmental factors in complex diseases." Lancet 358(9290): 1356-1360.
Garcia-Closas, M. and J. H. Lubin (1999). "Power and sample size calculations in case-
control studies of gene-environment interactions: comments on different
approaches." Am J Epidemiol 149(8): 689-692.
Gauderman, W. and J. Morrison (2006). QUANTO 1.1: A computer program for power
and sample size calculations for genetic-epidemiology studies,
http://hydra.usc.edu/gxe.
Gauderman, W. J., E. Avol, et al. (2004). "The effect of air pollution on lung
development from 10 to 18 years of age." N Engl J Med 351(11): 1057-1067.
Gauderman, W. J., G. F. Gilliland, et al. (2002). "Association between air pollution and
lung function growth in southern California children: results from a second
cohort." Am J Respir Crit Care Med 166(1): 76-84.
Gauderman, W. J., R. McConnell, et al. (2000). "Association between air pollution and
lung function growth in southern California children." Am J Respir Crit Care Med
162(4 Pt 1): 1383-1390.
Gauderman, W. J., H. Vora, et al. (2007). "Effect of exposure to traffic on lung
development from 10 to 18 years of age: a cohort study." Lancet 369(9561): 571-
577.
Gilbert, N. L., M. S. Goldberg, et al. (2005). "Assessing spatial variability of ambient
nitrogen dioxide in Montreal, Canada, with a land-use regression model." J Air
Waste Manag Assoc 55(8): 1059-1063.
Gilliland, F. D., K. Berhane, et al. (2003). "Environmental tobacco smoke and
absenteeism related to respiratory illness in schoolchildren." Am J Epidemiol
157(10): 861-869.
56
Gilliland, F. D., K. Berhane, et al. (2000). "Maternal smoking during pregnancy,
environmental tobacco smoke exposure and childhood lung function." Thorax
55(4): 271-276.
Gilliland, F. D., Y. F. Li, et al. (2002). "Effects of glutathione S-transferase M1, maternal
smoking during pregnancy, and environmental tobacco smoke on asthma and
wheezing in children." Am J Respir Crit Care Med 166(4): 457-463.
Gilliland, F. D., Y. F. Li, et al. (2001). "Effects of maternal smoking during pregnancy
and environmental tobacco smoke on asthma and wheezing in children." Am J
Respir Crit Care Med 163(2): 429-436.
Hunter, D. J., P. Kraft, et al. (2007). "A genome-wide association study identifies alleles
in FGFR2 associated with risk of sporadic postmenopausal breast cancer." Nat
Genet 39(7): 870-874.
Hwang, S. J., T. H. Beaty, et al. (1994). "Minimum sample size estimation to detect gene-
environment interaction in case-control designs." Am J Epidemiol 140(11): 1029-
1037.
Khoury, M. J. and W. D. Flanders (1996). "Nontraditional epidemiologic approaches in
the analysis of gene-environment interaction: case-control studies with no
controls!" Am J Epidemiol 144(3): 207-213.
Kooperberg, C. and M. Leblanc (2008). "Increasing the power of identifying gene x gene
interactions in genome-wide association studies." Genet Epidemiol 32(3): 255-
263.
Kraft, P., Y. C. Yen, et al. (2007). "Exploiting gene-environment interaction to detect
genetic associations." Hum Hered 63(2): 111-119.
Li, D. and D. V. Conti (2009). "Detecting gene-environment interactions using a
combined case-only and case-control approach." Am J Epidemiol 169(4): 497-
504.
Li, Y. F., F. D. Gilliland, et al. (2000). "Effects of in utero and environmental tobacco
smoke exposure on lung function in boys and girls with and without asthma." Am
J Respir Crit Care Med 162(6): 2097-2104.
McConnell, R., K. Berhane, et al. (2002). "Asthma in exercising children exposed to
ozone: a cohort study." Lancet 359(9304): 386-391.
McConnell, R., K. Berhane, et al. (1999). "Air pollution and bronchitic symptoms in
Southern California children with asthma." Environ Health Perspect 107(9): 757-
760.
57
McConnell, R., K. Berhane, et al. (2006). "Traffic, susceptibility, and childhood asthma."
Environ Health Perspect 114(5): 766-772.
Mukherjee, B., J. Ahn, et al. (2008). "Tests for gene-environment interaction from case-
control data: a novel study of type I error, power and designs." Genet Epidemiol
32(7): 615-626.
Mukherjee, B. and N. Chatterjee (2008). "Exploiting gene-environment independence for
analysis of case-control studies: an empirical Bayes-type shrinkage estimator to
trade-off between bias and efficiency." Biometrics 64(3): 685-694.
Murcray, C. E., J. P. Lewinger, et al. (2009). "Gene-environment interaction in genome-
wide association studies." Am J Epidemiol 169(2): 219-226.
Piegorsch, W. W., C. R. Weinberg, et al. (1994). "Non-hierarchical logistic models and
case-only designs for assessing susceptibility in population-based case-control
studies." Stat Med 13(2): 153-162.
Price, A. L., N. J. Patterson, et al. (2006). "Principal components analysis corrects for
stratification in genome-wide association studies." Nat Genet 38(8): 904-909.
Pritchard, J. K. and N. A. Rosenberg (1999). "Use of unlinked genetic markers to detect
population stratification in association studies." Am J Hum Genet 65(1): 220-228.
Pritchard, J. K., M. Stephens, et al. (2000). "Inference of population structure using
multilocus genotype data." Genetics 155(2): 945-959.
Sarasua, S. M., J. S. Collins, et al. (2009). "Effect of population stratification on the
identification of significant single-nucleotide polymorphisms in genome-wide
association studies." BMC Proc 3 Suppl 7: S13.
Saxena, R., B. F. Voight, et al. (2007). "Genome-wide association analysis identifies loci
for type 2 diabetes and triglyceride levels." Science 316(5829): 1331-1336.
Scott, L. J., K. L. Mohlke, et al. (2007). "A genome-wide association study of type 2
diabetes in Finns detects multiple susceptibility variants." Science 316(5829):
1341-1345.
Team, R. D. C. (2009). R: A Language and Environment for Statistical Computing.
Wang, C., M. T. Salam, et al. (2008). "Effects of in utero and childhood tobacco smoke
exposure and beta2-adrenergic receptor genotype on childhood asthma and
wheezing." Pediatrics 122(1): e107-114.
58
Wang, K. (2009). "Testing for genetic association in the presence of population
stratification in genome-wide association studies." Genet Epidemiol 33(7): 637-
645.
Wang, L. Y. and W. C. Lee (2008). "Population stratification bias in the case-only study
for gene-environment interactions." Am J Epidemiol 168(2): 197-201.
Wenten, M., K. Berhane, et al. (2005). "TNF-308 modifies the effect of second-hand
smoke on respiratory illness-related school absences." Am J Respir Crit Care Med
172(12): 1563-1568.
Wenten, M., Y. F. Li, et al. (2009). "In utero smoke exposure, glutathione S-transferase
P1 haplotypes, and respiratory illness-related absence among schoolchildren."
Pediatrics 123(5): 1344-1351.
Yang, Q., M. J. Khoury, et al. (1997). "Sample size requirements in case-only designs to
detect gene-environment interaction." Am J Epidemiol 146(9): 713-720.
Zeggini, E., M. N. Weedon, et al. (2007). "Replication of genome-wide association
signals in UK samples reveals risk loci for type 2 diabetes." Science 316(5829):
1336-1341.
Zhu, Y., W. C. Hinds, et al. (2002). "Concentration and size distribution of ultrafine
particles near a major highway." J Air Waste Manag Assoc 52(9): 1032-1042.
59
1 APPENDIX: SAMPLE POWER AND SAMPLE SIZE CODE
2
3 ## CALCULATION OF POWER AND SAMPLE SIZE FOR SINGLE ‘BASE’
4 MODEL
5 # PARAMETER SETTINGS
6 ## INPUT PARAMETERS
7 ## Base Model From Table V, Line I
8
9 ## Minor allele frequency
10 daf <- 0.15
11
12 ## Genotype coding for aa, aA, AA
13 ## Additive: c(0,1,2)
14 ## Dominant: c(0,1,1)
15 ## Recessive: c(0,0,1)
16
17 dom <- c(0,1,2)
18
19 ## Exposure prevalence
20 ep <- 0.3
21
22 ## G-E association
23 ORge_DSL <- 1
24
25 ## G-E association
26 ORge <- 1
27 tau <- 0.53
28
29 ## True relative risk
30 RRg <- 1
31 RRe <- 1
32 RRge <- 2
33
34 ## baseline prevalence
35 p0 <- 0.01
36
37 ## population level g-e association
38 pge <- 0.0
39
40 ## Num controls per case
41 cpercase <- 1
42
43 ## Num of Tests (Genotypes)
44 M <- 1000000
60
45
46 ## One- or Two-sided Alterative47 Alt.H <- 2
48
49 ## Power / Significance Level
50 Power <- 0.8
51 Exp_Sig <- 0.05
52
53 ## Sample size required to achieve 80% power for CC analysis
54 N <- 2632
55
56 step1_alpha <- 0.0001
57 Marg_sig <- 0.0001
58 rho <- 0.5
59
60
61 ## CALL TO PROGRAM FUNCTIONS ##
62
63 ## One-Step Case-Control Test for Interaction (CC)
64 # Sample Size above set to baseline result from Table V to achieve 80% power
65 cc <- CC_power(daf=daf, ep=ep, RRg=RRg, RRe=RRe, RRge=RRge, N=N,
66 M=M, p0=p0, ORge=ORge_DSL, pge=pge)
67 N_cc <- cc[1,]
68 Pow_cc <- cc[2,]
69
70 ## Print out for Sample Size Required to Achieve Specified Power (Power = 0.8)
71 print(c("Case-Control Sample Size:" , round(N_cc)), quote=F)
72 [1] Case-Control Sample Size: 2632
73
74 ## Print out of Power of CC Test with Set Sample Size = 2,632
75 print(c("Case-Control Power for N=",N, Pow_cc), sep="", quote=F)
76 [1] Case-Control Power for N= 2632 0.800056703448843
77
78 ## One-Step Case-Only Test for Interaction (CO)
79 co <- CO_power(daf=daf, ep=ep, RRg=RRg, RRe=RRe, RRge=RRge, N=N,
80 M=M, p0=p0, adjust=T, ORge=ORge_DSL, pge=pge)
81 N_co <- co[1,]
82 Pow_co <- co[2,]
83
84 ## Print out for Sample Size Required to Achieve Specified Power (Power = 0.8)
85 print(c("Case-Only Sample Size:" , round(N_co)), quote=F)
86 [1] Case-Only Sample Size: 1116
87
88 ## Print out of Power of CO Test with Set Sample Size = 2,632
89 print(c("Case-Only Power for N=",N, Pow_co), sep="", quote=F)
90 [1] Case-Only Power for N= 2632 0.99998731710789
61
91
92 ## Two-Step Test of Environment-Gene Two-Step Test for Interaction (EG2)
93 Pow_2_step <- Two_Step_Power(daf=daf, ep=ep, RRg=RRg, RRe=RRe,
94 RRge=RRge, N=N, M=M, p0=p0, ORge=ORge_DSL, pge=pge)
95
96 # Print out of Power for EG2 Test with Set Sample Size = 2,632
97 print(c("Two-Step Power for N=", N, Pow_2_step), sep="", quote=F)
98 [1] Two-Step Power for N= 2632 0.997320928963074
99
100 ## Print out for Sample Size Required to Achieve Specified Power (Power = 0.8)
101 N_2stepA <- round(SampleSize_2sA(N=N, tol=0.00001, a1_A = step1_alpha))
102 print(c("Association Two-Step Sample Size:" , N_2stepA), quote=F)
103 [1] Association Two-Step Sample Size: 1383
104
105
106 ## Two-Step Test of Disease-Gene Two-Step Test for Interaction (DG2)
107 Marg <- Marg_2Step(Marg_Sig=Marg_sig, daf=daf, ep=ep, RRg=RRg,
108 RRe=RRe, RRge=RRge, N=N, p0=p0, ORge=ORge, pge=pge, alpha=Exp_Sig)
109
110 # Print out of Power for EG2 Test with Set Sample Size = 2,632
111 print(c("Two-Step Marg Power for N=", N, Marg), sep="", quote=F)
112 [1] Two-Step Marg Power for N= 2632 0.930109385250377
113
114 ## Print out for Sample Size Required to Achieve Specified Power (Power = 0.8)
115 N_2stepM <- round(SampleSize_2sM(N=N, tol=0.00001, a1_M = Marg_sig))
116 print(c("Marginal Two-Step Sample Size:" , N_2stepM), quote=F)
117 [1] Marginal Two-Step Sample Size: 2078
118
119
120 ## Hybrid Two-Step Test of Interaction (H2)
121 C2s<-Comb_2Step(Marg_Sig=Marg_sig, alpha1=step1_alpha, daf=daf, ep=ep,1
122 RRg=RRg, RRe=RRe, RRge=RRge, N=N, M=M, p0=p0, ORge=ORge, pge=pge,
123 alpha=Exp_Sig, A=rho)
124
125 # Print out of Power for EG2 Test with Set Sample Size = 2,632
126 print(c("Two-Step Hybrid Power for N=", N, C2s), sep="", quote=F)
127 [1] Two-Step Hybrid Power for N= 2632 0.995726862849578
128
129 ## Print out for Sample Size Required to Achieve Specified Power (Power = 0.8)
130 N_2stepH <- round(SampleSize_2sH(N=N, tol=0.00001, a1_A=step1_alpha,
131 a1_M = Marg_sig, rho=rho))
132 print(c("Hybrid Two-Step Sample Size:" , N_2stepH), quote=F)
133 [1] Hybrid Two-Step Sample Size: 1406
134
135 ## OPTMIZATION OF EG2 AND DG2
62
136 optimize(opt_power, c(0,1), maximum=TRUE, daf=daf, ep=ep, RRg=RRg,
137 RRe=RRe, RRge=RRge, N=N, p0=p0, ORge=ORge_DSL, pge=pge)
138 $maximum
139 [1] 6.610696e-05
140
141 $objective
142 [1] 0.997958
143
144 optimize(opt_Marg, c(0,1), maximum=TRUE, alpha=Exp_Sig, daf=daf, ep=ep,
145 RRg=RRg, RRe=RRe, RRge=RRge, N=N, p0=p0, ORge=ORge, pge=pge)
146 $maximum
147 [1] 0.001645910
148
149 $objective
150 [1] 0.9706693
151
152 ## PLOT POWER AS FUNCTION OF INTERACTION EFFECT SIZE
153 ## Set range of Rge
154 Rge <- seq(1.1, 2.5, 0.1)
155
156 CC <- CC_power(daf=daf, ep=ep, RRg=RRg, RRe=RRe, RRge=Rge, N=N,
157 M=M, p0=p0, ORge=ORge)
158
159 CO <- CO_power(daf=daf, ep=ep, RRg=RRg, RRe=RRe, RRge=Rge, N=N,
160 M=M, p0=p0, adjust=T, ORge=ORge_DSL, pge=pge)
161
162 EG2 <- Two_Step_Power(ORge=ORge, pge=pge, daf=daf, ep=ep, RRg=RRg,
163 RRe=RRe, RRge=Rge, N=N, M=M, p0=p0)
164
165 DG2 <-Marg_2Step(Marg_Sig=Marg_sig, daf=daf, ep=ep, RRg=RRg, RRe=RRe,
166 RRge=Rge, N=N, p0=p0, ORge=ORge, pge=pge, alpha=Exp_Sig)
167
168 H2 <-Comb_2Step(Marg_Sig=Marg_sig, alpha1=step1_alpha, daf=daf, ep=ep,
169 RRg=RRg, RRe=RRe, RRge=Rge, N=N, M=M, p0=p0, ORge=ORge, pge=pge,
170 alpha=Exp_Sig, A=rho)
171
172 ## Plotting
173 plot(Rge, CC[2,], ylab="Power", xlab="Interaction Effect Size", type="l",
174 col="red", ylim=c(0,1), lwd=2)
175 lines(Rge, EG2, type="l", col="blue", ylim=c(0,1), lwd=2)
176 lines(Rge, DG2, ylab="Power", type="l", col="orange", ylim=c(0,1), lwd=2)
177 lines(Rge, H2, ylab="Power", type="l", col="green", ylim=c(0,1), lwd=2)
178 lines(Rge, CO[2,], ylab="Power", type="l", col="black", ylim=c(0,1), lwd=2)
179 legend(2, .2,c("CC", "EG2", "DG2", "H2", "CO"), lty=1, col=c("red", "blue",
180 "orange", "green", "black"), merge=T, lwd=2, bty="n")
Abstract (if available)
Abstract
Many complex diseases (e.g. asthma, diabetes) are likely to be a result of the interplay of genes and environmental exposures. The standard analysis in a genome-wide association study (GWAS) scans for main effects and ignores the potentially useful information in the available exposure data. This dissertation explores alternative approaches to detect gene-environment interactions (G×E) in GWA studies. The first chapter explores a novel approach aimed at prioritizing the large number of SNPs tested to highlight those most likely to be involved in a G×E interaction. This approach screens all markers available in a GWAS on a test that models the G-E association induced by an interaction in the combined case-control sample. Power and Type I error of this approach are compared to a traditional approach. In the second chapter of this dissertation, I explore alternative two-step approaches through the development of a likelihood based software package designed to compute power and sample size of a variety of approaches to detect G×E interactions. In the final chapter, I demonstrate the use of this software package in the optimization and analysis of a nested case-control sample from the Children’s Health Study to investigate heterogeneity of genetic risk by subgroups defined by environmental exposures on asthma susceptibility in children in southern California. I optimize this procedure to efficiently scan for G×E interactions that effect asthma susceptibility for binary (e.g.i.e. in utero tobacco smoke, close proximity to a major road) exposure, distance to major road or freeway) and continuous exposures (e.g. i.e. PMpm2.5, ozone, traffic-related pollution NOx) exposures.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
Combination of quantile integral linear model with two-step method to improve the power of genome-wide interaction scans
PDF
Bayesian model averaging methods for gene-environment interactions and admixture mapping
PDF
High-dimensional regression for gene-environment interactions
PDF
Minimum p-value approach in two-step tests of genome-wide gene-environment interactions
PDF
Comparisons of four commonly used methods in GWAS to detect gene-environment interactions
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
PDF
Bayesian hierarchical models in genetic association studies
PDF
Extending genome-wide association study methods in African American data
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
A genome wide association study of multiple sclerosis (MS) in Hispanics
PDF
Adaptive set-based tests for pathway analysis
PDF
The influence of DNA repair genes and prenatal tobacco exposure on childhood acute lymphoblastic leukemia risk: a gene-environment interaction study
PDF
Genomic risk factors associated with Ewing Sarcoma susceptibility
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
Asset Metadata
Creator
Murcray, Cassandra Elizabeth
(author)
Core Title
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
09/03/2010
Defense Date
08/17/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
association studies,case-control,environment,Gene,interactions,OAI-PMH Harvest
Place Name
California
(states)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Gauderman, James W. (
committee chair
), Conti, David V. (
committee member
), Sun, Fengzhu Z. (
committee member
)
Creator Email
cassie.murcray@gmail.com,murcray@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3421
Unique identifier
UC1148851
Identifier
etd-Murcray-4052 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-384704 (legacy record id),usctheses-m3421 (legacy record id)
Legacy Identifier
etd-Murcray-4052.pdf
Dmrecord
384704
Document Type
Dissertation
Rights
Murcray, Cassandra Elizabeth
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
association studies
case-control
environment
interactions