Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
(USC Thesis Other)
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
I TWO-STEP TESTING APPROACHES FOR DETECTING QUANTITATIVE TRAIT GENE-ENVIRONMENT INTERACTIONS IN A GENOME-WIDE ASSOCIATION STUDY by Pingye Zhang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BIOSTATISTICS) May 2016 Copyright 2016 Pingye Zhang II Dedicated to Huicheng Zhang, Hangli Zhao and Eva III Acknowledgements First I would like to thank my advisor, Dr. William Gauderman for guiding and supporting me over the years. You have set an example of excellence as a researcher, mentor, instructor, and role model. I would also like to thank my other committee members, Dr. David Conti, Dr. Juan Pablo Lewinger and Dr. Fengzhu Sun, for all of their guidance through this process; your discussion, ideas, and feedback have been absolutely invaluable. I also owe thanks to other USC faculty members for helping me learn and giving me advice and guidance. I'd also like to thank my colleagues, fellow graduate students from whom I have learned, found support and received helpful advice and suggestions. I am very grateful to all of you. Finally, I would especially like to thank my amazing family for the love, support, and constant encouragement I have gotten over the years. In particular, I would like to thank my parents, Huicheng Zhang and Hangli Zhao, and my girlfriend Eva. You are the salt of the earth, and I undoubtedly could not have done this without you. Your love and support is always the motivation for me to find the better me! IV Table of Contents ACKNOWLEDGMENTS .................................................................................................III LIST OF TABLES ............................................................................................................ VI LIST OF FIGURES ......................................................................................................... VII ABSTRACT ...................................................................................................................... IX CHAPTER 1: INTRODUCTION ........................................................................................1 CHAPTER 2: TWO STEP APPROACHES FOR DETECTING G×E INTERACTION ...................................................................................5 2.1 Materials and Methods .................................................................................5 2.1.1 Marginal (YG) test ...........................................................................6 2.1.2 Interaction Test (G×E) .....................................................................6 2.1.3 Two-Step Test with Screening on Marginal G (YG|G×E) ..............6 2.1.4 Two-Step Test with Screening on Variance Heterogeneity (Var|G×E) ........................................................................................7 2.1.5 Two-Step Test with a Screening on Residual Variance Heterogeneity (rVar|G×E) ...............................................7 2.1.6 Joint Two-Step Test with Screening Based on Both YG and rVar (Joint|G×E) .................................................................9 2.1.7 Hypothesis Testing Approaches in Step 2 .....................................10 2.2 Simulation Study ........................................................................................11 2.2.1 Type 1 Error ...................................................................................14 2.2.2 Power Comparison .........................................................................14 2.3 Results ........................................................................................................15 2.3.1 Type 1 Error ...................................................................................15 2.3.2 Power Comparison .........................................................................16 2.4 Discussion ..................................................................................................22 2.5 Supplemental Materials .............................................................................25 2.5.1 Correlation Between the Variance-per-genotype and G×E Interaction Estimators .....................................................25 2.5.2 Variance-per-genotype Estimator is Uncorrelated with Marginal G estimator .............................................................29 2.5.3 Levene’s Test on Imputed Data .....................................................31 2.5.4 Supplemental Tables and Figures ..................................................32 V CHAPTER 3: SOFTWARE DEVELOPMENT 3.1 General Description ...................................................................................37 3.2 Methods......................................................................................................38 3.2.1 Notation..........................................................................................38 3.2.2 Exhaustive Test ..............................................................................39 3.2.3 Two-Step Procedures .....................................................................40 3.2.4 Step-2 Hypothesis Testing Approaches .........................................44 3.3 Example Dataset ........................................................................................45 3.4 Program Execution.....................................................................................49 3.4.1 G×EScan.exe ..................................................................................50 3.4.2 G×EMerge.exe ...............................................................................55 3.4.3 G×ETest.R .....................................................................................56 3.5 Program Output .........................................................................................57 3.5.1 Summary ........................................................................................57 3.5.2 Exhaustive Tests ............................................................................58 3.5.3 Two-Step Tests ..............................................................................59 3.5.4 Example Summary .........................................................................61 CHAPTER 4: APPLICATION TO THE CHILDREN’S HEALTH STUDY ..................63 4.1 General Description and Motivation ..........................................................64 4.2 Phenotype and Subjects .............................................................................64 4.3 Exposure Data ............................................................................................65 4.4 Genotype Data ...........................................................................................65 4.5 Statistical Analysis .....................................................................................67 4.6 Results ........................................................................................................68 4.7 Conclusion .................................................................................................83 CHAPTER 5: SUMMARY................................................................................................84 REFERENCES ..................................................................................................................85 VI List of Tables Table 2.1 Type 1 error rates for tests of G×E1 interaction when non-QTL SNPs have neither marginal G effect nor G×E1 interaction effect. .......................................................................................16 Table S2.1 Models of G×E1 interaction that produce the indicated marginal genetic effect size (βG). ...............................................................32 Table S2.2 Type 1 error rates for tests of G×E1 interaction when 10 non-QTL SNPs have a marginal G effect but no G×E1 interaction effect. ......................................................................................33 Table S2.3 Power using subset testing for Joint|G×E across a range of step-1 thresholds (α1) compared to the power of weighted hypothesis testing. ......................................................................34 Table 4.1 Descriptive Statistics for Environmental Exposures Considered for Testing G×E Interactions in the Children’s Health Study. ........................65 Table 4.2 Top 10 SNPs from Joint|G×E analysis of 506,788 SNPs for G x Hispanic-ethnicity interaction of FEV1 in the Children’s Health Study. ...........................................................................71 Table 4.3 Most Significant SNPs Involved in G×E Interactions for each Environmental Exposures in the Children’s Health Study using 2-Step methods with subset testing (α=0.05).. ..........................................83 VII List of Figures Figure 2.1 Power to detect G×E1 interaction in the presence of one interaction effect, with 6,000 individuals. (A) Power is presented across a range of magnitudes for the marginal genetic effect (R 2 G) with R 2 GE1=0.4%. (B) Power is presented across a range of magnitudes for G×E1 interaction (R 2 GE1) with R 2 G=0.17%. .........................................................18 Figure 2.2 Power to detect G×E1 interaction in the presence of two interaction effects, with 6,000 individuals. (A) Power is presented across a range of magnitudes for the marginal genetic effect (R 2 G) with R 2 GE1=0.4%, R 2 E2= R 2 GE2=2%. (B) Power is presented across a range of magnitudes for G×E1 interaction (R 2 GE1) with R 2 G=0.17%, R 2 E2= R 2 GE2=2%. (C) Power is presented across a range of magnitudes for the marginal effect of E2 (R 2 E2) with R 2 G=0.17%, R 2 GE1=0.4%, R 2 GE2=2%. (D) Power is presented across a range of magnitudes for G×E2 interaction (R 2 GE2) with R 2 G=0.17%, R 2 GE1=0.4%, R 2 E2=2%. ...................21 Figure S2.1 Variance heterogeneity can be induced by G×E interaction ......................33 Figure S2.2 The removal of the marginal effect of E. (A) The magnitude of G×E interaction will not be affected. (B) The magnitude of marginal G effect will not be affected ......................................................35 Figure S2.3 Power to detect G×E1 interaction in the presence of two interaction effects, with varied qA, PE1 and PE2. Proportions of total variance explained by each component were fixed at R 2 G=0.17%, R 2 GE1=0.4%, R 2 E2= R 2 GE2=2%. (A) Power is presented across a range of MAF for the QTL (qA). (B) Power is presented across a range of prevalence for E1. (C) Power is presented across a range of prevalence for E2. ....................36 Figure 3.1 Summary of Example data results. ............................................................58 Figure 3.2 QQ, Manhattan plots and part of the table of top hits for the standard analysis of G×E interaction. ........................................................59 Figure 3.3 QQ and Manhattan plots for 2-step Joint|G×E procedure with Subset testing. ............................................................................................60 Figure 3.4 Manhattan plot for 2-step Joint|G×E procedure with Weighted hypothesis testing. .....................................................................61 VIII Figure 4.1 Quantile-quantile and Manhattan plots for G x Hispanic-ethnicity interaction analysis of 506,788 SNPs from the Children’s Health Study. (A-B) Marginal G scan. (C-D) Standard G×E scan (E-F) Step-1 from the rVar|G×E method. ...............................................................................69 Figure 4.2 Analysis of 506,788 SNPs for G x Hispanic-ethnicity interaction using two-step methods. Subset testing was used (see Methods), with Step-1 threshold (α1) set to be 0.1. (A-B) Step-2 from the YG|G×E method. (C-D) Step-2 from the rVar|G×E method. (E-F) Step-2 from the Joint|G×E method. ....................................................................................70 Figure 4.3 G x Hipanic-ethnicity interaction pattern for rs1439945 for carriers (G=1 with AA and Aa genotype combined) relative to noncarriers (G=0 for aa genotype) in Hispanic White children (E=0) or non-Hispanic White children (E=1). .........................................................72 Figure 4.4 Analysis of 6,489,739 SNPs for marginal effect scan. Familywise error rate (α) is set to be 0.05. (A) Quantile-quantil plot. (B) Manhattan plot .....................................................................................72 Figure 4.5 Quantile-quantile and Manhattan plots for G×Hispanic-ethnicity interaction analysis of 6,489,739 SNPs from the Children’s Health Study. (A-B) Step-2 from the Joint|G×E method. (C-D) Step-2 from the rVar|G×E method. .....................................................................................74 Figure 4.6 Quantile-quantile and Manhattan plots for G×PM2.5 interaction analysis of 6,489,739 SNPs from the Children’s Health Study. (A-B) Step-2 from the Joint|G×E method. (C-D) Step-2 from the rVar|G×E method. (E-F) Step-2 from the YG|G×E method ....................................................76 Figure 4.7 Quantile-quantile and Manhattan plots for G×NO2 interaction analysis of 6,489,739 SNPs from the Children’s Health Study. (A-B) Step-2 from the Joint|G×E method. (C-D) Step-2 from the rVar|G×E method. ...........78 Figure 4.8 Quantile-quantile and Manhattan plots for G × in utero tobacco smoke interaction analysis of 6,489,739 SNPs from the Children’s Health Study. (A-B) Step-2 from the Joint|G×E method. (C-D) Step-2 from the rVar|G×E method. (E-F) Step-2 from the YG|G×E method. ....................79 Figure 4.9 Quantile-quantile plots for G×sex interaction analysis of 6,489,739 SNPs from the Children’s Health Study. (A) Without adjustment for sex × Hispanic-ethnicity interaction. (B) With adjustment for sex × Hispanic-ethnicity interaction.. ........................................................80 Figure 4.10 Quantile-quantile and Manhattan plots for G×sex interaction analysis of 6,489,739 SNPs from the Children’s Health Study. (A-B) Step-2 from the Joint|G×E method. (C-D) Step-2 from the YG|G×E method. ..............82 IX Abstract A genome-wide association study (GWAS) typically is focused on detecting marginal genetic effects. However, many complex traits are likely to be the result of the interplay of genes and environmental factors. These SNPs may have a weak marginal effect and thus unlikely to be detected from a scan of marginal effects, but may be detectable in a gene-environment (G ×E) interaction analysis. However, a genome-wide interaction scan (GWIS) using a standard test of G ×E interaction is known to have low power, particularly when one corrects for testing multiple SNPs. Two 2-step methods for GWIS have been previously proposed, aimed at improving efficiency by prioritizing SNPs most likely to be involved in a G×E interaction using a screening step. For a quantitative trait, these include a method that screens on marginal effects [Kooperberg and Leblanc, 2008] and a method that screens on variance heterogeneity by genotype [Paré et al., 2010]. In this dissertation I show that the Paré et al. approach has an inflated false-positive rate in the presence of an environmental marginal effect, and we propose an alternative that remains valid. We also propose a novel 2-step approach that combines the two screening approaches, and provide simulations demonstrating that the new method can outperform other GWIS approaches. We develop software, as modules to the program “G×EScan”, to implement the novel methods along with other existing methods. Application of novel methods to a G x Hispanic-ethnicity scan for childhood lung function reveals a SNP near the MARCO locus that was not identified by previous marginal-effect scans. 1 Chapter 1: Introduction Genome-wide association studies (GWAS) have been conducted for years as a tool for identifying genetic susceptibility loci that are associated with complex traits. Several hundred trait-related loci for many human diseases and quantitative traits have been discovered through the scans of marginal genetic effects [Hindorff et al., 2015]. However, there remains a significant amount of heritability left unexplained for many traits after accounting for the SNPs that have been identified. This missing heritability is not likely to be explained by additional marginal- effect loci that might be detectable even with larger sample sizes [Maher, 2008]. A number of alternative explanations for this “missing” heritability have been discussed [Manolio et al., 2009]. One possible reason is that many complex traits are likely to be the result of the interplay of genes and environmental factors, such that a given trait-related locus may be important only for a subgroup of the population defined by some environmental factor. The presence of gene- environment (G ×E) interaction may produce a weak marginal genetic effect that is unlikely to be detected using a standard genome-wide scan. In fact, an interaction with opposite genetic effects in different subgroups (qualitative interaction) could produce virtually no marginal genetic effect. Testing for G ×E interactions in genome-wide association studies is a promising approach to find novel genes that could be missed from the primary scans of marginal effect. Identifying G ×E interaction is considered one possible key to better understanding the genetic architecture of complex traits [Manolio and Collins, 2007; Zuk et al., 2012], and incorporating G ×E interactions into trait prediction models is a primary goal of the President’s Precision Medicine Initiative (http://www.nih.gov/news/health/sep2015/od-17.htm). 2 The traditional analysis of G ×E interaction is based on an exhaustive scan for interaction with each SNP using a regression framework – linear regression for a quantitative outcome and logistic regression for a dichotomous outcome. In the context of dichotomous outcome this approach has been shown to have poor power compared to the test of marginal genetic effect or environmental effect alone [Hwang, Beaty et al., 1994; Garcia-Closas and Lubin, 1999]. In the context of a GWAS, the correction for multiple testing needs to be performed for millions of SNPs and typically requires p-values less than 10 -8 to be declared statistically significant at a genome-wide significance level [Devlin B and Roeder K, 1999]. Moreover, it’s believed that large and pervasive interaction effects are unlikely in human populations [Aschard et al., 2012]. Therefore, more efficient approaches need to be developed for testing G ×E interactions in a GWAS. In the context of case-control sampling, one important contribution was the case-only (CO) analysis. The CO analysis is based on the observation that the G ×E interaction relative risk can be estimated by the odds ratio between genetic variant and environmental exposure in the cases alone under the assumption of independence between genetic variant and environment [Piegorsch, Weinberg et al. 1994]. The CO analysis can provide substantially greater power than the standard case-control (CC) analysis. However if the genetic variant and environment are not independent in the population, CO analysis could have an unacceptably high false-positive rate. For a case-control study (e.g., Unmatched case-control with 1 control per case), we sample cases from the case population and typically sample the same amount of controls from the control population. As a result, the case-control sample will have a case prevalence much higher than it is in the source population. In the presence of G ×E interaction, because a G ×E interaction induces a correlation between environment and genetic variant in cases, there is also an induced, 3 but weaker, correlation between environment and genetic (E-G) in the combined case-control sample due to the over ascertainment of cases from the source population. In addition to the exhaustive scan for G ×E interaction several 2-step approaches have been developed to improve efficiency in the context of case-control sampling [Murcray et al., 2009, 2011; Kooperberg and LeBlanc, 2008; Hsu et al., 2012; Gauderman et al, 2013]. The two- step approaches are based on the idea of reducing the burden of correction for multiple testing by screening out SNPs that are unlikely to have significant interaction effects at the first step (screening step) and only formally testing for interactions on those SNPs passed to the second step (testing step). The key of a two-step approach is to find information in the sample to form a test statistic in the screening step that is independent of the test statistic in the testing step. The validity of a two-step approach is based on the independence assumption between the two steps. If the testing step statistic is dependent on the screening step statistic, then inflated Type I error will be presented among the null markers. Note that most of the 2-step approaches for a case- control sample [Murcray et al., 2009, 2011; 2008; Hsu et al., 2012; Gauderman et al, 2013] are based on the observation that there is also an induced correlation between environment and genetic (E-G) in the combined case-control sample due to the over ascertainment of cases from the source population in the presence of G ×E interaction. For a quantitative trait, power for the traditional approach is also poor and 2-step alternatives have been proposed. Kooperberg and LeBlanc [Kooperberg and Leblanc, 2008] proposed screening SNPs based on their marginal effect, followed by testing of G ×E interaction in Step 2 for only those SNPs that pass a 0.05 significance threshold in the screen. Note that the Kooperberg and LeBlanc approach was proposed in the context of G ×G interaction and studied by simulation only for case-control sample. Here we explored the power of this strategy for 4 testing for G ×E interaction with a quantitative trait. In a scan for quantitative trait loci (QTLs), we typically draw a random sample from the source population without regard for phenotype, and thus there is no induced E-G correlation in the sample. Consequently, unlike the case-control sample, screening on the E-G correlation cannot provide us with information on potential G ×E interaction. Pare et al. [Paré et al., 2010] developed an alternative procedure, with screening in Step 1 based on a test of variance heterogeneity across SNP genotypes. The rationale is that when the magnitude and direction of the effect of a QTL differ depending on environment or other SNP, the variability of the quantitative trait among individuals carrying the risk allele is likely to be larger than the variability among non-carriers. However, as I show in this dissertation, the Pare et al. approach does not preserve the overall Type 1 error, except in limited circumstances. In this dissertation, we develop a solution to this problem that recovers a valid 2- step method using variance heterogeneity screening. We also propose a new 2-step approach that combines both marginal and variance-heterogeneity information in a single screening test. We use simulations to compare the Type 1 error and power of these various approaches across a wide range of underlying models. We also develop modules for the software program “G×EScan” that implement the two novel methods and other existing methods, and apply these methods to a GWIS of childhood lung function in the Children’s Health Study. 5 Chapter 2: Two Step Approaches for Detecting G×E Interaction 2.1 Materials and Methods Let Y be a quantitative phenotype. We define E to be an exposure variable, which can be an environmental variable (e.g. air pollution), personal exposure (e.g. smoking), or other personal characteristic (e.g. sex). We further assume that M SNPs have been genotyped on each of the N study individuals, with G1, G2, …, GM denoting the genotypes at the M loci. We let qA to be the minor allele frequency (MAF) of allele A for the quantitative trait locus (QTL). In the following sections we describe several different approaches that can be used to analyze these data. Marginal (YG) Test A standard GWAS is conducted by testing the null hypothesis βG=0 for each of the M SNPs using a linear regression model of the form 0 G Y G (2.1) Adjustment covariates to control for potential confounders are typically included in the model. A correction [Dudbridge and Gusnanto, 2008] is applied to the p-value (pG) to preserve the family- wise error rate (FWER). In the presence of an environmental factor (E) and a G×E interaction, βG is a weighted average of the corresponding genetic effect in each environmental group, hence the 6 use of the term ‘marginal’ to describe this effect. The same magnitude of βG can result from different underlying patterns of G×E interaction. Interaction Test (G×E) In a follow-up to the primary marginal scan, one can test for G×E interaction using the model 0 G E GE G E G YE (2.2) based on testing the null hypothesis β GE=0 for each of the M SNPs. Let pG×E denote the p-value for the test of G×E interaction. A correction is also needed to preserve the FWER. Two-Step Test with Screening on Marginal G (YG|G×E) In the pursuit of gene-gene (GxG) interactions, Kooperberg and Leblanc [Kooperberg and Leblanc, 2008] developed a 2-step approach that screens SNPs by genetic marginal effects at Step-1 significance level α1. They proposed to formally test for GxG interactions for the subset m M SNPs that pass the Step-1 screen, with a Bonferroni-corrected significance level α/m to preserve the FWER. This approach is based on the observation that most variants involved in the interactions are also likely to display some non-zero (although possibly weak) marginal effect on the phenotype. An analogous screening on marginal effects can also be applied for G×E analysis. We explore the power of this strategy for G×E interactions by screening M SNPs in the first step by genetic marginal effects (model 2.1). One can use either subset testing [Kooperberg and 7 Leblanc, 2008] or an alternative weighted hypothesis testing [Ionita-Laza et al., 2007] in Step 2 (model 2.2). These two hypothesis testing approaches are described in more detail below. Two-Step Test with Screening on Variance Heterogeneity (Var|G×E) An alternative 2-step approach can be constructed by screening for variance heterogeneity across genotypic groups [Paré et al., 2010]. The rationale is that if the effect of a QTL differs depending on an environmental factor, the variability of the quantitative trait among individuals carrying the risk allele will also be different from the variability among non-carriers [see S2.1]. Paré et al. [Paré et al., 2010] proposed to use Levene’s test [Levene H, 1960] for homogeneity of variance across genotypic groups at the screening step for all M SNPs, and then only test for G×E interactions for a subset of SNPs that pass the screening step using model 2.2. They showed that the power of this procedure can be higher than the power of the standard G×E analysis depending on the magnitude of the main effect of the interacting factor E, where the main effect is defined as the effect of E when G=0. However, we show that the correlation between the variance estimator and the G×E interaction estimator is dependent on the marginal effect of E and this correlation will only be equal to zero when no marginal effect of E exists (Details shown in section 2.5.1). In other words, the Step-1 Levene’s test, which is based on comparing variances across G, is correlated with the Step-2 test of G×E in the presence of a marginal effect of E. This correlation violates the basic premise of 2-step testing procedures that requires independence of Steps 1 and 2 for validity [Murcray et al., 2009]. The result, as we will show in our simulation studies, is that the Paré et al. approach can produce an unacceptably high false-positive rate. 8 Two-Step Test with a Screening on Residual Variance Heterogeneity (rVar|G×E) As we show in the Section 2.5.1, the correlation between Levene’s test and the G×E interaction test can be eliminated if the marginal effect of E is removed. Moreover, removing the marginal effect of E will not affect the magnitude of G×E interaction (Section 2.5.4, Fig. S2.1). Based on these two observations, we propose a revised 2-step approach to using variance heterogeneity as follows: Step-0: Given a quantitative trait Y, fit the linear regression model: 0 ii i E YE (2.3) Where i=1,…,N for N individuals in the sample. Step-1: Given K subgroups defined by genotype, test the null hypothesis of equality of variance, H0: σ1= σ2=… σK, where σk is the standard deviation of within the k th subgroup (k=1,…,K, e.g. K=3 for an additive model, K=2 for a dominant model), using Levene’s test defined as: 2 1 2 11 1 k K kk k N K kj k kj N K N Z Z W K Z Z (2.4) where N is the total sample size, Zkj=| kj– k•| where kj is the residual of the regression in (2.3) for j th observation in k th subgroup, and k• is the mean or median of k th subgroup. Zk• is the mean of Zkj for k th subgroup, and Z•• is the mean of all Zkj. Note that Equation 2.4 defined Levene’s 9 test of variance heterogeneity applied to the residuals instead of the original trait Y. The p- value is defined as: rVar 1, P Pr K N K FW (2.5) Where F is the F-distribution with K–1 and N–K degrees of freedom. Step-2: Considering results of the Step-1 screen, test the null hypothesis H0: β GE=0 using model 2.2. The use of residuals removes the marginal E effect and thus eliminates the correlation between the Levene’s test and the test of G×E interaction. This leads to a 2-step procedure that preserves the FWER, as we show by simulation. One can use either subset testing or weighted hypothesis testing in Step 2 (see below). Joint Two-Step Test with Screening Based on Both YG and rVar (Joint|G×E) So far we have seen three pieces of information that can be utilized in the search for G×E interactions: the G×E interaction effect, the marginal G effect, and variance heterogeneity across G. The standard test uses only the G×E interaction effect, Kooperberg and Leblanc [Kooperberg and Leblanc, 2008] use the marginal effect of G to screen and G×E to test, and our modification of Paré et al [Paré et al., 2010] utilizes residual variance heterogeneity to screen and G×E to test. We propose a novel 2-step approach that uses all three of these sources of available information, as follows: Step-0: Remove the marginal E effect and collect residuals from model 2.3, as described for the rVar|G×E approach above. 10 Step-1: Combine the P-value (PG) from marginal G scan and P-value (PrVar) from test of variance heterogeneity using Fisher’s method [Fisher RA, 1932]: int rVar 2 ln P ln P jo G T (2.6) Note that the test of marginal G effect is independent of the test of variance heterogeneity across G (Section 2.5.2). Tjoint therefore follows a chi-squared distribution with 4 degrees of freedom under the joint null H0: σ1= σ2=… σk and βG=0. Step-2: Considering results of the Step-1 screen, test the null hypothesis H0: β GE=0 using model 2.2. The G×E interaction estimator is uncorrelated with both the residual variance per genotype (Section 2.5.1) and with the marginal G estimator [Dai et al., 2012], and so the joint test in Step- 1 is uncorrelated with the Step-2 interaction test. One can either use subset testing or weighted hypothesis testing in Step 2 (see below). Hypothesis Testing Approaches in Step 2 For a 2-step approach, it is important to have an efficient testing scheme in Step 2 to make the best use of the information from Step 1. Previously developed 2-step methods have described both subset testing and weighted-hypothesis testing for Step 2 [Gauderman et al., 2013; Hsu et al., 2012]. In subset testing, a step-1 significance threshold α1 is pre-specified. At the screening step, SNPs with a step-1 p-value less than α1 will be passed into the Step 2 and only that subset of SNPs passed into Step 2 will be formally tested for G×E interaction. A Bonferroni correction for the number of SNPs tested in Step 2 will be utilized. Simulation studies 11 suggest that the power of this 2-step approach depends strongly on the choice of α1 (Gauderman et al, 2013). A larger value of α1 will increase the chance of passing a true trait-related SNP into Step 2, but at the expense of also passing more null markers leading to a larger penalty for multiple testing. A lower value of α1 will produce a smaller subset of SNPs tested in Step 2 and less-severe multiple testing burden, but at the possible cost of screening out a truly associated SNP. Instead of specifying a significance threshold α1 to screen out SNPs, Ionita-Laza et al. [Ionita-Laza et al., 2007] proposed a weighted hypothesis testing scheme. At the screening step, all M SNPs are ordered by their step-1 p-values. At Step 2, the M SNPs are assigned with different weights to the significance level based on their rankings in Step 1. Specifically, an initial bin size B is pre-specified and the B most significant SNPs based on step-1 P-values are tested at significance level α/2B , the next 2B SNPs are tested at significance level α/[2 2 (2B)], the next 4B SNPs will be tested at significance level α/[2 3 (2 2 B)], etc. The weighted testing schema follows the intuition that SNPs with strong step-1 evidence are more likely to be truly involved in a G×E interaction, and thus should be ‘rewarded’ with a less stringent significance threshold than the standard Bonferroni correction. However, this approach ‘punishes’ those SNPs with lower step-1 rankings by requiring a stricter significance threshold than the standard Bonferroni. 2.2 Simulation Study We use simulation to evaluate Type I error rates and compare power of all methods discussed above. We consider two scenarios for the data generating model. In scenario 1, Y is a 12 function of a marginal effect of G (the QTL), a marginal effect of E1 and an interaction effect between G and E1 as follows: 1 1 1 1 0 1 1 G G E E GE G E Y G E G E (2.7) Here we assume μG and μE1 are the population means of the covariates G and E1. For example, if G is coded according to a dominant model (G = 1 for AA or Aa genotype, and G=0 for aa genotype), then μG=qA(2–qA). For simplicity, we assume E1 is a binary indicator of exposure, so that μE1=PE1, is the population prevalence of exposure. In scenario 2, we assume Y is related to two uncorrelated exposures E1 and E2, according a model that includes marginal effects and two G×E interactions as follows: 1 1 1 1 2 2 2 2 2 0 1 1 2 G G E E GE G E E E GE G E Y G E G E E G E (2.8) Similarly, μE2 is the population mean of E2 and the components, G-μG, E1-μE1, (G-μG)×(E1-μE1), E2-μ E2, (G-μG)×(E2-μE2), in model 2.8 are pairwise uncorrelated due to the centering of each variable on its respective mean. The variance of Y conditional on G can be written as: 0 1 1 1 1 1 1 2 1 2 2 2 22 2 1 1 1 1 2 2 2 2 2 | [ ( ) ( ) ( ) ( ) ( ) ( ) ( ) ] ( ) 1 ( ) 1 G G E E GE G E E E GE G E E GE G E E E GE G E E Var Y G Var G E G E E G E G P P G P P (2.9) 13 In this model, βG is the marginal effect of G, βE1 and βE2 are marginal effects of E1 and E2 respectively, βGE1 and βGE2 are the two interaction effects, and ԑ is assumed to have a normal distribution with mean 0 and variance σ 2 . The expression for the variance and interpretations of the parameters are similar for the simpler model in scenario 1. Note that in scenario 2, although the true model includes E2 and G×E2, we assume that the focus is still only on G×E1 interaction, and that data for E2 are not part of the available data, to mimic the typical scenario in a GWIS that we only have limited data on environment. For both scenarios, we specify a ‘base’ model that includes a QTL with qA=0.1635 and a dominant model so that 30% carry at least one “A” allele. The exposure variables E1 and E2 were both assumed to be binary variables with PE1= PE2=0.3. Without loss of generality, we assumed the total variance of Y to be 1.0 and used the strategy proposed by Gauderman [Gauderman, 2003] to partition the variance. Specifically, the proportion of the total variance of Y explained by the marginal effect of G (QTL) was defined as R 2 G=β 2 GVar(G–μG) where Var(G–μG)=Var(G)=qA(2–qA)(1–qA) 2 . Similarly, the proportion of the total variance of Y explained by the marginal effect of E1 was defined as R 2 E1=β 2 E1Var(E1–μ E1) where Var(E1–μE1)=Var(E1)=PE1(1– PE1). The quantity R 2 GE1= β 2 GE1Var[(G–μG)(E1–μE1)] gives the proportion of total variance explained by the interaction effect between G and E1, where Var[(G–μG)(E1–μE1)]=2PE1(1– P E1)qA(2–qA)(1–qA) 2 . The analogous quantities R 2 E2=β 2 E2Var(E2– μE2) and R 2 GE2= β 2 GE2Var[(G–μG)(E2–μE2)] denote the proportion of variance explained by the marginal effects of E2 and G×E2 interaction, respectively. The error ԑ was assumed to have a normal distribution with mean 0 and variance σ 2 =1 – (R 2 G+ R 2 E1+ R 2 GE1) for scenario 1 and σ 2 =1 – (R 2 G+ R 2 E1+ R 2 GE1+ R 2 E2+ R 2 GE2) for scenario 2. For scenario 1, our base model has a modest interaction effect with no effect of G when E=0 and no effect of E when G=0, and with marginal G and E effect sizes of R 2 G= R 2 E1=0.17% and interaction effect size R 2 GE1=0.4%. For scenario 2, 14 we used the same settings as for Scenario 1, but add a marginal effect of E2 (R 2 E2=2%) and an interaction effect between G and E2 (R 2 GE2=2%). In all simulations we generated 1,000 replicate data sets per simulation. For all of the 2-step approaches, we adopted weighted hypothesis testing and assumed the initial bin size of B=5. Type 1 Error We evaluated the Type I error rate for both scenarios 1 and 2. We defined non-QTL (null) SNPs as those have neither marginal G nor G×E1 effect. We performed multiple simulations varying the magnitude of R 2 E1 from 0 to 4% by 2%. For scenario 2, one SNP was generated to have G×E2 interaction effect (but no marginal G and G×E1 effect) and R 2 E2 and R 2 GE2 were fixed at 2%. We also considered an alternative model in which 10 loci had a marginal (but no G×E1 or G×E2) effect on the trait, with R 2 G for those 10 loci randomly sampled from a uniform distribution in the range 0.17%-0.43%. In each replicated data set, we simulated 10,000 null SNPs with 6,000 study individuals. For each of the SNPs we randomly sampled an allele frequency from a uniform distribution in the range 0.1 – 0.4. The Type I error rate for each procedure was estimated as the proportion of replicates in which at least one of the 10,000 non- QTL SNPs was declared statistically significant at a FWER of α=0.05. Power Comparison To estimate power, we simulated 1,000 replicate datasets, each including 1 million SNPs and 6,000 study subjects. One SNP was designated as the QTL, first generated according to the base model and then using alternative models to examine sensitivity of the power comparisons. We performed GWIS using all the methods described above. The marginal effect of E1 is 15 expected to affect neither the power to detect the marginal G effect nor the power to detect G×E1 interaction. Moreover, since the marginal effect of E1 needs to be removed in the rVar|G×E and Joint|G×E methods, the magnitude of βE1 is also expected not to affect the power of the test of variance heterogeneity. Thus, we kept R 2 E1 fixed at 0.17% for all scenarios. We varied R 2 G from 0 to 0.425% by 0.085%. These settings yielded a wide range of interaction models, including both quantitative (effects of G in same direction depending on E1) and qualitative (effects of G in opposite directions depending on E1) interaction models (Table S2.1 in Section 2.5.4). We also performed multiple simulations varying R 2 GE1 from 0.10% to 0.85%. In scenario 2, we performed additional simulations varying the magnitudes of R 2 E2 in the interval [0, 10%], and R 2 GE2 in the interval [0, 5%]. To further explore the robustness of our power comparisons, we also considered alternative models with different QTL minor allele frequency (qA) and exposure prevalence (PE1 and PE2). The power for each method was estimated as the proportion of replicates in which the QTL was declared as statistically significant at a FWER of α=0.05. For each model, we also estimated the power to detect the marginal effect of the QTL, to quantify the chance that the QTL would be identified by the primary G-only scan. 2.3 Results Type 1 Error Both Table 2.1 and Table S2.1 (Section 2.5.4) show that when there is no marginal effect of E1 (R 2 E1), all the methods maintain the correct test size. However, the Type 1 error rate of the Var|G×E approach (Paré et al.) is inflated to unacceptable levels as the marginal effect of E1 increased. The standard G×E test, and the YG|G×E, rVar|G×E and Joint|G×E 2-step methods 16 achieve the nominal Type 1 error rate, whether or not there is a marginal effect of E1. The Type I error rates are similar for all methods under either scenario 1 or 2. Table 2.1: Type 1 error rates for tests of G×E1 interaction when non-QTL SNPs have neither marginal G effect nor G×E1 interaction effect R 2 E1 Method 0.00 0.02 0.04 Exhaustive G×E Scenario 1 0.046 0.046 0.048 Scenario 2 0.042 0.043 0.043 2-step Var|G×E Scenario 1 0.049 0.159 0.264 Scenario 2 0.048 0.134 0.218 YG|G×E Scenario 1 0.054 0.054 0.054 Scenario 2 0.047 0.048 0.047 rVar|G×E Scenario 1 0.053 0.053 0.050 Scenario 2 0.048 0.049 0.047 Joint|G×E Scenario 1 0.053 0.053 0.048 Scenario 2 0.040 0.041 0.041 Scenario 1 involves E1 only. Scenario 2 involves E1 and E2 with R 2 E2 and R 2 GE2 fixed at 2%. Each estimate of Type 1 error is based on the proportion of 1,000 replicate data sets for which the indicated procedure identified at least one statistically significant result among 10,000 non-QTL SNPs, with Type 1 error rate set to be 0.05. Weighted hypothesis testing is used for 2-step approaches. Significantly inflated Type 1 error is indicated in bold Power Comparison Figure 2.1 shows the results when G×E1 is the only interaction contributing to the heritability of Y(scenario 1). The power to detect an interaction of magnitude R 2 GE1=0.4% with 6,000 individuals is quite low (30%) using the standard G×E analysis (Fig. 2.1A). The rVar|G×E method is the least powerful method (2%) under this scenario, almost always failing to detect the 17 QTL. Power for these two methods was nearly independent of the size of the marginal G effect (R 2 G). On the other hand, power for those methods that utilize marginal G information in the screen (YG|G×E and Joint|G×E) depend strongly on the size of R 2 G. When there is no marginal G effect, YG|G×E and Joint|G×E are less powerful than the standard G×E scan. However, both of these 2-step methods outperform the standard G×E scan when QTL has even a small-sized marginal G effect (R 2 G>0.085%). The YG|G×E method is the most powerful method when there is a small to moderate marginal genetic effect and is always more powerful than the Joint|G×E method. The Joint|G×E method is less powerful because the test of variance heterogeneity here has low power and the Joint|G×E method pays a price for incorporating an inefficient source of information in the screen. Note that, even when R 2 G=0.425%, the marginal G scan only has a power of 33% to detect the QTL compared to 91% for the YG|G×E method. Holding the marginal G effect constant at R 2 G=0.17%, power for all the procedures increases as the magnitude of the interaction effect increases (Fig. 2.1B), as we would expect. The rVar|G×E method again performs poorly in this situation. For example, when R 2 GE1=0.85%, power for rVar|G×E is 57% compared to 95% for the standard G×E scan. The YG|G×E and Joint|G×E are more powerful than the standard G×E scan when the magnitude of R 2 GE1 is small to moderate (R 2 GE1 ≤0.55% for YG|G×E and R 2 GE1 ≤0.4% for Joint|G×E). The standard G×E scan becomes the most powerful when R 2 GE1 ≥ 0.7%. The YG|G×E method still provides greater power than the Joint|G×E method across all the magnitudes of R 2 GE1 considered, with a gain of power in the range 2%-9%. The power of the marginal G scan is very low (1%), and is independent of the magnitude of interaction effect. 18 A B Figure 2.1. Power to detect G×E1 interaction in the presence of one interaction effect, with 6,000 individuals. (A) Power is presented across a range of magnitudes for the marginal genetic effect (R 2 G) with R 2 GE1=0.4%. (B) Power is presented across a range of magnitudes for G×E1 interaction (R 2 GE1) with R 2 G=0.17%. Figure 2.2 presents the power to detect G×E1 when two G×E interactions (G×E1 and G×E2) contributing to the heritability of Y (Scenario 2). Unlike the results for scenario 1, the rVar|G×E method provides higher power (by about 10%) than the standard G×E scan across a range of marginal G effect sizes (Fig. 2.2A). The Joint|G×E method is more powerful than the YG|G×E method across all the sizes of R 2 G evaluated, with a gain of power from 4% to 27% (Fig. 2.2A). Note that when the magnitude of marginal genetic effect is small (R 2 G=0.085%), the power for YG|G×E method is about 9% lower than the standard G×E scan, while the Joint|G×E method provides power that is 26% higher than the standard G×E scan. As shown in Fig 2.2B, the standard G×E scan has the least power when the size of R 2 GE1 is small to moderate, in the range of 0.1%-0.55%. The power for rVar|G×E and YG|G×E is about the same across the sizes of R 2 GE1 evaluated. The Joint|G×E method again provides most power 19 across all the sizes of R 2 GE1 evaluated. For example, when R 2 GE1=0.4% the power for the Joint|G×E method is 72%, compared to 43% for YG|G×E, 39% for rVar|G×E and 27% for the standard G×E scan. When R 2 GE1=0.7%, power for the standard G×E scan is about 4% higher than both YG|G×E and rVar|G×E, while Joint|G×E still provides about 9% higher power than the standard G×E scan. Power for YG|G×E and standard G×E scan is independent of the size of R 2 E2 (Fig. 2.2C). On the other hand, power of the 2-step methods that utilize variance heterogeneity information in their screening step (rVar|G×E and Joint|G×E) depends strongly on the magnitude of R 2 E2. When R 2 E2=0, there is little heterogeneity in variance across genotype and as a result, rVar|G×E performs poorly, with power 6% compared to 30% for the standard G×E scan. The Joint|G×E and rVar|G×E methods begin to outperform YG|G×E and the standard G×E scan when R 2 E2>0, with a gain in power increasing with the magnitude of R 2 E2. For example, when R 2 E2 is 0.4%, the powers are 87% for Joint|G×E and 66% for rVar|G×E, compared to 46% for YG|G×E and 30% for the standard G×E scan. Similar trends are observed across a range of different values for R 2 GE2 (Fig. 2.2D). When there is no G×E2 interaction (R 2 GE2=0), the rVar|G×E method has almost no power to detect the QTL (2%). The Joint|G×E method provides a power slightly greater than the standard G×E scan but less than the YG|G×E. The power for the two 2-step approaches that utilize variance heterogeneity information in their screening step increases as the magnitude of the R 2 GE2 increases. The Joint|G×E method again is the most powerful method when R 2 GE2 ≥1%. The power for the 2-step approaches that utilize the variance heterogeneity information also depends on the magnitude of qA for the QTL (Section 2.5.4, Fig. S2.2A). Note that the proportions of variance explained by each component (G, E1, E2, G×E1 and G×E2) are fixed 20 under the base model (See ‘2.2 Simulation Study’ for details). The power for the Joint|G×E and rVar|G×E drops as the magnitude of the minor allele frequency qA increases. The Joint|G×E method remains as the most powerful across the range of qA evaluated (0.01-0.25). The power of all the methods is independent of the size of PE1 and PE2 (Section 2.5.4, Figs. S2.2B, S2.2C). For the 2-step methods, all of the above results are based on weighted hypothesis testing. We also examined the power using subset testing, considering a range of step-1 significance threshold (α1). We explored the power varying the model settings around the base model. Table S3 (Section 2.5.4) shows that the power of Joint|G×E using subset testing depends strongly on the choice of α1 and the optimal magnitude of α1 varies across the underlying models. When the information (marginal G association or residual variance heterogeneity) used in the screening step is weak, for example, when any of R 2 G, R 2 E2 or R 2 GE2 is set to be 0, the highest power for subset testing occurs with a loose step-1 threshold (α1=0.05 or 0.01) and is higher than the power for the weighted testing. However, when any of R 2 G, R 2 E2 or R 2 GE2 has a relatively large value, a strict step-1 threshold (α1=0.0001) leads to the highest power for the subset testing and no choice of α1 for a subset testing leads to as much power as can be achieved using weighted hypothesis testing. 21 A B C D Figure 2.2. Power to detect G×E1 interaction in the presence of two interaction effects, with 6,000 individuals. (A) Power is presented across a range of magnitudes for the marginal genetic effect (R 2 G) with R 2 GE1=0.4%, R 2 E2= R 2 GE2=2%. (B) Power is presented across a range of magnitudes for G×E1 interaction (R 2 GE1) with R 2 G=0.17%, R 2 E2= R 2 GE2=2%. (C) Power is presented across a range of magnitudes for the marginal effect of E2 (R 2 E2) with R 2 G=0.17%, R 2 GE1=0.4%, R 2 GE2=2%. (D) Power is presented across a range of magnitudes for G×E2 interaction (R 2 GE2) with R 2 G=0.17%, R 2 GE1=0.4%, R 2 E2=2%. 22 2.4 Discussion As demonstrated by simulation, the previously proposed 2-step G×E method that screens on variance heterogeneity [Paré et al., 2010] has inflated Type 1 error in presence of a marginal E effect. In this paper, we proposed an alternative that uses variance heterogeneity and preserves the Type 1 error rate whether or not there is a marginal effect of E. We also proposed a novel 2- step method that utilizes both variance heterogeneity and the marginal G effect in a joint screening test. However, when only one G×E interaction is involved in the underlying trait model, our simulations demonstrated that neither the variance heterogeneity nor the joint approach lead to increased power to detect G×E interaction compared to the standard test of G×E interaction. However, these two new methods can provide greater power than the standard test across a wide range of models when the trait depends on two (or more) G×E interactions. When only one interaction with a modest effect size is involved in the development of Y (scenario 1), the variance of Y conditional on G only depends on G×E1 interaction effect βGE1 and marginal effect of E1 βE1 (See equation 2.9). The magnitude of variance heterogeneity across G also depends on βE1 and βGE1 if βGE1≠0. When βE1 is removed, as required in the 2-step approaches that utilize the test of variance heterogeneity in their screening step (Joint|G×E and rVar|G×E), the magnitude of variance heterogeneity is reduced which makes the screening steps of those methods have very limited power to pass the true QTL to the testing step. As a result, power of the 2-step approaches that utilize the test of residual variance heterogeneity will be diminished due to sacrificing degrees of freedom for utilizing an inefficient source of information. The YG|G×E method is generally the most powerful method in the presence of one interaction. The YG|G×E method can outperform the standard G×E test when even a small 23 marginal G effect is present, with increasing gains in power as the marginal G effect increases (Figure 2.1). When two interactions (G×E1 and G×E2) are involved in the development of Y (scenario 2), the test of variance heterogeneity begins to play a part and the rVar|G×E method can provide higher power than the standard G×E test (Fig. 2.2). That is because when βE1 is removed, the magnitude of variance heterogeneity across G still depends on β GE1, βE2 and βGE2. If the sizes of βE2 and βGE2 are large enough, then the magnitude of variance heterogeneity remains an efficient source of information even when the size of βGE1 is small. As demonstrated by simulation, the rVar|G×E method can outperform YG|G×E method when the size of marginal G effect is small. When interactions induce both a marginal G effect and variance heterogeneity across G, the Joint|G×E method can prioritize SNPs for step-2 testing more efficiently than either the YG|G×E or rVar|G×E method. Note that in this paper E1 and E2 are simulated to be independent. It would be of interest to evaluate our new methods in the presence of different relationships between E1 and E2 (positively correlated and negatively correlated), which could be a subject of a future paper. Note that the 2-step approaches for a case-control sample [Murcray et al., 2009, 2011; 2008; Hsu et al., 2012; Gauderman et al, 2013] are all based on the observation that there is an induced correlation between environment and genetic (E-G) in the combined case-control sample due to the over ascertainment of cases from the source population in the presence of G ×E interaction (e.g., Unmatched case-control with 1 control per case). However, in a scan for quantitative trait loci (QTLs), we typically draw a random sample from the source population without regard for phenotype, and thus there is no induced E-G correlation in the sample. Consequently, unlike the case-control sample, screening on the E-G correlation cannot provide 24 us with information on potential G×E interaction. The previously proposed joint/hybrid approaches [Murcray et al., 2009, 2011; 2008; Hsu et al., 2012; Gauderman et al, 2013] are all in the context of a disease trait. In this paper, we adopt the idea of combining useful information in the screening step that was developed for disease traits in the development of a quantitative trait. To our knowledge, ours is the first paper to propose a joint screening test for a quantitative trait. This is also the first paper to our knowledge that has studied the potential gains in power that can be obtained by using marginal screening (YG|GxE) for a quantitative trait, as Kooperberg and LeBlanc (2008) focused their evaluation on case-control data . While we described our methods in the context of a binary environmental factor, we note that the “E” can be replaced by a continuous variable, or a pre-specified candidate gene. The Levene’s test of variance heterogeneity can also be extended to imputed data (Section 2.5.3). It is widely accepted that the etiology of many complex traits involves not only genetic and environmental factors but also interactions between the two as well as G×G interactions. In this paper, we described two new methods for detecting interactions for a quantitative trait in a genomewide setting. Both methods use a novel test of variance heterogeneity that does not depend on the size of marginal G effect. These two new methods, Joint|G×E and rVar|G×E, have the potential to identify novel SNPs that have been missed from primary GWAS scans. 25 2.5 Supplemental Materials Correlation Between the Variance-per-genotype and G×E-interaction Estimators Suppose y 1=( y11 y12 … y1N1 ) T and y 2=( y21 y22 … y 2N2 ) T are two vectors of phenotypic values for N1 risk-allele carriers and N2 non-carriers respectively under a dominant model. Assume y1 and y 2 have a multivariate normal distribution as follows: 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 0 ~ , 0 N I I y x x y x x (S1) Where x1=( x11 x12 … x1N1 ) T and x2=( x21 x22 … x2N1 ) T are two vectors of environmental factor values (E) for carriers and non-carriers respectively, and β1 and β2 are environmental effects among carriers and non-carriers respectively. ԑ1=( ԑ11 ԑ12 … ԑ1N1 ) T and ԑ2=( ԑ21 ԑ22 … ԑ2N2 ) T are two vectors representing error terms and we assume variances are constant across all gene- environment combinations (denote as σ 2 ). The maximum likelihood estimator of βi, i=1,2 is given by: 1 1 11 1 ˆ i i i i TT TT T T T T TT i i i i i i i i i i i i i i i i i i i i i x x x y x x x x x x x x x x x x x x (S2) The interaction effect is the difference between β1 and β2 and the corresponding maximum likelihood estimator can be obtained as: 26 12 ˆ ˆ ˆ GE (S3) Under the null hypothesis of no interaction, β1 is equal to β2 and the expectation of equation 3 will be equal to 0. We further let σ1 2 and σ2 2 denote the within-genotype variances of y for carriers and non-carriers respectively. The maximum likelihood estimator of σi 2 , i=1,2 is defined as: TT T 2 TT T T T T 2 )) 1 11 ( ) ( ) ( ) ( ˆ ) i i i i ii ii i i i i i i ii ii i i i i i i ii NN N NN NN y 1 y 1 yy y 1 y 1 yy 11 i i i i i i i i ( - ( - x x x x (S4) Where 1i=[1 …1 ] T Niˣ1,i=1,2. Then the covariance between 𝛽 ̂ G×E and 𝜎 ̂ 2 1 under the null hypothesis of no G×E interaction can be written as: 27 2 22 11 22 1 2 1 2 1 1 22 1 2 1 1 1 2 2 1 2 1 2 1 ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ , Since =0 under the n ˆ ˆ ˆ ull hypothesi ˆ ˆ s GE G E G E i Cov E E E E E E E E E EE 2 1 11 22 1 1 1 1 1 2 2 2 2 2 1 1 2 1 2 2 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 11 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 ˆ 2 2 1 NN N T T T T TT T T T T T T T T T T T T T 11 E E x x x x x x E x x x x E x x x x x x x x x x x x E xx 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 2 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 1 N N T T T T T T T T T T T T T T T T T T T T T T 1 1 1 1 1 1 1 1 x x x x x E x x x x x x x x x x x x x x x x E 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 2 2 2 2 1 1 2 2 2 2 1 1 2 2 2 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 2 1 1 1 1 2 2 2 2 1 2 1 1 1 2 2 2 1 2 2 2 1 2 N N T T T T T T T T T T T T T T T T T T T T T T T T T 11 E x x x x x x x x x x x x x x x x E x x x E x x x E x x x x x E x x x E x x 1 1 11 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 12 2 2 2 1 1 1 1 1 1 1 1 1 1 11 12 1 2 11 1 2 2 2 1 1 1 11 11 12 1 22 12 1 1 1 1 1 2 2 11 1 2 2 2 , , ,2 1 , 1 2 N N NN N NN N x x x x N xx T T T T T T T T T T T TT 11 1 1 x x E x x x x E x x x E x x x x E E E x x E E E x 1 1 1 1 11 12 1 2 2 2 2 2 2 2 2 11 2 2 2 11 2 1 12 11 2 1 2 11 2 11 12 2 1 1 1 11 12 1 1 1 1 1 1 2 11 2 1 1 1 Since ~ , 1,2; 1 2 2 ,2,..., ,and 2 N N ij N N ij x x x x x x i j N N N N T T T T 1 E E xx E E E x x x x x E E 11 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 2 2 1 2 2 2 2 2 2 N NN T T T T TT 1 1 x x x x x x x x x x (S5) 28 Similarly, the covariance between 𝛽 ̂ G×E and 𝜎 2 2 under the null hypothesis of no G×E interaction can be written as: 2 2 2 2 1 2 22 2 2 2 2 2 2 2 ˆ 2 ˆ,2 GE NN TT 1 Cov x x x (S6) The variance heterogeneity in Y across G can be used to prioritize SNPs for further interaction testing in the step-1 screen from a 2-step approach. Therefore, we are interested in the difference between variances across G. Under the null hypothesis of no interaction, β1= β2= βE, where βE is the marginal effect of the environmental factor E. Hence, the covariance between 𝛼 ̂ 2 1-𝛼 ̂ 2 2 and 𝛽 ̂ G×E under the null hypothesis has the form as: 1 1 12 2 2 22 22 2 2 2 22 22 21 2 22 22 2 22 11 2 1 1 1 1 22 21 11 2 1 1 1 2 12 1 22 21 ˆ , ˆ ˆ 1 ˆ ˆ ˆ ˆ ,, 2 2 22 2 1 GE G E G E E E EE E N N N N NN N N N N T T T T T T T T 11 11 Cov Cov Cov x x x x x x x x x x x x (S7) Therefore, under the null hypothesis of no interaction, the within genotype variance will be correlated with the G×E interaction estimator when the true marginal effect of environmental factor is not equal to 0. For this reason, we argue that the approach proposed by Paré et al. is not valid in the presence of a marginal effect of the environmental factor. This is important because most investigators will focus their search for G×E interactions on environmental factors that 29 have previously been shown to have an effect (a marginal effect) on the trait of interest. The two estimators will be uncorrelated if the marginal effect of environmental factor is equal to 0. Variance-per-genotype Estimator is Uncorrelated with Marginal G Estimator Suppose y 1=( y11 y12 … y1N1 ) T and y 2=( y21 y22 … y 2N2 ) T are two vectors of phenotypic values for N1 risk-allele carriers and N2 non-carriers respectively under a dominant model. Assume y1 and y 2 have a multivariate normal distribution as follows: 2 1 1 1 1 1 1 11 2 2 2 2 2 2 2 22 0 ~ , 0 N 11 I 11 I y y (S8) Where 1i=[1 …1 ] T Niˣ1,i=1,2. μ1 and μ2 are mean values for y among carriers and non-carriers respectively. ԑ1=( ԑ11 ԑ 12 … ԑ1N1 ) T and ԑ2=( ԑ 21 ԑ 22 … ԑ2N2 ) T are two vectors representing error terms with mean 0 and variance σ1 2 and σ2 2 for carriers and non-carriers. The maximum likelihood estimator of marginal genetic effect (denoted as β G) is given by: T T T T 1 1 2 2 12 T T T T 1 1 2 2 12 12 ˆ G yy NN NN 11 11 (S9) The maximum likelihood estimator of σi 2 , i=1,2 is defined as 30 TT T 2 TT T T T T 2 )) 1 ˆ i i i i ii ii i i i i i i ii ii i i i i i i ii NN N NN NN y 1 y 1 yy y 1 y 1 yy 11 ( - ( - (S10) Then the covariance between 𝛽 ̂ G and 𝜎 ̂ 2 1 can be written as: 2 2 2 1 1 1 T T T T T 2 1 1 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 ˆ ˆ ˆ , ˆ ˆ ˆ 1 0 G G G N N N N N N 1 1 1 1 1 Cov E E E E (S11) Similarly, the covariance between 𝛽 ̂ G and 𝜎 2 2 is equal to 0. Hence, the covariance between 𝛼 ̂ 2 1- 𝛼 ̂ 2 2 and 𝛽 ̂ G is also equal to 0. 1 1 22 2 22 2 ˆ , ˆ ˆ ˆ ˆ ,, ˆ ˆ 0 G GG Cov Cov Cov (S12) 31 Levene’s Test on Imputed Data Suppose i is the residual of regression model 0 i E i i YE for i th observation, where Y is the quantitative trait, E is the environmental variable and i=1,...,N. Let pi1, pi2 and pi3 denote the probabilities obtained from genotype imputation for Gi=0, 1 and 2 respectively, where Gi is the genotypic value for i th observation. These probabilities can be produced by commonly used imputation software such as IMPUTE2 [Howie et al., 2009] and MACH [Li et al., 2009, 2010]. The Levene’s test statistic for imputed data is defined as: 2 1 2 11 1 K kk k KN ik ik k ki N K N Z Z W K p Z Z (S13) Where K is the number of subgroup defined by genotype ( e.g. K=3 for an additive model, K=2 for a dominant model) pik is probability subject i has genotype k 𝑁 is the number of subjects 𝑁 𝑘 = ∑ 𝑝 𝑖𝑘 𝑁 𝑖 =1 𝜏 ̅ 𝑘 = 1 𝑁 𝑘 ∑ 𝑝 𝑖𝑘 𝜏 𝑖 𝑁 𝑖 =1 𝜏 ̃ 𝑘 = 𝜏 𝑚 where ∑ I(𝜏 𝑖 ≤ 𝜏 𝑚 )𝑝 𝑖𝑘 ≥ 0.5 and 𝑁 𝑖 =1 ∑ I(𝜏 𝑖 ≥ 𝜏 𝑚 )𝑝 𝑖𝑘 ≥ 0.5 𝑁 𝑖 =1 𝑍 𝑖𝑘 = { |𝜏 𝑖 − 𝜏 ̅ 𝑘 |, 𝜏 ̅ 𝑘 is the weighted mean of the 𝑘 th group |𝜏 𝑖 − 𝜏 ̃ 𝑘 |, 𝜏 ̃ 𝑘 is the weighted median of the 𝑘 th group 𝑍 .. = 1 𝑁 ∑ ∑ 𝑝 𝑖𝑘 𝑍 𝑖𝑘 𝑁 𝑖 =1 𝐾 𝑘 =1 , the weighted mean of all Zik 𝑍 𝑘 . = 1 𝑁 𝑘 ∑ 𝑝 𝑖𝑗 𝑍 𝑖𝑘 𝑁 𝑖 =1 , the weighted mean of the Zik for group k 32 The p-value is defined as: 1, P Pr K N K FW (S14) Where F is the F-distribution with K-1 and N–K degree of freedoms. Supplemental Tables and Figures Table S2.1: Models of G×E1 interaction that produce the indicated marginal genetic effect size (βG) R 2 G (%) βG βG|E1=0 βG|E1=1 Type 0.000 0.00 -0.09 0.21 0.085 0.06 -0.03 0.27 0.170 0.09 0.00 0.30 0.225 0.11 0.02 0.32 0.340 0.13 0.04 0.34 0.425 0.14 0.05 0.35 Exposure-specific genetic effect (βG|E1) that produce the indicated marginal βG, for a moderate interaction (R 2 GE1=0.4%) with qA=0.1635 and PE=0.3. The “Type” column shows the pattern of mean of Y for carriers (G=1, dashed line) relative to noncarriers (solid line) in subjects with E1=0 (left side of each graph) or E1=1 (right side). 33 Table S2.2: Type 1 error rates for tests of G×E1 interaction when 10 non-QTL SNPs have a marginal G effect but no G×E1 interaction effect R 2 E1 Method 0.00 0.02 0.04 Exhaustive G×E Scenario 1 0.049 0.050 0.050 Scenario 2 0.042 0.043 0.042 2-step Var|G×E Scenario 1 0.055 0.150 0.247 Scenario 2 0.045 0.145 0.234 YG|G×E Scenario 1 0.050 0.055 0.056 Scenario 2 0.050 0.048 0.050 rVar|G×E Scenario 1 0.054 0.054 0.054 Scenario 2 0.047 0.048 0.047 Joint|G×E Scenario 1 0.055 0.045 0.048 Scenario 2 0.046 0.049 0.049 10 SNPs with R 2 G randomly sampled from a uniform distribution in the range 0.170%-0.425%. Scenario 1 involves E1 only. Scenario 2 involves E1 and E2 with R 2 E2 and R 2 GE2 fixed at 2%. Type 1 error is estimated as the proportion of 1,000 replicate data sets for which the indicated procedure identified at least one statistically significant result among 10,000 non QTL SNPs. Weighted hypothesis testing is used for 2-step approaches. Significantly inflated Type 1 error is indicated in bold. Figure S2.1. Variance heterogeneity can be induced by G×E interaction. 34 Table S2.3: Power using subset testing for Joint|G×E across a range of step-1 thresholds (α1) compared to the power of weighted hypothesis testing α1: Expected number of markers passing to Step2 b Model d 0.05:50,000 0.01:10,000 0.005:5,000 0.001:1,000 0.0005:500 0.0001:100 Weighted testing c Base a 0.570 0.639 0.642 0.610 0.576 0.487 0.716 R 2 G (%) 0.000 0.392 0.337 0.309 0.233 0.205 0.123 0.298 0.085 0.446 0.502 0.516 0.471 0.433 0.343 0.539 0.250 0.481 0.602 0.651 0.728 0.744 0.722 0.845 0.340 0.514 0.637 0.687 0.783 0.809 0.855 0.934 0.425 0.494 0.640 0.698 0.801 0.828 0.892 0.963 R 2 GE1 (%) 0.10 0.008 0.011 0.017 0.039 0.056 0.093 0.155 0.25 0.159 0.233 0.272 0.352 0.377 0.406 0.497 0.55 0.775 0.834 0.830 0.809 0.771 0.662 0.878 0.70 0.928 0.927 0.908 0.823 0.771 0.660 0.945 0.85 0.979 0.952 0.929 0.856 0.809 0.690 0.981 R 2 E2 (%) 0 0.431 0.445 0.419 0.325 0.286 0.207 0.400 4 0.517 0.634 0.681 0.748 0.757 0.739 0.865 6 0.527 0.639 0.690 0.773 0.808 0.850 0.928 8 0.509 0.635 0.686 0.790 0.825 0.868 0.948 10 0.504 0.631 0.684 0.794 0.832 0.900 0.969 R 2 GE2 (%) 0 0.395 0.383 0.358 0.282 0.227 0.150 0.341 1 0.470 0.524 0.519 0.461 0.443 0.342 0.538 3 0.510 0.630 0.661 0.723 0.731 0.738 0.861 4 0.513 0.621 0.674 0.791 0.824 0.860 0.941 5 0.499 0.620 0.681 0.787 0.832 0.907 0.960 qA 0.01 0.478 0.572 0.614 0.722 0.767 0.843 0.925 0.05 0.515 0.654 0.704 0.801 0.826 0.866 0.945 0.10 0.529 0.641 0.670 0.762 0.780 0.749 0.863 0.15 0.511 0.595 0.622 0.672 0.658 0.599 0.747 0.20 0.490 0.577 0.584 0.580 0.558 0.480 0.649 0.25 0.503 0.553 0.580 0.549 0.508 0.412 0.637 a Base model is in the presence of two G×E interaction effects (See section ‘2.2 Simulation Study’ for details). b 1 is the significance threshold used for the Step-1 screen. The expected number of SNPs passing to Step 2 is based on screening 1 million SNPs in Step 1. c Initial bin size K=5 d Selected models were varied around the base model. Highest power for subset testing in each setting is indicated in bold with underline Highest power in each setting is indicated in blue font 35 A B Figure S2.2. The removal of the marginal effect of E. (A) The magnitude of G×E interaction will not be affected. (B) The magnitude of marginal G effect will not be affected. Summary of Figure S2.2: We assume a binary exposure variable E (E=0,1) and a dominant genotype model (G=0,1). We further assume a pure interaction pattern which is presented in Figure S2.1A. Line 1 corresponds to study individuals with E=1 and line 2 to individuals with E=0. The interaction effect is equal to a+b. Removing the marginal effect of E is equivalent to moving the line 1 down to the line 3 and the interaction effect then is equal to b+c. Since a=c, 36 a+c=b+c. Therefore, removing the marginal E effect will not affect the magnitude of the interaction effect. Similarly, in Figure S2.1B, Line 1 corresponds to individuals with E=1, line 2 to individuals with E=0, and line 3 is for E=1 and E=0 combined. The marginal effect of G is equal to a+b. Removing the marginal effect of E is equivalent to moving the line 1 down to line 4. The line 5 then is for the E=1 and E=0 groups combined. In this model, the marginal G effect is equal to b+c. Since a=c, a+b=c+b. Therefore, the marginal effect of G will not be affected by removing the marginal effect of E. Figure S2.3. Power to detect G×E1 interaction in the presence of two interaction effects, with varied qA, PE1 and PE2. Proportions of total variance explained by each component were fixed at R 2 G=0.17%, R 2 GE1=0.4%, R 2 E2= R 2 GE2=2%. (A) Power is presented across a range of MAF for the QTL (qA). (B) Power is presented across a range of prevalence for E1. (C) Power is presented across a range of prevalence for E2. 37 Chapter 3: Software Development 3.1 General Description G×EScan performs a genomewide scan for gene-environment (G×E) interaction for both case-control sample [Gauderman et al, 2013] and/or quantitative traits. For quantitative traits, the program implements traditional test of G×E interaction, the 2-degree-of-freedom (df) joint test of G and G×E, and efficient 2-step methods for test of G×E. As a byproduct of the available tests, GxEScan also performs standard marginal G scan, i.e. the standard test of each SNP conducted in a GWAS. The ‘environment’ factor E can be either binary or continuous and can be an exogenous exposure variable (e.g., air pollution), personal exposure (e.g., smoking), other personal characteristic (e.g., sex, age, candidate gene) or a drug treatment. GxEScan will test G×E interaction with measured and/or imputed SNPs on autosomal chromosomes and will control the family-wise error rate (FWER) at a user-defined level (e.g. 5%). GxEScan consists of three components: (1) GxEScan.exe, a C++ program to generate results from a genomewide scan; (2), GxEMerge.exe, a C++ program to combine multiple files generated from GxEScan.exe into a single file that is ordered by p-value; (3) GxETest.R, an R program to synthesize the scans, conduct hypothesis tests, and output QQ plots, Manhattan plots, and tables of top results to an Excel file. GxEScan can be executed on a Windows operating system or a Linux operating system. GxEScan is developed by Jim Gauderman, John Morrison, Pingye Zhang and Chris Edlund, and can be downloaded at http://biostats.usc.edu/software. 38 3.2 Methods Notation We assume M SNPs will be analyzed and the goal is to identify one or more quantitative trait loci (QTL). The following notation will be used throughout this chapter. Y: Quantitative phenotype E: Environmental factor of interest, which can be treated as binary or continuous, e.g. Indicator of exposure (e.g. 1=Exposed, 0=Unexposed) Indicator of sex (e.g. 1=Female, 0=Male) Quantitative exposure (e.g. pack-years of tobacco smoking) Personal factor (e.g. age) Candidate locus genotype (e.g. 0, 1, or 2 minor alleles) Drug treatment (e.g. 1=treatment arm, 0=control arm) G: Genotype at one of M SNPs being scanned for G×E interaction. Only SNPs on autosomal chromosomes can be analyzed. Additive model (e.g. 2=AA, 1=Aa, 0=aa, where A is minor allele and a is the common allele) C: A vector of potential confounders to be adjusted for when assessing genetic association. 39 Exhaustive Test Marginal Test (YG) The marginal effect of a SNP (G) on a quantitative phenotype (Y) can be obtained estimated from the following linear regression model: T 0 GC G YC (3.1) Where ε is assumed to have a normal distribution with mean 0 and common variance σ 2 across G. A standard GWAS of marginal effects is conducted by testing the null hypothesis βG=0 for each of the M SNPs. P-value (pG) is typically tested at a strict significance level (e.g. 5 × 10 -8 ) to preserve the FWER [Dudbridge and Gusnanto, 2008]. Interaction Test (G×E) One could augment the model in equation 2.1 to test the multiplicative G×E interaction using the model: T 0 G G E C E G G E C E Y (3.2) A genomewide interaction scan is based on testing the null hypothesis βG×E=0 for each of the M SNPs. Let pG×E denote the p-value for the test of G×E interaction. A correction is also needed to preserve the FWER. 40 Joint G, G×E Test (2df) Kraft et al. [Kraft et al., 2007] proposed to test the main effect of G and the G×E interaction jointly. The joint test has 2 degree of freedoms (2df) and can be conducted by testing the null hypothesis βG=βG×E=0 in model 3.2. They showed that the joint test sometimes provides greater power to detect the QTL than either the marginal test or the interaction test alone. Note that the joint test evaluates a fundamentally different null hypothesis than the G×E-alone tests and in practice will produce different SNPs on the top lists. Two-Step Procedures Several 2-step methods have been proposed to conduct a genomewide G×E scan. These 2-step methods can provide greater power to detect QTL than the standard G×E scan or even the 2df test under some underlying models. The key requirement for a 2-step method is the independence of step-1 screening and step-2 testing statistics. The 2-step methods described below all have this independence and preserve the nominal Type 1 error rate. Two-Step Test with Screening on Marginal G (YG|G×E) Kooperberg and Leblanc [Kooperberg and Leblanc, 2008] developed a 2-step approach by screening SNPs by genetic marginal effects in Step 1 (model 3.1), and testing for interactions (model 3.2) in Step 2. They originally proposed to use subset-testing in Step 2. G×EScan implements both the subset-testing and weighted-testing in Step 2. These two hypothesis testing approaches are described in more detail in Section 2.2.2.5 below. 41 Two-Step Test with Screening on Residual Variance Heterogeneity (rVar|G×E) Variance heterogeneity in Y across G could be induced in the presence of G×E interaction. This information can be used to prioritize SNPs for Step 2 testing. In Chapter 1 we know that the test statistic of variance heterogeneity on Y will be correlated with the test statistic of interaction term in the presence of marginal E effect, and the correlation can be removed if marginal E effect is removed (see Section 2.5.1). The marginal E effect can be removed by collecting residuals ( ) from the following model: 0 ii i E YE (3.3) Where i=1,…,N for N individuals in the sample. Note that environment variable E does not have marginal effect on residual Then the variance heterogeneity can be tested on residuals using the Levene’s test: 2 1 2 11 1 k K kk k N K kj k kj N K N Z Z W K Z Z (3.4) Where 𝑁 is the number of subjects 𝜏 𝑘𝑗 is the residual of the regression in (3.3) for j th observation in k th subgroup (k=1,…, K, e.g. K=3 for an additive model, K=2 for a dominant model) 𝜏 ̅ 𝑘 = 1 𝑁 𝑘 ∑ 𝜏 𝑘𝑗 𝑁 𝑘 𝑗 =1 42 𝜏 ̃ 𝑘 = 𝜏 𝑚 where ∑ I(𝜏 𝑘𝑗 ≤ 𝜏 𝑚 ) ≥ 0.5 and 𝑁 𝑘 𝑗 =1 ∑ I(𝜏 𝑘𝑗 ≥ 𝜏 𝑚 ) ≥ 0.5 𝑁 𝑘 𝑗 =1 𝑍 𝑘𝑗 = { |𝜏 𝑘𝑗 − 𝜏 ̅ 𝑘 |, 𝜏 ̅ 𝑘 is the weighted mean of the 𝑘 th group |𝜏 𝑘𝑗 − 𝜏 ̃ 𝑘 |, 𝜏 ̃ 𝑘 is the weighted median of the 𝑘 th group 𝑍 .. = 1 𝑁 ∑ ∑ 𝑍 𝑘𝑗 𝑁 𝑘 𝑗 =1 𝑘 𝑗 =1 , the weighted mean of all Zkj 𝑍 𝑘 = 1 𝑁 𝑘 ∑ 𝑍 𝑘𝑗 𝑁 𝑘 𝑗 =1 , the weighted mean of the Zkj for group k The Levene’s test can also be extended to the imputed data. Assume residuals ( are collected from the model 0 i E i i YE , the Levene’s test for imputed data is defined as 2 1 2 11 1 K kk k KN ik ik k ki N K N Z Z W K p Z Z (3.5) Where pik is probability subject i has genotype k obtained from genotype imputation (e.g. IMPUTE2 [Howie et al., 2009] and MACH [Li et al., 2009, 2010]) 𝑁 is the number of subjects 𝑁 𝑘 = ∑ 𝑝 𝑖𝑘 𝑁 𝑖 =1 𝜏 ̅ 𝑘 = 1 𝑁 𝑘 ∑ 𝑝 𝑖𝑘 𝜏 𝑖 𝑁 𝑖 =1 𝜏 ̃ 𝑘 = 𝜏 𝑚 where ∑ I(𝜏 𝑖 ≤ 𝜏 𝑚 )𝑝 𝑖𝑘 ≥ 0.5 and 𝑁 𝑖 =1 ∑ I(𝜏 𝑖 ≥ 𝜏 𝑚 )𝑝 𝑖𝑘 ≥ 0.5 𝑁 𝑖 =1 𝑍 𝑖𝑘 = { |𝜏 𝑖 − 𝜏 ̅ 𝑘 |, 𝜏 ̅ 𝑘 is the weighted mean of the 𝑘 th group |𝜏 𝑖 − 𝜏 ̃ 𝑘 |, 𝜏 ̃ 𝑘 is the weighted median of the 𝑘 th group 43 𝑍 .. = 1 𝑁 ∑ ∑ 𝑝 𝑖𝑘 𝑍 𝑖𝑘 𝑁 𝑖 =1 𝐾 𝑘 =1 , the weighted mean of all Zik 𝑍 𝑘 . = 1 𝑁 𝑘 ∑ 𝑝 𝑖𝑗 𝑍 𝑖𝑘 𝑁 𝑖 =1 , the weighted mean of the Zik for group k For both measured and imputed data, W has a non-central F distribution under the null hypothesis of equality of variance, H0: σ1= σ2=… σK, where σk is the standard deviation of within the k th subgroup. The P-value is defined as rVar 1, P Pr K N K FW Where F is the F-distribution with K–1 and N–K degrees of freedom. Considering the results from Step-1 screen, SNPs tested at Step 2 will be tested formally for G×E interaction using model 3.2. One can either use subset testing or weighted hypothesis testing in Step 2 (see Section 2.2.2.5 below). Joint Two-Step Test with Screening on Both YG and rVar (Joint|G×E) This approach provides a more efficient screen by using all available surrogate information in a single Step-1 screening test. Specifically, model 3.3 is first used to collect the residuals. Then for each of the M SNPs, one combines the P-value (PG) from marginal G scan and P-value (PrVar) from test of variance heterogeneity using Fisher’s method [Fisher RA, 1932]: int rVar 2 ln P ln P jo G T (3.6) 44 Tjoint follows a chi-squared distribution with 4 degrees of freedom under the joint null H0: σ1= σ2=… σk and βG=0. The Step-2 test is based on Equation 3.2. We advocate using weighted hypothesis testing in Step 2 rather than subset testing, although both approaches are supported in G×EScan. Step-2 Hypothesis Testing Approaches G×EScan supports two approaches to hypothesis testing in Step 2 for all of the two-step methods. Subset Testing Test only a subset m<<M SNPs that have P-value < α1 from Step-1 screen. The Step-2 significance level after Bonferroni correction for the number of SNPs that pass Step 1 will be adjusted to α/m. The Step-1 threshold α1 must ne pre-specified by the user. A larger value of α1 will increase the chance of passing a QTL into Step 2, while at the cost of reducing the Step 2 significance level by increasing the number of unassociated SNPs pass into Step 2. A lower value of α1 leads to lower m thus greater chance to identify QTLs in Step 2, but also at the cost of screening out the QTLs in Step 1. The optimal α1 depends on the value of M and also on other parameters (e.g. minor allele frequency of a QTL, exposure frequency, magnitude of the information tested in Step 1, etc.). Weighted Hypothesis Testing Rather than restrict Step-2 testing to a subset of the M SNPs, user can test all M SNPs using the weighted significance level in Step 2 based on the order Step-1 P-values and an initial 45 bin size B [Ionita-Laza et al., 2007]. The basic idea is we can group SNPs into k groups and split the overall FWER α into the k groups so that (α/2)/B + (α/4)/(2B) +…+ (α/2 k )/(2 k-1 B) = α(1/2+1/4+…+1/2 k ) = α. Specifically, Specifically, an initial bin size B is pre-specified and the B most significant SNPs based on step-1 P-values are tested at significance level α/2B , the next 2B SNPs are tested at significance level α/[2 2 (2B)], the next 4B SNPs will be tested at significance level α/[2 3 (2 2 B)], etc. Across many simulation scenarios, we have found that the weighted testing generally provides power that no choice of α1 would lead to except when the Step-1 power is very low. 3.3 Example Dataset To demonstrate execution of G×EScan, we generated a simulated dataset consisting of a quantitative trait (Y) on 1000 individuals, M=1,000 SNPs, and a binary environmental variable (E). The quantitative trait has a normal distribution with population mean=10 and variance=1.05. Exposure is randomly sampled for each subject with prevalence 40% for E=1 (exposed) and 60% for E=0 (unexposed). The SNPs are arbitrarily labeled rs0001, rs0002, …,rs1000. The first SNP, rs0001, is designated as the QTL and is generated according to the following model: Minor allele frequency (MAF) = 0.2 Interaction effect βG×E = 0.5, accounting for 2% of total variance Main effects βG and βE in model 3.2 both set to 0 In other words, the QTL in this simulation is assumed to only increase quantitative trait mean in individuals that are exposed and carry at least one minor allele, according to the following pattern of group means: 46 Number of Minor Alleles 0 1 2 Unexposed (E=0) 10 10 10 Exposed (E=1) 10 10.5 11 For each of the remaining 999 SNPs, a MAF was randomly sampled from a uniform distribution on the range 0.1 to 0.4. None of these SNPs were assumed to have an effect on the quantitative trait. Genotypes for all 1,000 SNPs are coded as 2 2 (homozygous for minor allele), 2 1 (heterozygote), or 1 1 (homozygous for common allele). We also generate a sex for each individual with equal probability of male (sex=1) and female (sex=2). We provide this simulated dataset in the form of five files. All five files adhere to the same basic file formatting as is used in PLINK[Purcell et al, 2007]. For additional details of these file formats, see documentation provided with the PLINK program. example.fam: The .fam file for the entire sample, including 1,000 lines (one per subject) and format given below. 1 1 0 0 1 9 2 2 0 0 1 9 3 3 0 0 1 9 4 4 0 0 2 9 5 5 0 0 1 9 … Column 1: Family ID (unique for each subject in a sample) Column 2: Subject ID (arbitrary values, set equal to family ID in these data) 47 Columns 3, 4: Father ID and Mother ID. Set to zero for these data. Column 5: Sex code (1=male, 2=female) Column 6: “pheno” column (all set to 9 and not used, see pheno.txt below) pheno.txt: A file providing the trait value for each subject: FID IID Y 1 1 10.7 2 2 10.3 3 3 12.1 4 4 11.3 5 5 9.0 … Column 1: Family ID (this must be labelled FID) Column 2: Individual ID (this must be labelled IID) Column 3: Quantitative trait Note: the inclusion of a header is optional. covariate.txt: A file that contains data for E, with one line per subject and format: FID IID E 1 1 1 2 2 1 3 3 1 4 4 1 48 5 5 0 … Column 1: Family ID (this must be labelled FID) Column 2: Individual ID (this must be labelled IID) Column 3: Exposure (1=exposed, 0=unexposed) Note: the inclusion of a header is optional. example.map: A file giving information about each SNP, with 1,000 lines (one per SNP), and format: 1 rs0001 0 101 2 rs0002 0 201 3 rs0003 0 301 4 rs0004 0 401 5 rs0005 0 501 … Column 1: Chromsome number Column 2: SNP identifier Column 3: Genetic distance (morgans, set to zero for all snps in this simulated data) Column 4: Base pair position (arbitrarily values given in this simulated dataset) dosage.txt: A file providing the SNP genotype data in dosage-file format, with 1,000 lines (one per SNP). rs0001 A C 0 1 0 1 … 49 rs0002 A C 1 0 1 0 … rs0003 A C 1 0 0 0 … rs0004 A C 1 0 1 0 … rs0005 A C 0 1 1 0 … … Column 1: SNP identifier Columns 2: SNP allele code for the “non-index” allele, Column 3: SNP allele code for the “index” allele, i.e. the allele to which the dosage values correspond. For example, in the above data, the dosage values 0, 1, or 2 indicate the genotypes AA, AC, and CC, respectively. Remaining columns: SNP genotypes for each subject (2,000 columns for this dataset). The values must be provided in the same subject order as the .fam file. Here, each genotype is represented by two numbers (i.e. the two numbers for rs0001 represent the probability of an A/A, then an A/C genotype. The probability of a C/C is naturally 1 minus the sum of these). GxEScan also supports format=3 (3 genotype probabilities provided for each subject) styles (see PLINK documentation for additional details on these formats). 3.4 Program Execution We assume the user has their data in PLINK [Purcell et al, 2007] format as described in Section 3.3. GxEScan consists of three programs: 1. GxEScan.exe: Written in C++, this program will perform various genomewide scans (e.g. for marginal G effects, G×E, 2 df, etc.). This program can also be used to create 50 binary dosage files from the standard text-based dosage files generated from popular imputation software (e.g. IMPUTE2, MACH). 2. GxEMerge.exe: Written in C++, this program will combine multiple files generated from GxEScan.exe into a single file that is ordered by p-value. This program is needed if GxEScan is applied to multiple input files, e.g. separate dosage files for each chromosome. 3. GxETest.R: Written in R, this program collects all results from GxEScan/GxEMerge, performs some final tests, and produces a single file (in EXCEL format) that includes QQ-plots, Manhattan plots, and tables of top hits for each of the approaches. All three components of G×EScan can be executed in Windows or Linux with the following system requirements: Windows: Windows Vista or newer (including 7, 8, 8.1, 10) or Windows Server 2008 or 2012 Linux: Ubuntu 12.0 or newer, Redhat 5 or newer. G×EScan.exe The primary goal of GxEScan.exe is to perform several genomewide scans to generate the test statistics and effect estimates described in Section 3.2. This program can also be used in a preliminary step to generate a binary gene dosage file from the standard text output file created by popular imputation programs. These uses are described below. 51 Creating a Binary Dosage File Popular genotype imputation programs (IMPUTE2, MACH) produce results in a standard text file. We assume your imputed data is organized as it is in IMPUTE2 output, with one observation per SNP (see format of dosage.txt above). While GxEScan can perform scans using standard text-based imputation files, it is much more computationally efficient to run scans on a binary-format version of these. To create a binary-format file, you will use the following command: GxEScan --fam example.fam --dosage dosage.txt format=2 noheader --map example.map --make-bdosage --out example Required inputs are the family file (e.g. example.fam), text dosage file (e.g. dosage.txt), and map file (e.g. example.map). The following options are supported: format=K (K = 2, or 3) skip0=N (N is a non-negative integer. # columns to skip before SNP ID column) skip1=N (# columns to skip between SNP ID and first Allele code columns) skip2=N (# columns to skip between the last allele code and first dosage data) noheader: Use if there is no header in the text dosage file Two output files will be created. bdosage (e.g. example.bdosage): This is the binary-format version of the input text dosage file. bim (e.g. example.bim): Version of the *.map file that will be used with binary genotype or dosage data. 52 Performing Genomewide Scans GxEScan will automatically generate results for all of the analytic approaches described in Section 3.2. This involves running a series of five distinct scans: 1) Marginal scan (YG, Eq 3.1); 2) G×E scan (G×E, Eq 3.2); 3) 2df test (2df, Eq 3.2); 4) Levene’s test (rVar, Eq 3.4 or 3.5); 5) Joint test on both YG and rVar (Joint, Eq 3.6 ); Results from these analyses can be used to construct all of the tests described in Section 3.2. All of the following files will be generated from a single GxEScan execution. Each file will store the results from the corresponding analysis, in rank order by p-value. GxEscan.snpinfo: File providing a sequential SNP ID used for linking GxEScan output files, and additional SNP information GxEscan_QT_YG.gxeout: 1 df test of G GxEscan_QT_G×E.gxeout: 1 df test of G×E interaction GxEscan_QT_2df.gxeout: 2 df test GxEscan_QT_VH.gxeout: Levene’s test of variance heterogeneity GxEscan_QT_YGVH.gxeout: Joint test of marginal G and variance heterogeneity You can optionally provide a base file name to be used instead of ‘GxEscan’ (see --out option below). In addition to the 5 files listed above, GxEScan will also produce a log file (e.g. GxEscan.log) that provides important notes about program execution. 53 Required Commands Any scan will require the following options to be part of the command line. --gxeout --covar <filename> --covar-name (if you have a header in the covar file) OR --covar-number (if you do not have a header) --gxe-name (if you have used --covar-name) OR --gxe-number (if you have used --covar-number) --linear The following commands will also be required, and depend on the format of your genotype data A. Text-based Dosage Data --fam <filename> --dosage <filename> --map <filename> B. Binary Dosage Data (see Section 3.4.1) --fam <filename> --bdosage <filename> --bim <filename> Optional Commands The following options are supported by GxEScan and have the same action as they do in PLINK (see PLINK documentation for a complete description). All options are case sensitive. 54 Phenotype options --pheno <filename> --pheno-name --mpheno Dosage options --dosage options skip0=N (N is a non-negative integer) skip1=N skip2=N format=K (K = 2, or 3) Filtering options -- filter -- mfilter -- maf Output option --out <base filename> Example 1) The command line to analyze the example dataset using the text-based dosage file is: GxEScan --fam example.fam --pheno pheno.txt --pheno-name Y --covar covariate.txt --covar-name E --gxe-name E --map example.map --dosage dosage.txt format=2 noheader --gxeout --linear --out example 55 2) The corresponding command line to analyze the example dataset using a binary version of the dosage file (see Section 3.4.1) is: GxEScan --fam example.fam --pheno pheno.txt --pheno-name Y --covar covariate.txt --covar-name E --gxe-name E --bim example.bim --bdosage example.bdosage --gxeout --linear --out example GxEMerge.exe GxEMerge is used to merge the results from multiple runs of GxEScan that used the same analysis model. This is done when the genotype dosage data is split into several files. G×EMerge requires an input file that lists the base file names of the GxEScan output to merge and a base output file name. The output from GxEMerge consists of five files with the same extensions listed in 3.4.1. The output file for each analysis method will be in p-value sorted order. The command line for GxEMerge is: GxEMerge <input file> <base output file name> {-linear} For example, it is not unusual to have your genotype dosage data split into 22 separate files, one for each autosomal chromosome. In this situation, you will first run GxEScan 22 times, using each of the 22 dosage files in turn, which will result in 22 sets of output. Assuming that each run had a base filename of the form GxEscan_chrN where N is the chromosome number, you could create a text file called "filelist.txt” that includes the following lines: GxEscan_chr1 GxEscan_chr2 … GxEscan_chr22 56 The following command would then be used: GxEMerge filelist.txt GxEscan_all -linear This command would read in the 22 chromosome-specific sets of GxEscan output files listed in “filelist.txt” and generate five output files beginning with “GxEscan_all” and ending with the extensions listed in section 3.4.1. GxETest.R This program can be run within the R Gui or by using the command “rscript GxETest.R”. Note that you may need to provide the full path if using the command line, e.g. “C:\Program Files\R\R-3.2.2\bin\i386\rscript GxETest.R ” You can use either 64-bit or 32-bit R. Following are the first several lines of the program, with red highlights to note the most important input values/options. ###################### G×ETest.R program #################################### ## Overview # # This program reads output created by G×EScan.exe. # Input files must be in the format created by the --G×Eout option of G×EScan. # If G×EScan has been used to analyze multiple files of SNPs # (e.g. one file per chromosome), please use G×EMerge.exe to combine the files # prior to executing G×ETest.R ###################################################################################### ########################## Required R packages ############################# # (1) data.table # (2) XLConnect # (3) rJava # You can use R function "install.packages()" to install required R packages. ############################################################################# ##### Parameter Input ##### rm(list=ls(all=TRUE)) library("data.table") options(java.parameters = "-Xmx1024m") suppressMessages(library(XLConnect)) library(rJava) GWIS<-function(){ ##################################################################### #args <- commandArgs(trailingOnly = TRUE) args<-c( "--home-dir","/Your/Path/", # Path to G×EScan output files "--alpha","0.05", # Family-wise error rate "--GxEscan","example", # Base filename of G×EScan/G×EMerge output "--marginal-g", # Output Marginal G scan, Section 3.2.2.1 57 "--GxE", # Output G×E test, Section 3.2.2.2 "--2df", # Output 2df test, Section 3.2.2.3 ## Each of the following four lines specifies the available 2-step methods (see Section 3.2.2.4). The first value (e.g. 0.05) is the step-1 significance threshold required for subset testing, and the second value (e.g. 5) is the initial bin size for weighted hypothesis testing (Section 3.2.2.5) "--JointG×E","0.05","5", # Joint|G×E analysis, Section 3.2.2.4.3 "--YGG×E","0.05","5", # YG|G×E, Section 3.2.2.4.1 "--rVarG×E","0.05","5", # rVar|G×E, Section 3.2.2.4.2 ## Output options "--include-ranks", # Output ranks for subset testing "--top","50") # Output top 50 SNPs from each method ###################################################################################### All output from GxETest.R will be stored in a single Excel file, called “<basename>_output.xlsx”, where <basename> is the base filename provided in the --GxEscan option (e.g. “example_output.xlsx” would be produced from the above code). 3.5 Program Output As described above, GxEScan.exe produces multiple text output files corresponding scan results for each analytic method (see Section 3.2.2 and 3.2.3). Executing GxETest.R will produce a single Excel file with multiple worksheets. Selected worksheets are described below in the context of the Example data set. Summary This sheet summarizes the procedures that were applied. The upper portion of this sheet is shown in the inset. It provides the total number of SNPs tested (M), family-wise error rate (alpha), and for each of the 2-step procedures: the Step-1 significance threshold (alpha1), the number of SNPs that passed Step 1 (m), and the Step-2 significance threshold that should be used for subset testing (alpha/m). Also shown is the initial bin size (B) used for weighted Step-2 hypothesis testing. Additional lines in the sheet provide definitions and references for the methods. 58 Figure 3.1. Summary of Example data results Exhaustive Tests For the Example data, the results below show the QQ and Manhattan plots, and part of the table of top hits for the standard analysis of G×E interaction (Section 3.2.2). As shown in the table, the SNP was simulated to have a G×E interaction (rs0001) ranked first of the 1,000 SNPs by p-value, and marked with “***” in the “Sig” column indicating it achieved a genomewide significance threshold. Also shown in the table is the rank of each SNP in the other analytic approaches. Ranks for the 2-step methods (Joint|G×E etc.) are based on subset-testing and will be blank for those SNPs that did not pass the Step-1 screening threshold (Section 3.2.4). 59 Figure 3.2. QQ, Manhattan plots and part of the table of top hits for the standard analysis of G×E interaction. Two-Step Tests Following shows results for the 2-step Joint|G×E procedure (Section 3.2.3), with results from both Subset and Weighted hypothesis testing. Subset Testing: The results below show QQ and Manhattan plots for the M SNPs analyzed in Step 1 (upper plots) and for the m SNPs that pass Step 1 at the α1 significance level and that are analyzed in Step 2 (bottom plots). In the Example data, we see the corresponding Step-2 plots show that a SNP on chromosome 1 has a significant p-value based on the subset of m=60 SNPs tested in Step 2. 60 Figure 3.3. QQ and Manhattan plots for 2-step Joint|G×E procedure with Subset testing. Weighted Hypothesis Testing: The plot below is the Manhattan plot for the G×E test in all M SNPs ordered by their Step-1 p-values and organized into bins. Bin #1 includes the 5 (because initial Bin size is set to 5; see Summary sheet) with the lowest Step-1 p-values. The horizontal line for Bin #1 is the threshold required for SNPs in Bin #1 to achieve significance (with 5 SNPs in Bin 1, this threshold is 0.005 = 0.05/(2*5)). We see that one SNP surpasses this significance threshold. Bin #2 shows the corresponding results for the next 10 most significant SNPs from Step 1. None of these SNPs surpass the corresponding bin-specific threshold of 0.00125 = 61 0.05/(4*10). All remaining SNPs are also plotted based on their bin placements. A table of top hits is also included in each sheet. Figure 3.4. Manhattan plot for 2-step Joint|G×E procedure with Weighted hypothesis testing. Example Summary As we see in the results from the Example data, multiple methods have correctly identified the simulated QTL as having a statistically significant G×E interaction. None of the remaining 999 SNPs were falsely identified as significant by any of the methods. In general, the 2-step methods for testing G×E interaction all provide much better power than a standard G×E analysis. Note that the 2-step rVar|G×E provides less power than a standard G×E analysis if only 62 one interaction involved in the development of Y. The YG|G×E method is generally the most powerful method in the presence of one interaction. The YG|G×E method can outperform the standard G×E test when even a small marginal G effect is present, with increasing gains in power as the marginal G effect increases. When two interactions are involved in the development of Y, the test of variance heterogeneity begins to play a part and the rVar|G×E method can provide higher power than the standard G×E test (Section 2.3.2). In those situations when two interactions are involved and the marginal YG association is small, we have found that Joint|G×E with weighted hypothesis testing typically provides greater power than all of the other methods of testing G×E interaction. Still, no single method will be universally most powerful under all models, which is a reason GxEScan produces results from multiple different approaches. 63 Chapter 4: Application To The Children’s Health Study 4.1 General Description and Motivation The Children’s Health Study (CHS) is an ongoing cohort study investigating both genetic and environmental effect on asthma risk [McConnell et al., 2006] and lung function development [Gauderman et al., 2015] in children from 16 southern California communities. The CHS GWAS was based on a nested case-control sample of 1249 asthmatics and 1751 controls selected from within the cohort. All subjects in this GWAS sample were either Hispanic white (HW) (n=1398) or non-Hispanic white (NHW) (n=1602). Based on questionnaire responses, children were characterized as having physician diagnosed asthma at study entry or during active follow-up (cases), or as never having a diagnosis of asthma (controls). Forced expiratory volume in the first second (FEV1) is a widely used measure to evaluate pulmonary function. Prior GWAS scans have identified several loci that have marginal G association with FEV1 [Hancock et al., 2009]. We hypothesize that heritability for FEV1 can be different across ethnic groups for three possible reasons: 1) there is a difference in allele frequency across groups, but the same effect of the QTL on the trait in all groups, 2) the allele frequency is similar across groups but the QTL effect varies by group, 3) both the allele frequency and QTL effect vary across groups. In the first situation, a standard marginal G scan should provide the best power, while situations 2 and 3 should show evidence of G x Ethnicity interaction. For example, a QTL could have no effect on FEV1 in a non-Hispanic White population but serve as an important contributor to the explained heritability for a Hispanic 64 White population. In an attempt to identify additional loci for FEV1, we performed a genome- wide scan for G x Hispanic-ethnicity interaction using data from the CHS. 4.2 Phenotype and Subjects FEV1 was measured for each child by trained technicians and was log-transformed to satisfy assumptions of the linear regression model. The CHS is a prospective cohort study that has enrolled over 11,000 school children in southern California. The study has enrolled students at 3 separate times to 5 different cohorts. Cohorts A-D were enrolled in 12 southern California communities to study the effects of ambient air pollution on children’s respiratory health. Cohort E children were selected from 13 communities in southern California, including 9 of the original communities and 4 new communities. We focused on participants that had a lung function measurement in 9 th grade (average 15 years of age), which included a total of 1,728 children (684 HW, 1,044 NHW). 4.3 Exposure Data A detailed questionnaire was administered to all study subjects at baseline. This questionnaire included items about health history (including asthma and other respiratory conditions and symptoms), residential history, gender, in utero smoke exposure, and Hispanic ethnicity status. In utero exposure to maternal smoking was assessed by responses to the baseline questionnaire question “Did your child’s biological mother smoke while she was pregnant with your child?” Ethnicity was self-reported by parents on the CHS baseline questionnaire. Air pollution monitoring stations in each of the study communities have been continuously measuring regional air pollutants since 1994. Each station measured average hourly 65 concentrations of nitrogen dioxide (NO2). Stations also collected 2-week integrated filter samples for measuring PM2.5 mass and chemistry. Mean air-pollutant concentrations were calculated for each community over the relevant periods of exposure for each cohort (1994–1997 for cohort C, 1997–2000 for cohort D, and 2007–2010 for cohort E).The distribution and correlation structure of these pollutants across communities, and their effect on lung-function development, have been previously reported(Gauderman, McConnell et al. 2000; Gauderman, Gilliland et al. 2002; Gauderman, Avol et al. 2004). Table 4.1 summarizes the distributions of the exposures considered for this application. Table 4.1. Descriptive Statistics for Environmental Exposures Considered for Testing G×E Interactions in the Children’s Health Study. All Subjects NHW HW N=1,728 N=1,044 N=684 Binary E E=1 % E=1 % E=1 % in utero tobacco smoke 205 0.12 154 0.15 51 0.08 Male 848 0.49 528 0.51 320 0.47 Continuous E Median IQR Median IQR Median IQR PM 2.5 (μg/m 3 ) 15.29 10.84 12.66 10.74 15.29 12.07 NO 2 (ppb) 22.39 14.06 17.89 18.96 22.64 12.82 4.4 Genotype Data Measured data Study samples were genotyped at the USC Epigenome Center using the Illumina HumanHap550, HumanHap550-Duo or Human610-Quad BeadChip microarrays. Individuals were excluded from analysis with call rates < 90% (n=155). The HumanHap550, HumanHap550-Duo and Human610-Quad contained 366, 366 and 418 SNPs respectively that overlapped with a candidate gene study containing a large number of the subjects in this study (n=2,905). The average concordance rate between matching subjects for these overlapping SNPs 66 was > 99.69% for > 99% of the samples having a call rate > 90%. Subjects with poor concordance with genotypes from the candidate gene study were excluded (n=19). SNPs were excluded prior to imputation if they were not concordant between Illumina genotyped HapMap samples and HapMap2_r21 (< 95% in any population) or not concordant between Illumina genotyped HapMap samples on the HumanHap550 and Human610 (< 95% in any population) (n=8,616). Additionally, SNPs were filtered if they had a call rate < 95% (n=84,381) or departed from Hardy-Weinberg equilibrium (p < 10-5) in controls in HW and NHW samples separately (n=762 and 766 respectively). After applying these quality control measures, a total of N=1,728 subjects and M=506,738 SNPs were available for analysis. Imputed data The genotype data consists of SNP calls at roughly 600,000 loci for 3000 individuals from the CHS. Prior to this analysis, the measured SNP data was filtered to remove any SNP with MAF< 1%. It was then phased using SHAPEIT and imputed using IMPUTE2 against the 1000 Genomes Phase 1 integrated variant v3 phased reference (April 2012) by collaborators at UCSF. The imputation process yielded estimated dosage (0,1, or 2) probabilities for 37,538,514 loci common to all 3000 individuals. The dosage estimates include SNPs from all of the 22 autosomal pairs but not the sex chromosomes. The imputation was performed in roughly 5Mb chunks, so QCTOOL was used to reassemble the individual IMPUTE2 output files into 22 bgen (binary genotype) files, one for each chromosome pair. In the process, the SNPs were filtered based on their imputation info metric, wherein SNPs with scores of less than 0.5 were removed from the dataset. The info metric filtering eliminated 14,330,788 SNPs, leaving 23,207,726 SNPs for analysis. We further removed any SNP with MAF < 5% for genomewide scan. After 67 applying these quality control measures, a total of N=1,728 subjects and M=6,489,739 SNPs were available for analysis. 4.5 Statistical Analysis The two novel 2-step approaches (Joint|G×E and rVar|G×E) along with other methods (standard G×E scan and YG|G×E) were used to test for G×E interaction effects for in utero tobacco smoke exposure, gender, Hispanic-ethnicity stratus and exposure to ambient PM2.5 and nitrogen dioxide (NO2). Marginal scans were also performed. The model included adjustment for sex, age, body-mass index (BMI), BMI squared, log transformed height (log height), log height squared, community and estimated individual global ancestry (adjustment for population substructure, These Q-factors reflected the proportional ancestry for each study subject from four populations: Caucasian, African, Asian, and Native American). Both subset testing with α1=0.05 and weighted hypothesis testing with initial bin size of 5 were used for each of the 2-step approaches. 68 4.6 Results Measured data Approaches performed did not highlight any SNPs that achieved genomewide significance for G×E interaction effect for in utero tobacco smoke exposure, sex and exposure to ambient PM2.5 and NO2. For G×Hispanic-ethnicty scan, as shown in the Manhattan plots of Figure 4.2, one SNP was identified by the standard G×E test and the rVar|G×E and Joint|G×E 2- step approaches. The marginal G scan does not identify any SNPs with genome-wide significant P-values after Bonferroni correction (Fig. 4.1A-B). Note that the test of variance heterogeneity is not specific to the G×E interaction tested in step-2 and variance heterogeneity could also be induced by biological mechanisms other than interactions (Takeuchi et al., 2011). As a result, we observed inflation of Levene’s test statistic in the step-1 QQ plot for the rVar|G×E method. 69 A B C D E F Figure 4.1. Quantile-quantile and Manhattan plots for G x Hispanic-ethnicity interaction analysis of 506,788 SNPs from the Children’s Health Study. (A-B) Marginal G scan. (C-D) Standard G×E scan (E-F) Step-1 from the rVar|G×E method. 70 A B C D E F Figure 4.2. Analysis of 506,788 SNPs for G x Hispanic-ethnicity interaction using two-step methods. Subset testing was used (see Methods), with Step-1 threshold (α1) set to be 0.1. (A-B) Step-2 from the YG|G×E method. (C-D) Step-2 from the rVar|G×E method. (E-F) Step-2 from the Joint|G×E method. 71 This SNP is rs1439945 with MAF=0.16 within HW sample, MAF=0.14 within NHW sample and MAF= 0.15 for the combined sample. The SNP is located on chromosome 2 near the MARCO gene, with Step-1 screening P-value 5.4 ˣ 10 -2 for Joint|G×E, 1.8 ˣ 10 -2 for rVar|G×E and Step-2 testing P-value 6.4 ˣ 10 -8 (Table 4.2). The MARCO gene was found to be associated with susceptibility to pulmonary tuberculosis in a Chinese Han population (Ma et al., 2011) and in a Gambian population (Bowdish et al., 2013). This locus exhibits a qualitative interaction (Fig. 4.3), with a 130.3 milliliters decrease in FEV1 per allele for Hispanic whites and a 83.0 milliliters increase in FEV1 per allele for non-Hispanic whites. Note that this SNP was identified by the subset testing with α1=0.1 and would not have been found using the weighted hypothesis testing. The marginal effect of this SNP is weak (β G = –0.004, P-value=0.53), and as shown in Table S3 (Section 2.5.4), subset testing with a liberal screening threshold (e.g. α1=0.05) provides the highest power in this situation. This locus has not been previously identified in marginal G scans of FEV1. Table 4.2. Top 10 SNPs from Joint|G×E analysis of 506,788 SNPs for G x Hispanic- ethnicity interaction of FEV1 in the Children’s Health Study. SNP CHR Location Nearest Gene Step-1 P-value Step-2 P-value Significance Threshold rs1439945 2 119535662 MARCO 5.4 ˣ 10 -2 6.4 ˣ 10 -8 8.0 ˣ 10 -7 rs12338838 9 218848 DOCK8 2.5 ˣ 10 -2 1.8 ˣ 10 -6 8.0 ˣ 10 -7 rs296513 1 199173096 MROH3P 4.7 ˣ 10 -2 1.3 ˣ 10 -5 8.0 ˣ 10 -7 rs4478805 13 115405425 TSPAN2 8.6 ˣ 10 -2 5.0 ˣ 10 -5 8.0 ˣ 10 -7 rs6063141 6 45779866 SULF2 5.8 ˣ 10 -2 6.0 ˣ 10 -5 8.0 ˣ 10 -7 rs6933322 12 53099317 GCM1 2.8 ˣ 10 -2 7.0 ˣ 10 -5 8.0 ˣ 10 -7 rs10173517 12 230737937 SP110 7.3 ˣ 10 -2 7.4 ˣ 10 -5 8.0 ˣ 10 -7 rs2146390 12 98040506 DNTT 7.4 ˣ 10 -2 8.6 ˣ 10 -5 8.0 ˣ 10 -7 rs11237700 15 78437535 TENM4 5.4 ˣ 10 -2 8.8 ˣ 10 -5 8.0 ˣ 10 -7 rs6780603 2 34222683 LOC101928114 4.8 ˣ 10 -2 1.0 ˣ 10 -4 8.0 ˣ 10 -7 rs12296349 9 99835504 ANO4 6.3 ˣ 10 -3 1.1 ˣ 10 -4 8.0 ˣ 10 -7 72 Figure 4.3. G x Hipanic-ethnicity interaction pattern for rs1439945 for carriers (G=1 with AA and Aa genotype combined) relative to noncarriers (G=0 for aa genotype) in Hispanic White children (E=0) or non-Hispanic White children (E=1). Imputed data Figure 4.4 summarizes the results for all M = 6,489,739 SNPs scanned for marginal effects. No SNP had a statistically significant marginal effect at the genome-wide level (p = 7.7E-9). A B Figure 4.4. Analysis of 6,489,739 SNPs for marginal effect scan. Familywise error rate (α) is set to be 0.05. (A) Quantile-quantil plot. (B) Manhattan plot. 3350 3400 3450 3500 3550 3600 3650 G=0 (aa) G=1 (Aa/AA) FEV 1 E=0 (HW) E=1 (NHW) 73 G×Hispanic-ethnicity The same SNP (rs1439945 on chromosome 2 near MARCO gene) observed from the measured data was also detected from the imputed data for G × Hispanic-ethnicity scan. Note that this SNP was only detected by the 2-step approaches Joint|G×E and rVar|G×E (see Fig.4.5), but not detected by the standard G×E test. Two additional SNPs of potential interest were identified by the Joint|G×E and rVar|G×E methods. Those SNPs are rs12811916 (Step-2 testing P-value 1.5 × 10 -6 ) on chromosome 12 near RPL39P27 gene and rs77952193 (Step-2 testing P- value 1.1 × 10 -6 ) on chromosome 3 near IQCJ gene. The SNP rs12811916 has MAF= 0.19 within HW sample, MAF= 0.16 within NHW and MAF=0.17 for the combined sample. This locus exhibits a qualitative interaction, with a 146.1 milliliters decrease in FEV1 per allele for Hispanic whites and a 18.0 milliliters increase in FEV1 per allele for non-Hispanic whites. The SNP rs77952193 has MAF= 0.04 within HW sample, MAF= 0.07 within NHW and MAF=0.06 for the combined sample. This locus exhibits a qualitative interaction, with a 268.4 milliliters decrease in FEV1 per allele for Hispanic whites and a 79.2 milliliters increase in FEV1 per allele for non-Hispanic whites. 74 A B C D Figure 4.5. Quantile-quantile and Manhattan plots for G×Hispanic-ethnicity interaction analysis of 6,489,739 SNPs from the Children’s Health Study. (A-B) Step-2 from the Joint|G×E method. (C-D) Step-2 from the rVar|G×E method. G × PM2.5 scan The Joint|G×E and rVar|G×E methods implemented for G × PM2.5 identify one region on chromosome 5 near LOC100420027 gene (see Fig.4.6 A-D). The top SNP within that region is rs12189290 has MAF= 0.34 within sample exposed to PM2.5 less than 15.3 μg/m 3 (median value of PM2.5), MAF= 0.32 within sample exposed to PM2.5 greater than 15.3 μg/m 3 and MAF=0.33 for the combined sample. This locus exhibits a qualitative interaction, with a 64.7 milliliters 75 decrease in FEV1 per allele for sample exposed to PM2.5 less than 15.3 μg/m 3 and a 95.1 milliliters increase in FEV1 per allele for sample exposed to PM2.5 greater than 15.3 μg/m 3 ; with a 27.3 milliliters decrease in FEV1 per μg/m 3 for non-carriers and a 43.4 milliliters decrease in FEV1 per μg/m 3 for carriers. Note that this locus almost hit the genomewide significance level with a marginally significant P-value of 1.4 × 10 -7 compared to the genomewide significance level of 1.3 × 10 - 7. The YG|G×E method identifies one region with potential interest on chromosome 8 near COL14A1 gene (see Fig.4.6 E-F). The top SNP within that region is rs79137215 (Step-2 testing P-value 1.3 × 10 -5 ) has MAF= 0.09 within sample exposed to PM2.5 less than 54.3 μg/m 3 (median value of PM2.5), MAF= 0.10 within sample exposed to PM2.5 greater than 15.3 μg/m 3 and MAF=0.10 for the combined sample. This locus exhibits a qualitative interaction, with a 64.7 milliliters decrease in FEV1 per allele for sample exposed to PM2.5 less than 15.3 μg/m 3 and a 139.9 milliliters decrease in FEV1 per allele for sample exposed to PM2.5 greater than 15.3 μg/m 3 ; with a 40.1 milliliters decrease in FEV1 per μg/m 3 for non- carriers and a 26.5 milliliters decrease in FEV1 per μg/m 3 for carriers. 76 A B C D E F Figure 4.6. Quantile-quantile and Manhattan plots for G×PM2.5 interaction analysis of 6,489,739 SNPs from the Children’s Health Study. (A-B) Step-2 from the Joint|G×E method. (C-D) Step-2 from the rVar|G×E method. (E-F) Step-2 from the YG|G×E method. 77 G × NO2 scan One region of potential interest was identified from the G × NO2 scan using the Joint|G×E and rVar|G×E methods. This region is on chromosome 19 near C19orf45 gene (see Fig.4.7). The top SNP within that region is rs150213336 (Step-2 testing P-value 8.8 × 10 -6 ) has MAF= 0.07 within sample exposed to NO2 less than 22.4 ppb (median value of NO2), MAF= 0.05 within sample exposed to NO2 greater than 22.4 ppb and MAF=0.06 for the combined sample. This locus exhibits a qualitative interaction, with a 149.0 milliliters increase in FEV1 per allele for sample exposed to NO2 less than 22.4 ppb and a 41.0 milliliters decrease in FEV1 per allele for sample exposed to NO2 greater than 22.4 ppb; with a 30.9 milliliters decrease in FEV1 per ppb for non-carriers and a 69.0 milliliters decrease in FEV1 per ppb for carriers. The rVar|G×E method identifies one additional region with potential interest on chromosome 1 near COL9A2 gene. The top SNP within this region is rs117703269 (Step-2 testing P-value 2.9 × 10 -6 ) which has MAF= 0.06 within sample exposed to NO2 less than 22.4 ppb (median value of NO2), MAF= 0.07 within sample exposed to NO2 greater than 22.4 ppb and MAF=0.06 for the combined sample. This locus exhibits a qualitative interaction, with a 107.6 milliliters increase in FEV1 per allele for sample exposed to NO2 less than 22.4 ppb and a 135.2 milliliters decrease in FEV1 per allele for sample exposed to NO2 greater than 22.4 ppb; with a 32.9 milliliters decrease in FEV1 per ppb for non-carriers and a 83.5 milliliters decrease in FEV1 per ppb for carriers. G × in utero smoke scan The G × in utero tobacco smoke scan finds one region of potential interest using the 2- Step approaches (see Fig.4.8). This region is on chromosome 2 near the LOC105373585 gene. The SNP rs75340301 appears as the top one SNP across all three 2-Step approaches (Step-2 testing P-value 1.7 × 10 -6 ). The SNP rs75340301 has MAF= 0.05 within sample not exposed to 78 in utero tobacco smoke, MAF= 0.08 within sample exposed to in utero tobacco smoke and MAF=0.06 for the combined sample. This locus exhibits a qualitative interaction, with a 131.0 milliliters decrease in FEV1 per allele for sample not exposed to in utero tobacco smoke and a 274.1 milliliters increase in FEV1 per allele for sample exposed to in utero tobacco smoke. A B C D Figure 4.7. Quantile-quantile and Manhattan plots for G×NO2 interaction analysis of 6,489,739 SNPs from the Children’s Health Study. (A-B) Step-2 from the Joint|G×E method. (C-D) Step-2 from the rVar|G×E method. 79 A B C D E F Figure 4.8. Quantile-quantile and Manhattan plots for G × in utero tobacco smoke interaction analysis of 6,489,739 SNPs from the Children’s Health Study. (A-B) Step-2 from the Joint|G×E method. (C-D) Step-2 from the rVar|G×E method. (E-F) Step-2 from the YG|G×E method. 80 G × sex scan Severe inflation (Inflation factor λ=1.4) was observed for G × sex scan using the model with adjustment covariates described in section 4.5 (see Fig.4.9 A). A B Figure 4.9. Quantile-quantile plots for G×sex interaction analysis of 6,489,739 SNPs from the Children’s Health Study. (A) Without adjustment for sex × Hispanic-ethnicity interaction. (B) With adjustment for sex × Hispanic-ethnicity interaction. Keller [Keller, 2014] demonstrated that to properly control for confounders, researchers need to enter the covariate × environment and the covariate × gene interaction terms in the same model that tests the G × E term. Specifically he showed that if G is correlated with a covariate C1, the G×E estimate can be written as: 1 1 , 2 ˆ CG G E G E C E G (4.1) Where βG×E is the true effect of G×E interaction, βC1×E is true effect of C1×E interaction, σC1,G is the covariance between C1 and G, σ 2 G is the variance of G. As a result, in the presence of C1×E 81 interaction and C1-G correlation, controlling for marginal effect of C1 only without C1×E could leads to a biased estimate for G×E interaction. In our G × sex analysis, G is correlated with Hispanic-ethnicity and sex and Hispanic-ethnicity have a significant interaction effect on FEV1 (P-value=4.0×10 -4 ). Therefore, sex × Hispanic-ethnicity interaction must be included in model to adjust for confounding effect, and the inflation was removed after controlling for the sex × Hispanic-ethnicity interaction effect (see Fig.4.9 B). We also tried to include the sex × height interaction term in the model since height is correlated with G and there is also significant sex × height interaction effect on FEV1 (P-value=4×10 -8 ). However, including sex × height interaction term did not remove the inflation. I suspect that the height-G correlation is not as strong as the Hispanic-ethnicity-G correlation. The Joint|G×E and YG|G×E methods identify one region with potential interest (see Fig.4.10). This region is on chromosome 1 within ALX3 gene. The top SNP within that region is rs12133506 (Step-2 testing P-value 1.7 × 10 -6 ) which has MAF= 0.19 for female, MAF= 0.20 for male and MAF=0.20 for the combined sample. This locus exhibits a qualitative interaction, with a 124.1 milliliters increase in FEV1 per allele for female and a 39.5 milliliters decrease in FEV1 per allele for male. 82 A B C D Figure 4.10. Quantile-quantile and Manhattan plots for G×sex interaction analysis of 6,489,739 SNPs from the Children’s Health Study. (A-B) Step-2 from the Joint|G×E method. (C-D) Step-2 from the YG|G×E method. SNPs with most significant interactions from each G×E scan are summarized in Table 4.3. Note that only the interaction identified from the G × Hispanic-ethnicity scan reached the genomewide significance level. The same SNP was also identified from the analysis using the measured data. 83 Table 4.3. Most Significant SNPs Involved in G×E Interactions for each Environmental Exposures in the Children’s Health Study using 2-Step methods with subset testing (α=0.05). 4.7 Conclusion Using these novel approaches applied to the CHS, we identified evidence of G x Hispanic-ethnicity interaction for SNP rs1439945 at genomewide significance based on both measured and imputed genotypes. In our analysis of G x Hispanic-ethnicity for FEV1 in the CHS, we identified the SNP rs1439945 at genomewide significance using both the rVar|G×E and Joint|G×E methods. This SNP has not been detected in previous marginal GWAS scans. Based on our simulations, finding this locus using the rVar|G×E and Joint|G×E approaches may suggest that this SNP also interacts with additional factors other than Hispanic-ethnicity to affect lung function. For example, previous studies in the CHS have shown that FEV1 in children is affected by many factors, including air pollution [Gauderman et al., 2004], the oxidative stress gene GSTM1 [Gilliland et al., 2002], and in-utero tobacco smoke [Breton et al., 2009]. The power to detect interaction of the rs1439945 SNP with Hispanic-ethnicity by either the rVar|G×E or Joint|G×E approach would be increased if it also interacts with one of these additional factors, or some other factor, to affect lung function. Further interpretation of our finding will require additional study, including replication in other samples. 84 Chapter 5: Summary The “missing” heritability suggests that the complex trait is more complicated than can be defined by models with only marginal effect. It is widely accepted that the etiology of many complex traits involves not only genetic and environmental factors but also interactions between the two as well as G x G interactions. The development of powerful tools that incorporate both the effect of genes and environment offers an opportunity to better understand the etiology of complex trait. The focus of this dissertation is the development and application of two new methods for detecting interactions for a quantitative trait in a genomewide setting. Both methods use a novel test of variance heterogeneity that does not depend on the size of marginal G effect. As a result, these two new methods, Joint|G×E and rVar|G×E, have the potential to identify novel SNPs that have been missed from primary GWAS scans. I demonstrated that new methods can preserve the nominal type 1 error rates, and greater power can be achieved when multiple G×E interactions are involved in the development of the complex trait. We have developed efficient software “G×EScan” designed to facilitate researchers to perform G×E scans using an extensive approaches including our new methods. Finally I applied new methods to several environmental factors to investigate the etiology of childhood lung function in Children’s Health Study. One novel locus was identified from G × Hispanic-ethnicity scan at the genomewide significance level. 85 References Breton CV, Vora H, Salam MT, et al. (2009). Variation in the GST mu Locus and Tobacco Smoke Exposure as Determinants of Childhood Lung Function. American Journal of Respiratory and Critical Care Medicine,179(7):601-607. Aschard, H, Lutz, S., Maus, B., Duell, E. J., Fingerlin, T., Chatterjee, N, et al. (2012). Challenges and Opportunities in Genome-Wide Environmental Interaction (GWEI) studies. Human Genetics, 131(10),1591–1613. Devlin, B. and Roeder, K. (1999), Genomic Control for Association Studies. Biometrics, 55: 997–1004. Dudbridge, F., & Gusnanto, A. (2008). Estimation of significance thresholds for genomewide association scans. Genetic Epidemiology, 32(3), 227–234. Dai J, Kooperberg C, LeBlanc M, Prentice R. (2010). On two-stage hypothesis testing procedures via asymptotically independent statistics, Paper 367. (Working Paper Series).University of Washington. Fisher RA. (1932). Statistical Methods for Research Workers. London: Oliver& Boyd. Garcia-Closas M, Lubin JH. (1999). Power and sample size calculations in casecontrol studies of gene-environment interactions: comments on different approaches. Am J Epidemiol 149(8): 689-692. Gauderman WJ, Zhang P, Morrison JL, Lewinger JP. (2013). Finding novel genes by testing G x E interactions in a genome-wide association study. Genet. Epidemiol 37(6):603-613. Gauderman WJ. (2003). Candidate gene association analysis for a quantitative trait, using parent offspring trios. Genet. Epidemiol 25(4):327-338. Gauderman WJ, Urman R, Avol E, et al. (2015). Association of improved air quality with lung development in children. N Engl J Med 372(10):905-913. Gauderman W, Avol E, Gilliland F, et al. (2004). The effect of air pollution on lung development from 10 to 18 years of age. N Engl J Med 351: 1057–67. Gilliland FD, Gauderman WJ, Vora H, et al. (2002). Effects of glutathione-S-transferase M1, T1, and P1 on childhood lung function growth. Am J Respir Crit Care Med 166:710–716. Hindorff LA, MacArthur J, Morales J, Junkins HA, Hall PN, Klemm AK, and Manolio TA. (2015). A Catalog of Published Genome-Wide Association Studies. Available at: www.genome.gov/gwastudies. 86 Hwang SJ, Beaty TH, Liang KY, Coresh J, Khoury MJ. (1994). Minimum sample size estimation to detect geneenvironment interaction in case-control designs. Am J Epidemiol 140(11):1029-1037. Hsu L, Jiao S, Dai JY, Hutter C, Peters U, Kooperberg C. (2012). Powerful cocktail methods for detecting genome-wide gene-environment interaction. Genet Epidemiol 36(3): 183-194. Hancock DB, Eijgelsheim M, Wilk JB, et al. (2010). Meta-analyses of genome-wide association studies identify multiple novel loci associated with pulmonary function. Nat Genet 42(1):45 52. B. N. Howie, P. Donnelly, and J. Marchini. (2009). A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5(6): e1000529. Ionita-Laza I, McQueen MB, et al. (2007). Genomewide weighted hypothesis testing in family based association studies, with an application to a 100K scan. Am J Hum Genet 81(3): 607 614. Kraft P, Yen YC, Stram DO, Morrison J, Gauderman WJ. (2007). Exploiting gene-environment interaction to detect genetic associations. Hum Hered 63:111–119. Kooperberg C, Leblanc M. (2008). Increasing the power of identifying gene x gene interactions in genome-wide association studies. Genet Epidemiol 32(3): 255-263. Keller, M. C. 2014. Gene-by-environment interaction studies have not properly controlled for potential confounders: The problem and the (simple) solution. Biological Psychiatry, 75(1): 10. Levene H. 1960. Robust tests for equality of variances. Stanford University Press 278-292. Li Y, Willer CJ, Ding J, Scheet P and Abecasis GR (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 34:816-834. Li Y, Willer CJ, Sanna S and Abecasis GR (2009) Genotype Imputation. Annu Rev Genomics Hum Genet 10:387-406. Maher B. (2008). Personal genomes: The case of the missing heritability. Nature 456(7218):18- 21. Manolio TA, Collins FS, Cox NJ, et al. (2009). Finding the missing heritability of complex diseases. Nature 461(7265):747-753. Manolio TA, Collins FS. (2007). Genes, environment, health, and disease: facing up to complexity. Hum Hered 63(2):63-66. 87 Murcray CE, Lewinger JP, Gauderman WJ. (2009). Gene-environment interaction in genome wide association studies (with commentaries and rejoinder). Am J Epidemiol 169(2): 219 226. Murcray CE, Lewinger JP, et al. (2011). Sample size requirements to detect gene-environment interactions in genome-wide association studies. Genet Epidemiol 35(3): 201-210. McConnell R, Berhane K, Yao L, et al. (2006). Traffic, susceptibility, and childhood asthma. Environ Health Perspect 114(5):766-772. Piegorsch WW, Weinberg CR, Taylor JA (1994) Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Stat Med 13:153–162. Pare´ G, Cook NR, Ridker PM, Chasman DI. (2010). On the use of variance per genotype as a tool to identify quantitative trait interaction effects: a report from the Women’s Genome Health Study. PLoS Genet 6(6): e1000981. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D., et al. (2007). PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. American Journal of Human Genetics, 81(3), 559–575. Takeuchi F, Kobayashi S, Ogihara T, et al. (2011). Detection of common single nucleotide polymorphisms synthesizing quantitative trait association of rarer causal variants. Genome Res 21:1122-1130. Zuk O, Hechter E, Sunyaev SR, Lander ES. (2012). The mystery of missing heritability: genetic interactions create phantom heritability. Proc Natl Acad Sci USA 109(4):1193-1198.
Abstract (if available)
Abstract
A genome-wide association study (GWAS) typically is focused on detecting marginal genetic effects. However, many complex traits are likely to be the result of the interplay of genes and environmental factors. These SNPs may have a weak marginal effect and thus unlikely to be detected from a scan of marginal effects, but may be detectable in a gene-environment (G×E) interaction analysis. However, a genome-wide interaction scan (GWIS) using a standard test of G×E interaction is known to have low power, particularly when one corrects for testing multiple SNPs. Two 2-step methods for GWIS have been previously proposed, aimed at improving efficiency by prioritizing SNPs most likely to be involved in a G×E interaction using a screening step. For a quantitative trait, these include a method that screens on marginal effects [Kooperberg and Leblanc, 2008] and a method that screens on variance heterogeneity by genotype [Paré et al., 2010]. In this dissertation I show that the Paré et al. approach has an inflated false-positive rate in the presence of an environmental marginal effect, and we propose an alternative that remains valid. We also propose a novel 2-step approach that combines the two screening approaches, and provide simulations demonstrating that the new method can outperform other GWIS approaches. We develop software, as modules to the program “G×EScan”, to implement the novel methods along with other existing methods. Application of novel methods to a G x Hispanic-ethnicity scan for childhood lung function reveals a SNP near the MARCO locus that was not identified by previous marginal-effect scans.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Minimum p-value approach in two-step tests of genome-wide gene-environment interactions
PDF
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study
PDF
Combination of quantile integral linear model with two-step method to improve the power of genome-wide interaction scans
PDF
High-dimensional regression for gene-environment interactions
PDF
Comparisons of four commonly used methods in GWAS to detect gene-environment interactions
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Gene-set based analysis using external prior information
PDF
Inference correction in measurement error models with a complex dosimetry system
PDF
Leveraging functional datasets of stimulated cells to understand the relationship between environment and diseases
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Adaptive set-based tests for pathway analysis
PDF
Polygenic analyses of complex traits in complex populations
PDF
Two-step study designs in genetic epidemiology
PDF
Bayesian model averaging methods for gene-environment interactions and admixture mapping
PDF
Shortcomings of the genetic risk score in the analysis of disease-related quantitative traits
PDF
Statistical analysis of high-throughput genomic data
PDF
A genome wide association study of multiple sclerosis (MS) in Hispanics
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia
Asset Metadata
Creator
Zhang, Pingye
(author)
Core Title
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
04/20/2016
Defense Date
03/21/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
GWAS,OAI-PMH Harvest,quantitative trait,two-step methods,variance heterogeneity
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Gauderman, William James (
committee chair
), Conti, David (
committee member
), Lewinger, Juan Pablo (
committee member
), Sun, Fengzhu (
committee member
)
Creator Email
pingyezhangusc@gmail.com,sleepingzpy@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-235839
Unique identifier
UC11279054
Identifier
etd-ZhangPingy-4314.pdf (filename),usctheses-c40-235839 (legacy record id)
Legacy Identifier
etd-ZhangPingy-4314.pdf
Dmrecord
235839
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhang, Pingye
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
GWAS
quantitative trait
two-step methods
variance heterogeneity