Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Bayesian model averaging methods for gene-environment interactions and admixture mapping
(USC Thesis Other)
Bayesian model averaging methods for gene-environment interactions and admixture mapping
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
BAYESIAN MODEL AVERAGING METHODS FOR GENE-
ENVIRONMENT INTERACTIONS AND ADMIXTURE MAPPING
by
Lilit Chemenyan Moss
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BIOSTATISTICS)
August 2018
ii
Dedication
To my husband and daughter, Rob and Leona, my grandmother Azniv, and my parents,
Adam and Diana.
iii
Acknowledgements
My sincerest gratitude and respect to David V. Conti, the chair of my committee
and advisor, for his infinite patience, wisdom, guidance, support, and encouragement
throughout my time as a graduate student. Special thanks to all of my committee members,
William J. Gauderman, Duncan C. Thomas, Daniel O. Stram, and Lilyana Amezcua for
their advice, direction, and involvement. Thank you also to other members of the USC
biostatistics faculty, in particular, Juan Pablo Lewinger for his encouragement and helpful
suggestions. Thank you also to the very knowledgeable and helpful USC preventive
medicine staff for their support and guidance throughout my time in the program.
iv
Table of Contents
Dedication…………………………………………………………………………….. ii
Acknowledgements ...................................................................................... iii
List of Tables ................................................................................................ vi
List of Figures ............................................................................................. vii
Abstract………………………………………………………………………………... x
Chapter 1 Background .............................................................................. 1
1.1 The Bayesian Approach............................................................................... 1
1.2 Bayesian Model Averaging .......................................................................... 3
1.2.1 Overview ........................................................................................................3
1.2.2 Implementation of BMA .................................................................................6
1.3 GxE Interaction Studies .............................................................................. 9
1.3.1 Motivation ......................................................................................................9
1.3.2 Overview ...................................................................................................... 10
1.3.3 Review of GxE Methods................................................................................ 12
1.4 Admixture Mapping .................................................................................. 15
1.4.1 Background .................................................................................................. 15
1.4.2 Motivation .................................................................................................... 17
1.4.3 Admixture Mapping Methods ....................................................................... 18
Chapter 2 GxE Analysis Using Bayesian Model Averaging ................19
2.1 Abstract ..................................................................................................... 19
2.2 Introduction .............................................................................................. 20
2.3 Methods ..................................................................................................... 21
2.4 Simulations ................................................................................................ 32
2.4.1 Single-Marker Simulations ........................................................................... 33
2.4.2 Genome-Wide Simulations ........................................................................... 33
2.6 Results ....................................................................................................... 36
2.6.1 Single-Marker Simulation Results ................................................................ 36
2.6.2 Genome-Wide Simulation Results ................................................................. 42
2.7 Discussion .................................................................................................. 46
v
Chapter 3 Admixture Mapping Using Bayesian Model Averaging ....49
3.1 Abstract ..................................................................................................... 49
3.2 Introduction .............................................................................................. 50
3.3 Methods ..................................................................................................... 53
3.4 Results ....................................................................................................... 59
3.4 Discussion .................................................................................................. 63
Chapter 4 Application of Methods .........................................................64
4.1 GxE Analysis of Childhood Asthma .......................................................... 64
4.2.1 Introduction: Childhood Asthma .................................................................. 64
4.2.2 Methods: Childhood Asthma ........................................................................ 65
4.2.2 Results: Childhood Asthma .......................................................................... 69
4.2.4 Discussion: Childhood Asthma ..................................................................... 74
4.3 Admixture Mapping in Prostate Cancer ................................................... 75
4.3.1 Introduction: Prostate Cancer ...................................................................... 75
4.3.2 Methods: Prostate Cancer ............................................................................ 76
4.3.3 Results: Prostate Cancer ............................................................................... 77
4.3.4 Discussion: Prostate Cancer.......................................................................... 81
4.4 Admixture Mapping in Multiple Sclerosis ................................................. 83
4.4.1 Introduction: Multiple Sclerosis ................................................................... 83
4.4.2 Methods: Multiple Sclerosis .......................................................................... 84
4.4.3 Results: Multiple Sclerosis ............................................................................ 85
4.4.4 Discussion: Multiple Sclerosis ....................................................................... 89
Chapter 5 R Packages ..............................................................................90
5.1 BMA for GxE Interaction Software .......................................................... 90
5.2 BMA for Admixture Software ................................................................... 91
Chapter 6 Summary and Future Directions .........................................91
6.1 Summary of Findings ................................................................................ 91
6.2 Contribution to the State of the Art ........................................................... 93
6.3 Future Directions....................................................................................... 93
Bibliography .................................................................................................95
vi
List of Tables
Table 2-1 Type I error rates across one – step methods in scenarios with and without G-E
association. Error rate is calculated as the proportion of independent markers identified
by a given method as having interaction with E out of all independent markers
simulated. Type I error rate for the marginal association model is calculated as the
proportion of simulated SNPs identified by the marginal model as having significant
effect on outcome from all independent SNPs simulated. (Top) Error rates Shown for
null effects of both marginal G and GxE interaction; (Bottom) Error rates shown for
marginal G effect OR(G)=1.2 and null GxE interaction effect. ................................ 39
Table 3-1 Effect estimates, standard error of effects, and power for scenarios B (𝐀𝓵𝐤𝐀𝐟𝐟 =
𝟎 .𝟎𝟎𝟓 , 𝐀𝓵𝐤𝐔𝐧𝐚𝐟𝐟 = 𝟎 .𝟎 ) and F (𝐀𝓵𝐤𝐀𝐟𝐟 = 𝟎 .𝟎𝟎𝟓 , 𝐀𝓵𝐤𝐔𝐧𝐚𝐟𝐟 = 𝟎 .𝟎𝟎𝟓 ), for
prior model CO:CC odds of 10:1, 1:1, 1:10. ............................................................. 59
Table 4-1 Stratified marginal analysis of rs6866110 and rs4672623 by exposure group 71
Table 4-2 Top loci ranked by BMA P-value for G × Hispanicity interaction on asthma
susceptibility .............................................................................................................. 73
Table 4-3 Top loci ranked by BMA 2DF P-value for G × PM2.5 interaction on asthma
susceptibility .............................................................................................................. 73
Table 4-4 Mean global African ancestry in AAPC and LAPC samples, and mean global
Amerindian ancestry in LAPC sample across cases and controls. ............................ 79
Table 4-5 Admixture analysis significant regions in prostate cancer on chromosomes 8 and
16 for African and Amerindian ancestry within AAPC and LACP samples. Effect sizes
are negative due to the definition of deviance as the difference between global and
local ancestry. ............................................................................................................ 80
Table 4-6 Distribution of multiple sclerosis cases and controls in Hispanic White
individuals ................................................................................................................. 84
Table 4-7 Mean global European, Amerindian, and African ancestry of multiple sclerosis
cases and controls. ..................................................................................................... 85
Table 4-8 Most significant regions in admixture mapping of MS on Hispanic whites with
Amerindian ancestry. ................................................................................................. 86
vii
List of Figures
Figure 1-1 2x2 contingency tables for outcome and genotype stratified by exposure
(Gauderman et al., 2017) ........................................................................................... 12
Figure 1-2 2x2 contingency tables for exposure and genotype stratified by outcome
(Gauderman et al., 2017) ........................................................................................... 12
Figure 2-1 Heatmaps depicting power patterns for detection of GxE interaction across a
marginal G and GxE interaction effect range r = [-1.0, +1.0] for one-step methods on
1,000 simulations of 500 cases and 500 controls. Within each heatmap plot in the grid,
the x-axis shows the simulated marginal G effect with the null indicated by a vertical
line. The y-axis is the simulated GxE effect with the null indicated by a horizontal
line. The grid columns of Figure 1 represent the simulated G-E association in the
population. ................................................................................................................. 40
Figure 2-2 Empirical Power measured across a range r = [-1.0, +1.0] of G-E association
with and without a GxE interaction and marginal effect for CO DF2 and BMA 2DF
approaches. BMA(100:1) and BMA (1:100) represent an analysis of BMA 2DF with
prior weighting based on a 100:1 and 1:100 odds of a CC model being more
appropriate than a CO model respectively. A) OR(GxE)=1.0 & OR(G)=1.0, B)
OR(GxE)=1.5 & OR(G)=1.0, C) OR(GxE)=1.0 & OR(G)=1.2, D) OR(GxE)=1.5 &
OR(G)=1.2. ................................................................................................................ 41
Figure 2-3 Empirical power vs. OR(GxE) with independence between G and E (plots A-
C). Based on genome-wide simulations of 1 million SNPs with 1000 repetitions and
one designated causal SNP in each repetition. A) OR(G) = 1.0 & OR(E) = 1.0; B)
OR(G) = 1.2 & OR(E) = 1.2; C) Both OR(G) and OR(E) are induced by the interaction
effect and are not held constant. ................................................................................ 44
Figure 2-4 Receiver operating characteristic (ROC) curves for True and False positives in
simulations of 1000 repetitions of 10,000 SNPs. A) 20 SNPs with non-zero GxE
interaction (causal), no presence of non-causal SNPs associated with E, presence of
marginal effect of causal SNPs. B) 20 SNPs with non-zero GxE interaction (causal),
500 non-causal SNPs associated with E, presence of marginal effect of causal SNPs.
C) 20 SNPs with non-zero GxE interaction (causal), 500 non-causal SNPs associated
with E, no marginal effect of causal SNPs. ............................................................... 45
Figure 3-1 Power and effect estimates vs. simulation scenarios A-G. Scenarios are defined
by the average simulated value of ancestry deviations in controls only, 𝐀𝓵𝐤𝐔𝐧𝐚𝐟𝐟 ,
for some locus 𝓵 while average deviation in cases is held constant at 𝐀𝓵𝐤𝐀𝐟𝐟 =
𝟎 .𝟎𝟎𝟓 . Scenarios A-G are associated with the following list of 𝐀𝓵𝐤𝐔𝐧𝐚𝐟𝐟 values
respectively: −𝟎 .𝟎𝟎𝟏𝟐𝟓 , 𝟎 .𝟎 , 𝟎 .𝟎𝟎𝟏𝟐𝟓 , 𝟎 .𝟎𝟎𝟐𝟓 , 𝟎 .𝟎𝟎𝟑𝟕𝟓 , 𝟎 .𝟎𝟎𝟓 , 𝟎 .𝟎𝟎 𝟔𝟐𝟓 .
Power and effect estimates shown for models CO, CC, and BMA with prior CO:CC
odds 10:1, 1:1, 1:10. .................................................................................................. 61
viii
Figure 3-3 Sensitivity analysis of 𝑽𝒂𝒓 (𝜷𝒋 ) using 𝒄 = 𝟎 .𝟏 ,𝟎 .𝟓 ,𝟐 .𝟎 ,𝟏𝟎𝟎 multiples of
𝑺𝑬𝑨 𝓵 𝒌𝑼𝒏𝒂𝒇𝒇 . 𝑺𝑬𝑨 𝓵 𝒌𝑼𝒏𝒂𝒇𝒇 is abbreviated by 𝒔 ∗. ............................................ 62
Figure 3-2 Posterior probability of a case-control model for scenarios B (𝐀𝓵𝐤𝐀𝐟𝐟 =
𝟎 .𝟎𝟎𝟓 ,𝐀𝓵𝐤𝐔𝐧𝐚𝐟𝐟 = 𝟎 .𝟎 ) and F (𝐀𝓵𝐤𝐀𝐟𝐟 = 𝟎 .𝟎𝟎𝟓 ,𝐀𝓵𝐤𝐔𝐧𝐚𝐟𝐟 = 𝟎 .𝟎𝟎𝟓 ), for
prior model CO:CC odds of 100:1, 10:1, 1:1, 1:10, 1:100. ....................................... 62
Figure 4-1 Distribution of 𝐏𝐌𝟐 .𝟓 microgram per cubic meter exposure among 3000
asthma cases and controls. Bins indicate PM2.5 values observed for all children in each
neighborhood-cohort group. ...................................................................................... 67
Figure 4-2: Manhattan plots for Joint BMA (BMA-2DF), Marginal (MA), Case-Control
(CC), and Case-Only (CO) analysis of 6,216,909 SNPs for G × 𝐏𝐌𝟐 .𝟓 interaction
with childhood asthma.Case-Control: GxE ............................................................... 71
Figure 4-3: Manhattan plots for Joint BMA (BMA-2DF), Marginal (MA), Case-Control
(CC), and Case-Only (CO) analysis of 6,217,300 SNPs for G × 𝐇𝐢𝐬𝐩𝐚𝐧𝐢𝐜𝐢𝐭𝐲
interaction with childhood asthma.Case-Control: GxE ............................................. 72
Figure 4-4 Admixture mapping of prostate cancer using African ancestry in the AAPC on
chromosome 8 -log(P-values) (top), posterior probability of a CC model (middle), and
mean local ancestry relative to global ancestry for cases and controls separately
(bottom). BMA approach (red) was conducted using CO:CC odds of 1:1, with prior
variance 𝐕𝐚𝐫𝛃𝐁𝐌𝐀 = 𝐒𝐄𝐀𝓵𝐤𝐔𝐧𝐚𝐟𝐟 calculated prior to the analysis. Vertical grey
lines indicate known prostate cancer regions. ........................................................... 79
Figure 4-5 Admixture mapping of prostate cancer using African ancestry in the LAPC on
chromosome 8. BMA approach (red) was conducted using CO:CC odds of 1:1, with
prior variance 𝐕𝐚𝐫𝛃𝐁𝐌𝐀 = 𝐒𝐄𝐀𝓵𝐤𝐔𝐧𝐚𝐟𝐟 calculated prior to the analysis. Vertical
grey lines indicate known prostate cancer regions. ................................................... 80
Figure 4-6 Admixture mapping of prostate cancer using Amerindian ancestry in the LAPC
on chromosome 16. BMA approach (red) was conducted using CO:CC odds of 1:1,
with prior variance 𝑽𝒂𝒓𝜷𝑩𝑴𝑨 = 𝑺𝑬𝑨 𝓵 𝒌𝑼𝒏𝒂𝒇𝒇 calculated prior to the analysis.
Vertical grey lines indicate known prostate cancer regions. ..................................... 81
Figure 4-7 Admixture mapping of multiple sclerosis using Amerindian ancestry in the on
chromosome 5. BMA approach (red) was conducted using CO:CC odds of 1:1, with
prior variance 𝑽𝒂𝒓𝜷𝑩𝑴𝑨 = 𝑺𝑬𝑨 𝓵 𝒌𝑼𝒏𝒂𝒇𝒇 calculated prior to the analysis. ........ 87
Figure 4-8 Admixture mapping of multiple sclerosis using Amerindian ancestry in the on
chromosome 6. BMA approach (red) was conducted using CO:CC odds of 1:1, with
prior variance 𝑽𝒂𝒓𝜷𝑩𝑴𝑨 = 𝑺𝑬𝑨 𝓵 𝒌𝑼 𝒏𝒂𝒇𝒇 calculated prior to the analysis. ........ 87
Figure 4-9 Admixture mapping of multiple sclerosis using Amerindian ancestry in the on
chromosome 18. BMA approach (red) was conducted using CO:CC odds of 1:1, with
prior variance 𝑽𝒂𝒓𝜷𝑩𝑴𝑨 = 𝑺𝑬𝑨 𝓵 𝒌𝑼𝒏𝒂𝒇𝒇 calculated prior to the analysis. ........ 88
ix
Figure 4-10 Admixture mapping of multiple sclerosis using Amerindian ancestry in the on
chromosome 21. BMA approach (red) was conducted using CO:CC odds of 1:1, with
prior variance 𝑽𝒂𝒓𝜷𝑩𝑴𝑨 = 𝑺𝑬𝑨 𝓵 𝒌𝑼𝒏𝒂𝒇𝒇 calculated prior to the analysis. ........ 88
x
Abstract
Evidence suggests that identifying genetic contributions to the risk of complex
diseases requires moving beyond independent tests of association between markers and
traits. The purpose of this study is to present two methods within a Bayesian framework to
be used in identifying gene-by-environment (GxE) interactions and genomic regions
contributing to differential disease risk by ancestry. We first introduce a GxE approach
which combines a Bayesian framework with a two-degree-of-freedom (2df) test structure
for a simultaneous test of main and interaction effects. Simulations are used to present a
comparison study of classical and more complex GxE approaches used currently and
demonstrate that our proposed method performs similarly to existing 2df approaches with
increased power and robustness in numerous scenarios. A second approach is introduced
to perform admixture mapping and map susceptibility loci to complex disease with parental
ancestry. Our admixture mapping approach provides a linear regression framework in
which we reformulate the often used case-control and case-only statistics as nested
regression models which are combined within a Bayesian model selection framework.
Simulation is used to demonstrate that this approach is advantagous to using case-control
or case-only statistics in increased power and robustness. We conduct two genome-wide
interaction studies (GWIS) for childhood asthma using air pollution and ethnicity as
environmental factors in a nested case-control sample from the Children’s Health Study
(CHS). We conduct an admixture mapping of prostate cancer (PrCa) in African Americans
and Latinos from the Multiethnic Cohort as well as multiple sclerosis in Hispanic Whites
using our proposed method, as well as case-control and case-only methods.
1
Chapter 1 Background
1.1 The Bayesian Approach
At the heart of a Bayesian approach, is the simple probability result, Bayes
Theorem, attributed to Thomas Bayes from 1763 (Barnard, 1958) involving two events, A
and B:
𝑃 (𝐵 |𝐴 ) =
𝑃 (𝐵 )𝑃 (𝐴 |𝐵 )
𝑃 (𝐴 )
Here, the event B is of interest, Pr(B) is the prior probability (belief regarding the
probability of event B prior to observing data) of its occurrence, while event A is observed.
𝑃 (𝐵 |𝐴 ) is the posterior probability of B when A has already occurred. In essence, Bayes
theorem can be thought of as a formula for updating a prior probability to a posterior
probability through the observation of data (O'Hagan & Forster, 2004). Bayesian inference,
the application of the Bayesian approach to problems of statistical inference, then interprets
Bayes theorem as a relationship between an unknown parameter of interest, 𝜃 , prior beliefs
regarding 𝜃 , and observed data 𝑥 . Thus, a Bayesian approach treats inference as an update
on prior belief. The approach has the following sequence (O'Hagan & Forster, 2004):
1. Define a likelihood function 𝑓 (𝑥 |𝜃 )
2. Define a prior distribution for the unknown parameter, 𝑓 (𝜃 )
3. Derive the posterior distribution 𝑓 (𝜃 |𝑥 ) of the unknown parameter through an
application of Bayes’ theorem
4. Make inferences about 𝜃 from its posterior distribution
2
Fundamentally, the most significant feature distinguishing a Bayesian approach from a
classical frequentist approach is the treatment of parameters as random variables with prior
distributions. In contrast to the Bayesian approach, the frequentist method assumes the
observed data to be the only source of information about 𝜃 , excluding prior belief, and uses
the long-run frequency behavior of an event in an infinite sequence of hypothetical
sampling of similar circumstances to assign probability of the event (Bernardo, 2000).
To illustrate the relationship between prior belief and the observation of data in the
calculation of a posterior, we present an example from O'Hagan and Forster (2004): Let
𝑥 ∼ 𝑁 (𝜃 ,𝑣 ) where 𝜃 is the unknown mean and 𝑣 is the known variance. Further, assume
a normal prior distribution for 𝜃 , where 𝜃 ∼ 𝑁 (𝑚 ,𝑤 ). Using Bayes’ theorem, the posterior
distribution is also Normal, where 𝜃 |𝐷 ∼ 𝑁 (𝑚 1
,𝑤 1
) with
𝑚 1
=
𝑤𝑥 + 𝑣𝑚
𝑤 + 𝑣 , and 𝑤 1
=
𝑣𝑤
𝑤 + 𝑣
Here, we note that the posterior mean 𝑚 1
is the weighted average of the prior mean and
observed data. Rewriting the posterior mean using precision instead of variance yields:
𝑚 1
=
𝑣 −1
𝑥 + 𝑤 −1
𝑚 𝑣 −1
+ 𝑤 −1
,
and makes clear that the posterior mean is simply the weighted average of prior mean and
observed 𝑥 weighted by the precision of each measure. Thus, proximity of 𝑚 to either
source of information is dependent on the strength of that source. In this dissertation, a
Bayesian approach to inference and model selection is used to construct models for
genome-wide testing of GxE interactions and admixture signals in genetic studies. We use
the Bayesian approach by assigning prior distributions for regression model parameters.
3
We specify prior distributions by choosing the appropriate prior means and variances. In
our approach for testing GxE interactions, the choice of prior variance influences posterior
distributions of the regression models, and influences the weight placed on information
from the regression models. In our approach to admixture mapping, we aim to combine
multiple regression models by taking a weighted average. By using the Bayesian
framework and assigning prior variance to parameters, we can influence the posterior
weights used in our average. Given a nested set of models, prior variance on a parameter
can determine the range of values acceptable in the observed data. This range can then
inform model posteriors based on whether or not they include the parameter. In other
words, if variance is large for a parameter with prior mean zero, the posterior probabilities
of models without the term may be higher than those of models including the term.
1.2 Bayesian Model Averaging
1.2.1 Overview
The problem of model selection arises when there is more than one statistical model
that can be used to describe the effects of covariates on a dependent outcome, or to predict
future outcomes. The practice of selecting a “best” model often involves numerous choices
concerning definition of outliers, inclusion of covariates based on significance tests,
variable transformations, and comparisons between possible models. This practice often
ignores power considerations when using significance tests, compares un-nested models,
and ignores the uncertainty surrounding the model selection process itself (Kass & Raftery,
1993). Failing to account for model uncertainty in this way, results in underestimating the
uncertainty in inferences made conditionally from the “best” model (Draper, 1995; Draper,
4
Hodges, Leamer, Morris, & Rubin, 1987; Kass & Raftery, 1993; Kass & Raftery, 1995;
Regal & Hook, 1991); thus, inferences made without accounting for model uncertainty, are
done so over-confidently. One way to account for model uncertainty is to incorporate
posterior probabilities of all possible models into the inferences using Bayes model
averaging (Adrian Raftery, 2015; Hoeting, Madigan, Raftery, & Volinsky, 1999; Kass &
Raftery, 1993; Leamer, 1978; Raftery, 1993a; Raftery, Madigan, & Hoeting, 1997;
Roberts, 1965; Stewart, 1987). The BMA procedure was introduced by Roberts (1965),
many years prior to its widespread adoption, due to computational limitations. As
advancements in computational tools made the method more feasible, Hoeting et al. (1999)
re-introduced BMA and outlined procedures and tools for implementing it.
BMA is ultimately a method of taking weighted averages of estimates from several
to numerous models. It can be used to consider all possible models in a model space, or to
average over a chosen set of 𝐾 models. Given a parameter, 𝛽 , to be estimated using 𝐾
models, the BMA approach calculates a posterior density, conditional on the observed data
𝐷 for the parameter 𝛽 :
Pr(𝛽 |𝐷 ) = ∑ Pr(𝛽 𝑘 |ℳ
𝑘 ,𝐷 )Pr (ℳ
𝑘 |𝐷 )
𝑘 =1
(Leamer, 1978) where ℳ
𝑘 is the 𝑘 𝑡 ℎ
model and 𝛽 𝑘 the model-specific estimate of 𝛽 .
Pr (𝛽 𝑘 |ℳ
𝑘 ,𝐷 ) is the posterior distribution of the estimate of 𝛽 specific to model ℳ
𝑘 , and
Pr (ℳ
𝑘 |𝐷 ) is the posterior probability of model ℳ
𝑘 . The posterior density of the parameter
𝛽 conditional on model ℳ
𝑘 is:
Pr(𝛽 𝑘 |ℳ
𝑘 ,𝐷 ) = ∫Pr(𝛽 𝑘 |𝐷 ,𝜃 𝑘 ,ℳ
𝑘 )Pr(𝜃 𝑘 |𝐷 ,ℳ
𝑘 )𝑑 𝜃 𝑘
where 𝜃 𝑘 represents the parameters of model ℳ
𝑘 .
5
The posterior probability of model ℳ
𝑘 is:
Pr(ℳ
𝑘 |𝐷 ) =
Pr(𝐷 |ℳ
𝑘 )Pr(ℳ
𝑘 )
∑ Pr(𝐷 |ℳ
𝑙 )Pr (ℳ
𝑙 )
𝐾 𝑙 =1
Here, Pr(ℳ
𝑘 ) is the prior probability of model ℳ
𝑘 being the most appropriate model, and
the marginal likelihood, or integrated likelihood of model ℳ
𝑘 is:
Pr(𝐷 |ℳ
𝑘 ) = Pr∫Pr(𝐷 |𝜃 𝑘 ,ℳ
𝑘 )Pr(𝜃 𝑘 |ℳ
𝑘 )𝑑 𝜃 𝑘 ,
where Pr(𝐷 |𝜃 𝑘 ,ℳ
𝑘 ) is the likelihood of the data under ℳ
𝑘 , and Pr (𝜃 𝑘 |ℳ
𝑘 ) is the prior
density of 𝜃 𝑘 (Hoeting et al., 1999). Further, Pr(𝐷 |ℳ
𝑘 ) is the marginal probability of the
data, 𝐷 under model ℳ
𝑘 because it is the result of integrating the joint distribution of
(𝐷 ,𝜃 𝑘 ) given 𝐷 over 𝜃 𝑘 (Raftery & Richardson, 1996). Regarding both prior and posterior
model probabilities, we note that ∑ Pr (ℳ
𝑘 )
𝐾 𝑘 =1
= 1 and ∑ Pr (ℳ
𝑘 |𝐷 )
𝐾 𝑘 =1
= 1. From the
definition of the posterior density of 𝛽 , expectation and variance follow:
𝐸 (𝛽 |𝐷 ) = ∑ 𝛽 ̂
𝑘 Pr(ℳ
𝑘 |𝐷 )
𝐾 𝑘 =1
,
𝑉𝑎𝑟 (𝛽 |𝐷 ) = ∑[𝑉𝑎𝑟 (𝛽 𝑘 |𝐷 ,ℳ
𝑘 )+ 𝛽 ̂
𝑘 2
]Pr(ℳ
𝑘 |𝐷 )
𝐾 𝑘 =1
− [𝐸 (𝛽 |𝐷 )]
2
(Hoeting et al., 1999; Leamer, 1978). Here, the model-specific estimate, 𝛽 ̂
𝑘 =
𝐸 (𝛽 𝑘 |𝐷 ,ℳ
𝑘 ) is obtained as the expected value of 𝛽 𝑘 given the model-specific posterior
density. Thus, the posterior density of parameter 𝛽 is an average of the posterior densities
of the model-specific 𝛽 𝑘 ’s weighted by the product of the data-driven evidence for and the
prior belief about model ℳ
𝑘 .
6
1.2.2 Implementation of BMA
The main challenges of implementing a BMA approach have been intractable summations
resulting from a very large model space and the computation or estimation of the marginal
likelihood of ℳ
𝑘 , Pr(𝐷 |ℳ
𝑘 ). For the first, Hoeting et al. (1999) recommend using either
an Occam’s window method (Madigan & Raftery, 1994) to reduce the model space, or to
approximate the weighted sum using a Markov Chain Monte Carlo approach. The Occam’s
window approach to reducing model space is based on two rules. First, if a model predicts
the data significantly worse than the best model given a chosen threshold, it is discarded.
And second, models that are supported by the data less than simpler nested models are
discarded.
For the latter problem, numerous approaches exist to tackle computing the
integrated likelihood for the numerous circumstances and classes of models an investigator
is likely to encounter. For linear regression models, Raftery et al. (1997) present a closed-
form solution of the data under model ℳ
𝑘 using the standard normal gamma conjugate
class of priors based on Raïffa and Schlaifer (1961). In this dissertation, we implement
BMA for admixture mapping using conjugate priors that allows for the calculation of a
closed-form of the marginal likelihood (see Chapter 3). Hoeting et al. (1999) suggest
constructing approximate Bayes factors using the Laplace method of estimation (Tierney
& Kadane, 1986) for generalized linear models where no closed-form solution is available.
Bayes factors (BF) date back to the first half of the 20
th
century (Good, 1950; Jeffryes,
1935), and are a method of testing hypotheses within a Bayesian framework without using
p-values. Within a model selection as context and assuming two models ℳ
1
vs ℳ
2
, we
can represent the posterior probability of model ℳ
1
as
7
Pr(ℳ
1
|𝐷 ) =
Pr(𝐷 |ℳ
1
)Pr(ℳ
1
)
Pr(𝐷 |ℳ
2
)Pr(ℳ
2
)+ Pr(𝐷 |ℳ
2
)Pr (ℳ
2
)
without loss of generality, as the normalized product of the marginal likelihood and the
prior. Then, taking a ratio of posterior probabilities yields
Pr(ℳ
1
|𝐷 )
Pr(ℳ
2
|𝐷 )
=
Pr(𝐷 |ℳ
1
)Pr(ℳ
1
)
Pr(𝐷 |ℳ
2
)Pr(ℳ
2
)
where
Pr(ℳ
1
|𝐷 )
Pr(ℳ
2
|𝐷 )
is the posterior odds of ℳ
1
versus ℳ
2
since Pr(ℳ
1
|𝐷 )+ Pr(ℳ
2
|𝐷 ) =
1 ⇒ Pr(ℳ
2
|𝐷 ) = 1 − Pr(ℳ
1
|𝐷 ), and
Pr(ℳ
1
)
Pr(ℳ
2
)
is the prior odds following the same logic.
The BF here is defined as 𝐵 12
=
Pr(𝐷 |ℳ
1
)
Pr(𝐷 |ℳ
2
)
, a ratio of marginal likelihoods representing the
evidence for ℳ
1
versus ℳ
2
, and can be further thought of as the ratio of posterior odds of
ℳ
1
to its prior odds (Kass & Raftery, 1995). If prior odds of both models are set to be
equal, the BF becomes a ratio of posterior probabilities. Letting 𝛼 𝑘 be the prior odds of
model ℳ
𝑘 , 𝛼 𝑘 =
Pr(ℳ
𝑘 )
1−Pr(ℳ
𝑘 )
we can compute the posterior probability for ℳ
1
using BFs:
Pr(ℳ
1
|𝐷 ) =
𝛼 1
𝐵 12
𝛼 1
𝐵 12
+ 𝛼 2
𝐵 21
.
If all 𝐾 models are being considered against a model ℳ
0
, the posterior probability then for
any model ℳ
𝑘 can be computed using BFs in the following equation:
Pr(ℳ
1
|𝐷 ) =
𝛼 𝑘 𝐵 𝑘 0
∑ 𝛼 𝑟 𝐵 𝑟 0
𝐾 𝑟 =1
where 𝐵 00
and 𝛼 0
are both 1.
Laplace approximation is a method of approximating integrals using a Taylor
expansion about a global maximum point (de Bruijn, 1981; Tierney & Kadane, 1986).
When approximating BFs, Laplace approximation is an asymptotic approach that can be
8
used to approximate the marginal likelihood of 𝐷 under ℳ
𝑘 . Here, the global maximum is
assumed to be the posterior mode of the model density, which is usually true in large
sample sizes. Although the Laplace method of approximation for the marginal likelihood
provides relatively accurate estimates for marginal densities (Tierney & Kadane, 1986),
software to compute elements of Laplace estimation (e.g., the posterior mode of 𝜃 𝑘 and the
inverse Hessian evaluated at the mode) have not been widely available, though this is
changing (e.g., (Curry, 2013)). To work around this problem for implementing BMA,
Raftery (1996a) introduced a method of approximating Bayes factors using only the
maximum likelihood estimator (MLE) and the deviance of the information matrix. This
variation on Laplace approximation makes BMA feasible to run on most statistical
software because most software is able to compute MLEs. In this dissertation, we use the
variation to the Laplace method presented by Kass and Raftery (1995) using the MLE to
implement our gene-by-environment interaction model (see Chapter 2).
Thus, the step-by-step implementation of BMA is as follows:
1. Choose prior model odds/probabilities, ℳ
𝑘
2. Choose prior mean and variance for model parameters, 𝛽 𝑘
3. If a closed form solution exists for the marginal likelihood, obtain model posterior
probabilities, Pr (ℳ
𝑘 |𝐷 )
4. If no closed-form solution exists for the marginal likelihood, calculate approximate
BFs using Laplace estimation with MLE’s to obtain model posterior probabilities,
Pr (ℳ
𝑘 |𝐷 )
5. Obtain model-specific estimates of the parameters of interest, 𝛽 ̂
𝑘 = 𝐸 (𝛽 𝑘 |𝐷 ,ℳ
𝑘 )
9
6. Calculate weighted mean and variance for the parameter of interest, not conditioned
on any one model:
𝛽 ̃
𝐵𝑀𝐴 = ∑ 𝛽 ̂
𝑘 Pr(ℳ
𝑘 |𝐷 )
𝐾 𝑘 =1
𝐸 (𝛽 𝐵𝑀𝐴 |𝐷 ) = ∑ 𝛽 ̂
𝑘 Pr(ℳ
𝑘 |𝐷 )
𝐾 𝑘 =1
,
𝑉𝑎𝑟 (𝛽 𝐵𝑀𝐴 |𝐷 ) = ∑[𝑉𝑎𝑟 (𝛽 𝑘 |𝐷 ,ℳ
𝑘 )+ 𝛽 ̂
𝑘 2
]Pr(ℳ
𝑘 |𝐷 )
𝐾 𝑘 =1
− [𝐸 (𝛽 𝐵𝑀𝐴 |𝐷 )]
2
7. Obtain test statistic 𝒲 =
𝛽 ̃
𝐵𝑀𝐴
2
𝑉𝑎𝑟 (𝛽 𝐵𝑀𝐴 |𝐷 )
∼ 𝜒 𝑛 2
1.3 GxE Interaction Studies
1.3.1 Motivation
While association studies are a powerful method of identifying low-penetrance
genetic markers, it is known that the contributors to the risk of diseases such as cancer are
complex and are the result of a complicated interplay of genetic and environmental risk
factors. Evidence suggests that additional genetic and environmental factors can be
identified by conducting gene-environment (GxE) interaction studies beyond what is
possible through marginal association studies of genetic markers or environmental
exposures alone (Boffetta et al., 2012; Hunter, 2005; Kraft, Yen, Stram, Morrison, &
Gauderman, 2007; Thomas, 2010). A GxE study examines whether the effect of an
environmental exposure differs based on genotype, or vice a versa. Specifically, GxE
studies statistically test if the incidence of disease in the presence of both genetic and
environmental factors differs from the incidence expected given their individual effects.
GxE studies can contribute to the understanding of biological mechanisms and help foster
strategies for prevention and treatment of diseases. Additionally, GxE interaction studies
10
can aid in identifying high-risk groups for diseases and help inform public health
campaigns in prevention within such groups. Despite the motivations for GxE analysis,
there have been few interactions successfully identified by GxE analysis (e.g., NSAIDs
and the MGST1 gene in colon cancer, pesticide exposure and multiple chromosomal
regions in Parkinson’s Disease, and smoking and the NAT2 and UGT1A6 genes in bladder
cancer (Dick et al., 2007; Nan et al., 2015; Ritz, Paul, & Bronstein, 2016; Scherr, 2014).
Major challenges to conducting successful GxE studies are exposure assessment, sample
size, and heterogeneity (Thomas, 2010). Thus, part of the reason so few interactions have
been found is that GxE studies require larger sample sizes to achieve the power necessary
to identify interactions (Aschard, 2016), and much of the development of GxE statistical
methods has been aimed at improving power. Below, we review the types of statistical
approaches currently available to conduct GxE studies.
1.3.2 Overview
In conducting GxE studies, there are hypothesis-driven and agnostic approaches.
The former, which study the role of genes and environment considering genetic markers
that are known or thought to play key roles in protein function, include pathway-based and
candidate gene approaches. The latter type is typically an agnostic genome-wide scan of
all available genetic markers and a given environmental exposure, with the aim of detecting
interactions along the entire genome. Whether the study is hypothesis-driven or agnostic
however, the methods used to test a GxE interaction at a single genetic locus are largely
the same. Since there are two copies of each chromosome in humans, a given single
nucleotide polymorphism (SNP) is usually coded as 0/1/2 indicating the number of minor
alleles at the locus. When conducting an analysis, we can assume an additive effect of the
11
minor allele and code G as a trinary variable with the original SNP coding indicating the
number of minor alleles. For a dominant effect, G is coded as 0 when there are zero copies
of the minor allele (genotype AA) and as 1 when there are either one or two copies
(genotypes Aa or aa). For a recessive effect, G is coded as a 1 when there are two copies
of the minor allele (genotype aa) and 0 when there are one or zero (genotypes AA or Aa).
Outcomes (Y) environmental exposures may be categorical or continuous, with some
restrictions from the statistical models chosen. Statistical interaction is scale-dependent,
and an additive or a multiplicative scale chosen. Despite a consensus among
epidemiologists that using an additive model is the appropriate analysis for assessing effect
modification (Greenland, 2009; Rothman, Greenland, & Lash, 2008; Rothman, Greenland,
& Walker, 1980), using additive models in interaction studies is rare (Knol, Egger, Scott,
Geerlings, & Vandenbroucke, 2009). Instead, multiplicative interaction models are
typically used. Statistical tests for interactions using a multiplicative model assume that the
relative risk of disease in the presence of both genetic and environmental factors should be
equivalent to the product of the relative risks of each of the separate factors if no effect
modification is present. Thus, a multiplicative model tests the departure of the product of
relative risks of the factors from the expected product of their individual relative risks under
the null hypothesis. Given this null hypothesis, an observed risk greater for both factors
than their expected risk is known as a synergistic interaction; however, if the observed risk
is less than the expected, it is considered an antagonistic interaction (Rothman, 1976).
Because of their widespread use, our review of methods currently available for GxE
analysis, as well as the remainder of this text, will discuss multiplicative interaction studies.
12
1.3.3 Review of GxE Methods
Case-Control and Case-Only Tests for GxE Interaction
For binary G and exposure E, the distribution of case-control status, Y, by exposure
and genotype status, can be represented by two 2x2 contingency tables (see Figure 1-1).
Odds ratios for the effect of G within exposed and unexposed subjects may be calculated
using each of these tables (see Figure 1-1) as OR
GD|E=1
= (ad/bc) and OR
GD|E=0
=
(eh/fg). The case-control (CC) multiplicative GxE interaction effect may then be
calculated as the ratio of odds ratios:
OR
GxE
=
(OR
GY|E=1
)
(OR
GY|E=0
)
=
[ad/bc]
[eh/fg]
Conversely, the contingency tables in Figure 1-1 can be arranged to be stratified instead by
disease status (see Figure 1-2). From this configuration, we can calculate the interaction
effect by
OR
GxE
=
(OR
GE|Y=1
)
(OR
GE|Y=0
)
=
[ag/ce]
[bh/df]
which allows for explicit representation of the G-E association in the population. We can
represent the G-E association by OR
GE|Y=0
= bh/df; hence, if we can assume
Y=1 Y=0 Y=1 Y=0
G=1 a b G=1 e f
G=0 c d G=0 g h
E=1 E=0
Figure 1-1 2x2 contingency tables for outcome and genotype
stratified by exposure (Gauderman et al., 2017)
13
independence between G and E, we can constrain OR
GE|Y=0
= 1. Under this assumption,
the GxE interaction effect becomes OR
GxE
= OR
GE|Y=1
, which is the association of G-E in
cases, without information from controls. This comprises the case-only (CO) design for
GxE interaction (Piegorsch, Weinberg, & Taylor, 1994). Both the case-control and case-
only approaches can be tested using regression models (see Chapter 2). While a case-only
approach is more powerful than a case-control approach, using cases exclusively does not
supply estimates of main effects of E and G, and it is also susceptible to bias if the G-E
independence assumption is not true.
2-Degree-of-Freedom Tests
GxE interaction approaches which test main or marginal effects in conjunction with
interaction effects as two degree of freedom (2df) tests have been recently proposed (Dai
et al., 2012; Kraft et al., 2007; Tchetgen Tchetgen, 2011). These approaches jointly test
estimates obtained from CC, CO, and marginal association models to improve power for
discovery in both genetic association and GxE interaction studies. In a post GWA era where
marginal association studies as have been successfully conducted for numerous diseases
(Hindorff et al., 2009), missing heritability remains an issue for many diseases (Lander,
2011). Thus, an underlying argument for joint testing of estimates is that loci that have not
been uncovered using either marginal association testing or GxE interaction studies may
be uncovered by combining both effects. In this dissertation, we will implement a method
introduced by Kraft et al. (2007) as one 2df procedure in simulation studies and real-data
application in our GxE analysis (see Chapter 4). The method is described as a case-control
test for genetic association allowing for heterogeneity in genetic effect across exposure
strata, and is implemented using a logistic regression model in a case-control sample. A
14
likelihood ratio test is used to test a null hypothesis that assumes both interaction and main
effects are null 𝐻 0
: 𝛽 𝐺 = 𝛽 𝐺𝑥𝐸 = 0. Similarly, Dai et al. (2012) introduced a method that
again combines a genetic effect with an interaction, but does so by using disparate models
in the calculation of the final test statistic. Additionally, the Dai et al. (2012) approach uses
the marginal, rather than the main effect of the genetic marker by using two logistic models.
In the formulation of the method we will henceforth use in this dissertation, the approach
uses a marginal association model between G and Y using cases and controls to obtain an
estimate of the effect of G on Y, 𝛾̂
𝐺 and forming the statistic
γ
G
̂
2
Var
̂
(γ
G
̂ )
. Then, a logistic model
describing the G-E association in cases-only is used to obtain an estimate of the case-only
interaction parameter, 𝜃 ̂
𝐺𝐸
and form the statistic,
θ
GE
̂
2
Var
̂
(θ
GE
̂
)
. The two statistics are then
combined in a sum to form a 2df 𝜒 2
statistic W
MA+CO
=
γ
G
̂
2
Var
̂
(γ
G
̂ )
+
θ
GE
̂
2
Var
̂
(θ
GE
̂
)
∼ 𝜒 2
2
. While
there are other formulations described by Dai et al. (2012), we use what is described above
in simulations and data application in comparison studies (see Chapter 4).
Bayesian 1df Tests
Several approaches have been developed recently to combine a case-control and
case-only analysis using the Bayesian framework. The empirical Bayes method
(Mukherjee & Chatterjee, 2008) and the Bayes model averaging method (Li & Conti, 2009)
both combine the case-only and case-control models using weighted averages, and both do
so by relaxing the independence assumption of G and E in the population. The Empirical
Bayes method combines information from the case-control and case-only tests to estimate
a GxE interaction effect by taking an average of the interaction estimates from each of the
models weighted by sample size and the strength of the interaction observed within the
15
data. The Bayes model averaging approach instead combines the two estimates using
posterior model probabilities as weights in the weighted average. The underlying aim for
both approaches is to increase power while minimizing type I error to improve upon
implementing a case-control or case-only model separately.
Two-step Methods
Two step methods, defined by having two distinct steps, a screening step in which
all markers in a study are screened by a p-value, and a testing step in which only a subset
of markers, or a prioritized list of the original markers are tested for a GxE interaction
(Gauderman et al., 2017). Two-step approaches aim to improve efficiency and control type
I error in GxE studies and depend on the independence between the tests in the two steps
of the analysis. Numerous methods have been introduced for two-step analysis
(Gauderman et al., 2010; Gauderman, Zhang, Morrison, & Lewinger, 2013; Hsu et al.,
2012; Kooperberg & Leblanc, 2008; Murcray, Lewinger, Conti, Thomas, & Gauderman,
2011; Zhang, Lewinger, Conti, Morrison, & Gauderman, 2016) for both categorical and
continuous outcomes. In this dissertation, we implement the EDGE approach in our
simulations as it has been shown to be relatively more powerful in many scenarios than
other two-step approaches (Gauderman et al., 2013).
1.4 Admixture Mapping
1.4.1 Background
Genetic admixture is the result of two or more previously isolated populations,
parental populations, interbreed. This mixing yields a population of admixed individuals,
which, after a few generations of interbreeding, develop genomes comprised of distinct
16
chromosomal segments from each of the parental populations. The chromosomes of
admixed individuals then resemble a mosaic (Shriner, 2017) of these segments. In admixed
genomes, each allele at a given chromosomal locus is thus inherited from one of the
parental populations. Examples of recently admixed populations are African Americans
(Parra et al., 1998; Smith et al., 2004) and Latinos (Carvajal-Carmona et al., 2003; Dipierri
et al., 1998). The genomes of Latinos in North and South America are proportioned by
African, European, and Amerindian ancestries (Bryc, Durand, Macpherson, Reich, &
Mountain, 2015). Genomes of African Americans are likewise segmented by European,
African, and Amerindian ancestries, with differences in proportions of European versus
African ancestries in individuals across different regions of the country (Bryc et al., 2015).
The method of estimating the contribution from parental ancestries to an individual genome
is done at each genetic locus (local ancestry), and also as the proportion of the individual’s
genome inherited from each contributing parental ancestry (global ancestry). Because local
ancestry cannot be directly observed, given an ancestral population of interest and a
particular locus, the number of alleles originating from this population at this locus is
inferred. Local-ancestry inference (LAI) is typically done for individuals based on
reference panels to which the individual’s haplotypes are matched to a particular ancestry.
The choice of reference panel population is key to accurate ancestry estimation. Reference
populations should be homogenous populations representing the parental populations that
make up the admixed genomes. Based on matching, at any given locus on an individual’s
genome there can be 0, 1, or 2 alleles from one of the parental ancestries. Initially, LAI was
performed using a subset of markers called ancestry informative markers (AIMs); however,
the low-cost genotyping available currently allows for LAI to be done on all available
17
markers. We used a random field-based approach for LAI in all analyses in this text by
using RFMix software (Maples, Gravel, Kenny, & Bustamante, 2013) to infer local
ancestry (see Chapter 4). We estimate global ancestry proportions for each individual by
taking mean local ancestry across their genome.
Admixture mapping is a method of gene-mapping that uses ancestry to identify
genetic loci associated with complex traits, and is particularly useful for traits which have
differential risk by ancestry (Mani, 2017; Rife, 1954; Shriner, 2017; Zhu & Wang, 2017).
The assumption underlying admixture mapping is that causal variants increasing the risk
of a particular disease will occur in excess in segments inherited from the population with
the greater risk of disease (Shriner, 2017). Admixture mapping works by exploiting
admixture linkage disequilibrium, long haplotype blocks produced in the process of
admixture taken place in the past few hundred years (Smith & O’Brian 2005). The LD
blocks are longer in admixed populations (Montana & Pritchard, 2004; Parra et al., 1998),
admixture mapping require fewer markers than the very large numbers necessary to
successfully perform GWAS (Kruglyak, 1999).
1.4.2 Motivation
GWAS have been able to identify numerous associations between genetic markers
and diseases, an overwhelming majority of which worldwide have been performed on
individuals of European ancestry (Park, Cheng, & Haiman, 2018). One reason GWAS have
lacked diversity has been to avoid spurious associations due to differential LD in
inhomogeneous samples (Ewens & Spielman, 1995; Lander & Schork, 1994). However,
because admixed populations are a valuable source of information due to their varying
patterns of LD (Bonilla et al., 2004; Gonzalez Burchard et al., 2005) and because this
18
feature makes GWAS less desirable, admixture mapping affords an alternative approach
for studying diverse populations.
1.4.3 Admixture Mapping Methods
Admixture mapping is typically performed as a study of departure of mean local
ancestry from the expected global ancestry mean with an outcome of interest. There are
typically two or three parental ancestries from which all alleles are thought to be descended
in a study, and current methods allow for mapping one ancestral population at a time. Since
the objective of admixture mapping is to correlate genetic loci with a trait of interest, there
are several statistical tests for doing so. The simplest of these tests is the case-only study
design, which tests the departure of mean local ancestry from the global mean in affected
individuals only (Montana & Pritchard, 2004; Patterson et al., 2004). The underlying
assumptions of the case-only approach are that including unaffected individuals in the
analysis will contribute only noise and that there is no confounding due to population
stratification. A case-control approach can also be used to perform admixture analysis by
comparing the departure of mean local ancestry from the global ancestry in cases and
controls (Patterson et al., 2004). While the case-only approach is more powerful than the
case-control approach, the case-control is more robust than a case-only approach. Hence,
both methods are typically implemented in an admixture study to maximize power and
mitigate the number of spurious associations. In this dissertation, we investigate the
performance of each of these models and introduce a novel approach that combines the two
(see Chapter 3).
19
Chapter 2 GxE Analysis Using Bayesian Model Averaging
2.1 Abstract
Genome-wide association studies (GWAS) typically search for marginal
associations between a single nucleotide polymorphism (SNP) and a disease trait while
gene-environment (GxE) interactions remain generally unexplored. More powerful
methods beyond the simple case-control approach leverage either marginal effects or case-
control ascertainment to increase power. However, these potential gains depend on
assumptions whose aptness is often unclear a priori. Here, we review GxE methods and
use simulations to highlight performance as a function of main and interaction effects and
the association of the two factors in the source population. Substantial variation in
performance between methods leads to uncertainty as to which approach is most
appropriate for any given analysis. We present a framework that: (1) balances the
robustness of a case-control approach with the power of the case-only approach; (2)
incorporates main SNP effects; (3) allows for incorporation of prior information; and (4)
allows the data to determine the most appropriate model. Our framework is based on Bayes
model averaging, which provides a principled statistical method for incorporating model
uncertainty. We average over inclusion of parameters corresponding to the main and GxE
interaction effects and the G-E association in controls. The resulting method exploits the
joint evidence for main and interaction effects while gaining power from a case-only
equivalent analysis. Through simulations we demonstrate that our approach detects SNPs
within a wide range of scenarios with increased power over current methods. We illustrate
the approach on a gene-environment scan in the USC Children’s Health Study.
20
2.2 Introduction
Genome-Wide Association Studies (GWAS) have uncovered many trait-related
SNPs to date, but many SNPs are likely yet undiscovered by GWAS due to insufficient
power as a result of small effect sizes, low allele frequencies, or opposing effects in sample
subgroups. Additionally, evidence suggests that marginal genetic effects alone may not
explain all disease susceptibility (Manolio et al., 2009). It is therefore worthwhile
considering GxE interactions when scanning for novel loci associations and for identifying
genotypes with elevated susceptibility to complex diseases based on exposure to an
environmental contributor. A conventional case-control logistic analysis is broadly
acknowledged to suffer from low power to detect GxE interactions. The case-only design
(Piegorsch et al., 1994) is an alternative which provides a substantial increase in power.
However, the case-only design is subject to significant bias under non-independence of G
and E in the source population, resulting in a highly increased Type I error and a large
number of false discoveries. Numerous approaches have been developed to improve power
while mitigating potential increases in Type I error. These include empirical Bayes (EB)
(Mukherjee & Chatterjee, 2008), Bayes model averaging (BMA) (Li & Conti, 2009),
numerous two-step methods (Gauderman et al., 2013; Kooperberg & Leblanc, 2008;
Murcray, Lewinger, & Gauderman, 2009), and two-degree of freedom joint tests of main
and GxE interaction effects (Dai et al., 2012; Kraft et al., 2007; Tchetgen Tchetgen, 2011).
In this paper, we extend the BMA approach proposed by Li and Conti (2009) and propose
a novel Bayes model averaging approach to weight the case-only and case-control
interaction effects within a two-degree of freedom test. We use simulations to show that
this approach improves power in many scenarios while controlling the false discovery rate
21
– even in the presence of non-independence of G and E in the source population. Our
comparison study uses GxE approaches that are currently widely used, particularly
powerful, similar to our novel approach, or a combination of the three. We used our
proposed Bayes model averaging approach to analyze the role of air pollutants, Hispanicity
and genotype on childhood asthma in the CHS dataset.
2.3 Methods
We first introduce the basic setup and notation and briefly review standard G and
GxE approaches. For simplicity, we consider a total sample size of N with equal numbers
of cases and controls. Y is a binary indicator for disease status with baseline population
disease risk Pr(Y = 1) = p
Y
. Categorical exposure status is denoted as E, where E is binary
for simplicity with population prevalence Pr(E= 1) = p
E
. Genotype is denoted as G,
where for simplicity we use dominant coding (G= 1 for AA and Aa genotypes and G= 0
for aa genotypes), with Pr(G= 1) = q
A
as the probability of having the AA or Aa
genotype.
Marginal Association Test (MA)
The most widely used method for finding an association between a genetic marker
and a disease outcome in a GWAS is the marginal test of association (MA) carried out in
samples of cases and controls. The MA method is typically comprised of a regression of a
particular phenotype on a genetic variant (G) with a test of the association. Using a case-
control sample with disease outcome Y, the MA test is typically characterized using the
logistic equation:
Logi t[Pr(Y = 1|G)] = β
MA
0
+ β
MA
G
G (1)
22
with adjustment variables included when necessary. Here, β
MA
G
denotes the log-odds ratio
of G on the disease outcome, and a Wald, score, or likelihood ratio test is carried out to test
the null hypothesis, β
MA
G
= 0 of no genetic association. Within a GWAS, the MA model
is repeated for each of the markers considered and tested using a specified P-value
threshold 𝛼 which is adjusted to maintain the family-wise error rate (FWER).
Case-Control Test of 𝐆 × 𝐄 Interaction (CC)
Using a sample of cases and controls, a test of GxE interaction with a binary disease
outcome is often performed as a case-control (CC) model characterized by
logit [Pr(Y = 1|G,E)] = β
cc
0
+ β
cc
E
E+ β
cc
G
G+ β
cc
G×E
EG (2)
Here, β
cc
E
and β
cc
G
represent the main effects of E and G respectively, while β
cc
G×E
is the
log-odds ratio of the interaction of GxE. The null hypothesis of no interaction,
H
0
: β
cc
GxE
= 0, is tested using a Wald, score or likelihood ratio test. Though
straightforward to implement and widely used, the CC test of interaction is also known to
suffer from low power.
Case-Only Test of 𝐆 × 𝐄 Interaction (CO)
Using no information from controls, a case-only (CO) approach, where the
association between the genetic variant and exposure is tested in affected individuals only,
is often used as an alternative method to boost the power of detecting a GxE interaction
over a CC approach. A CO logistic model is given by
logit [Pr(G= 1|E,Y = 1)] = β
co
0
+ β
co
G×E
E (3)
23
Here, the term β
co
G×E
parameterizes the GxE interaction log odds ratio on disease status
(Y) and in the presence of a rare disease and independence of G and E in the population,
exp(β
co
G×E
) equals the GxE interaction relative risk (RR) (Piegorsch et al., 1994). The
CO model yields biased estimates of effect and incorrect Type I error under violations of
this independence assumption. Additionally, since covariate adjustment is done in cases
only, comparisons to the analogous adjustment in CC models needs to be considered with
care when using this model.
Weighted GxE Tests
To increase power while also mitigating bias under independence assumption
violations, two methods have been introduced that combine CC and CO models as
weighted averages. Li and Conti (2009) introduced a Bayes Model Averaging (BMA)
approach which combines the GxE effect estimates from the two models via a weighted
average determined by the posterior probabilities of each of the models. Using loglinear
equivalent forms of Equation 2 (Bishop, Fienberg, & Holland, 1975) and Equation 3
(Umbach & Weinberg, 1997), β
cc
G×E
and β
co
G×E
are averaged using the posterior
probabilities of their respective models given the data, D, hence incorporating model
uncertainty within the resulting estimate. An overall interaction effect estimate is obtained
by averaging the expectation of the interaction effect from each model and tested using a
Wald statistic. A similar approach introduced by Mukherjee and Chatterjee (2008) is the
Empirical Bayes approach, which also takes an average of GxE interaction effects from the
CC and CO models. Rather than using posterior model probabilities as weights, the
empirical Bayes approach uses the CC estimate, β
cc
G×E
, its variance, and the uncertainty
about the independence assumption between G and E estimated by the G-E association in
24
controls (Mukherjee & Chatterjee, 2008). Both methods consider the uncertainty of which
model (CC vs. CO) is most appropriate while aiming to balance efficiency and bias when
estimating the GxE interaction effect.
Two-step approaches
In a genome-wide setting, there are a variety of two-step GxE interaction methods
based on an initial ‘screening’ followed by a ‘testing’ step. The screening first step of a
two-step procedure tests a given association and filters results based on a defined threshold.
Markers which are significant during the screening step are then tested in the second step
using a CC GxE interaction test in Equation 2 and appropriate control of the family-wise
error rate, α
FWER
. To guarantee that Type I error is preserved at the nominal level, the test
statistics used at each of the two steps must be independent. Several approaches exist for
two-step methods that alter the first-step test of association (Gauderman et al., 2013;
Kooperberg & Leblanc, 2008; Murcray et al., 2009). Gauderman et al. (2013) introduced
the EDGE procedure, which combines the association between the disease and the genetic
marker (Y-G) with the association between the environmental factor and the genetic marker
(E-G) by summing the two independent test statistics. The test statistic for the Y-G
association is calculated using Equation 1, while the statistic for the E-G association is
calculated using a chi-square test of association between E and G in a combined case-
control sample. Each test statistic has a
2
distribution with one degree of freedom, and
since the statistics are independent, their sum for the screening step has a
2
distribution
with two degrees of freedom. Step 1 P-values are ranked and the second step correction for
multiple testing can occur in one of two ways. Using the subset testing approach, within
the original group of W SNPs tested in step 1, a subset, w, of SNPs with P-value ≤ α
1
are
25
then included for the second step GxE test using a second threshold, α
FWER
/w, a
Bonferroni correction for multiple testing. Alternatively, rather than a subset of SNPs, all
W SNPs are tested in the second step according to a weighted significance threshold,
whereby SNPs are tested against a threshold that increases in stringency with increasing
screening step P-values (Ionita-Laza, McQueen, Laird, & Lange, 2007). The EDGE
approach is structurally very different from previous classes of models we have discussed
and has been shown to be more powerful than many two-step methods in many of the
scenarios that we use for comparing the GxE approaches (Gauderman et al., 2013). Thus,
we include the EDGE approach as an important comparison to other GxE methods within
our simulation study.
2 Degree of Freedom Tests
Unlike most single-degree-of-freedom test statistics of interaction, multiple-
degree-of-freedom test statistics jointly test multiple parameters. The first of these tests
was introduced by Kraft, et al. (2007) as a joint test of a main genetic effect and a GxE
interaction effect using the CC model from Equation 2. This approach, denoted as DF2,
tests the hypothesis of no main genetic effect and no GxE interaction effect (i.e.,
H
0
: β
cc
G
= β
cc
G×E
= 0) using a likelihood ratio test with two degrees of freedom. A similar
test, denoted here as CO 2DF (Dai et al., 2012; Tchetgen Tchetgen, 2011), fits the MA
model from Equation 1 and the CO model from Equation 3 and sums up the corresponding
Wald test statistics from each of the models. The resulting test statistic has an (asymptotic)
Chi-square distribution with two degrees of freedom. Because they are both constructed as
omnibus tests, rejection of the null with the DF2 and CO 2DF indicates that at least one of
the component parameters is equal to zero, revealing an association between a particular
26
locus and a disease outcome but without pinpointing the driver of the association (i.e.
marginal/main vs. interaction vs. both).
A Review of Loglinear Models
Loglinear models are a class of generalized linear models with a Poisson random
component and a log link (Agresti, 2002). They are used for modeling count data assumed
to have a Poisson distribution. Loglinear models are a specific subtype of Poisson
regression models, which are comprised solely of non-negative categorical variables and
are thus often used to describe cell counts in contingency tables. In a loglinear model, the
data is summarized as counts of a table rather than individual classifications of the subjects.
They treat the cell counts of a table as independent observations of Poisson random
variables. A distinguishing feature of a loglinear model from a logistic model is that
response and explanatory variables are treated symmetrically and without distinction.
Although logistic models require a single categorical response, logit models with
categorical explanatory variables have loglinear equivalent models (Agresti, 2002; Bishop
et al., 1975; Fienberg, 1977).
Using this equivalence, Umbach and Weinberg (1997) show that case-control and
case-only regression models can be represented by loglinear models with equivalent GxE
interaction effect terms across the models for a 2x2x2 contingency table. The benefit of
using loglinear models over logit models is that it allows the explicit parameterization of
effects in controls. Given a 2x2x2 table with binary variables E, G, and Y, a case-control
logistic regression approach would model a GxE interaction in the equation
logit (Pr [Y|E,G]) = λ
Y
+ λ
E
E+ λ
G
G+ λ
GxE
GE
27
where exp(𝜆 𝐺𝑥𝐸 ) is the odds ratio of the interaction. The equivalent loglinear model is
log(μ
yge
) = λ
0
+ λ
0E
E+ λ
0G
G+ λ
0GE
GE+ λ
Y
Y+ λ
E
EY+ λ
G
GY+ λ
GxE
GEY
where 𝜇 𝑦𝑔𝑒
is the expected cell count, λ
0
, λ
0E
, λ
0G
, and λ
0GE
parameterize the joint
distribution of E and G in controls, and λ
Y
,λ
E
and λ
G
are the same values in both logit and
loglinear models. Umbach and Weinberg (1997) show that constraining λ
0GE
= 0 in the
following loglinear model:
log(μ
yge
) = λ
0
′
+ λ′
0E
E+ λ′
0G
G+ 0+ λ′
Y
Y+ λ′
E
EY+ λ′
G
GY+ λ′
GxE
GEY
enforces the independence of G and E in controls and effectively changes the interpretation
of 𝜆 ′ ̂
𝐺𝑥𝐸 from a case-control estimate of interaction to an estimate of interaction effect from
a case-only model.
Novel Bayes Model Averaging Two Degree-of-Freedom Test (BMA 2DF)
We propose a Bayes model averaging (Hoeting et al., 1999; Raftery, 1996a; Raftery
et al., 1997) two-degree-of-freedom test (BMA 2DF) that expands the BMA (Li & Conti,
2009) method. Our approach weights between the CC and CO models to test for both GxE
interaction and G main effects using a multivariate Wald test with two-degrees of freedom.
The approach is based on analogous loglinear models for CC and CO logistic models
(Umbach & Weinberg, 1997) given respectively by:
CC: log(n
egy
|G,E,Y) = α
cc
0
+ α
cc
G
G+ α
cc
E
E+ α
cc
GE
GE+ β
cc
0
Y+ β
cc
G
GY+
β
cc
E
EY+ β
cc
G×E
EGY (4)
CO: log(n
egy
|G,E,Y) = α
co
0
+ α
co
G
G+ α
co
E
E+ β
co
0
Y+ β
co
G
GY+ β
co
E
EY+
28
β
co
G×E
EGY (5)
Here, n
egy
is the expected number of individuals per cell of the 2x2x2 contingency table
of E, G, and Y, where e, g, and 𝑦 denote the levels of E,G, and Y respectively. The
estimators α
cc
0
,α
cc
G
,α
cc
E
, and α
cc
GE
in Equation 4 parameterize the joint distribution of
G and E in controls with α
cc
GE
denoting the association between G and E in controls. The
parameters β
cc
0
,β
cc
G
,β
cc
E
, and β
cc
G×E
in Equation 4 maintain the same interpretation they
have in the logistic CC model in Equation 2, notably that β
cc
G×E
captures the CC
interaction effect. Umbach & Weinberg (1997) showed that constraining α
cc
GE
= 0 (i.e.
assuming independence of G-E in controls), produces a GxE interaction estimate, β
co
G×E
in Equation 5, that is approximately equivalent to the CO logistic estimate of β
co
G×E
in
Equation 3 without reliance on controls and a smaller variance than β
CC
G×E
.
We note that α
co
0
,α
co
G
, and α
co
E
in Equation 5 parameterize the independent
distribution of G and E in controls and that the CO model in Equation 5 still uses
information from both cases and controls with estimates α
co
G
and α
co
E
based on marginal
totals (Umbach & Weinberg, 1997). Similarly, main effects for E and G in Equation 5,
β
co
G
and β
co
E
, are distinguished from β
cc
G
and β
cc
E
in Equation 4 because they are
dependent on controls through marginal totals only and thus yield smaller variances. The
smaller CO estimator variance results in an improvement in power. The BMA 2DF
approach takes a weighted average over the two disparate estimators, β
cc
G
and β
co
G
of a
main G effect, as well as estimators, β
cc
G×E
and β
co
G×E
, for a GxE interaction from
Equations 4 and 5, respectively. We then test the resulting averaged estimators, β
̃
BMA
G
and
29
β
̃
BMA
G×E
simultaneously by using two degrees of freedom.
Letting ℳ
cc
and ℳ
co
denote models represented in Equations 4 and 5 respectively,
the BMA 2DF approach uses prior probabilities of the models, Pr(ℳ
cc
) and Pr(ℳ
co
), to
be chosen by the investigator and to sum to one. Letting 𝛃 = [β
̃
BMA
G
,β
̃
BMA
G×E
]
T
denote
the BMA estimates of main and interaction effects, we define the posterior distribution of
β
̃
BMA
G
and β
̃
BMA
G×E
given the observed data as:
Pr(𝛃 |D) = Pr(ℳ
cc
|𝐷 )[
Pr(β
cc
G
|D,ℳ
cc
)
Pr(β
cc
G×E
|D,ℳ
cc
)
] + Pr(ℳ
co
|D) [
Pr(β
co
G
|D,ℳ
co
)
Pr(β
co
G×E
|D,ℳ
co
)
] (𝟔 )
For simplicity of notation, we let i ∈ {cc,co} and define the posterior model probability for
each of the models given the observed data as Pr(ℳ
i
|D) ∝ Pr(D|ℳ
i
)Pr(ℳ
i
). Here
Pr(D|ℳ
i
) = ∫Pr(D|ℳ
i
, 𝛉 i
)Pr(𝛉 i
|ℳ
𝑖 )d𝛉 i
is the integrated likelihood of model ℳ
𝑖 over
its parameters 𝜽 𝑖 (Hoeting et al., 1999; Viallefont, Raftery, & Richardson, 2001), where
Pr(𝛉 i
|ℳ
𝑖 ) is the joint prior distribution of parameters for model ℳ
𝑖 . Pr(D|ℳ
i
) is
estimated using a Laplace approximation as implemented in the R package GLIB (Raftery,
1996a; Raftery & Richardson, 1996). Pr(β
i
G
|D,ℳ
i
) and Pr(β
i
G×E
|D,ℳ
i
) in Equation 6
denote the posterior probability distributions of β
i
G
and β
i
G×E
specific to model ℳ
𝑖 , and
we estimate these model-specific effects using the expectations β
̂
i
G
= E[β
i
G
|D,ℳ
i
] and
β
̂
i
G×E
= E[β
i
G×E
|D,ℳ
i
] from these distributions respectively. Model-specific multivariate
vectors containing main and interaction effect estimates from each of the CC and CO
models are denoted 𝛃 ̂
cc
= [β
̂
cc
G
,β
̂
cc
G×E
]
T
and 𝛃 ̂
co
= [β
̂
co
G
,β
̂
co
G×E
]
T
respectively. The
30
posterior mean and variance of the main and interaction effects are multivariate extensions
to mean and variance presented by Hoeting et al. (1999) and are given by:
E(𝛃 |D) = Pr(ℳ
cc
|D)𝛃 ̂
cc
+ Pr(ℳ
co
|D)𝛃 ̂
co
(7)
and
Var (𝛃 |D) = {Pr(ℳ
cc
|D)(Γ
cc
+ 𝛃 ̂
cc
𝛃 ̂
cc
T
) + Pr(ℳ
co
|D)(Γ
co
+ 𝛃 ̂
co
𝛃 ̂
co
T
)}
− E(𝛃 |D)E(𝛃 |D)
T
, (8)
where
Γ
i
= [
Var (β
̂
i
G
|D,ℳ
i
) Cov (β
̂
i
G
,β
̂
i
G×E
|D,ℳ
i
)
Cov (β
̂
i
G
,β
̂
i
G×E
|D,ℳ
i
) Var (β
̂
i
G×E
|D,ℳ
i
)
]
is the model-specific covariance matrix for ℳ
𝑖 . Letting 𝚪 = Var (𝛃 |Data ), we can
calculate the statistic 𝒲 = 𝛃 T
(𝚪 )
−1
𝛃 ∼ χ
(2)
2
to perform a multivariate Wald test (White,
1982).
31
Prior distribution specifications of BMA 2DF
Letting 𝑛 𝑗 represent the number of observations in the 𝑗 𝑡 ℎ
cell of the 2x2x2
contingency table for binary values of E, G and Y, we represent the contingency table as:
Frequency E G Y
𝐧 𝟏 0 0 0
𝐧 𝟐 1 0 0
𝐧 𝟑 0 1 0
𝐧 𝟒 1 1 0
𝐧 𝟓 0 0 1
𝐧 𝟔 1 0 1
𝐧 𝟕 0 1 1
𝐧 𝟖 1 1 1
Denoting the vector of all cell counts as 𝐧 = [n
1
…n
8
]
T
, we assume a Poisson distribution
for 𝐧 given model ℳ
i
with Poisson parameter, 𝝁
𝐧 |𝛍 ,ℳ
i
~ Poisson (𝛍 ),
and use a natural log link to model the Poisson parameter given model ℳ
i
and design
matrix 𝐗 i
,
log(𝛍 |ℳ
i
) = 𝐗 i
β
i
𝛍 |ℳ
i
= e
𝐗 i
β
i
β
i
|ℳ
i
~ N(𝟎 ,σ
2
𝐕 i
)
Here, 𝛽 𝑖 is the vector of effects for model ℳ
𝑖 with a Normal prior distribution. The
marginal likelihood of model ℳ
i
is
Pr(𝐧 |ℳ
i
) = ∫Pr(𝐧 |𝛉 i
,ℳ
i
)Pr(𝛉 i
|ℳ
i
)Pr (ℳ
i
)d𝛉 𝐢 , 𝛉 𝐢 = [𝛃 i
,σ
2
]
T
32
where a closed-form solution for Pr(𝐧 |ℳ
i
) is not analytically attainable due to the lack of
conjugacy between the Gaussian prior and the Poisson likelihood. Hence, we use the GLIB
(Raftery & Richardson, 1996) routine within the BMA R package which utilizes Laplace
estimation to estimate this likelihood. We implement GLIB with prior covariance matrix
𝑽 = 𝜎 2
[
𝜙 2
(
1
𝑛 𝑿 1
𝑇 𝑿 1
)
−1
⋯ 0
⋮ ⋱ ⋮
0 ⋯ 𝜙 2
(
1
𝑛 𝑿 𝑝 𝑇 𝑿 𝑝 )
−1
]
.
where the hyperparameter 𝜙 = 1 is chosen based on suggestions by Raftery (1993).
2.4 Simulations
Single-marker and genome-wide simulations were conducted using an underlying
population that was simulated using the following sampling distributions and logistic
regression equations:
E ~ Bernoulli (p
E
)
logit (Pr(G= 1|E)) = logit (q
A
)+ α
cc
ge
(E− E
̅
)
logit (Pr(Y = 1|E,G))
= logit (p
𝑌 )+ β
cc
E
(E− E
̅
)+ β
cc
G
(G− G
̅
)+ β
cc
G×E
(E− E
̅
)(G− G
̅
)
From this population, we sampled equal numbers of cases and controls for all simulation
scenarios for both single-marker and genome-wide simulations. When fitting the 1 and 2-
degree-of-freedom BMA models, we used the GLIB function in the BMA R package based
on the Laplace approximation to the marginal likelihood (Raftery, 1996a). Prior means for
33
all model parameters were set to 𝟎 = [ 0 0 0 0 0 0 0 0 0 ]
𝑇 . Prior model weights were set
according to prior specified CC:CO odds for the models with 1:1 odds ⇒ Pr(ℳ
𝑐𝑐
) =
Pr(ℳ
𝑐𝑜
) = 0.5 and 100:1 odds ⇒ Pr(ℳ
𝑐𝑐
) = 0.990099 and Pr(ℳ
𝑐𝑜
) = 0.00990099 .
2.4.1 Single-Marker Simulations
We conducted single-marker simulations of a range of scenarios to compare
empirical power. 1,000 replicate datasets were generated with 500 cases and 500 controls
for a disease outcome (Y), a binary environmental exposure (E) with a marginal OR(E) =
1.2, and a binary genotype (G ) assuming a dominant model. We used a population disease
prevalence p
Y
= 0.01, population exposure prevalence p
E
= 0.4, and genotype frequency
of q
A
= 0.225. We simulated datasets across a range of log odds ratios [-1.0, +1.0] for
main and interaction effects (i.e., β
cc
G
,β
cc
G×E
), as well as for G-E association, α
cc
GE
.
Analyses and corresponding tests of association were performed for each scenario using
α = 0.05 as the significance threshold. For scenarios simulated to have a non-zero GxE
effect, empirical power was calculated testing GxE interaction as the proportion of
replicates for which the given method detected a significant interaction at a given α-level.
Power for the MA model was calculated as the proportion of simulated causal SNPs found
to have a significant marginal effect on Y at threshold α.
2.4.2 Genome-Wide Simulations
Genome-wide simulations were done by generating W SNPs, d of which were
designated as the disease-causing (DSL) SNPs and W− d of which were assumed to be
independent of Y with neither main nor GxE interaction effect. The d ‘causal’ SNPs were
34
simulated based on their specified associations with E and Y as in the single-marker
simulation. Two sets of simulations were performed to assess power, sensitivity and
specificity for discovery. The first set of simulations consisted of 1,000 replicates of N =
10,000 samples with equal numbers of cases and controls. We specified W = 1 million, d
= 1, p
E
= 0.4, q
A
= 0.1, and p
Y
= 0.05. The second set of genome-wide simulations
consisted of 1,000 replicates of N = 3,750 samples with equal numbers of cases and
controls, W = 10,000, d = 20, p
E
= 0.4, q
A
= 0.225 for all d SNPs, and p
Y
= 0.01. We
note that a smaller sample size was applied in the simulations used to create Receiver
Operating Characteristic (ROC) curves in order to reduce sensitivity across all approaches
and yield informative differentiation in results between methods. Such an approach was
used because all methods in our comparison showed very high sensitivity, making it
impossible to show differences between them. Unless otherwise specified, we simulated
independent E and G , and ‘non-causal’ genetic variants with Pr (G= 1) sampled from a
uniform distribution within the range [0.10,0.40]. We set a null marginal environmental
effect of OR(E) = 1.0 for both sets of simulations except for one instance in which we
calculated power using induced main G effects. For induced main effects, we increased the
effects of E and G with increasing interaction effect. To measure empirical power in
simulations with one designated ‘causal’ marker, we took the proportion of replicates in
which the ‘causal’ marker was identified to be genome-wide significant (P-value ≤
5 × 10
−8
). To create ROC plots, we repeatedly simulated sets of markers with d = 20
designated causal markers. The resulting P-values in each iteration were ordered from least
to greatest, and the number of ‘causal’ markers ranked within the set of k smallest P-values,
35
P
k
were averaged across 1,000 repetitions. We then calculated sensitivity and specificity
of discovery as follows:
Sensitivity =
True Positives in P
k
k
and
Specificity = 1 −
False Positives in P
k
All True Negatives
.
Figure 2-3 was created using a simulation of 1,000 replicates of a sample with size N =
10,000 made up of 500 cases and 500 controls. We simulated 999,999 independent SNPs,
and one designated ‘causal’ SNP with a non-zero interaction effect. Part (A) depicts a
simulation without the marginal effects of E and G (β
cc
E
= β
cc
G
= Log (1.0)) for the
designated SNP. Part (B) depicts a simulation with constant marginal effects for all values
of β
cc
G×E
, β
cc
E
= β
cc
G
= Log (1.2), and part C is based on marginal effects induced
through the increasing interaction effect based on values produced by Quanto
(http://biostats.usc.edu/Quanto.html). ROC curves shown in Figure 2-4 (A-C) were
produced using a simulation of 1,000 replicates of a sample sized N = 10,000 with 500
cases and 500 controls. We simulated 9980 independent SNPs part (A) and 9480 (parts B
and C), and 20 designated ‘causal’ SNPs. Parts A and B depict effect sizes of β
cc
G×E
=
Log (1.3) and β
cc
G
= Log (1.2) and β
cc
E
= Log (1.0), with α
cc
GE
= Log (1.0) and α
cc
GE
=
Log (1.2) respectively for (A) and (B). Part (C) depicts effect sizes of β
cc
G×E
=
Log (1.0),β
cc
G
= Log (1.0), β
cc
E
= Log (1.0), and α
cc
GE
= Log (1.0).
36
Two-step approaches
For two-step methods, we utilized the weighted hypothesis approach to test for
interaction with bin size b = 5 and family-wise error rate of α
FWER
= 0.05, as it is
generally more powerful than subset testing (Gauderman et al., 2013). For all one-step
methods we used a α
FWER
= 0.05 with a correction for testing W SNPs, α
FWER
/W. Of the
available two-step approaches, we include only the EDGE approach in our comparison
study since Gauderman et al. showed that this approach has the best performance across a
variety of scenarios. For performance comparisons among one-step approaches and the
EDGE approach, empirical power was calculated as the proportion of replicates in which
the designated ‘causal’ SNP was found to be significant, while Type I error was calculated
as the proportion of replicates in which at least 1 of the W− d null SNPs was found to be
significant.
2.6 Results
2.6.1 Single-Marker Simulation Results
Single marker simulations, depicted by heatmaps in Figure 2-1 show empirical
power across a range of simulated marginal G and GxE interaction effects with each row
indicating results for each approach and red indicating higher power. For the MA approach,
power increases as the horizontal distance from 0 increases both to the left and the right,
indicating an increase in power with larger marginal effects along the x-axis. Likewise for
the CC, CO, and BMA approaches there is increasing power with distance in either
direction along the y-axis away from a null GxE effect. Power for CC, CO, and BMA
approaches increases mostly independent of changes in the marginal G effect. The 2-
degree-of-freedom tests in Figure 2-1 show increasing power in both directions: increasing
37
marginal effect size and GxE interaction effect size, since these approaches are testing both
effects. As a result, the 2-degree-of-freedom approaches show a circular pattern around the
null values of both parameters while the single-degree-of-freedom tests show a rectangular
pattern surrounding null values of GxE interaction log odds. Visually, performance can be
gauged by tightness of either the rectangle or the circle around the null parameter values.
More frequently occuring warm colored areas, such as red and orange regions, indicate
higher power for a larger proportion of the marginal and interaction effect pairwise
combinations. For instance, the CO 2DF approach in Figure 2-1 has a much tighter circle
surrounding the locus of null main and interaction effects, indicating higher power across
more combinations of the two effects than other 2-degree-of-freedom approaches.
However, Figure 2-1 shows the circle in the CO 2DF approach move across the three G -E
association value columns, OR(G -E) = [0.8,1.0,1.2], indicating a bias and an increase in
Type I error with violations to the G -E independence assumption. Thus, Figure 2-1 shows
CO 2DF to also be the most susceptible to violations to this assumption. Figure 2-1 also
highlights the sensitivity of all models incorporating a CO design to G -E associations
within the sample. While no method shows pronounced bias under the no G -E association
(center) column in Figure 2-1, inflated Type I error rates for OR(G -E) = 0.8 and
OR(G -E) = 1.2 are shown for CO, BMA, CO 2DF, and the BMA 2DF models. While these
models have compromised robustness under independence violations, inflation of Type I
error as the G -E association increases is largely mitigated for BMA and BMA 2DF by the
models’ inherent averaging process. Figure 2-2 shows empirical power as a function of the
G -E (α
cc
GE
) association for the CO 2DF model and the BMA 2DF approach in four
scenarios. Figure 2-2 depicts two cases of empirical power for the CO 2DF model and the
38
BMA 2DF test under no genetic effect with GxE interaction (A) and without GxE
interaction (B). While the curves are close for very small values of α
cc
GE
, there is dramatic
reduction in empirical power for the BMA 2DF test with increasing G -E association due to
weighting the average more heavily toward a CC model via model posteriors. In the
absence of genetic and interaction effects, empirical power effectively becomes the Type I
error rate for a 2-DF test of both β
G
and β
GxE
. For scenarios in which a marginal effect
exists (OR(G) = 1.2), the comparison in Figure 2-2.C and 2-2.D show a similar pattern
with a shift upward to account for detection of a main genetic effect. More detailed Type I
error across methods is displayed in Table 2-1 based on simulations of independent SNPs
which do not interact with E in their effect on disease status. Table 2-1 shows Type I error
across 1-step methods as measured by the proportion of SNPs identified to be statistically
significant which do not have any marginal or interaction effects on disease status. The
Type I error rate is inflated for the CO, BMA, and BMA 2DF in the presence of a G -E
association, albeit less inflated for those BMA approaches weighted towards the CC model.
Table 2-1 also shows Type I error (or empirical power) as measured by the proportion of
SNPs identified to be statistically significant which do have a marginal effect on disease
statust (OR(G) = 1.2) but which do not interact with E. Methods that include the CO model
demonstrate an inflation with non-zero G -E association; however the BMA 2DF model has
mitigated inflation compared with the CO and CO 2DF approaches which can be
interpreted as power to detect the non-zero main G effect.
39
Table 2-1 Type I error rates across one – step methods in scenarios with and without
G-E association. Error rate is calculated as the proportion of independent markers
identified by a given method as having interaction with E out of all independent
markers simulated. Type I error rate for the marginal association model is calculated
as the proportion of simulated SNPs identified by the marginal model as having
significant effect on outcome from all independent SNPs simulated. (Top) Error rates
Shown for null effects of both marginal G and GxE interaction; (Bottom) Error rates
shown for marginal G effect OR(G)=1.2 and null GxE interaction effect.
* Power to detect a main G effect. Does not represent inflated type I error accurately for GxE interaction.
† Composite of power to detect a main G effect and type I error for testing GxE inflated by G-E association.
40
Figure 2-1 Heatmaps depicting power patterns for detection of GxE interaction across a marginal G and GxE interaction
effect range r = [-1.0, +1.0] for one-step methods on 1,000 simulations of 500 cases and 500 controls. Within each heatmap
plot in the grid, the x-axis shows the simulated marginal G effect with the null indicated by a vertical line. The y-axis is
the simulated GxE effect with the null indicated by a horizontal line. The grid columns of Figure 1 represent the simulated
G-E association in the population.
41
Figure 2-2 Empirical Power measured across a range r = [-1.0, +1.0] of G-E
association with and without a GxE interaction and marginal effect for CO DF2 and
BMA 2DF approaches. BMA(100:1) and BMA (1:100) represent an analysis of BMA
2DF with prior weighting based on a 100:1 and 1:100 odds of a CC model being more
appropriate than a CO model respectively. A) OR(GxE)=1.0 & OR(G)=1.0, B)
OR(GxE)=1.5 & OR(G)=1.0, C) OR(GxE)=1.0 & OR(G)=1.2, D) OR(GxE)=1.5 &
OR(G)=1.2.
42
2.6.2 Genome-Wide Simulation Results
For our genome-wide simulation of 1 million SNPs with one designated causal
SNP, we investigated empirical power under no violations of the G -E independence
assumption. We investigated three scenarios (see Figure 2-3 (A-C)): A) a constant
OR(G) = 1.0; B) a constant OR(G) = 1.2; and C) a main effect that increases due to
induced effects from the increasing OR(GxE). These scenarios are indicated with red lines
in the corresponding sub-graph to orient the figures presented within the heat maps in
Figure 2-1. As expected, genome-wide simulations in Figure 2-3 show the BMA 2DF
approach empirical power to consistently lie between that of the DF2 and CO 2DF models.
When there is no main effect, Figure 2-3.A shows that a CC to CO weighting scheme of
1:100 towards a CO model coincides with the CO 2DF model, while using a 1:1 prior
weighting scheme has power which lies between that of the CO 2DF and DF2 models and
is nearly identical to the EDGE 2-step method. Figure 2-3.C shows empirical power when
the main effect of G is induced by the interaction effect, rather than being held constant. In
this scenario, approaches that incorporate a test of the SNP association are most sensitive
and reflect the increase in the induced SNP effect with increasing power. In our second
genome-wide simulation of 10,000 SNPs, we investigated the trade-off between increases
in sensitivity and false discovery rates. ROC curves based on rankings in Figure 2-4 (A-C)
show that the performance of the BMA 2DF approach again lies between the DF2 and CO
DF2 models in the absence of SNPs, independent of the outcome, which are associated
with E. Results in Figure 2-4 indicate that approaches that include a test of main effect gain
sensitivity when there is a small genetic effect and lose sensitivity when there is no genetic
effect. When SNPs that are associated with E but independent of the disease trait are
43
introduced into the same simulation, the CO model FDR (1-specificity) is notably higher
(Figure 2-4.B). In this scenario, the BMA 2DF test offers improvement in sensitivity over
the DF2 model while also showing an improvement in robustness over CO 2DF under
violations of the independence assumption. Based on the sensitivity and specificity plots
in Figure 2-4, it is evident that the set of top SNPs based on P-value rankings contains a
large number of false-positives identified by the CO 2DF approach in the scenario where
we have null SNPs associated with E. We exclude the two-step EDGE approach in Figure
2-4 in order to portray all SNPs tested rather than a subset resulting from a step 1 screen.
44
Figure 2-3 Empirical power vs. OR(GxE) with independence between G and E
(plots A-C). Based on genome-wide simulations of 1 million SNPs with 1000
repetitions and one designated causal SNP in each repetition. A) OR(G) = 1.0 &
OR(E) = 1.0; B) OR(G) = 1.2 & OR(E) = 1.2; C) Both OR(G) and OR(E) are
induced by the interaction effect and are not held constant.
45
Figure 2-4 Receiver operating characteristic (ROC) curves for True and False
positives in simulations of 1000 repetitions of 10,000 SNPs. A) 20 SNPs with non-
zero GxE interaction (causal), no presence of non-causal SNPs associated with E,
presence of marginal effect of causal SNPs. B) 20 SNPs with non-zero GxE
interaction (causal), 500 non-causal SNPs associated with E, presence of marginal
effect of causal SNPs. C) 20 SNPs with non-zero GxE interaction (causal), 500 non-
causal SNPs associated with E, no marginal effect of causal SNPs.
46
2.7 Discussion
Within a genome-wide interaction scan, the BMA 2DF approach can provide a
robust and powerful tool for identifying genetic loci with small effect sizes on disease
outcomes, while also providing the flexibility of incorporating prior knowledge regarding
the associations of G and E in the population. By producing a test that combines CC and
CO methods and using two degrees of freedom to incorporate a test of main genetic effects
along with interaction effects, the BMA 2DF method can provide increased power over
many existing methods. BMA 2DF results are also more reliable than the most powerful
CO and CO 2DF tests because we have shown that Type I error and bias are minimized by
the BMA 2DF approach compared to these methods in Table 2-1 and Figure 2-2. Single-
SNP and genome-wide simulation results presented have shown that the BMA 2DF
approach is an appropriate method to use in situations where there may be G -E association
present, particularly where there may be numerous genomic regions correlated with the
environmental factor. Genome-wide simulations in Figure 2-3.B have shown the BMA
2DF method is also an appropriate approach in the situation where there are numerous
SNPs that have null interaction and main effects on the outcome but are nevertheless
associated with the environmental factor. In this specific situation, the BMA 2DF method
parses out spurious associations by weighting more heavily to a CC analysis and gains
robustness. In contrast, the CO 2DF method becomes subject to identifying spurious
associations.
In practice, the BMA 2DF model is recommended for identifying genomic regions
that interact with an environmental agent on a disease outcome in the context of a genome-
wide study. By design, the approach is not meant to make inferences on genomic regions
47
already known or suspected to be associated with an outcome as in candidate gene studies.
Consideration should be given to the possible associations between the environmental
factor E and genetic markers G when implementing the BMA 2DF approach. If association
is suspected in the population studied, measures should be taken to account for likely bias
that may result under violations of the G -E independence assumption. As in our G x
Hispanicity analysis, directly assigning prior model weights to favor a CC model is an
effective way to ensure that bias and spurious associations are kept to a minimum when a
G -E association is suspected prior to analysis. We recommend setting prior model weights
according to a 1:1 odds favoring both models equally in all scenarios where G -E correlation
is not known or suspected. Consideration should also be given to setting prior effect
hyperparameters, as it is possible to influence the BMA 2DF approach’s inclination toward
either the CC or CO approach by altering the precision around the prior mean of the α
cc
GE
model parameter. Given that the CC model (see Equation 4) is distinguished from the CO
model (see Equation 5) by the assumption of a non-zero G-E association, decreasing the
precision around the G-E association parameter α
cc
GE
allows for greater acceptance of G-
E association in controls while still identifying α
cc
G
as zero and weighting toward a CO
model. Some user-specified hyperparameters (Kass & Raftery, 1995; Raftery, 1996b), can
influence the posterior weight distribution between CC and CO models; however, we
recommend using prior model weights exclusively to inform model posteriors. We have
used values as recommended by Raftery (1993b) to obtain all results presented in this
paper.
Due to the parameterization of the BMA 2DF model as a loglinear model, it is
necessary to implement the approach using categorical variables in order to maintain
48
equivalence of parameters between the logistic and loglinear models (Umbach &
Weinberg, 1997). We have presented results and simulations using dichotomous
environmental and confounding variables, but variables with three or more categories are
also appropriate. We have used a dominant genetic model by analyzing genotypes as G=
0 or G= 1; however additive, dominant, recessive, or codominant analyses are also
possible using the BMA 2DF approach, though additional levels of G will result in larger
contingency tables (e.g., trinary coding for G creates a 2x2x3 table whereas binary coding
creates a 2x2x2 table) (Umbach & Weinberg, 1997). The parameterization of the BMA
2DF approach as a loglinear model poses additional considerations pertaining to
confounding covariates. Such variables should be categorical and their inclusion must be
carefully designed to retain equivalency of terms in loglinear equations to their
counterparts in models using the logistic link (Agresti, 2002). While continuous variables
can be used for adjustment in the BMA 2DF approach, the resulting estimates may no
longer have the direct interpretation as when categorical variables are used. Equating
models with logistic and loglinear links is beyond the scope of this paper, and we
recommend that investigators maintain parsimonious models whenever possible. We have
presented simulations with a case:control ratio of 1:1; however, we expect both power and
Type I error of the BMA 2DF approach to increase as the number of cases increase and
decrease with increasing controls (Li & Conti, 2009). Software to conduct our novel BMA
2DF approach is available as an R package with details provided in chapter 5. In a
comparison of CPU run times, we found that the loglinear CC model in Equation 4 without
specified prior distributions for parameters required 40% of the CPU run time of the CC
logistic model in Equation 2. With specified priors, the CC loglinear model required 70%
49
the run time necessary for a logistic CC model, and the BMA 2DF approach, estimating
two nested loglinear models (See Equations 4 and 5), had the same run time requirements
as the CC logistic approach outlined in Equation 2.
Chapter 3 Admixture Mapping Using Bayesian Model Averaging
3.1 Abstract
Admixture mapping is typically performed in recently admixed populations to
detect genetic risk loci for diseases that have differential risk by ancestry. Admixture
mapping relies on the deviation of local ancestry, a locus inherited from one ancestral
population, from global ancestry, the proportion of the entire genome inherited from the
same ancestral population. The approaches most commonly used are case-only or case-
control models which respectively compare ancestry deviations in cases only, or test the
effect of the deviation between cases and controls on the disease outcome. The case-only
approach has the potential for increased power in comparison to the case-control approach,
but is prone to spurious associations when deviations of ancestry are not solely due to noise.
In this study, we used Bayes model averaging from estimates obtained from case-only and
case-control models to yield a novel statistical test for admixture mapping. The approach
offers more power than a case-control method while remaining robust in scenarios where
the case-only method is most susceptible to false positives. We use simulations to
demonstrate that our approach detects admixture signals with increased power and
robustness over case-control and case-only methods and illustrate the approach in the
African Ancestry Prostate Cancer (AAPC) and the Latino American Prostate Cancer
(LAPC) consortia.
50
3.2 Introduction
Genome-wide association studies (GWAS) have been successful at identifying
associations between numerous variants and traits (Hindorff et al., 2009). Because GWAS
leverage linkage disequilibrium (LD), they have been performed most commonly on
homogeneous samples with minimum ancestral diversity in order to prevent spurious
associations due to confounding by population stratification or complications due to
differential linkage disequilibrium. (Ewens & Spielman, 1995; Lander & Schork, 1994).
Genetic admixture, present in inhomogeneous groups such as African Americans and
Latinos, is the results of the interbreeding of two or more previously isolated populations,
and manifests as a mosaic of chromosomal segments inherited from distinct contributing
parental populations. Admixture mapping is the method of associating genetic markers
with traits that have differential risk by ancestry (Mani, 2017; Rife, 1954; Shriner, 2017;
Zhu & Wang, 2017). While several approaches have been suggested to mitigate the effects
of genetic admixture in association studies (Devlin & Roeder, 1999; Pritchard, Stephens,
Rosenberg, & Donnelly, 2000; Setakis, Stirnadel, & Balding, 2006), admixture mapping
aims to exploit genetic admixture by using the difference in allele frequencies between
populations to identify trait-associated loci. The concept behind admixture mapping is that
affected individuals will likely exhibit higher frequencies of trait-associated alleles from
the parental population with the higher prevalence of the trait (McKeigue, 2005; Patterson
et al., 2004). Additionally, because LD blocks are longer in admixed genomes, admixture
mapping requires fewer markers than a GWAS using marginal association tests. Numerous
regions have been identified by admixture mapping in admixed populations in such traits
as breast cancer (Ruiz-Narvaez et al., 2016), lung health (Burkart et al., 2018), obesity and
51
blood lipids (Basu, Tang, Arnett, et al., 2009; Basu, Tang, Lewis, et al., 2009), immune
system function (Molineros et al., 2013; Pino-Yanes et al., 2015), etc. Most notably,
admixture mapping has been used in prostate cancer (Bensen et al., 2014; Bock et al., 2009;
Freedman et al., 2006; Wilson et al., 2018) to identify the region 8q24, a region known to
be significantly associated with prostate cancer risk (Cropp et al., 2014; Haiman et al.,
2007; Han et al., 2016; Schumacher et al., 2018; Schumacher et al., 2007; Yeager et al.,
2009).
The inherent variation in parental ancestry and differential linkage disequilibrium
between ancestries along the genomes of admixed individuals make a traditional GWAS
on admixed populations vulnerable to decreased power and spurious associations
(Hirschhorn & Daly, 2005; Liu, Lewinger, Gilliland, Gauderman, & Conti, 2013; Mani,
2017; Rosenberg et al., 2010; Zhu & Wang, 2017). Thus, a strong limitation of GWAS is
the incompatibility of finding consistently valid associations with traits in mixed samples,
where a marginal test is liable to detect an association with an allele that is more frequent
in the ethnic group with the higher frequency of the trait (Lander & Schork, 1994; Thomas
& Witte, 2002). Studies in admixed and diverse populations are valuable, however, when
performed using appropriate methods. Studies using admixed samples such as African
Americans and Latinos, for example, can be valuable due to their varying patterns of
linkage disequilibrium (Bonilla et al., 2004; Gonzalez Burchard et al., 2005).
Numerous studies have implemented admixture mapping using approaches such as
locus-genome statistics (Cheng et al., 2012), linear mixed models (Burkart et al., 2018;
Sofer et al., 2017), a combination of case-control and case-only statistics (Bock et al., 2009;
Cheng et al., 2009; Giri et al., 2017; Ruiz-Narvaez et al., 2016), and methods combining
52
admixture and association tests (Pasaniuc et al., 2011; Seldin, Pasaniuc, & Price, 2011;
Szulc, Bogdan, Frommlet, & Tang, 2017). A case-only (CO) approach to admixture
mapping, considering a susceptible parental population, tests the difference in excess local
ancestry from this parental population from that of the genome-wide average in affected
individuals only. A case-control (CC) approach is a test of the difference in excess local
ancestry from the genome-wide average between affected and unaffected individuals.
While there are seemingly several approaches for performing admixture mapping, most
studies use variations of the CC and CO approaches. Thus, the CC and CO models are not
only the most popular, but are also often the foundation on which other admixture methods
are built. While CC and case-only CO statistics are most popular, both have limitations
(Shriner, 2017) which compel investigators to use them in tandem. A CO approach is ideal
to use in the scenario where deviance of local ancestry from a genome-wide average is
observed in affected individuals only while controls show only noise without any
systematic deviations. Subsequently, while the CO approach is more powerful than a CC
approach, the method requires an assumption that controls contribute only noise. When
this assumption is violated, the CO method is vulnerable to false positive associations. A
CC approach is advantageous to use when there is a difference in deviations between cases
and controls, regardless of whether controls deviate systematically or as a result of noise.
A CC approach, while less susceptible to false positives, often lacks adequate power,
especially in relation to the CO approach (Shriner, 2017). Here, we introduce a method of
combining the two approaches within a Bayes model averaging (BMA) framework
(Hoeting et al., 1999; Raftery, 1996a; Raftery et al., 1997) which can eliminate the need to
either choose between a CC or CO model or implement them both in independently. We
53
use simulations to show that combining the two methods of admixture mapping can provide
a robust and powerful analysis, which improves on the power of a CC approach and is more
reliable than a CO approach in situations where there are excesses of ancestry inherited
from the more disease-susceptible parental population in both cases and controls within
the same region. We use the proposed BMA approach to map prostate cancer in African
Americans and Latinos in the African Ancestry Prostate Cancer (AAPC) and Latino
American Prostate Cancer (LAPC) consortia using African and Amerindian ancestries.
3.3 Methods
We assume an admixed sample with DNA segments inherited from 𝐾 ancestral
populations, a
1
,a
2
, …a
K
. For simplicity, we consider samples of size N with equal numbers
of cases and controls where Y is the vector of an observed binary indicator for disease
status on N individuals with a baseline population disease risk p
Y
= Pr(Y = 1). The
estimated average ancestry inherited from ancestry a
𝑘 at a particular locus ℓ is referred to
as the individual’s local ancestry, and is denoted as L
iℓ𝑘 for the i
th
individual. Assuming
biallelic loci, L
iℓk
values range from 0-2. L
̅
ℓ𝑘 is the mean local ancestry across all
individuals at a locus ℓ, with L
̅
ℓ𝑘 𝐴𝑓𝑓 and L
̅
ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 denoting mean local ancestry across all
cases and controls respectively. Global ancestry, the proportion of an individual’s entire
genome inherited from population a
𝑘 , is estimated by taking an average of local ancestry,
L
iℓ𝑘 , across all genomic loci for individual i, and is denoted as Q
i𝑘 . Mean global ancestry
is Q
̅
𝑘 for the entire sample, with Q
̅
𝑘 𝐴𝑓𝑓 and Q
̅
𝑘 𝑈𝑛𝑎𝑓𝑓 denoting mean global ancestry for
cases and controls respectively.
54
Case-Control (CC)
A case-control approach to admixture mapping compares the difference of mean
estimated local ancestry at each locus from a genome-wide average ancestry in cases versus
controls (Patterson et al., 2004). A deviation of mean local ancestry L
̅
𝑘 from the genome-
wide average, or global ancestry at a particular locus observed in cases but not in controls
indicates a potential disease locus. We express this deviation for individual i at locus ℓ as
A
iℓ𝑘 = Q
i𝑘 − L
iℓ𝑘 , and as 𝑨 ℓ𝑘 for the vector of N observed deviations. The mean deviations
are A
̅
ℓ𝑘 , A
̅
ℓ𝑘 𝐴𝑓𝑓 , and A
̅
ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 for the total sample studied, cases only, and controls only
respectively. A case-control difference in mean local and global ancestry is calculated
using a t-statistic (Patterson et al., 2004). Using ordinary least squares (OLS) regression,
the case-control test of mean deviation between local and global ancestry for 𝑎 𝑘 is given
by
CC: 𝐀 ℓk
= α + β
cc
𝐘 + ϵ (1)
Here, 𝛼̂ estimates the mean ancestry difference in controls, A
̅
ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 , and 𝛽 ̂
𝑐𝑐
estimates the
case-control difference in A
̅
ℓ𝑘 or, the effect of being a case versus a control on A
̅
ℓ𝑘 . Thus,
the CC approach requires the estimation of two means (in cases and controls) with two
corresponding sources of error. The CC then tests the null hypothesis of no case-control
difference in mean deviance, H
0
:β
cc
= 0.
Case-Only (CO)
A case-only admixture mapping approach compares the estimated mean local
ancestry in affected individuals only at each locus with the same individuals’ overall global
ancestry (Montana & Pritchard, 2004; Patterson et al., 2004) and often tests this difference
using a t-statistic. Assumptions of the CO approach are 1) that no population stratification
55
is present and 2) that controls contribute only noise to the ancestral allele frequencies
(Shriner, 2017). In scenarios where local ancestry in both cases and controls deviates from
the respective global mean, this latter approach is violated. In such scenarios, the CO
approach produces invalid results since the deviation is not indicative of a disease-causing
locus. If there is a systematic deviation in controls only, this assumption is again violated;
however in this scenario, CO results are not invalid but there is loss of power. We represent
this latter assumption as E(A
̅
ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 ) = E(Q
̅
ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 − L
̅
ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 ) = 0, since the
assumption that controls contribute noise implies that the average difference between local
and global ancestry in controls should be zero. Thus, we can constrain 𝛼 = 0 from Equation
1 since 𝛼 estimates A
̅
ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 . The constraint on 𝛼 leads to a novel representation of the CO
approach as a regression equation without an intercept:
CO: 𝐀 ℓk
= β
co
𝐘 + ϵ (2)
The absence of an intercept indicates that 𝛽 ̂
𝑐𝑜
directly estimates the mean deviation in
cases, A
̅
ℓ𝑘 𝐴𝑓𝑓 . Thus, a test of H
0
: β
co
= 0 is a test of excess ancestry in cases only using
the same t-statistic described by Patterson et al. (2004) and Montana and Pritchard (2004).
The purpose of defining a CO model as in Equation 2, is that the CC and CO models
become a nested set with explicit estimators that can be combined in a Bayesian model
averaging framework.
Novel Bayes Model Averaging Admixture (BMA)
Using the nested set provided by Equations 1 and 2, we propose a Bayes model
averaging (Hoeting et al., 1999; Raftery et al., 1997; Raftery, 1993b) approach to admixture
mapping that combines the CC and CO approaches. We let ℳ
cc
and ℳ
co
represent the CC
56
and CO models and define a composite BMA estimate, 𝛽 ̃
𝐵𝑀𝐴 , with the following posterior
distribution given observed data D:
Pr (β
BMA
|D) = Pr(β
cc
|ℳ
cc
,D)Pr(ℳ
cc
|D)+ Pr(β
co
|ℳ
co
,D)Pr(ℳ
co
|D)
Defining j ∈ {cc,co}, Pr(β
j
|ℳ
j
,D) is the ℳ
𝑗 model-specific posterior distribution of 𝛽 𝑗
and Pr(ℳ
𝑗 |𝐷 ) refers to the posterior probability of model ℳ
𝑗 given D. Pr(ℳ
𝑗 |𝐷 ) is the
proportionality
Pr(ℳ
j
|D) ∝ Pr(D|ℳ
j
)Pr (ℳ
j
)
(Raftery et al., 1997), where, Pr(𝐷 |ℳ
𝑗 ) is the marginal likelihood of ℳ
𝑗 given by
Pr(𝐷 |ℳ
𝑗 ) = Pr∫(𝐷 |𝜽 𝑗 ,ℳ
𝑗 )Pr(𝜽 𝑗 |ℳ
𝑗 )𝑑 𝜽 𝑗
Here, 𝜽 𝑗 is the vector of parameters in model ℳ
𝑗 , and Pr(ℳ
𝑗 ) is the prior probability of
ℳ
𝑗 such that Pr(ℳ
𝑐𝑐
)+ Pr(ℳ
𝑐𝑜
) = 1. For both ℳ
cc
and ℳ
co
we make the following
assumption about the distributions of errors and effect estimates:
𝜖 𝑗 ∼ 𝑁 (0,𝜎 2
)
𝛽 𝑗 |𝜎 2
,𝑀 𝑗 ∼ 𝑁 (𝜇 𝑗 ,𝜎 2
𝑉 𝒋 )
𝜎 2
∼ 𝐼𝑛𝑣 − 𝐺𝑎𝑚𝑚𝑎 (
𝜈 2
,
𝜈𝜆
2
),
where 𝜇 𝑗 and 𝑉 𝑗 are the model-specific prior mean and variance of 𝛽 𝑗 , and 𝜈 and 𝜆 are
hyperparameters to be chosen. Based on the distribution assumptions above, the marginal
likelihood of ℳ
𝑗 can be directly calculated as
Pr(𝐷 |ℳ
𝑗 ) =
(Γ(
𝜈 + 𝑁 2
)(𝜈𝜆 )
𝜈 2
)
𝜋 𝑁 2
Γ(
𝜈 2
)|𝑰 + 𝑿 𝑗 𝑉 j
𝑿 𝑗 𝑇 |
1
2
× {𝜆𝜈 + (𝑨 𝒌 − 𝑿 𝑗 𝜇 𝑗 )
𝑇 × (𝑰 + 𝑿 𝑗 𝑉 j
𝑿 𝑗 𝑇 )
−1
(𝑨 𝒌 − 𝑿 𝑗 𝜇 𝑗 )}
−
𝜈 +𝑁 2
57
where 𝑿 𝑗 is the design matrix for ℳ
𝑗 (Raiffa & Schlaifer, 1961). We can then calculate the
posterior mean and variance of 𝛽 ̃
𝐵𝑀𝐴 as
E(𝛽 𝐵𝑀𝐴 |𝐷 ) = 𝛽 ̂
𝑐𝑐
Pr(ℳ
𝑐𝑐
|𝐷 )+ 𝛽 ̂
𝑐𝑜
Pr(ℳ
𝑐𝑜
|𝐷 )
and
Var (β
BMA
|D) = Pr(ℳ
cc
|D){Var (β
̂
cc
|D,ℳ
cc
) + β
̂
cc
2
}
+ Pr(ℳ
co
|D){Var (β
̂
co
|D,ℳ
co
) + β
̂
co
2
} − [E(𝛽 𝐵𝑀𝐴 |𝐷 )]
2
where 𝛽 ̂
𝑗 is the ℳ
𝑗 model-specific effect estimated by 𝛽 ̂
𝑗 = 𝐸 (𝛽 𝑗 |𝐷 ,ℳ
𝑗 ) (Draper, 1995;
Raftery, 1993a). The BMA approach then tests the null hypothesis, 𝐻 0
:𝛽 ̃
𝐵𝑀𝐴 = 0 using a
corresponding the Wald statistic, 𝒲 =
E(𝛽 𝐵𝑀𝐴 |𝐷 )
√Var(β
̃
BMA
|D)
.
Simulation Study
Our simulation study was based on seven scenarios (A-G) of global-local ancestry
deviation for cases and controls. A
̅
ℓ𝑘 𝐴𝑓𝑓 was held constant while A
̅
ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 was increased in
increments. Holding global ancestry fixed at 𝑄 ̅
𝑘 = 0 and A
̅
ℓ𝑘 𝐴𝑓𝑓 = 0.005, we simulated
the values A
̅
ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 = {−0.00125 , 0.0, 0.00125 , 0.00250 , 0.00375, 0.00500 ,
0.00625}. Since A
̅
ℓ𝑘 𝐴𝑓𝑓 was held constant, the values of A
̅
ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 are based on simulating
difference in local ancestry between cases and controls. As an example, A
̅
ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 =
0.0025 when A
̅
ℓ𝑘 𝐴𝑓𝑓 = 0.005 implies a difference between cases and controls of 0.0025 ;
meaning that cases deviate from their global mean by twice as much as controls. For
simplicity, the prior distribution of 𝜎 2
was specified such that 𝐸 (𝜎 2
) = 1 so that prior
variance, 𝑉𝑎𝑟 (𝛽 𝑗 |𝜎 2
,𝑀 𝑗 ) = 𝜎 2
𝑉 𝒋 , is specified directly through diagonal entries in 𝑉 𝑗 .
58
Hence, we set prior hyperparameters 𝜈 = 4 and 𝜆 =
1
2
so that 𝜎 2
∼ 𝐼𝑛𝑣 -𝐺𝑎𝑚𝑚𝑎 (2,1).
Prior means of both 𝛽 𝑐𝑜
and 𝛽 𝑐𝑐
were centered at zero. Prior variance of 𝛼 , 𝛽 𝑐𝑐
, and 𝛽 𝑐𝑜
is
equivalent to the sample standard error of 𝐴 ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 , which is calculated prior to
implementing the BMA model. 2,000 replicates of 250 cases and 250 controls were
simulated with average global ancestry set to zero for both cases and controls, Q
̅
𝑘 𝐴𝑓𝑓 = 0
and Q
̅
𝑘 𝑈𝑛𝑎𝑓𝑓 = 0 for simplicity. Prior model weights were chosen based on CO:CC odds
of 100:1, 10:1, 1:1, 1:10, and 1:100; thus, yielding the following prior model probabilities:
Pr(ℳ
𝑐𝑜
) = [0.9901,0.9091,0.5000,0.0909,0.0099]
and
Pr(ℳ
𝑐𝑐
) = [0.0099,0.0909,0.5000,0.9091,0.9901]
Empirical power was calculated as the proportion of the 2,000 replicates in which the
resulting P-value for the BMA Wald statistic, 𝒲 , was less than or equal to 0.05. For
comparison, CC and CO approaches were performed and tested on the same simulated
samples using their nested regression formulations with threshold 0.05. We performed a
sensitivity analysis to ascertain the optimal choices for prior variance of model effects, 𝛽 𝑗 .
In a simulation where A
̅
ℓ𝑘 𝐴𝑓𝑓 = 0.005 is held constant across the following levels of
A
̅
ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 = [−0.005,0.0000,0.005,0.008,0.009], we set Var (𝛼 ) = 𝑐𝑆𝐸 (𝐴 ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 ) and
Var (𝛽 𝑗 ) = 𝑐𝑆𝐸 (𝐴 ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 ), where 𝑐 is a constant with values 𝑐 = {0.1,0.5,2,100}. The
choice of priors in our simulation study and subsequent analyses of real data (Chapter 4)
was determined by the results of this sensitivity analysis.
59
3.4 Results
Simulations of empirical power as a function of mean ancestry deviation A
̅
ℓ𝑘 𝑈𝑛𝑎𝑓𝑓
in controls (see Figure 3-1), shows the BMA approach to have power consistently between
those of the CO and CC approaches for all prior odds. Prior odds favoring a CO model,
10:1, results in empirical power closer to that of the CO approach, while prior odds favoring
a CC model, 1:10 results in power which coincides more closely with CC power. BMA
effect estimates from models with prior odds favoring either the CO or CC
models, 𝛽 ̃
𝐵𝑀 𝐴 10:1
and 𝛽 ̃
𝐵𝑀 𝐴 1:10
are shown in panel 3 of Figure 3-1 for all seven scenarios.
𝛽 ̃
𝐵𝑀 𝐴 10:1
and 𝛽 ̃
𝐵𝑀 𝐴 1:10
have values which closely mimic those of 𝛽 ̂
𝑐𝑜
and 𝛽 ̂
𝑐𝑐
respectively,
while the 𝛽 ̃
𝐵𝑀𝐴 1:1
is distanced more equally between 𝛽 ̂
𝑐𝑜
and 𝛽 ̂
𝑐𝑐
. As panel 1 in Figure 3-
1 shows the simulated mean local ancestry in controls increase toward the constant value
in cases, panel 2 shows power for the CO model to be invariant to this increase since the
CO model does not test controls. Empirical power in the CC model however, gradually
decreases as the distance between mean local ancestry in cases and controls shrinks with
the rise in mean local ancestry in controls. Scenario A in Figure 3-1 demonstrates that the
Table 3-1 Effect estimates, standard error of effects, and power for scenarios
B (𝐀 ̅
𝓵𝐤
𝐀𝐟𝐟 = 𝟎 .𝟎𝟎𝟓 , 𝐀 ̅
𝓵𝐤
𝐔𝐧𝐚𝐟𝐟 = 𝟎 .𝟎 ) and F (𝐀 ̅
𝓵𝐤
𝐀𝐟𝐟 = 𝟎 .𝟎𝟎𝟓 , 𝐀 ̅
𝓵𝐤
𝐔𝐧𝐚𝐟𝐟 = 𝟎 .𝟎𝟎𝟓 ),
for prior model CO:CC odds of 10:1, 1:1, 1:10.
60
CC model will have the greatest power when the mean deviation of local ancestry from
global ancestry in cases and controls is in opposing directions. In Scenario B, where mean
local ancestry in controls does not differ from global ancestry, A
̅
ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 = 0 while
A
̅
ℓ𝑘 𝐴𝑓𝑓 0.005, both the CC and CO models attain effect estimates approximately equivalent
to 0.005. However, power for the CO model is greater (83%) than power for the CC
approach (52%) (see Table 3-1). This is due to a difference in standard error for the two
effect estimates, 𝑆𝐸 (𝛽 ̂
𝑐𝑜
) < 𝑆𝐸 (𝛽 ̂
𝑐𝑐
) (see Table 3-1) resulting from the exclusion of noise
from controls in the CO approach. In scenario F, where mean ancestry deviations are
equivalent between cases and controls, power for the CC approach is 0.05. While the CO
approach tests only a difference between mean ancestry deviation in cases and zero, the
CC approach tests a difference between mean ancestry deviations in cases versus controls;
thus, power for CC approach in scenario F should be interpreted as type I error. Figure 3-
1 and Table 3-1 show the BMA approach with prior odds 1:1 and 1:10 to be sensitive to
the narrowing distance in mean ancestry deviations between cases and controls, and have
decreased power similar to the CC approach. Figure 3-2 shows increasing posterior
probability for the CC method with increasing prior odds toward the CC model for both
scenarios B and F. While Figure 3-2 shows similar posterior probability patterns for both
scenarios, the probability of the CC model is higher for all prior odds models in scenario
F, where the distance between a CO and CC mean ancestry deviation is zero. This is echoed
in Figure 3-1, which depicts the BMA (1:1) model decreasing in power and effect estimate,
and attaining power (11%) closer to that of the CC model. Results of the sensitivity analysis
shown in Figure 3-3 demonstrates that posterior model weights remain largely invariant to
variation in the prior variance of the CC and CO parameters, 𝛽 𝑗 . Conversely, posterior
61
weights appear to be influenced significantly by the choice of prior variance for 𝛼 . Larger
values of Var (𝛼 ) result in larger posterior weights for the CO approach, and smaller
variance leads to weighting toward a CC approach.
Figure 3-1 Power and effect estimates vs. simulation scenarios A-G. Scenarios are defined
by the average simulated value of ancestry deviations in controls only, 𝐀 ̅
𝓵𝐤
𝐔𝐧𝐚𝐟𝐟 , for some
locus 𝓵 while average deviation in cases is held constant at 𝐀 ̅
𝓵𝐤
𝐀𝐟𝐟 = 𝟎 .𝟎𝟎𝟓 . Scenarios A-
G are associated with the following list of 𝐀 ̅
𝓵𝐤
𝐔𝐧𝐚𝐟𝐟 values respectively: −𝟎 .𝟎𝟎𝟏𝟐𝟓 ,
𝟎 .𝟎 , 𝟎 .𝟎𝟎𝟏𝟐𝟓 , 𝟎 .𝟎𝟎𝟐𝟓 , 𝟎 .𝟎𝟎𝟑𝟕𝟓 , 𝟎 .𝟎𝟎𝟓 , 𝟎 .𝟎𝟎𝟔𝟐𝟓 . Power and effect estimates shown for
models CO, CC, and BMA with prior CO:CC odds 10:1, 1:1, 1:10.
62
Figure 3-3 Sensitivity analysis of 𝑽𝒂 𝒓 (𝜷 𝒋 ) using 𝒄 = {𝟎 .𝟏 ,𝟎 .𝟓 ,𝟐 .𝟎 ,𝟏𝟎𝟎 }
multiples of 𝑺𝑬 (𝑨 𝓵 𝒌 𝑼𝒏𝒂𝒇𝒇 ). 𝑺𝑬 (𝑨 𝓵 𝒌 𝑼𝒏𝒂𝒇𝒇 ) is abbreviated by 𝒔 ∗
.
𝑆𝐸 (𝛼 ) = 100𝑠 ∗
𝑆𝐸 (𝛼 ) = 2𝑠 ∗
𝑆𝐸 (𝛼 ) = 0.5𝑠 ∗
𝑆𝐸 (𝛼 ) = 0.1𝑠 ∗
𝑺𝑬 (𝜷 𝒋 )
= 𝟎 .𝟏 𝒔 ∗
𝑺𝑬 (𝜷 𝒋 )
= 𝟎 .𝟓 𝒔 ∗
𝑺𝑬 (𝜷 𝒋 ) = 𝟏 𝒔 ∗
𝑺𝑬 (𝜷 𝒋 ) = 𝟐 𝒔 ∗
Power Power Power Power
𝐀 ̅
𝓵𝐤
𝐔𝐧𝐚𝐟𝐟
Figure 3-2 Posterior probability of a case-control model for scenarios B (𝐀 ̅
𝓵𝐤
𝐀𝐟𝐟 =
𝟎 .𝟎𝟎𝟓 ,𝐀 ̅
𝓵𝐤
𝐔𝐧𝐚𝐟𝐟 = 𝟎 .𝟎 ) and F (𝐀 ̅
𝓵𝐤
𝐀𝐟𝐟 = 𝟎 .𝟎𝟎𝟓 ,𝐀 ̅
𝓵𝐤
𝐔𝐧𝐚𝐟𝐟 = 𝟎 .𝟎𝟎𝟓 ), for prior model
CO:CC odds of 100:1, 10:1, 1:1, 1:10, 1:100.
Posterior Probability of a Case-Control Model
63
3.4 Discussion
When CC and CO estimates are averaged, the resulting BMA estimate 𝛽 ̃
𝐵𝑀𝐴 gives
rise to a test more powerful than a CC approach and more robust than a CO approach (see
Figure 3-2). While power is increased for the BMA over the CC approach for almost all
meaningful scenarios (A-E), scenario F redefines power as Type I error, since it the
objective of the analysis to detect excesses in cases, and not in both cases and controls.
Scenario F presents a situation in which both cases and controls have the same amount of
mean local ancestry deviation from their respective global averages. Since the departure of
local ancestry from global ancestry is systematic in controls (i.e., patterned) rather than due
to random noise, scenario F is an example of a violation of the fundamental assumption
made by the CO approach that controls contribute only noise to an analysis. In such a
scenario, the CO approach detects a false positive association between the region and the
trait as it insensitive to variation in controls. However, while the Type I error resulting from
the BMA approach is still elevated beyond a nominal 5%, it is a modest 11% compared to
83% resulting from the CO approach (see Table 3-1).
Our sensitivity analysis was conducted (see Figure 3-3) for the optimal choice of
variance within the prior distributions of the model-specific regression parameters,
𝛽 𝑗 |𝜎 2
,𝑀 𝑗 ∼ 𝑁 (𝜇 𝑗 , 𝜎 2
𝑉 𝒋 ). The results lead us to suggest that an optimal prior variance
should be equivalent to one standard deviation of the noise contributed by controls.
Specifically, we recommend obtaining 𝑆𝐸 (𝐴 ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 ) and using this measure as the prior
variance for β
co
, α, and β
cc
. While posterior model probabilities, Pr(ℳ
j
|D), are largely
invariant to the choice of priors for β
cc
, the choice of priors for β
co
and α exert some
64
influence. The prior variance on α from the CC model is particularly influential as it sets
the range of variation to be tolerated as noise contributed by controls. If variation in
controls surpasses this range in the data, α will not be constrained to 0, and thus posterior
weights will be influenced toward a CC approach. Thus, a value equivalent to one standard
deviation of ancestry difference in controls ensures tolerance of noise from controls based
on evidence from the data. A larger prior variance implies a larger tolerance for noise, thus
making the BMA approach less sensitive to systematic deviations in controls and weighting
toward a CO model. While it is possible to influence posterior model probabilities through
specification of prior distributions, we recommend incorporating prior knowledge or
preferences through the explicit choice of prior model weights for simplicity. In setting
prior model weights, we recommend maintaining a 1:1 CO:CC ratio in most
implementations, unless there is enough knowledge a priori otherwise.
Chapter 4 Application of Methods
4.1 GxE Analysis of Childhood Asthma
4.2.1 Introduction: Childhood Asthma
Childhood asthma is the most prevalent childhood illness (Asher et al., 2006) with
8.3% U.S. children currently affected ("CDC.gov," 2018). While clinically asthma can
present as various phenotypes (Borish & Culp, 2008), in general symptoms typically
include wheezing, coughing, labored breathing and feeling weak or tired ("American
Academy of Allergy Asthma and Immunology ", 2018). The risk of asthma in the United
States has some variation with ethnic group and heritability estimates imply 35-80% of
variation in risk is attributable to genetic variation (Duffy, Martin, Battistutta, Hopper, &
65
Mathews, 1990; Nieminen, Kaprio, & Koskenvuo, 1991; Torgerson et al., 2011). Traffic-
related ambient air pollution has been shown to contribute to the development and
exacerbation of asthma (Barone-Adesi et al., 2015; Gauderman et al., 2005; Gauderman et
al., 2007; Guarnieri & Balmes, 2014; Kravitz-Wirtz et al., 2018; McConnell et al., 2006).
While marginal associations between genetic factors and asthma have been studied
(Moffatt et al., 2010; Torgerson et al., 2011), here we perform a genome-wide interaction
study for discovery of markers with low effect sizes that would have been undetected by
marginal association studies alone. We conduct two GxE scans using self-declared
Hispanic ethnicity and ambient air pollution as a PM2.5 measure.
4.2.2 Methods: Childhood Asthma
We applied the BMA 2DF (see Chapter 2), MA, CC, CO and DF2 methods to the
Children’s Health Study (CHS), an ongoing cohort study spanning 16 southern California
communities investigating genetic and environmental factors leading to childhood
respiratory outcomes. Using GWAS data on a nested case-control sample of 3,000 subjects,
including 1,398 parent-identified Hispanic whites (HW) and 1,602 non-Hispanic whites
(NHW) from the CHS, we analyzed GxE effects on childhood asthma. Childhood asthma
status was based on questionnaire responses from parents affirming doctor-diagnosed
asthma. We used a sample of 1,249 cases, of which 606 individuals were identified as
Hispanic whites, and 1,751 controls, of which 792 individuals were identified as Hispanic
whites. We analyzed two separate interactions: gene by self-reported Hispanicity (G x
Hisp), and gene by ambient air pollution (G x PM
2.5
). We used microgram per cubic meter
of PM
2.5
, particulate matter in the air smaller than 2.5 micrometers, as our measure of air
pollution exposure. PM
2.5
exposure was categorized into ‘low’ (≤ 15.2 μg/m
3
) and ‘high’
66
(> 15.2 μg/m
3
) exposure levels, classifying 58.6% of our sample as exposed to low levels
and 41.4% to high levels of PM
2.5
. Four cohorts of children are included in the analysis
from the 16 neighborhoods. The PM
2.5
level for each child is the level measured for their
particular cohort and neighborhood; thus, children from the same neighborhood-cohort
grouping are assigned the same PM
2.5
value. Figure 4-1 shows the distribution of PM
2.5
exposure in the CHS case-control sample with bins indicating the value of PM
2.5
for all
children in each neighborhood-cohort group. Measured genotype data consisted of 630,600
SNPs. These SNPs were phased using SHAPEIT and additional SNPs were imputed using
IMPUTE2 separately for Hispanic and non-Hispanic whites against 1,000 Genomes Phase
1 integrated variant v3 phased reference. Imputed SNPs were filtered using the IMPUTE2
information metric removing SNPs with an information score < 0.7. SNPs with a combined
minor allele frequency for both non-Hispanic whites and Hispanic whites less than 5%
were removed from the analysis. After this QC, a total of 6,216,909 SNPs were available
for analysis. In all analyses, we adjusted for sex, and Native American ancestry (<5%, 5-
50%, and >50%). We further adjusted for self-reported Hispanicity in all analyses of G x
PM
2.5
interaction as well as for the analysis of marginal genetic effects on asthma status.
Based on prior knowledge, the prior weighting in the G x PM
2.5
analysis was set to equally
favor the CC and CO models (i.e. 1:1) while the prior weighting for the G x Hispanicity
analysis was set at 100:1 odds that a CC model is more appropriate. These prior weights
are supported empirically as the overdispersion parameter for the logistic CC and CO for
the G x Hispanicity analyses are =1.0 and =1.8, respectively. Because we are using
Laplace estimation to obtain marginal likelihoods, the computation time for the BMA 2DF
model is relatively nominal.
67
Figure 4-1 Distribution of 𝐏 𝐌 𝟐 .𝟓 microgram per cubic meter exposure among 3000
asthma cases and controls. Bins indicate PM2.5 values observed for all children in
each neighborhood-cohort group.
PM2.5 Distribution
PM 2.5
PM2.5 Distribution by Neighborhood-Cohort Group
Number of Observations
PM 2.5
Number of Observations
68
Childhood Asthma Model Specification
We performed the BMA 2DF approach using the loglinear models outlined in
Chapter 2. To maintain equivalence between parameter interpretations from loglinear and
logistic models, we used the loglinear models below for the BMA 2DF GxE analyses using
Hispanicity and PM
2.5
. We conducted the G× PM
2.5
analysis by specifying the following
case-control equation:
Log (n|G,E,Y,C
k
)
= α
cc
0
+ α
cc
G
G+ α
cc
E
E+ α
GE
GE+ β
cc
0
Y+ β
cc
G
GY+ β
cc
E
EY
+ β
cc
G×E
GEY+ ∑α
cc
C
k
C
k
4
k
+ Y∑β
cc
C
k
C
k
4
k
+ G∑β
cc
GC
k
C
k
4
k
+ E∑β
cc
EC
k
C
k
4
k
where
C
1
= Sex (1: male, 0: female)
C
2
= Native American Ancestry (1: 5% - 50%, 0: otherwise)
C
3
= Native American Ancestry (1: >50%, 0: otherwise)
C
4
= Hispanic White (1: Hispanic White, 0: Non-Hispanic White).
Likewise, the G× Hispanicity analysis used the following case-control equation:
69
Log(n|G,E,Y,C
k
)
= α
cc
0
+ α
cc
G
G+ α
cc
E
E+ α
GE
GE+ β
cc
0
Y+ β
cc
G
GY+ β
cc
E
EY
+ β
cc
G×E
GEY+ ∑α
cc
C
k
C
k
3
k
+ Y∑β
cc
C
k
C
k
3
k
+ G∑β
cc
GC
k
C
k
3
k
+ E∑β
cc
EC
k
C
k
3
k
with the omission of C
4
as Hispanicity is captured here by E.
4.2.2 Results: Childhood Asthma
In our analysis of G x PM
2.5
interaction on asthma, the BMA 2DF approach
identified a genome-wide significant region on chromosome 22, with the most significant
SNP in the region having a P-value of 5.81× 10
−9
(Table 4-2). Table 4-2 also shows the
same region identified by the CC and DF2 models as having a significant interaction with
PM
2.5
on asthma. The MA model shows no marginal effect of the region on asthma while
the CO model produced P-values that are low in the region, but do not reach genome-wide
significance. Thus, the finding of rs62227671 by the BMA 2DF approach is largely driven
by its adherence to the CC model with a posterior probability for the CC model of 0.993
(see Figure 4-2).
A second region identified as marginally genome-wide significant by the BMA
2DF model on chromosome 20, rs6122625 (BMA 2DF P-value 5.97 × 10
−8
), was not
identified by any of the other approaches as being genome-wide significant or marginally
significant. While rs6122625 has no marginal effect on asthma (MA P-value 3.77 × 10
−1
),
both CC and CO models yield relatively small P-values, implying that the finding is driven
70
by the interaction alone, which is also true of the subsequent marginally significant BMA
2DF findings on chromosomes 2 and 8 for the G x PM
2.5
analysis.
The BMA 2DF test identifies rs6866110 on chromosome 5 as marginally
significant (P-value 3.24× 10
−7
) as well while the MA, CC and CO methods show
relatively weaker signals. To investigate the weak signals from other approaches, we
examine the marginal effect of rs6866110 by PM
2.5
exposure group in Table 4-1. Table 4-
1 shows the marginal effects of rs6866110 in opposite directions according to the low/high
exposure group.
In the analysis of G x Hispanicity, the BMA 2DF approach identified a genome-
wide significant SNP, rs4672623 (P-value 9.48 × 10
−9
) on chromosome 2 in Table 4-3
and Figure 4-3. Results for rs4672623 from the CC analysis show a marginally significant
interaction, while the marginal test for association between rs4672623 and asthma shows
a much weaker signal in the opposite direction from that of the GxE interaction. Due to
effects of rs4672623 in opposite directions per Hispanicity group (OR[G|E= NHW] =
1.73 and OR[G|E= HW] = 0.71) as shown in Table 4-1, the marginal effect of G in the
combined sample is weakened. However, testing both G and the interaction together in a
2-degree-of-freedom setting as both the BMA 2DF and DF2 methods do, yields a signal
that reaches genome-wide significance. Additionally, the BMA 2DF test identified 3
marginally significant regions on chromosomes 8, 1, and 6. Each of these regions exhibit
protective effects according to marginal tests of association between G and asthma, and
interact with Hispanicity with effects in the same direction within a CC analysis. Table 4-
3 shows rs10955770 on chromosome 8 has opposite CC and CO effects. Assuming this is
71
a result of a G -E association in controls, the 1:100 prior weighting scheme makes the BMA
2DF P-value more plausible since it results in posteriors heavily in favor of a CC model.
Joint BMA: G and GxE
Marginal: G
Case-Only: GxE Case-Control: GxE
Figure 4-2: Manhattan plots for Joint BMA (BMA-2DF), Marginal (MA), Case-
Control (CC), and Case-Only (CO) analysis for G × 𝐏 𝐌 𝟐 .𝟓 interaction with
childhood asthma.
Table 4-1 Stratified marginal analysis of rs6866110 and rs4672623 by exposure
group
72
Joint BMA: G and GxE Marginal: G
Case-Only: GxE Case-Control: GxE
Figure 4-3: Manhattan plots for Joint BMA (BMA-2DF), Marginal (MA), Case-
Control (CC), and Case-Only (CO) analysis for G × 𝐇𝐢𝐬𝐩𝐚𝐧𝐢𝐜𝐢𝐭𝐲 interaction with
childhood asthma.
73
Table 4-2 Top loci ranked by BMA P-value for G × Hispanicity interaction on asthma susceptibility
Table 4-3 Top loci ranked by BMA 2DF P-value for G × PM2.5 interaction on asthma susceptibility
74
4.2.4 Discussion: Childhood Asthma
In our analysis of G× PM
2.5
exposure using the Children’s Health Study, we
identified a novel region on chromosome 22 that has a genome-wide significant interaction
with PM
2.5
(P-value = 5.8 × 10
−9
) on childhood asthma. The SNP with the greatest effect
size in this locus, rs62227671, is in the PARVB gene region, a gene involved in
cytoskeleton organization and cell adhesion, and with no previous record of association
with either childhood asthma, nor as an effect modifier of PM
2.5
on childhood asthma.
From Table 4-2 we can see that the association is largely driven by the interaction effect
from a CC model where the interaction effect, OR(GxE) = 2.57 is highly significant (P-
value = 7.6 × 10
10
). Since this SNP has no significant effect on childhood asthma
marginally, the SNP would have likely been overlooked by a standard GWAS using an
MA approach. Additional examination of the relationship of rs62227671 and PM
2.5
to
childhood asthma is necessary to determine the true effects and mechanisms of action of
this genetic region on childhood asthma.
Due to the likely correlation expected between genetic markers and self-reported
Hispanicity, we used a prior weighting scheme based on a CC to CO model odds of 100:1,
favoring more weight toward a CC model. Figure 4-3 shows the inflation resulting from
the CO model, likely due to the violation of G-E independence assumption. Using this
weighting scheme, the BMA 2DF method identified rs4672623 on chromosome 2 as
having a genome-wide significant interaction with self-identified Hispanicity (P-value =
9.48 × 10
−9
) as shown in Table 4-3. This association is largely driven by the CC model,
however it is not driven by the CC result alone. The MA and CO associations, though
modest, appear also to be contributing to the BMA 2DF signal in this region. We note that
75
like the BMA 2DF approach, the DF2 model also captures the association as genome-wide
significant by its incorporation of the SNP effect in its 2-degree-of-freedom testing scheme.
SNP rs4672623 is in the ErbB4 gene region on chromosome 2, which has been shown to
regulate late fetal lung development (Liu et al., 2010; Zscheppang, Giese, Hoenzke,
Wiegel, & Dammann, 2013), suggesting that the association is plausible. Further
investigation is necessary to determine the true role of rs4672623 on childhood asthma in
Hispanic white and non-Hispanic white children.
4.3 Admixture Mapping in Prostate Cancer
4.3.1 Introduction: Prostate Cancer
Prostate cancer is the second most common cancer and second leading cause of
cancer death in men in the United States ("American Cancer Society," 2018) with men of
African ancestry at greater risk of developing and dying from prostate cancer compared to
men from other populations (DeSantis et al., 2016; Kolonel et al., 2000). As much as 57%
of the variability in population prostate cancer risk is estimated to be due to genetic factors
(Mucci et al., 2016). To date, GWAS have identified more than 160 common variants
associated with prostate cancer risk, which in total account for approximately 30% of the
familial risk of prostate cancer in populations of European ancestry (Eeles et al., 2013;
Schumacher et al., 2018). Risk of prostate cancer is differential across ethnicities, with
rates in African populations higher than those observed in European populations
(Brathwaite, Brathwaite, & del Riego, 2007; Echimane et al., 2000; Kolonel et al., 2000).
Here, we perform an admixture mapping of prostate cancer using African ancestry in
African American and Latino populations. In addition to studying African ancestry, we
76
conduct an admixture mapping of prostate cancer using Amerindian ancestry in Latinos.
We use the case-control (CC), case-only (CO), and novel BMA admixture approaches
discussed in Chapter 3 to map prostate cancer in African Americans and Latinos in the
African Ancestry Prostate Cancer (AAPC) and Latino American Prostate Cancer (LAPC)
consortia using African and Amerindian ancestries.
4.3.2 Methods: Prostate Cancer
We applied the BMA, CC, and CO admixture approaches to 8,326 (4,227 cases and
4,071 controls) of from the African Ancestry Prostate Cancer consortium (AAPC) and
2,288 individuals (1235 cases and 1053 controls) from the Latino American Prostate
Cancer consortium (LAPC). Local ancestry for the AAPC samples have been estimated by
RFMix (Maples et al., 2013) using 1000 genomes (phase 1) Europeans and Africans as
reference panels. Local ancestry for the LAPC samples has also been estimated by RFMix
using the PAGE mega global reference panel, consisting of 517 African, 513 European,
and 360 Amerindian genomes, and 1000 genomes (phase 3) as reference panels. Global
ancestry for both samples is the mean of all local ancestry estimates across the genome for
each individual.
We fit the following CC and CO models:
CC: 𝐀 ℓk
= α + β
cc
𝐘
CO: 𝐀 ℓk
= β
co
𝐘
where 𝐀 ℓk
= 𝑄 ̅
−
L
̅
ℓ𝑘 2
, and 𝑘 was African and Amerindian ancestry. For BMA admixture
implementation, we set the prior variance of 𝛼 , 𝛽 𝑐𝑐
and 𝛽 𝑐𝑜
according to the relation
77
𝑉𝑎𝑟 (𝛽 𝑗 |𝜎 2
,𝑀 𝑗 ) = 𝜎 2
𝑉 𝒋 where diagonal entries of 𝑉 𝑗 were set to 𝑆𝐸 (𝐴 ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 ) = 0.0053 ,
𝑆𝐸 (𝐴 ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 ) = 0.0069, and 𝑆𝐸 (𝐴 ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 ) = 0.016 for African ancestry in the AAPC,
African ancestry in the LAPC, and Amerindian ancestry in LAPC respectively. All
additional hyperparameters were chosen according to the recommendations in chapter 3.
4.3.3 Results: Prostate Cancer
CO, CC, and BMA approaches all showed significant difference in mean ancestry
deviation within the 8q24 region in the AAPC analysis using African ancestry. Figure 4-4
shows −log
10
P-values, posterior probability of a CC model, and mean local ancestry in
relation to global ancestry for cases and controls. The peak shown in panel 1 of Figure 4-4
between 100Mb and 150Mb shows the p-values attaining genome-wide significance for all
three methods, the smallest p-value resulting from the CO analysis (P-value =
1.4962× 10
−15
). The BMA approach is pictured in Figure 4-4 as having P-values between
those of the CO and CC approaches. Panel 2 of Figure 4-4 shows the CC posterior
probability remaining low and flat throughout the 8q24 region, indicating greater weight
toward the CO approach in the region. Panel 3 of Figure 4-4 shows an increase in deviance
of mean local African ancestry in cases from their global average in the 8q24 region, while
controls show no deviation other than noise. Additionally, the global average of African
ancestry in the AAPC for chromosome 8 is higher for cases than in controls Q
̅
𝑘 𝐴𝑓𝑓 =
0.8315 and Q
̅
𝑘 𝑈𝑛𝑎𝑓𝑓 = 0.8245, though tests for local ancestry deviance is invariant to
mean global ancestry values for all methods. Admixture mapping using African ancestry
in Latinos of the LAPC sample shows a substantially smaller global proportion of African
ancestry (see Table 4-4 and Figure 4-5, panel 3) and smaller differentiation between cases
78
and controls in global ancestry means. However, results show a jump in − log
10
P-values
within the 8q24 region nonetheless in a similar pattern to that of the AAPC results.
Although only the CO approach exceeds 𝑝 -value< 5 × 10
−8
, all three methods show an
increase in −log
10
P-values in this region (see Figure 4-5, panel 1).
Admixture analysis performed on the LAPC sample using Amerindian ancestry
shows an ancestry deviation peak from global Amerindian ancestry between 60Mb and
80Mb shown in Figure 4-6. All three approaches show an increase in −log
10
P-values in
this region though only the CO approach exceeds 𝑝 -value< 5 × 10
−8
with 𝑝 -value=
1.7433× 10
−8
(see Table 4-5). Additionally, though not exceeding 𝑝 -value< 5 × 10
−8
,
there is a second region shown in Figure 4-6 between 5Mb and 10Mb in which the CO
approach shows a peak of increased −log
10
P-values . Between 5-10 MB, Figure 4-6
shows a mean local ancestry deviation from the global mean in the same direction for both
cases and controls. While the CO approach is insensitive to deviations in the controls here
and shows a peak, the CC approach is able to detect that there is no difference between
cases and controls and thus shows no peak in p-values in this region. The BMA approach
in this region has a posterior model probability for the CC model of 0.84 in Table 4-5. The
performance of the BMA method adheres more closely to the CC model here and also
shows no peak in P-values. Figure 4-6 panel 2 shows the posterior probability of the CC
model peak in this region where mean local ancestry in both cases and controls deviates
from the global average. Table 4.5 shows the most significant markers (rs10956365,
rs16901163, and rs1859088) as identified by the BMA approach and their effect estimates
and P-values resulting from the CO and CC approaches.
79
Mean Global Ancestry Q
̅
𝑘
Study Ancestry Cases Controls
AAPC African 0.8314 0.8247
LAPC African 0.0531 0.0522
LAPC Amerindian 0.3481 0.4080
Table 4-4 Mean global African ancestry in AAPC and LAPC samples, and mean global
Amerindian ancestry in LAPC sample across cases and controls.
Figure 4-4 Admixture mapping of prostate cancer using African ancestry in the AAPC
on chromosome 8 -log(P-values) (top), posterior probability of a CC model (middle),
and mean local ancestry relative to global ancestry for cases and controls separately
(bottom). BMA approach (red) was conducted using CO:CC odds of 1:1, with prior
variance 𝐕𝐚𝐫 (𝛃 ̃
𝐁𝐌𝐀 ) = 𝐒𝐄 (𝐀 𝓵 𝐤 𝐔𝐧𝐚𝐟𝐟 ) calculated prior to the analysis. Vertical grey lines
indicate known prostate cancer regions.
Chromosome 8 Admixture Mapping AAPC: Prostate Cancer African Ancestry (1:1)
80
Table 4-5 Admixture analysis significant regions in prostate cancer on chromosomes 8 and
16 for African and Amerindian ancestry within AAPC and LACP samples. Effect sizes are
negative due to the definition of deviance as the difference between global and local ancestry.
Figure 4-5 Admixture mapping of prostate cancer using African ancestry in the LAPC on
chromosome 8. BMA approach (red) was conducted using CO:CC odds of 1:1, with prior
variance 𝐕𝐚𝐫 (𝛃 ̃
𝐁𝐌𝐀 ) = 𝐒𝐄 (𝐀 𝓵 𝐤 𝐔𝐧𝐚𝐟𝐟 ) calculated prior to the analysis. Vertical grey lines
indicate known prostate cancer regions.
Chromosome 8 Admixture Mapping LAPC: Prostate Cancer African Ancestry (1:1)
81
4.3.4 Discussion: Prostate Cancer
The region between 5Mb – 10Mb on chromosome 16 of admixture mapping in
Latinos (see Figure 4-6) displays a scenario in which a CO approach may be likely to detect
spurious associations. Here, we see deviations of mean local Amerindian ancestry from
mean global Amerindian ancestry in both cases and controls. The posterior probability of
the CC method is increased in this region (see Figure 4-6, panel two) causing the BMA
approach to adhere closely to the flat statistics of the CC approach rather than show an
increase in -Log
10
(P-values) similar to the CO approach. Because the prior probabilities
of the CO and CC models were set to equally favor both of them (odds 1:1) in the LAPC
Figure 4-6 Admixture mapping of prostate cancer using Amerindian ancestry in the LAPC
on chromosome 16. BMA approach (red) was conducted using CO:CC odds of 1:1, with
prior variance 𝑽𝒂𝒓 (𝜷 ̃
𝑩𝑴𝑨 ) = 𝑺𝑬 (𝑨 𝓵 𝒌 𝑼𝒏𝒂𝒇𝒇 ) calculated prior to the analysis. Vertical grey
lines indicate known prostate cancer regions.
Chromosome 16 Admixture Mapping LAPC: Prostate Cancer Amerindian Ancestry (1:1)
82
analysis, we can be confident that the BMA approach, without prior weights specifying the
CC method as more appropriate, will remain robust in such scenarios.
Our analysis showed significant peaks in regions 8q24 for both AAPC and LAPC
for African ancestry, and 16q21 - 16q23 in the LAPC for Amerindian ancestry. The 8q24
region is a well-documented region associated with prostate cancer (Cropp et al., 2014;
Haiman et al., 2007; Han et al., 2016; Schumacher et al., 2018; Schumacher et al., 2007;
Yeager et al., 2009) with evidence suggesting an association with African ancestry, which
may explain why African Americans carry disproportionally higher risk of the disease
(Cropp et al., 2014; Han et al., 2016; Irizarry-Ramirez et al., 2017; Murphy et al., 2012;
Okobia, Zmuda, Ferrell, Patrick, & Bunker, 2011). The regions 16q21 and 16q23 are
recently reported from a GWAS meta-analysis (Schumacher et al., 2018) with index
markers, rs11863709 and rs201158093 (indicated in grey vertical lines in Figure 4-4). Our
analysis detects a significant excess of mean Amerindian local ancestry in cases while no
such departure from global Amerindian ancestry is observed in controls. Further
investigation is necessary to determine the role of this region and the role of Amerindian
ancestry in prostate cancer. In both of these peaks, P-values resulting from the BMA
approach consistently lie between those of the CC and CO approaches, suggesting it
provides an appropriate substitute to implementing both approaches in tandem. Software
to conduct the BMA admixture approach is available as an R package with details provided
in chapter 5.
83
4.4 Admixture Mapping in Multiple Sclerosis
4.4.1 Introduction: Multiple Sclerosis
Multiple sclerosis (MS) is a neurodegenerative autoimmune disease with a
prevalence of 400,000 individuals in the United States and 2.1 million worldwide (Browne
et al., 2014; Zwibel & Smrtka, 2011). MS impacts patients’ quality of life by negatively
affecting employment, productivity, and social relationships (Nortvedt, Riise, Myhr, &
Nyland, 1999; Phillips, 2004; Rao et al., 1991), with a health care cost burden in the United
States of $8,528 to $52,244 per patient per year (Adelman, Rane, & Villa, 2013). The
disease burden of MS is distributed differentially by ethnic groups with a greater burden
on Whites and African Americans, with less burden on Hispanic and Asian populations
(Langer-Gould, Brara, Beaber, & Zhang, 2013; Rivas-Rodríguez & Amezcua, 2018).
Moreover, clinical characteristics are differential between ethnic groups, with earlier onset
and more severity of disease in Hispanic and African American individuals compared with
individuals of European ancestry (Amezcua, Lund, Weiner, & Islam, 2011;
Hadjixenofontos et al., 2015; Ventura, Antezana, Bacon, & Kister, 2017). European
ancestry is thought to contribute to MS risk in Hispanics and African Americans (Isobe et
al., 2015; Oksenberg et al., 2004; Ordoñez et al., 2015; Rivas-Rodríguez & Amezcua,
2018). Genetic studies of MS have consisted largely of case-control and family-based
candidate gene studies (International Multiple Sclerosis Genetics et al., 2007; Isobe et al.,
2013; Johnson et al., 2009; Oksenberg et al., 2004) and have identified the HLA region to
be most important in MS risk (Rivas-Rodríguez & Amezcua, 2018). Here, we conduct an
admixture mapping of MS in Latino individuals using Amerindian ancestry.
84
4.4.2 Methods: Multiple Sclerosis
We applied the BMA, CC, and CO admixture approaches presented in chapter 3 to
1,236 individuals of Hispanic-White self-declared ethnicity. The sample consisted of 1,018
controls from the Los Angeles Latino Eye Study (LALES) and 218 USC recruited multiple
sclerosis cases (see Table 4-6). Cases were genotyped on the Illumina
HumanOmniExpress-12v1_A and Illumina HumanOmniExpress-24v1-0_A arrays.
LALES controls were genotyped on the Illumina HumanOmniExpress-12v1_H and
Illumina HumanOmniExpress-12-v1-1. Imputation was done using IMPUTE2 v2.3.2 and
SHAPIT with all samples from 1000 Genomes Phase 3 reference panel. Global ancestry
was estimated using RFMix (Maples et al., 2013) from average RFMix posterior
probabilities for each sample, both haplotypes, genome-wide across three ancestries:
European, African, and Amerindian. Interpolated local ancestry was estimated using typed
and imputed SNPs by RFMix and the 1000 Genomes Project (phase 3). Posterior
probabilities were interpolated using a linear function calculated from the probabilities
from the closest typed SNPs (upstream and downstream) as the y-variable, and their
positions as the x-variable. This function was then applied to each untyped SNP using its
position for the x-variable. We filtered out SNPs with information score < 0.7 and a MAF
< 0.01 for a total of 10,062,132 SNPs included in the analysis.
Male Female Total
Cases 89 (7%) 129 (0.1%) 218 (18%)
Controls 364 (29%) 654 (53%) 1018 (82%)
Total 453 (37%) 783 (63%) 1236
Table 4-6 Distribution of multiple sclerosis cases and
controls in Hispanic White individuals
85
We fit the following CC and CO models:
CC: 𝐀 ℓk
= α + β
cc
𝐘
CO: 𝐀 ℓk
= β
co
𝐘
where 𝐀 ℓk
= 𝑄 ̅
−
L
̅
ℓ𝑘 2
, and 𝑘 is Amerindian ancestry. For BMA admixture implementation,
we set the prior variance of 𝛼 , 𝛽 𝑐𝑐
and 𝛽 𝑐𝑜
according to the relation 𝑉𝑎𝑟 (𝛽 𝑗 |𝜎 2
, 𝑀 𝑗 ) =
𝜎 2
𝑉 𝒋 where diagonal entries of 𝑉 𝑗 were set to 𝑆𝐸 (𝐴 ℓ𝑘 𝑈𝑛𝑎𝑓𝑓 ) = 0.011. All additional
hyperparameters were chosen according to the recommendations in chapter 3.
4.4.3 Results: Multiple Sclerosis
Global ancestry proportions were primarily concentrated between European and
Amerindian ancestries, with very little mean African ancestry in both cases (Q
̅
African
𝐴𝑓𝑓 =
0.07) and controls (Q
̅
African
𝑈𝑛𝑎𝑓𝑓 = 0.05). Table 4-7 shows mean global ancestries for
European, Amerindian and African ancestry. We found that cases have a higher proportion
of both European and African ancestries compared to controls (see Table 4-7).
We observed a peak on chromosome 5 between 125Mb – 135Mb, with the most significant
p-value resulting from the CO (𝑝 -value= 1.66× 10
4
) in Figure 4-7. The BMA approach
Ancestry Cases Controls Total
European 0.5035 0.4709 0.4767
Amerindian 0.426 0.4756 0.4669
African 0.0704 0.0535 0.0565
Mean Global Ancestry
Table 4-7 Mean global European, Amerindian, and African
ancestry of multiple sclerosis cases and controls.
86
had a slightly higher p-value for this region (𝑝 -value= 3.16× 10
4
) and a posterior
probability that a CC approach being most appropriate of Pr(𝐶𝐶 ) = 0.29 (see Table 4-8).
Table 4-8 summarizes the most significant peaks from our analysis. Chromosome 6 has
several peaks from the CO method only in the region between 18Mb – 40Mb (see Figure
4-8). Figure 4-9 shows a peak in Chromosome 18 within the region 70Mb – 80Mb. Mean
local ancestry in cases here shows an increase in deviance from mean global Amerindian
ancestry while controls are largely showing noise with some trend toward increasing
deviance. The CO approach is the most sensitive in this region (𝑝 -value= 1.13× 10
5
)
and produces the lowest p-value in our analysis (see Table 4-8 and Figure 4-9). Figure 4-
10 shows results from chromosome 21 in the 45Mb – 50Mb region. Here, both cases and
controls deviate in mean local ancestry from global ancestry. BMA and CC approaches do
not show a peak in this region.
Ancestry Chr RS No. Position Effect P-value Pr(CC) Effect P-value Effect P-value
Amerindian 5 rs1520246 127266725 0.08 3.16E-04 0.29 8.25E-02 1.66E-04 7.85E-02 1.16E-03
Amerindian 6 rs214621 18197475 0.05 6.82E-02 0.99 4.85E-02 4.30E-04 4.54E-02 6.82E-02
Amerindian 18 rs2727058 75476329 -0.07 5.41E-03 0.98 -1.01E-01 1.13E-05 -7.04E-02 5.32E-03
Amerindian 21 rs2839201 47715594 -0.03 2.45E-01 1.00 -9.49E-02 3.04E-05 -2.86E-02 2.45E-01
Marker BMA CO CC
Table 4-8 Most significant regions in admixture mapping of MS on Hispanic whites
with Amerindian ancestry.
87
Chromosome 5 Admixture Mapping Results: Multiple Sclerosis (1:1)
Figure 4-7 Admixture mapping of multiple sclerosis using Amerindian ancestry in
the on chromosome 5. BMA approach (red) was conducted using CO:CC odds of
1:1, with prior variance 𝑽𝒂𝒓 (𝜷 ̃
𝑩𝑴𝑨 ) = 𝑺𝑬 (𝑨 𝓵 𝒌 𝑼𝒏𝒂𝒇𝒇 ) calculated prior to the
analysis.
Chromosome 6 Admixture Mapping Results: Multiple Sclerosis (1:1)
Figure 4-8 Admixture mapping of multiple sclerosis using Amerindian ancestry
in the on chromosome 6. BMA approach (red) was conducted using CO:CC
odds of 1:1, with prior variance 𝑽𝒂𝒓 (𝜷 ̃
𝑩𝑴𝑨 ) = 𝑺𝑬 (𝑨 𝓵 𝒌 𝑼𝒏𝒂𝒇𝒇 ) calculated prior
to the analysis.
88
Chromosome 18 Admixture Mapping Results: Multiple Sclerosis (1:1)
Figure 4-9 Admixture mapping of multiple sclerosis using Amerindian ancestry
in the on chromosome 18. BMA approach (red) was conducted using CO:CC odds
of 1:1, with prior variance 𝑽𝒂𝒓 (𝜷 ̃
𝑩𝑴𝑨 ) = 𝑺𝑬 (𝑨 𝓵 𝒌 𝑼𝒏𝒂𝒇𝒇 ) calculated prior to the
analysis.
Chromosome 21 Admixture Mapping Results: Multiple Sclerosis (1:1)
Figure 4-10 Admixture mapping of multiple sclerosis using Amerindian ancestry
in the on chromosome 21. BMA approach (red) was conducted using CO:CC odds
of 1:1, with prior variance 𝑽𝒂𝒓 (𝜷 ̃
𝑩𝑴𝑨 ) = 𝑺𝑬 (𝑨 𝓵 𝒌 𝑼𝒏𝒂𝒇𝒇 ) calculated prior to the
analysis.
89
4.4.4 Discussion: Multiple Sclerosis
The region between 125Mb – 135Mb on chromosome 5 shows a significant deviation
of mean local ancestry in cases and not in controls, thereby indicating that the region may
represent a meaningful association with multiple sclerosis. All three methods show a peak
in -log(𝑝 -values) in this region. Chromosome 6 region 18Mb – 40Mb is detected only by
the CO approach (see Figure 4-8 panel 1) and panel 3 of Figure 4-8 shows divergence of
the mean local ancestry from the global average in both cases and controls, in the same
direction. The posterior probability for the CC approach is calculated to be ~0.99, so the
BMA approach adheres to CC estimates. This region presents a scenario in which a CO
approach is likely to detect spurious associations since it is insensitive to systemic deviance
of local ancestry in controls. However, this region includes 6p21, which is the Human
leukocyte antigen (HLA) region known to be the strongest association with MS risk (Jersild
et al., 1973). The chromosome 18 region between 70Mb – 80Mb shows the most significant
peak in our analysis from the CO approach (𝑝 -value= 1.13× 10
5
). The CC approach in
this region remains without a peak, and with a posterior probability of 0.98 for the CC
model, the BMA approach also yields less significant p-values (𝑝 -value= 5.41× 10
3
).
This region includes 18q23, which is known for its association with MS via the Golli-MBP
gene (Tienari et al., 1998). The peak from the CO analysis in chromosome 21 region 45Mb
– 50Mb shows a scenario in which the CO model is liable to produce invalid results. This
is due to the assumption the CO approach makes that controls will only contribute noise;
thereby, all systematic deviations in controls are not considered. Here, we find no different
in cases versus controls in deviance. Sample size proved to be a limitation on power for
detecting loci associated with MS, and resulted in relatively large standard errors. We
90
found several regions in which a CO approach would be likely to identify spurious
associations with MS. In these regions, the BMA and CC approaches are more reliable.
For this reason, we recommend using the BMA approach genome-wide as it has been
demonstrated to be both powerful and robust.
Chapter 5 R Packages
5.1 BMA for GxE Interaction Software
Software to implement a BMA GxE analysis is available as an R package.
Download and documentation are available through GitHub at
https://github.com/LilithMoss/bma.gxe.git. This program performs a GxE association test
based on the 2df BMA GxE approach outlined in chapter 2 and the 1df BMA approach
introduced by Li and Conti (2009). Y, G, and E variable inputs must be binary for disease
outcomes, genetic factors, and environmental exposures. Using prior model weights
specified by the user, the program constructs a loglinear model with terms based on the
number of covariates included. If no covariates are considered, the models are identical to
equations 4 and 5 in chapter 2. If a covariate 𝐶 is included, the terms 𝐶 and 𝑌 :𝐶 are included
in the models in order to maintain an approximate equivalence with logit CC and CO
models. Results for the 1df BMA test includes the BMA GxE effect estimate, standard
error, test statistic, and p-value. The program returns a p-value for the BMA 2DF test. The
program is designed to compute one association at a time, but may be applied genome-
wide using a looping function. Run times are similar to running a CC approach in R.
91
5.2 BMA for Admixture Software
Software to implement a BMA admixture mapping is available as an R package.
Download and documentation are available through GitHub at
https://github.com/LilithMoss/bma.admix.git. This program performs an admixture
mapping based on the BMA approach outlined in chapter 3. Using prior model weights and
entries for the diagonal prior variance matrix specified by the user, the program computes
a CC, CO, and BMA test for one locus. Results for all three analyses include an effect
estimate, standard error, test statistic, and p-value. The program is designed to compute
one association at a time, but may be applied genome-wide using a looping function. Run
times for the BMA result are similar to running CC and CO approaches serially in R.
Chapter 6 Summary and Future Directions
6.1 Summary of Findings
In chapter 2, we found that the most powerful methods for GxE interaction
discovery are those which incorporate a case-only analysis; however these are also prone
to inflated type I error rates, and in some scenarios provide results which are entirely
invalid. We found that 2df approaches can enhance power for discovery in scenarios with
small marginal effects. We presented the BMA 2DF approach that uses Bayes model
averaging to combine case-control and case-only loglinear models for a 2df test of joint
main and interaction effects. We found that the performance of the BMA 2DF model in
simulations is empirically more powerful than both CC and DF2 models while showing
more robustness than CO and CO 2DF approaches under violations of the G-E
92
independence assumption. We showed through heatmaps that conducting 2df tests widens
the scope of discovery over single-degree-of-freedom tests such as MA, CC, and CO tests.
We presented a nested regression framework for CC and CO admixture mapping
tests and used this representation of the models to formulate a BMA approach that
combines both models as a weighted average. Through simulations we showed that the
BMA model is more powerful than a CC method and is robust in scenarios where a CO
analysis would be prone to identifying spurious associations; namely, where ancestry
deviance fluctuations in controls are not due solely to noise.
In our application study we conducted two genome-wide interaction scans using
the CHS with childhood asthma as the outcome of interest. Our GxPM2.5 interaction
revealed a novel region that shows a significant interaction in several 2df models with the
air pollutant PM2.5 on chromosome 22 (𝑝 -value= 5.81× 10
−9
). This finding presents an
opportunity for further investigation of the region’s role in childhood asthma risk. In our
second analysis, we identified a significant GxHispanicity interaction in a region on
chromosome 2 (𝑝 -value= 9.48× 10
−9
). This region has been indicated in the literature
as being related to fetal lung development. The significant regions identified from both
analyses had marginal signals that were too weak to have been identified in a traditional
GWAS. The region identified by the GxHispanicity analysis was identified only by 2df
approaches, suggesting that 2df approaches offer more power than 1df tests when there is
even a small marginal effect.
We conducted two admixture mapping scans using the admixed populations of
African Americans and Latinos. Our scan using African ancestry in both populations
showed significant local ancestry departure in the 8q24 region, whose association with
93
prostate cancer risk is well documented. We also detected the previously reported 16q21 -
16q23 region in Latinos using Amerindian ancestry. Here, our analysis showed a departure
of local ancestry in cases, resulting in a genome-wide significant signal in the CO analysis
(𝑝 -value< 5 × 10
−8
), and a marginally significant result in the BMA analysis
(𝑝 -value< 5 × 10
−6
). Admixture mapping of multiple sclerosis in Latinos revealed
peaks on chromosomes 5, 6, 18, and 21 however, the sample size of cases proved to be a
limitation on power of detecting significant peaks. The small number of cases also resulted
in large standard errors.
6.2 Contribution to the State of the Art
Two novel approaches using a Bayesian model averaging framework were
presented in this dissertation with software available for download from GitHub. The first
method is a GxE interaction approach that combines the 2df and BMA frameworks and
offers a powerful test, which incorporates prior knowledge and model uncertainty. This
method is designed exclusively for discovery and contributes a robust and powerful
approach for discovery of genetic markers with weak marginal signals. The second BMA
approach is designed to combine CC and CO analyses for admixture mapping, and offers
an alternative to conducting and interpreting multiple models to avoid false positives while
still identifying all potential signals. The application of the BMA framework in both cases
ensures that both approaches account for model uncertainty, which is an improvement upon
the practice of choosing one model.
6.3 Future Directions
Developments underway in our BMA 2DF approach for GxE interaction are
focused on modifying the BMA 2DF approach in order to incorporate continuous exposure
94
variables. Future work in admixture mapping is to expand the BMA approach to
incorporate a test of marginal marker associations. This would allow for one scan to
conduct a composite marginal association and admixture test, which could be performed
genome-wide. This extension could provide investigators a multifaceted test, which may
detect associations with genomic regions that have low signals marginally and by local
ancestry.
95
Bibliography
Adelman, G., Rane, S. G., & Villa, K. F. (2013). The cost burden of multiple sclerosis in
the United States: a systematic review of the literature. J Med Econ, 16(5), 639-
647. doi: 10.3111/13696998.2013.778268
Adrian Raftery, J. H., Chris Volinsky, Ian Painter and Ka Yee Yeung. (2015). BMA:
Bayesian Model Averaging (Version 3.18.6). Retrieved from http://CRAN.R-
project.org/package=BMA
Agresti, A. (2002). Loglinear Models for Contingency Tables Categorical Data Analysis
(2 ed., pp. 314-356). New Jersey: John Wiley & Sons.
American Academy of Allergy Asthma and Immunology (2018). from
https://www.aaaai.org/conditions-and-treatments/conditions-dictionary/Pediatric-
Asthma
American Cancer Society. (2018).
Amezcua, L., Lund, B. T., Weiner, L. P., & Islam, T. (2011). Multiple sclerosis in
Hispanics: a study of clinical disease expression. Mult Scler, 17(8), 1010-1016.
doi: 10.1177/1352458511403025
Aschard, H. (2016). A perspective on interaction effects in genetic association studies.
Genet Epidemiol, 40(8), 678-688. doi: 10.1002/gepi.21989
Asher, M. I., Montefort, S., Bjorksten, B., Lai, C. K., Strachan, D. P., Weiland, S. K., . . .
Group, I. P. T. S. (2006). Worldwide time trends in the prevalence of symptoms
of asthma, allergic rhinoconjunctivitis, and eczema in childhood: ISAAC Phases
One and Three repeat multicountry cross-sectional surveys. Lancet, 368(9537),
733-743. doi: 10.1016/S0140-6736(06)69283-0
96
Barnard, G. A. (1958). STUDIES IN THE HISTORY OF PROBABILITY AND
STATISTICS: IX. THOMAS BAYES'S ESSAY TOWARDS SOLVING A
PROBLEM IN THE DOCTRINE OF CHANCES*Reproduced with the
permission of the Council of the Royal Society from The Philosophical
Transactions (1763), 53, 370-418. Biometrika, 45(3-4), 293-295. doi:
10.1093/biomet/45.3-4.293
Barone-Adesi, F., Dent, J. E., Dajnak, D., Beevers, S., Anderson, H. R., Kelly, F. J., . . .
Whincup, P. H. (2015). Long-Term Exposure to Primary Traffic Pollutants and
Lung Function in Children: Cross-Sectional Study and Meta-Analysis. PLoS
ONE, 10(11), e0142565. doi: 10.1371/journal.pone.0142565
Basu, A., Tang, H., Arnett, D., Gu, C. C., Mosley, T., Kardia, S., . . . Risch, N. (2009).
Admixture Mapping of Quantitative Trait Loci for BMI in African Americans:
Evidence for Loci on Chromosomes 3q, 5q, and 15q. Obesity, 17(6), 1226-1231.
doi: 10.1038/oby.2009.24
Basu, A., Tang, H., Lewis, C. E., North, K., Curb, J. D., Quertermous, T., . . . Risch, N. J.
(2009). Admixture mapping of quantitative trait loci for blood lipids in African-
Americans. Human Molecular Genetics, 18(11), 2091-2098. doi:
10.1093/hmg/ddp122
Bensen, J. T., Xu, Z., McKeigue, P. M., Smith, G. J., Fontham, E. T., Mohler, J. L., &
Taylor, J. A. (2014). Admixture mapping of prostate cancer in African Americans
participating in the North Carolina-Louisiana Prostate Cancer Project (PCaP).
Prostate, 74(1), 1-9. doi: 10.1002/pros.22722
97
Bernardo, J. M. (2000). Bayesian theory [electronic resource] / Jos é M. Bernardo,
Adrian F.M. Smith. Chichester
New York: Chichester
New York : Wiley, c2000.
Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (1975). Discrete multivariate
analysis : theory and practice. Cambridge, Mass. ; London: M.I.T. Press.
Bock, C. H., Schwartz, A. G., Ruterbusch, J. J., Levin, A. M., Neslund-Dudas, C., Land,
S. J., . . . Rybicki, B. A. (2009). Results from a prostate cancer admixture
mapping study in African-American men. Hum Genet, 126(5), 637-642. doi:
10.1007/s00439-009-0712-z
Boffetta, P., Winn, D. M., Ioannidis, J. P., Thomas, D. C., Little, J., Smith, G. D., . . .
Khoury, M. J. (2012). Recommendations and proposed guidelines for assessing
the cumulative evidence on joint effects of genes and environments on cancer
occurrence in humans. Int J Epidemiol, 41(3), 686-704. doi: 10.1093/ije/dys010
Bonilla, C., Parra, E. J., Pfaff, C. L., Dios, S., Marshall, J. A., Hamman, R. F., . . .
Shriver, M. D. (2004). Admixture in the Hispanics of the San Luis Valley,
Colorado, and its implications for complex trait gene mapping. Ann Hum Genet,
68(Pt 2), 139-153.
Borish, L., & Culp, J. A. (2008). Asthma: a syndrome composed of heterogeneous
diseases. Annals of Allergy, Asthma & Immunology, 101(1), 1-9. doi:
10.1016/S1081-1206(10)60826-5
Brathwaite, A. F., Brathwaite, N., & del Riego, A. (2007). Epidemiological profile of
cancer for Grand Bahama residents: 1988-2002. West Indian Med J, 56(1), 26-33.
98
Browne, P., Chandraratna, D., Angood, C., Tremlett, H., Baker, C., Taylor, B. V., &
Thompson, A. J. (2014). Atlas of Multiple Sclerosis 2013: A growing global
problem with widespread inequity. Neurology, 83(11), 1022-1024. doi:
10.1212/WNL.0000000000000768
Bryc, K., Durand, Eric Y., Macpherson, J M., Reich, D., & Mountain, Joanna L. (2015).
The Genetic Ancestry of African Americans, Latinos, and European Americans
across the United States. Am J Hum Genet, 96(1), 37-53. doi:
10.1016/j.ajhg.2014.11.010
Burkart, K. M., Sofer, T., London, S. J., Manichaikul, A., Hartwig, F. P., Yan, Q., . . .
Barr, R. G. (2018). A Genome-wide Association Study in Hispanics/Latinos
Identifies Novel Signals for Lung Function. The Hispanic Community Health
Study/Study of Latinos. Am J Respir Crit Care Med. doi: 10.1164/rccm.201707-
1493OC
Carvajal-Carmona, L. G., Ophoff, R., Service, S., Hartiala, J., Molina, J., Leon, P., . . .
Ruiz-Linares, A. (2003). Genetic demography of Antioquia (Colombia) and the
Central Valley of Costa Rica. Hum Genet, 112(5-6), 534-541. doi:
10.1007/s00439-002-0899-8
CDC.gov. (2018). 2018, from https://www.cdc.gov/
Cheng, C. Y., Kao, W. H., Patterson, N., Tandon, A., Haiman, C. A., Harris, T. B., . . .
Reich, D. (2009). Admixture mapping of 15,280 African Americans identifies
obesity susceptibility loci on chromosomes 5 and X. PLoS Genet, 5(5), e1000490.
doi: 10.1371/journal.pgen.1000490
99
Cheng, C. Y., Reich, D., Haiman, C. A., Tandon, A., Patterson, N., Selvin, E., . . . Kao,
W. H. (2012). African ancestry and its correlation to type 2 diabetes in African
Americans: a genetic admixture analysis in three U.S. population cohorts. PLoS
ONE, 7(3), e32840. doi: 10.1371/journal.pone.0032840
Cropp, C. D., Robbins, C. M., Sheng, X., Hennis, A. J., Carpten, J. D., Waterman, L., . . .
Nemesure, B. (2014). 8q24 risk alleles and prostate cancer in African-Barbadian
men. Prostate, 74(16), 1579-1588. doi: 10.1002/pros.22871
Curry, E. (2013). LCA (Version 0.1) [R]: cran.r-project.org. Retrieved from
https://CRAN.R-project.org/package=LCA
Dai, J. Y., Logsdon, B. A., Huang, Y., Hsu, L., Reiner, A. P., Prentice, R. L., &
Kooperberg, C. (2012). Simultaneously testing for marginal genetic association
and gene-environment interaction. Am J Epidemiol, 176(2), 164-173. doi:
10.1093/aje/kwr521
de Bruijn, N. G. (1981). Asymptotic Methods in Analysis: Dover Publications.
DeSantis, C. E., Siegel, R. L., Sauer, A. G., Miller, K. D., Fedewa, S. A., Alcaraz, K. I.,
& Jemal, A. (2016). Cancer statistics for African Americans, 2016: Progress and
opportunities in reducing racial disparities. CA Cancer J Clin, 66(4), 290-308.
doi: 10.3322/caac.21340
Devlin, B., & Roeder, K. (1999). Genomic control for association studies. Biometrics,
55(4), 997-1004.
Dick, F. D., De Palma, G., Ahmadi, A., Osborne, A., Scott, N. W., Prescott, G. J., . . .
Felice, A. (2007). Gene‐environment interactions in parkinsonism and Parkinson's
100
disease: the Geoparkinson study. Occupational and Environmental Medicine,
64(10), 673-680. doi: 10.1136/oem.2006.032078
Dipierri, J. E., Alfaro, E., Martinez-Marignac, V. L., Bailliet, G., Bravi, C. M., Cejas, S.,
& Bianchi, N. O. (1998). Paternal directional mating in two Amerindian
subpopulations located at different altitudes in northwestern Argentina. Hum Biol,
70(6), 1001-1010.
Draper, D. (1995). Assessment and Propagation of Model Uncertainty. Journal of the
Royal Statistical Society Series B-Statistical Methodology, 57(1), 45-97.
Draper, D., Hodges, J. S., Leamer, E. E., Morris, C. N., & Rubin, D. B. (1987). A
research agenda for assessment and propagation of model uncertainty. Santa
Monica, CA: The Rand Corporation.
Duffy, D. L., Martin, N. G., Battistutta, D., Hopper, J. L., & Mathews, J. D. (1990).
Genetics of asthma and hay fever in Australian twins. Am Rev Respir Dis, 142(6
Pt 1), 1351-1358. doi: 10.1164/ajrccm/142.6_Pt_1.1351
Echimane, A. K., Ahnoux, A. A., Adoubi, I., Hien, S., M'Bra, K., D'Horpock, A., . . .
Parkin, D. M. (2000). Cancer incidence in Abidjan, Ivory Coast: first results from
the cancer registry, 1995-1997. Cancer, 89(3), 653-663.
Eeles, R. A., Olama, A. A., Benlloch, S., Saunders, E. J., Leongamornlert, D. A.,
Tymrakiewicz, M., . . . Easton, D. F. (2013). Identification of 23 new prostate
cancer susceptibility loci using the iCOGS custom genotyping array. Nature
Genetics, 45(4), 385-391, 391e381-382. doi: 10.1038/ng.2560
Ewens, W. J., & Spielman, R. S. (1995). The transmission/disequilibrium test: history,
subdivision, and admixture. Am J Hum Genet, 57(2), 455-464.
101
Fienberg, S. E. (1977). The Analysis of Cross-classified Categorical Data: MIT Press.
Freedman, M. L., Haiman, C. A., Patterson, N., McDonald, G. J., Tandon, A.,
Waliszewska, A., . . . Reich, D. (2006). Admixture mapping identifies 8q24 as a
prostate cancer risk locus in African-American men. Proceedings of the National
Academy of Sciences of the United States of America, 103(38), 14068-14073. doi:
10.1073/pnas.0605832103
Gauderman, W. J., Avol, E., Lurmann, F., Kuenzli, N., Gilliland, F., Peters, J., &
McConnell, R. (2005). Childhood Asthma and Exposure to Traffic and Nitrogen
Dioxide. Epidemiology, 16(6), 737-743.
Gauderman, W. J., Mukherjee, B., Aschard, H., Hsu, L., Lewinger, J. P., Patel, C. J., . . .
Chatterjee, N. (2017). Update on the State of the Science for Analytical Methods
for Gene-Environment Interactions. Am J Epidemiol, 186(7), 762-770. doi:
10.1093/aje/kwx228
Gauderman, W. J., Thomas, D. C., Murcray, C. E., Conti, D., Li, D., & Lewinger, J. P.
(2010). Efficient genome-wide association testing of gene-environment
interaction in case-parent trios. Am J Epidemiol, 172(1), 116-122. doi:
10.1093/aje/kwq097
Gauderman, W. J., Vora, H., McConnell, R., Berhane, K., Gilliland, F., Thomas, D., . . .
Peters, J. (2007). Effect of exposure to traffic on lung development from 10 to 18
years of age: a cohort study. Lancet, 369(9561), 571-577. doi: 10.1016/S0140-
6736(07)60037-3
102
Gauderman, W. J., Zhang, P., Morrison, J. L., & Lewinger, J. P. (2013). Finding novel
genes by testing G x E interactions in a genome-wide association study. Genet
Epidemiol, 37(6), 603-613. doi: 10.1002/gepi.21748
Giri, A., Hartmann, K. E., Aldrich, M. C., Ward, R. M., Wu, J. M., Park, A. J., . . .
Edwards, T. L. (2017). Admixture mapping of pelvic organ prolapse in African
Americans from the Women's Health Initiative Hormone Therapy trial. PLoS
ONE, 12(6), e0178839. doi: 10.1371/journal.pone.0178839
Gonzalez Burchard, E., Borrell, L. N., Choudhry, S., Naqvi, M., Tsai, H. J., Rodriguez-
Santana, J. R., . . . Risch, N. (2005). Latino populations: a unique opportunity for
the study of race, genetics, and social environment in epidemiological research.
Am J Public Health, 95(12), 2161-2168. doi: 10.2105/AJPH.2005.068668
Good, I. J. (1950). Probability and the weighting of evidence. London: Griffin.
Greenland, S. (2009). Interactions in epidemiology: relevance, identification, and
estimation. Epidemiology, 20(1), 14-17. doi: 10.1097/EDE.0b013e318193e7b5
Guarnieri, M., & Balmes, J. R. (2014). Outdoor air pollution and asthma. The Lancet,
383(9928), 1581-1592. doi: 10.1016/S0140-6736(14)60617-6
Hadjixenofontos, A., Beecham, A. H., Manrique, C. P., Pericak-Vance, M. A., Tornes,
L., Ortega, M., . . . Delgado, S. R. (2015). Clinical Expression of Multiple
Sclerosis in Hispanic Whites of Primarily Caribbean Ancestry.
Neuroepidemiology, 44(4), 262-268.
Haiman, C. A., Patterson, N., Freedman, M. L., Myers, S. R., Pike, M. C., Waliszewska,
A., . . . Reich, D. (2007). Multiple regions within 8q24 independently affect risk
for prostate cancer. Nature Genetics, 39(5), 638-644. doi: 10.1038/ng2015
103
Han, Y., Rand, K. A., Hazelett, D. J., Ingles, S. A., Kittles, R. A., Strom, S. S., . . .
Haiman, C. A. (2016). Prostate Cancer Susceptibility in Men of African Ancestry
at 8q24. J Natl Cancer Inst, 108(7). doi: 10.1093/jnci/djv431
Hindorff, L. A., Sethupathy, P., Junkins, H. A., Ramos, E. M., Mehta, J. P., Collins, F. S.,
& Manolio, T. A. (2009). Potential etiologic and functional implications of
genome-wide association loci for human diseases and traits. Proceedings of the
National Academy of Sciences of the United States of America, 106(23), 9362-
9367. doi: 10.1073/pnas.0903103106
Hirschhorn, J. N., & Daly, M. J. (2005). Genome-wide association studies for common
diseases and complex traits. Nat Rev Genet, 6(2), 95-108. doi: 10.1038/nrg1521
Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model
averaging: A tutorial. Statistical Science, 14(4), 382-401.
Hsu, L., Jiao, S., Dai, J. Y., Hutter, C., Peters, U., & Kooperberg, C. (2012). Powerful
cocktail methods for detecting genome-wide gene-environment interaction. Genet
Epidemiol, 36(3), 183-194. doi: 10.1002/gepi.21610
Hunter, D. J. (2005). Gene-environment interactions in human diseases. Nat Rev Genet,
6(4), 287-298. doi: 10.1038/nrg1578
International Multiple Sclerosis Genetics, C., Hafler, D. A., Compston, A., Sawcer, S.,
Lander, E. S., Daly, M. J., . . . Hauser, S. L. (2007). Risk alleles for multiple
sclerosis identified by a genomewide study. N Engl J Med, 357(9), 851-862. doi:
10.1056/NEJMoa073493
Ionita-Laza, I., McQueen, M. B., Laird, N. M., & Lange, C. (2007). Genomewide
weighted hypothesis testing in family-based association studies, with an
104
application to a 100K scan. Am J Hum Genet, 81(3), 607-614. doi:
10.1086/519748
Irizarry-Ramirez, M., Kittles, R. A., Wang, X., Salgado-Montilla, J., Nogueras-Gonzalez,
G. M., Sanchez-Ortiz, R., . . . Pettaway, C. A. (2017). Genetic ancestry and
prostate cancer susceptibility SNPs in Puerto Rican and African American men.
Prostate, 77(10), 1118-1127. doi: 10.1002/pros.23368
Isobe, N., Gourraud, P.-A., Harbo, H. F., Caillier, S. J., Santaniello, A., Khankhanian, P.,
. . . Oksenberg, J. R. (2013). Genetic risk variants in African Americans with
multiple sclerosis. Neurology, 81(3), 219-227. doi:
10.1212/WNL.0b013e31829bfe2f
Isobe, N., Madireddy, L., Khankhanian, P., Matsushita, T., Caillier, S. J., Moré, J. M., . . .
Oksenberg, J. R. (2015). An ImmunoChip study of multiple sclerosis risk in
African Americans. Brain, 138(6), 1518-1530. doi: 10.1093/brain/awv078
Jeffryes, H. (1935). Some tests of significance, treated by the theory of probability.
Proceedings of the Cambridge Philosophy Society, 31, 203-222.
Jersild, C., Hansen, G., Svejgaard, A., Fog, T., Thomsen, M., & Dupont, B. (1973).
HISTOCOMPATIBILITY DETERMINANTS IN MULTIPLE SCLEROSIS,
WITH SPECIAL REFERENCE TO CLINICAL COURSE. The Lancet,
302(7840), 1221-1225. doi: 10.1016/S0140-6736(73)90970-7
Johnson, B. A., Wang, J., Taylor, E. M., Caillier, S. J., Herbert, J., Khan, O. A., . . .
Oksenberg, J. R. (2009). Multiple sclerosis susceptibility alleles in African
Americans. Genes And Immunity, 11, 343. doi: 10.1038/gene.2009.81
https://www.nature.com/articles/gene200981#supplementary-information
105
Kass, R. E., & Raftery, A. E. (1993). Bayes Factors and Model Uncertainty. Seattle,
Washington: University of Washington.
Kass, R. E., & Raftery, A. E. (1995). Bayes Factors. Journal of the American Statistical
Association, 90(430), 773-795. doi: 10.1080/01621459.1995.10476572
Knol, M. J., Egger, M., Scott, P., Geerlings, M. I., & Vandenbroucke, J. P. (2009). When
one depends on the other: reporting of interaction in case-control and cohort
studies. Epidemiology, 20(2), 161-166. doi: 10.1097/EDE.0b013e31818f6651
Kolonel, L. N., Henderson, B. E., Hankin, J. H., Nomura, A. M., Wilkens, L. R., Pike, M.
C., . . . Nagamine, F. S. (2000). A multiethnic cohort in Hawaii and Los Angeles:
baseline characteristics. Am J Epidemiol, 151(4), 346-357.
Kooperberg, C., & Leblanc, M. (2008). Increasing the power of identifying gene x gene
interactions in genome-wide association studies. Genet Epidemiol, 32(3), 255-
263. doi: 10.1002/gepi.20300
Kraft, P., Yen, Y. C., Stram, D. O., Morrison, J., & Gauderman, W. J. (2007). Exploiting
gene-environment interaction to detect genetic associations. Hum Hered, 63(2),
111-119. doi: 10.1159/000099183
Kravitz-Wirtz, N., Teixeira, S., Hajat, A., Woo, B., Crowder, K., & Takeuchi, D. (2018).
Early-Life Air Pollution Exposure, Neighborhood Poverty, and Childhood
Asthma in the United States, 1990–2014. International Journal of Environmental
Research and Public Health, 15(6), 1114.
Kruglyak, L. (1999). Prospects for whole-genome linkage disequilibrium mapping of
common disease genes. Nature Genetics, 22, 139. doi: 10.1038/9642
106
Lander, E. S. (2011). Initial impact of the sequencing of the human genome. Nature, 470,
187. doi: 10.1038/nature09792
https://www.nature.com/articles/nature09792#supplementary-information
Lander, E. S., & Schork, N. J. (1994). Genetic dissection of complex traits. Science,
265(5181), 2037-2048.
Langer-Gould, A., Brara, S. M., Beaber, B. E., & Zhang, J. L. (2013). Incidence of
multiple sclerosis in multiple racial and ethnic groups. Neurology, 80(19), 1734-
1739. doi: 10.1212/WNL.0b013e3182918cc2
Leamer, E. (1978). Specification searches : ad hoc inference with nonexperimental data /
Edward E. Leamer.
Li, D., & Conti, D. V. (2009). Detecting gene-environment interactions using a combined
case-only and case-control approach. Am J Epidemiol, 169(4), 497-504. doi:
10.1093/aje/kwn339
Liu, J., Lewinger, J. P., Gilliland, F. D., Gauderman, W. J., & Conti, D. V. (2013).
Confounding and heterogeneity in genetic association studies with admixed
populations. Am J Epidemiol, 177(4), 351-360. doi: 10.1093/aje/kws234
Liu, W., Purevdorj, E., Zscheppang, K., von Mayersbach, D., Behrens, J., Brinkhaus, M.
J., . . . Dammann, C. E. (2010). ErbB4 regulates the timely progression of late
fetal lung development. Biochim Biophys Acta, 1803(7), 832-839. doi:
10.1016/j.bbamcr.2010.03.003
Madigan, D., & Raftery, A. E. (1994). Model Selection and Accounting for Model
Uncertainty in Graphical Models Using Occam's Window. Journal of the
American Statistical Association, 89(428), 1535-1546. doi: 10.2307/2291017
107
Mani, A. (2017). Local Ancestry Association, Admixture Mapping, and Ongoing
Challenges. Circ Cardiovasc Genet, 10(2). doi:
10.1161/CIRCGENETICS.117.001747
Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J.,
. . . Visscher, P. M. (2009). Finding the missing heritability of complex diseases.
Nature, 461(7265), 747-753. doi: 10.1038/nature08494
Maples, B. K., Gravel, S., Kenny, E. E., & Bustamante, C. D. (2013). RFMix: a
discriminative modeling approach for rapid and robust local-ancestry inference.
Am J Hum Genet, 93(2), 278-288. doi: 10.1016/j.ajhg.2013.06.020
McConnell, R., Berhane, K., Yao, L., Jerrett, M., Lurmann, F., Gilliland, F., . . . Peters, J.
(2006). Traffic, susceptibility, and childhood asthma. Environ Health Perspect,
114(5), 766-772.
McKeigue, P. M. (2005). Prospects for admixture mapping of complex traits. Am J Hum
Genet, 76(1), 1-7. doi: 10.1086/426949
Moffatt, M. F., Gut, I. G., Demenais, F., Strachan, D. P., Bouzigon, E., Heath, S., . . .
Consortium, G. (2010). A large-scale, consortium-based genomewide association
study of asthma. N Engl J Med, 363(13), 1211-1221. doi:
10.1056/NEJMoa0906312
Molineros, J. E., Maiti, A. K., Sun, C., Looger, L. L., Han, S., Kim-Howard, X., . . .
Nath, S. K. (2013). Admixture mapping in lupus identifies multiple functional
variants within IFIH1 associated with apoptosis, inflammation, and autoantibody
production. PLoS Genet, 9(2), e1003222. doi: 10.1371/journal.pgen.1003222
108
Montana, G., & Pritchard, J. K. (2004). Statistical tests for admixture mapping with case-
control and cases-only data. Am J Hum Genet, 75(5), 771-789. doi:
10.1086/425281
Mucci, L. A., Hjelmborg, J. B., Harris, J. R., Czene, K., Havelick, D. J., Scheike, T., . . .
Nordic Twin Study of Cancer, C. (2016). Familial Risk and Heritability of Cancer
Among Twins in Nordic Countries. JAMA, 315(1), 68-76. doi:
10.1001/jama.2015.17703
Mukherjee, B., & Chatterjee, N. (2008). Exploiting gene-environment independence for
analysis of case-control studies: an empirical Bayes-type shrinkage estimator to
trade-off between bias and efficiency. Biometrics, 64(3), 685-694. doi:
10.1111/j.1541-0420.2007.00953.x
Murcray, C. E., Lewinger, J. P., Conti, D. V., Thomas, D. C., & Gauderman, W. J.
(2011). Sample size requirements to detect gene-environment interactions in
genome-wide association studies. Genet Epidemiol, 35(3), 201-210. doi:
10.1002/gepi.20569
Murcray, C. E., Lewinger, J. P., & Gauderman, W. J. (2009). Gene-environment
interaction in genome-wide association studies. Am J Epidemiol, 169(2), 219-226.
doi: 10.1093/aje/kwn353
Murphy, A. B., Ukoli, F., Freeman, V., Bennett, F., Aiken, W., Tulloch, T., . . . Kittles,
R. A. (2012). 8q24 risk alleles in West African and Caribbean men. Prostate,
72(12), 1366-1373. doi: 10.1002/pros.22486
Nan, H., Hutter, C. M., Lin, Y., Jacobs, E. J., Ulrich, C. M., White, E., . . . on behalf of,
C. G. (2015). Association of aspirin and non-steroidal anti-inflammatory drug use
109
with risk of colorectal cancer according to genetic variants. JAMA, 313(11), 1133-
1142. doi: 10.1001/jama.2015.1815
Nieminen, M. M., Kaprio, J., & Koskenvuo, M. (1991). A population-based study of
bronchial asthma in adult twin pairs. Chest, 100(1), 70-75.
Nortvedt, M. W., Riise, T., Myhr, K. M., & Nyland, H. I. (1999). Quality of life in
multiple sclerosis: measuring the disease effects more broadly. Neurology, 53(5),
1098-1103.
O'Hagan, A., & Forster, J. J. (2004). Kendall's Advanced Theory of Statistics, volume 2B:
Bayesian Inference, second edition (Second ed. Vol. 2B). West Sussex, United
Kingdom: John Wiley & Sons Ltd.
Okobia, M. N., Zmuda, J. M., Ferrell, R. E., Patrick, A. L., & Bunker, C. H. (2011).
Chromosome 8q24 variants are associated with prostate cancer risk in a high risk
population of African ancestry. Prostate, 71(10), 1054-1063. doi:
10.1002/pros.21320
Oksenberg, J. R., Barcellos, L. F., Cree, B. A. C., Baranzini, S. E., Bugawan, T. L., Khan,
O., . . . Hauser, S. L. (2004). Mapping Multiple Sclerosis Susceptibility to the
HLA-DR Locus in African Americans. The American Journal of Human
Genetics, 74(1), 160-167. doi: https://doi.org/10.1086/380997
Ordoñez, G., Romero, S., Orozco, L., Pineda, B., Jiménez-Morales, S., Nieto, A., . . .
Sotelo, J. (2015). Genomewide admixture study in Mexican Mestizos with
multiple sclerosis. Clinical Neurology and Neurosurgery, 130, 55-60. doi:
https://doi.org/10.1016/j.clineuro.2014.11.026
110
Park, S. L., Cheng, I., & Haiman, C. A. (2018). Genome-Wide Association Studies of
Cancer in Diverse Populations. Cancer Epidemiol Biomarkers Prev, 27(4), 405-
417. doi: 10.1158/1055-9965.EPI-17-0169
Parra, E. J., Marcini, A., Akey, J., Martinson, J., Batzer, M. A., Cooper, R., . . . Shriver,
M. D. (1998). Estimating African American admixture proportions by use of
population-specific alleles. Am J Hum Genet, 63(6), 1839-1851. doi:
10.1086/302148
Pasaniuc, B., Zaitlen, N., Lettre, G., Chen, G. K., Tandon, A., Kao, W. H., . . . Price, A.
L. (2011). Enhanced statistical tests for GWAS in admixed populations:
assessment using African Americans from CARe and a Breast Cancer
Consortium. PLoS Genet, 7(4), e1001371. doi: 10.1371/journal.pgen.1001371
Patterson, N., Hattangadi, N., Lane, B., Lohmueller, K. E., Hafler, D. A., Oksenberg, J.
R., . . . Reich, D. (2004). Methods for high-density admixture mapping of disease
genes. Am J Hum Genet, 74(5), 979-1000. doi: Doi 10.1086/420871
Phillips, C. J. (2004). The cost of multiple sclerosis and the cost effectiveness of disease-
modifying agents in its treatment. CNS Drugs, 18(9), 561-574.
Piegorsch, W. W., Weinberg, C. R., & Taylor, J. A. (1994). Non-hierarchical logistic
models and case-only designs for assessing susceptibility in population-based
case-control studies. Stat Med, 13(2), 153-162.
Pino-Yanes, M., Gignoux, C. R., Galanter, J. M., Levin, A. M., Campbell, C. D., Eng, C.,
. . . Burchard, E. G. (2015). Genome-wide association study and admixture
mapping reveal new loci associated with total IgE levels in Latinos. J Allergy Clin
Immunol, 135(6), 1502-1510. doi: 10.1016/j.jaci.2014.10.033
111
Pritchard, J. K., Stephens, M., Rosenberg, N. A., & Donnelly, P. (2000). Association
mapping in structured populations. Am J Hum Genet, 67(1), 170-181. doi:
10.1086/302959
Raftery, A. E. (1993a). Bayesian model selection in structurual equation models. In K.
Bollen & J. Long (Eds.), Testing Structural Equation Models (pp. 163-180).
Newbury Park, CA: Sage.
Raftery, A. E. (1996a). Approximate Bayes factors and accounting for model uncertainty
in generalized linear models Biometrika, 83(2), 251-266.
Raftery, A. E., Madigan, D., & Hoeting, J. A. (1997). Bayesian Model Averaging for
Linear Regression Models Journal of the American Statistical Association,
92(437), 179-191.
Raftery, A. E., Madigan, D.M. and Hoeting, J. (1993b). Model selection and accounting
for model uncertainty in linear regression models (U. o. W. Department of
Statistics, Trans.) Technical Report.
Raftery, A. E., & Richardson, S. (1996). Model selection for generalized linear models
via GLIB, with application to epidemiology. In D. A. B. a. D. K. Stangl (Ed.),
Bayesian Biostatistics (pp. 321-354). New York: Marcel Dekker.
Raftery, A. E. a. R., S. (1996b). Model selection for generalized linear models via GLIB,
with application to epidemiology. In D. A. B. a. D. K. Stangl (Ed.), Bayesian
Biostatistics (pp. 321--354). New York: Marcel Dekker.
Raiffa, H., & Schlaifer, R. (1961). Applied Statistical Decision Theory. Cambridge, MA:
MIT Press.
112
Raïffa, H., & Schlaifer, R. (1961). Applied statistical decision theory: Division of
Research, Graduate School of Business Adminitration, Harvard University.
Rao, S. M., Leo, G. J., Ellington, L., Nauertz, T., Bernardin, L., & Unverzagt, F. (1991).
Cognitive dysfunction in multiple sclerosis. II. Impact on employment and social
functioning. Neurology, 41(5), 692-696.
Regal, R. R., & Hook, E. B. (1991). The effects of model selection on confidence
intervals for the size of a closed population. Stat Med, 10(5), 717-721.
Rife, D. C. (1954). Populations of hybrid origin as source material for the detection of
linkage. Am J Hum Genet, 6(1), 26-33.
Ritz, B. R., Paul, K. C., & Bronstein, J. M. (2016). Of Pesticides and Men: A California
Story of Genes and Environment in Parkinson’s Disease. Current environmental
health reports, 3(1), 40-52. doi: 10.1007/s40572-016-0083-2
Rivas-Rodríguez, E., & Amezcua, L. (2018). Ethnic Considerations and Multiple
Sclerosis Disease Variability in the United States. Neurologic Clinics, 36(1), 151-
162. doi: https://doi.org/10.1016/j.ncl.2017.08.007
Roberts, H. V. (1965). Probabilistic Prediction. Journal of the American Statistical
Association, 60(309), 50-62. doi: 10.1080/01621459.1965.10480774
Rosenberg, N. A., Huang, L., Jewett, E. M., Szpiech, Z. A., Jankovic, I., & Boehnke, M.
(2010). Genome-wide association studies in diverse populations. Nat Rev Genet,
11(5), 356-366. doi: 10.1038/nrg2760
Rothman, K. J. (1976). CAUSES. Am J Epidemiol, 104(6), 587-592. doi:
10.1093/oxfordjournals.aje.a112335
113
Rothman, K. J., Greenland, S., & Lash, T. L. (2008). Modern Epidemiology: Wolters
Kluwer Health/Lippincott Williams & Wilkins.
Rothman, K. J., Greenland, S., & Walker, A. M. (1980). CONCEPTS OF
INTERACTION. Am J Epidemiol, 112(4), 467-470. doi:
10.1093/oxfordjournals.aje.a113015
Ruiz-Narvaez, E. A., Sucheston-Campbell, L., Bensen, J. T., Yao, S., Haddad, S.,
Haiman, C. A., . . . Lunetta, K. L. (2016). Admixture Mapping of African-
American Women in the AMBER Consortium Identifies New Loci for Breast
Cancer and Estrogen-Receptor Subtypes. Front Genet, 7, 170. doi:
10.3389/fgene.2016.00170
Scherr, D. S. (2014). Commentary on “Common Genetic Polymorphisms Modify the
Effect of Smoking on Absolute Risk of Bladder Cancer.” Garcia-Closas M,
Rothman N, Figueroa JD, Prokunina-Olsson L, Han SS, Baris D, Jacobs EJ,
Malats N, De Vivo I, Albanes D, Purdue MP, Sharma S, Fu YP, Kogevinas M,
Wang Z, Tang W, Tardón A, Serra C, Carrato A, García-Closas R, Lloreta J,
Johnson A, Schwenn M, Karagas MR, Schned A, Andriole G Jr., Grubb R 3rd,
Black A, Gapstur SM, Thun M, Diver WR, Weinstein SJ, Virtamo J, Hunter DJ,
Caporaso N, Landi MT, Hutchinson A, Burdett L, Jacobs KB, Yeager M,
Fraumeni JF Jr., Chanock SJ, Silverman DT, Chatterjee N, Division of Cancer
Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, USA.:
Cancer Res 2013;73(7):2211–20 [Epub 2013 Mar 27]. Urologic Oncology:
Seminars and Original Investigations, 32(2), 213-214. doi:
https://doi.org/10.1016/j.urolonc.2013.08.015
114
Schumacher, F. R., Al Olama, A. A., Berndt, S. I., Benlloch, S., Ahmed, M., Saunders, E.
J., . . . Eeles, R. A. (2018). Association analyses of more than 140,000 men
identify 63 new prostate cancer susceptibility loci. Nature Genetics. doi:
10.1038/s41588-018-0142-8
Schumacher, F. R., Feigelson, H. S., Cox, D. G., Haiman, C. A., Albanes, D., Buring, J., .
. . Hunter, D. J. (2007). A common 8q24 variant in prostate and breast cancer
from a large nested case-control study. Cancer Res, 67(7), 2951-2956. doi:
10.1158/0008-5472.CAN-06-3591
Seldin, M. F., Pasaniuc, B., & Price, A. L. (2011). New approaches to disease mapping in
admixed populations. Nat Rev Genet, 12(8), 523-528. doi: 10.1038/nrg3002
Setakis, E., Stirnadel, H., & Balding, D. J. (2006). Logistic regression protects against
population structure in genetic association studies. Genome Res, 16(2), 290-296.
doi: 10.1101/gr.4346306
Shriner, D. (2017). Overview of Admixture Mapping. Curr Protoc Hum Genet, 94, 1 23
21-21 23 28. doi: 10.1002/cphg.44
Smith, M. W., Patterson, N., Lautenberger, J. A., Truelove, A. L., McDonald, G. J.,
Waliszewska, A., . . . Reich, D. (2004). A high-density admixture map for disease
gene discovery in african americans. Am J Hum Genet, 74(5), 1001-1013. doi:
10.1086/420856
Sofer, T., Baier, L. J., Browning, S. R., Thornton, T. A., Talavera, G. A., Wassertheil-
Smoller, S., . . . Franceschini, N. (2017). Admixture mapping in the Hispanic
Community Health Study/Study of Latinos reveals regions of genetic associations
115
with blood pressure traits. PLoS ONE, 12(11), e0188400. doi:
10.1371/journal.pone.0188400
Stewart, L. (1987). Hierarchical Bayesian Analysis Using Monte Carlo Integration:
Computing Posterior Distributions When There are Many Possible Models.
Journal of the Royal Statistical Society. Series D (The Statistician), 36(2/3), 211-
219. doi: 10.2307/2348514
Szulc, P., Bogdan, M., Frommlet, F., & Tang, H. (2017). Joint genotype- and ancestry-
based genome-wide association studies in admixed populations. Genet Epidemiol,
41(6), 555-566. doi: 10.1002/gepi.22056
Tchetgen Tchetgen, E. (2011). Robust discovery of genetic associations incorporating
gene-environment interaction and independence. Epidemiology, 22(2), 262-272.
doi: 10.1097/EDE.0b013e318207ffc3
Thomas, D. (2010). Gene--environment-wide association studies: emerging approaches.
Nat Rev Genet, 11(4), 259-272. doi: 10.1038/nrg2764
Thomas, D. C., & Witte, J. S. (2002). Point: population stratification: a problem for case-
control studies of candidate-gene associations? Cancer Epidemiol Biomarkers
Prev, 11(6), 505-512.
Tienari, P. J., Kuokkanen, S., Pastinen, T., Wikström, J., Sajantila, A., Sandberg-
Wollheim, M., . . . Peltonen, L. (1998). Golli-MBP gene in multiple sclerosis
susceptibility. Journal of Neuroimmunology, 81(1), 158-167. doi: 10.1016/S0165-
5728(97)00171-9
116
Tierney, L., & Kadane, J. B. (1986). Accurate Approximations for Posterior Moments
and Marginal Densities. Journal of the American Statistical Association, 81(393),
82-86. doi: 10.1080/01621459.1986.10478240
Torgerson, D. G., Ampleford, E. J., Chiu, G. Y., Gauderman, W. J., Gignoux, C. R.,
Graves, P. E., . . . Nicolae, D. L. (2011). Meta-analysis of genome-wide
association studies of asthma in ethnically diverse North American populations.
Nature Genetics, 43(9), 887-892. doi: 10.1038/ng.888
Umbach, D. M., & Weinberg, C. R. (1997). Designing and analysing case-control studies
to exploit independence of genotype and exposure. Stat Med, 16(15), 1731-1743.
Ventura, R. E., Antezana, A. O., Bacon, T., & Kister, I. (2017). Hispanic Americans and
African Americans with multiple sclerosis have more severe disease course than
Caucasian Americans. Mult Scler, 23(11), 1554-1557. doi:
10.1177/1352458516679894
Viallefont, V., Raftery, A. E., & Richardson, S. (2001). Variable selection and Bayesian
model averaging in case-control studies. Stat Med, 20(21), 3215-3230.
White, H. (1982). Maximum-Likelihood Estimation of Mis-Specified Models.
Econometrica, 50(1), 1-25. doi: Doi 10.2307/1912526
Wilson, B. D., Ricks-Santi, L. J., Mason, T. E., Abbas, M., Kittles, R. A., Dunston, G.
M., & Kanaan, Y. M. (2018). Admixture Mapping Links RACGAP1 Regulation
to Prostate Cancer in African Americans. Cancer Genomics Proteomics, 15(3),
185-191. doi: 10.21873/cgp.20076
Yeager, M., Chatterjee, N., Ciampa, J., Jacobs, K. B., Gonzalez-Bosquet, J., Hayes, R.
B., . . . Chanock, S. J. (2009). Identification of a new prostate cancer
117
susceptibility locus on chromosome 8q24. Nature Genetics, 41(10), 1055-1057.
doi: 10.1038/ng.444
Zhang, P., Lewinger, J. P., Conti, D., Morrison, J. L., & Gauderman, W. J. (2016).
Detecting Gene-Environment Interactions for a Quantitative Trait in a Genome-
Wide Association Study. Genet Epidemiol, 40(5), 394-403. doi:
10.1002/gepi.21977
Zhu, X., & Wang, H. (2017). The Analysis of Ethnic Mixtures. Methods Mol Biol, 1666,
505-525. doi: 10.1007/978-1-4939-7274-6_25
Zscheppang, K., Giese, U., Hoenzke, S., Wiegel, D., & Dammann, C. E. L. (2013).
ErbB4 is an upstream regulator of TTF-1 fetal mouse lung type II cell
development in vitro. Biochim Biophys Acta, 1833(12), 2690-2702. doi:
10.1016/j.bbamcr.2013.06.030
Zwibel, H. L., & Smrtka, J. (2011). Improving quality of life in multiple sclerosis: an
unmet need. Am J Manag Care, 17 Suppl 5 Improving, S139-145.
Abstract (if available)
Abstract
Evidence suggests that identifying genetic contributions to the risk of complex diseases requires moving beyond independent tests of association between markers and traits. The purpose of this study is to present two methods within a Bayesian framework to be used in identifying gene-by-environment (GxE) interactions and genomic regions contributing to differential disease risk by ancestry. We first introduce a GxE approach which combines a Bayesian framework with a two-degree-of-freedom (2df) test structure for a simultaneous test of main and interaction effects. Simulations are used to present a comparison study of classical and more complex GxE approaches used currently and demonstrate that our proposed method performs similarly to existing 2df approaches with increased power and robustness in numerous scenarios. A second approach is introduced to perform admixture mapping and map susceptibility loci to complex disease with parental ancestry. Our admixture mapping approach provides a linear regression framework in which we reformulate the often used case-control and case-only statistics as nested regression models which are combined within a Bayesian model selection framework. Simulation is used to demonstrate that this approach is advantagous to using case-control or case-only statistics in increased power and robustness. We conduct two genome-wide interaction studies (GWIS) for childhood asthma using air pollution and ethnicity as environmental factors in a nested case-control sample from the Children’s Health Study (CHS). We conduct an admixture mapping of prostate cancer (PrCa) in African Americans and Latinos from the Multiethnic Cohort as well as multiple sclerosis in Hispanic Whites using our proposed method, as well as case-control and case-only methods.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Combination of quantile integral linear model with two-step method to improve the power of genome-wide interaction scans
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study
PDF
Comparisons of four commonly used methods in GWAS to detect gene-environment interactions
PDF
High-dimensional regression for gene-environment interactions
PDF
Bayesian multilevel quantile regression for longitudinal data
PDF
A genome wide association study of multiple sclerosis (MS) in Hispanics
PDF
Bayesian hierarchical models in genetic association studies
PDF
Minimum p-value approach in two-step tests of genome-wide gene-environment interactions
PDF
Population substructure and its impact on genome-wide association studies with admixed populations
PDF
The influence of DNA repair genes and prenatal tobacco exposure on childhood acute lymphoblastic leukemia risk: a gene-environment interaction study
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
Bayesian models for a respiratory biomarker with an underlying deterministic model in population research
PDF
Missing heritability may be explained by the common household environment and its interaction with genetic variation
PDF
Two-step study designs in genetic epidemiology
PDF
Functional based multi-level flexible models for multivariate longitudinal data
PDF
Statistical methods and analyses in the Multiethnic Cohort (MEC) human gut microbiome data
PDF
Stochastic inference for deterministic systems: normality and beyond
PDF
Polygenic analyses of complex traits in complex populations
PDF
Identification of differentially connected gene expression subnetworks in asthma symptom
Asset Metadata
Creator
Moss, Lilit Chemenyan
(author)
Core Title
Bayesian model averaging methods for gene-environment interactions and admixture mapping
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
07/31/2018
Defense Date
06/19/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
admixture,Bayesian model averaging,BMA,childhood asthma,gene-environment,interactions,mapping,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Conti, David (
committee chair
), Amezcua, Lilyana (
committee member
), Gauderman, William James (
committee member
), Stram, Daniel (
committee member
), Thomas, Duncan (
committee member
)
Creator Email
chemenya@usc.edu,liliths8686@yahoo.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-44142
Unique identifier
UC11671921
Identifier
etd-MossLilitC-6584.pdf (filename),usctheses-c89-44142 (legacy record id)
Legacy Identifier
etd-MossLilitC-6584.pdf
Dmrecord
44142
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Moss, Lilit Chemenyan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
admixture
Bayesian model averaging
BMA
childhood asthma
gene-environment
interactions
mapping