Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Using multi-level Bayesian hierarchical model to detect related multiple SNPs within multiple genes to disease risk
(USC Thesis Other)
Using multi-level Bayesian hierarchical model to detect related multiple SNPs within multiple genes to disease risk
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
USING MULTI-LEVEL BAYESIAN HIERARCHICAL MODEL
TO DETECT RELATED MULTIPLE SNPS WITHIN MULTIPLE GENES
TO DISEASE RISK
by
Lewei Duan
A Thesis Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(BIOSTATISTICS)
May 2013
Copyrights 2013 Lewei Duan
i
DEDICATION
To my family and friends
ii
ACKNOWLEDGEMENTS
I am grateful to Dr. Duncan Thomas, Dr. Paul Marjoram, Dr. David Conti, and Dr.
Stanley Azen for their invaluable guidance and support.
iii
TABLE OF CONTENTS
DEDICATION ………………………………………………………………………………….………………. i
ACKNOWLEDGEMENTS …………………………………………………………… …….……………… ii
LIST OF TABLES……………………………………………………………………………………………… v
LIST OF FIGURES ………………………………………………………………………………………….. vi
ABBREVIATION ………………………………………………………………………………………..……vii
ABSTRACT …………………………………………………………………………………………..……… viii
CHAPTER 1: INTRODUCTION .................................................................................................1
Previous Approaches to Variant Selection .........................................................................2
Motivation for the Multi-level Bayesian Hierarchical Model ........................................4
Organization ................................................................................................................................6
CHPATER 2: MOTIVATING EXCAMPLE: THE WECARE STUDY OF DNA DAMAGE
RESPONSE PATHWAY IN SECOND BREAST CANCER FOLLOWING
RADIOTHERAPY .........................................................................................................................7
DNA Double-Strand Breaks Response Pathway and Breast Cancer ...........................7
WECARE study design ..............................................................................................................9
Summary .................................................................................................................................... 13
CHAPTER 3: THE GENERAL BAYESIAN FRAMEWORK ................................................ 15
Bayesian Framework ............................................................................................................. 15
Bayes Factors ........................................................................................................................... 19
Bayesian Hierarchical Models Structure .......................................................................... 21
iv
CHAPTER 4: MULTI-LEVEL BAYESIAN HIERARCHICAL MODEL: A NOVEL
APPROACH FOR PATHWAY ANALYSIS OF GENETIC ASSOCIATION STUDIES ..... 25
Hierarchical Modeling: Applications to Genetic Data .................................................. 25
Multi-level Bayesian Hierarchical Model: Statistical Methods .................................. 29
Model I ........................................................................................................................................ 31
Model II ...................................................................................................................................... 34
Fitting the Model ..................................................................................................................... 35
Posterior Summarization ...................................................................................................... 37
CHAPTER 5: SIMULATION STUDIES .................................................................................. 39
CHAPTER 6: APPLICATION TO THE WECARE DATA ................................................... 43
CHPATER 7: DISCUSSION ..................................................................................................... 49
REFERENCES ............................................................................................................................ 72
v
LIST OF TABLES
Table 1: Bayes Factos’s Interpretation for association (Kass and Raftery, 1995) .
...................................................................................................................................................... 58
Table 2: Frequency Distribution of SNPs based on their MAFs ................................ 58
Table 3: Bayes Factors in SNP-level and Gene-level from Model I and Model II for
Selected SNPs. .......................................................................................................................... 59
Table 4: Association Between Selected Variants in DNA-Damage Response
Genes and CBC risk ................................................................................................................. 60
vi
LIST OF FIGURES
Figure 1: Directed acyclic graph describing the structure of the model.. .............. 61
Figure 2: Graphical representation of the A matrix derived from the Gene
Ontology.. ................................................................................................................................... 62
Figure 3: Comparison of estimated parameters based on different priors. .......... 63
Figure 4: The correlation analysis n between Estimated and True parameters. 64
Figure 5.1: Simulation analysis for Model I. Sensitivity/Specificity based on
proportions of the number of SNPs (a). Sensitivity based on Bayes Factors of
SNPs (b). Bayes factors of genes (c) ................................................................................. 65
Figure 5.2: Simulation analysis for Model II. Sensitivity/Specificity based on
proportions of the number of SNPs (a). Sensitivity based on Bayes Factors of
SNPs (b). Bayes factors of genes (c). ................................................................................ 66
Figure 6.1: Simulation analysis for Model I by MAF. Sensitivity/Specificity based
on proportions of the number of SNPs categorized by MAF (Left panels).
Sensitivity based on Bayes Factors of SNPs categorized by MAF (Right panels).
...................................................................................................................................................... 67
Figure 6.2: Simulation analysis for Model II by MAF. Sensitivity/Specificity based
on proportions of the number of SNPs categorized by MAF (Left panels).
Sensitivity based on Bayes Factors of SNPs categorized by MAF (Right panels).
...................................................................................................................................................... 68
Figure 7: Trace plots of Conditional LogLikelihood for Model I (a) and for Model
II (b).. .......................................................................................................................................... 69
Figure 8: Gene-Disease association analysis.. ............................................................... 70
Figure 9: SNP inclusion analysis.. ...................................................................................... 71
vii
ABBREVIATIONS
Abbreviation Description
BF Bayes Factor
BRI Bayesian Risk Index
CBC Contralateral Breast Cancer
CDCV Common Disease Common Variants
CI Confident Interval
DSB Double-strand Breaks
GO Gene Ontology
GWAS Genome-Wide Association Scan
Gy
HRR
Gray
Homologous Recombination Repair
IR Ionizing Radiation
LD linkage disequilibrium
MCMC Markov Chain Monte Carlo
MLE Maximum Likelihood Estimation
MRN
NHEJ
MRE11A-RAD50-NBN complex
Non-homologous End Joining
OR
RERF
Odds Ratio
Radiation Effects Research Foundation
RR Relative Risk
RT Ratiation Treatment
SNP Single Nucleotide Polymorphism
SSGS Stochastic Search Gene Suggestion
UBC Unilateral Breast Cancer
WECARE Women's Environmental Cancer And Radiation Epidemiology Study
viii
ABSTRACT
We proposed a novel statistical method to investigate the involvement of
multiple genes thought to be part of a common pathway for a particular disease. If a
gene is identified to be associated with the disease, we are also interested in
discovering which single nucleotide polymorphisms (SNPs) within this gene are
responsible for this association. Here we present a Bayesian hierarchical modeling
strategy that allows for multiple SNPs within each gene, with external prior
information at either the SNP or gene level. The model involves variable selection at
the SNP level through latent indicator variables and Bayesian shrinkage at the gene
level towards a prior mean vector and covariance matrix that depend on external
information. The entire model is fitted using Markov Chain Monte Carlo methods.
Simulation studies show good ability to recover the underlying model. The method
is applied to data on 504 SNPs in 38 candidate genes involved in DNA damage
response in the Women's Environmental Cancer And Radiation Epidemiology study
(WECARE) of second breast cancers in relation to radiotherapy exposures.
Key words: pathway models; candidate genes; hierarchical Bayes models; breast
cancer; DNA damage response; WECARE study.
1
CHAPTER 1: INTRODUCTION
There is a growing literature on methods for pathway modeling, motivated in
large part by an interest in mining genome-wide association scan (GWAS) data for
commonalities across related genes that individually may not achieve genome-wide
significance, but in the aggregate may point to novel pathways (see [Wang et al.
2007] for a review of gene set enrichment analysis and alternatives). Our goal here
is more modest, guided by an a priori selection of strong candidate genes (Thomas
2010a). Like other methods of pathway analysis, however, we aim to exploit
external knowledge about the biological function of each gene and the relationships
between them (Thomas 2010ab).
A major focus of investigation in recent years has been the search for genetic
risk factors for diseases of interest. However, genetic association studies using
conventional statistical tools have been hindered by complications of multiple
comparisons or selective reporting. Hierarchical modeling may improve the
analyses of genetic data by increasing the precision of the risk estimates and
reducing the likelihood of false positives via shrinkage toward to the prior mean
(Hung et al 2007).
The identification of susceptibility and/or resistance alleles in candidate-
gene studies provides direct evidence that these genes and their biological pathways
are relevant to specific diseases in humans. High-throughput genotyping
technologies such as whole-genome SNP scans have made it possible to find genetic
2
variants that contribute to the risk of common diseases in the past few years
(Hunter 2005). An agnostic GWAS is currently underway within the Women’s
Environmental Cancer And Radiation Epidemiology (WECARE) study, but we
focused on methods for pathway analysis in candidate gene association studies. The
aim of our study is to provide a comprehensive modeling strategy for examining the
effects of all genes and their SNPs in a pathway, and to apply our approaches to
variants selection and gene association study on WECARE data.
Previous Approaches to Variant Selection
Selection of predictors to be included in a multiple regression model has long
been a crucial problem in statistics, from both frequentist and Bayesian
perspectives. Variable selection via Gibbs sampling was proposed for selecting
promising subsets (George and McCulloch 1993) by embedding the regression setup
in a hierarchical normal mixture model where latent variables are used to identify
subset choices. The variables with higher posterior probabilities are identified as
the promising subsets of predictors. The Gibbs sampler is used to indirectly sample
from this multinomial posterior distribution on the set of possible subsets. The
more frequently the subsets appear in the Gibbs sample, the higher the probability
is that they have predictive value.
George and McCulloch (1997) also compared various approaches for prior
specification and posterior computation for Bayesian hierachical variable selection.
They pointed out that a variety of Markov Chain Monte Carlo (MCMC) algorithms
3
can be constructed based on the Gibbs sampler and Metroplis-Hastings algorithms,
or even a combination of both, for the pupose of posterior exploration. They pointed
out that it is more appropriate to average predictions over the posterior distribution
rather than using predictions from any single model, when the goal is prediction.
Clyde et al. (1996) and Raftery et al. (1993) have discussed the potential of
prediction averaging in the face of uncertainty about variable selection. This concept
has been implemented in our model.
Most of the previous methodological literature on the design of genome-wide
association studies and most applications have used statistical significance as the
sole criterion. However, various investigators have recognized that giving higher
priority to a targeted subset of SNPs with greater biological plausibility of true
association can lead to higher predictive power (Bostein and Risch 2003; Sohns et
al. 2009; Tintle et al. 2009). Information from previously published investigations
indicating the effects of SNPs can be considered for studies that attempt to replicate
association findings (Rebbeck et al. 2004). Methods that prioritize the choice of
genetic variants to be genotyped in molecular epidemiological studies can provide
some useful guidance to optimally select variants for candidate gene association
studies. For instance, 21 SNPs selected by GWAS for association with primary breast
cancer were genotyped in WECARE study population along with other SNPs in DNA
damage response pathway, and several of these were found to increase risk for
contralateral breast cancer (Teraoka et al. 2011).
4
Preliminary snalysis of the genetic information based on previous GWAS
findings can provide evidence that will help prioritize the strongest constellations of
results. Using pathway analysis of GWAS results to prioritize genes and pathways
within a biological context is one of the common analytic methods to combine
preliminary GWAS statistics to identify genes, alleles and pathways for deeper
investigations (Cantor et al. 2010). Moreover, Lewinger et al. (2007) illustrated a
hierarchical regression approach that would allow appropriate use of multiple
sources of prior knowledge, so call “prior covariates”, without prejudging therir
informativeness. If at least some of the prior covariates have predictive value, the
ranking by posterior expectations performs better at selecting the true positive
association than a simple ranking of p-values. For instance, the large amount of
markers in GWAS scan enables the data to suggest which prior covariates are
correlatted with the strength of the marker-disease association.
Motivation for the Multi-level Bayesian Hierarchical Model
Our approach is motivated by many of the hierarchical methods used in
previous studies. For instance, our model is motivated in part by SSGS (Swartz et al
2006) to implement a stochastic search with a prior covariance structure to search
for genes associated with disease, for which using a prior covariance structure
improves variable selection under the normal linear model. Our model is also
motivated in part by recent work on methods for testing associations with multiple
rare variants using Bayesian methods in cancer data (Wilson 2010b, Quintana et al.
5
2011), and in next generation sequencing data (Quintana and Conti 2013), where
testing any single variant is nearly impossible because of their rarity and the
enormous multiple comparisons penalty. This motivates our choice of a burden
index for gene-level associations comprising simple 1/0/+1 weights with model
averaging across their uncertainty distribution. An extensive review of hierarchical
Bayesian methodology for variable selection in regression models can be found in
Rockova et al. (2012).
Our starting point is a model for multiple rare variants proposed by Quintana
et al. (2011), which collapses all the variants with a gene into a single “burden” type
index, similar to quite a number of other recent proposals (see Basu and Pan [2011]
for a recent review and comparison by simulation), but extended to allow for both
deleterious and protective effects and to explicitly allow for uncertainty about which
variants to include in the model (and which direction for those that are included) by
Bayesian model averaging. Hoffman and Witte (2010) introduced a step-up variable
selection approach that allows for deleterious and protective effects, but does not
consider model uncertainty except in the form of a permutation procedure for the
overall significance test, so is unable to assess the importance and direction of
particular variants or alternative models. Chen et al. (2010) describe a somewhat
similar model that combines variable selection at the SNP level with shrinkage at the
gene level. In this thesis, we extend this approach to multiple genes, incorporating
prior covariates and prior gene-gene similarity information in a Bayesian
hierarchical modeling framework.
6
Organization
The thesis is organized as follows. Chapter 1 (this chapter) outlines the
statistical problem and introduces methods previously used for variable selection.
We provide an overview of the motivations and inspirations for our new models.
Chapter 2 presents a motivating example: the WECARE study of DNA damage
response pathway in second breast cancer following radiotherapy. We describe the
WECARE study design, and summarize our study aim and strategy in this chapter.
Chapter 3 provides background of Bayesian framework, and introduces general
concepts and components of hierarchical Bayes model and some of their
applications. Chapter 4 briefly reviews the applications of hierarchical model in
genetic studies, and proposes our novel models: multi-level Bayesian hierarchical
models for pathway analysis in gene-disease association studies. We describe the
likelihood, define the hierarchical priors, and introduce the implementation of the
Markov chain to explore the posterior distribution. Next, in chapter 5, we evaluate
the performance of our multi-level Bayesian hierarchical models on simulated data.
In chapter 6, we examine the results from our multi-level Bayesian hierarchical
models applied to the real WECARE data. Finally, we conclude in chapter 7 with a
discussion of the results and implications of our modeling strategies, as well as
potential future work.
7
CHPATER 2: MOTIVATING EXCAMPLE: THE WECARE STUDY OF DNA
DAMAGE RESPONSE PATHWAY IN SECOND BREAST CANCER
FOLLOWING RADIOTHERAPY
DNA Double-Strand Breaks Response Pathway and Breast Cancer
Various kinds of DNA damage, such as double-strand breaks (DSB), can result
in development of cancer. Ionizing radiation (IR), for example, is known to cause
DSBs. Such damage invokes various DNA damage response mechanisms including
homologous recombination repair (HRR, a type of genetic recombination in which
nucleotide sequences are exchanged between two similar or identical molecules of
DNA), non-homologous end joining DNA repair (NHEJ, in contrast to HRR, repairs
DSBs in DNA by directly ligating the break ends without the need for a homologous
template), apoptosis (biochemical events lead to characteristic cell changes and
death) and cell cycle arrest (a regulatory process that halts progression through the
cell cycle during one of the normal phases (G1,S,G2,M)).
The invocation of DNA damage response pathways results in activation of
several DNA damage response molecules, some of which play pivotal roles in
sensing DNA damage and initiating the subsequent DNA repair process. These can
repair the damage or cause cell cycle arrest or apoptosis to protect the tissue from
disease, but can also permanently misrepair the damage in a manner that can lead
to cancer. In human cancer cells, some of the proteins in the DNA damage response
pathways are mutated or non-functional. The survival and proliferation of the
8
cancer cells are reliant on an impaired DNA damage response pathway, which
provides a therapeutic concept for IR treatment. Germline variation in these genes
can also lead to an inherent susceptibility to cancer or sensitivity to exposures like
IR, modifying an individual’s risk following expos ure. It is this latter constitutional
genetic susceptibility to second cancers following radiotherapy that is investigated
here, rather than the response of the first cancer to treatment.
Based on a comprehensive program of medical follow-up of survivors of the
atomic bombings of Hiroshima and Nagasaki, Japan, the Radiation Effects Research
Foundation (RERF) has produced quantitative estimates of cancer risk from
exposure to IR. In particular, RERF studies have found an excess risk of breast
cancer in female atomic bomb survivors (Land 1995), decreasing with increasing
age at exposure (Tokunaga, Land et al. 1987).
The influence of IR to breast cancer risk has been further explored in many
other studies, notably various cohorts of medical patients exposed to diagnostic or
therapeutic radiation. For instance, in the WECARE Study, Stovall et al. (2008)
showed an association with radiation dose to the contralateral breast, modified by
age at exposure and time since exposure.
The regulation of DSBs repair pathway induced by IR involves many genes in
a complex structure. Mutations in ATM, MRE11A, RAD50, and NBN lead to genetic
disorders and deficiency that are associated with increased cellular sensitivity to IR
(Helleday et al. 2008). The MRE11A-RAD50-NBN (MRN) complex recognizes the
DSB and stabilizes the broken strands of DNA at the break (Lee and Paull 2005,
9
Paull and Lee 2005). MRN recruits ATM to the damage site and activates ATM to
phosphorylate a number of downstream targets (Lee et al. 2010). These
downstream targets including CHEK2, NBN, MDC1, and TP53BP1, amplify the
damage signal by stabilizing proteins such as MRN at the DSB sites and recruiting
MDC1, TP53BP1, and others. This process activates multiple signaling cascades and
invokes several DNA damage response mechanisms including cell-cycle checkpoint
arrest, DNA repair, and apoptosis (Lavin 2008).
Thus, implications of variation in genes involved in the DNA DSB response
pathway become candidate risk factors for radiation-induced breast cancer. The
WECARE Study (Bernstein et al. 2004) is a typical example of a study aimed at a
comprehensive examination of genes involved in a particular pathway, in our case,
ionizing radiation induced DSB.
WECARE study design
Genetic epidemiology has been historically dominated by the use of family-
based designs from which inherited susceptibility can be inferred. With the advent
of methods for assessing DNA-sequence variability directly, association studies
using unrelated individuals became more popular. With evidence accrued from
large association (case-control) studies, it is possible to leverage much more
accurate predictions about the impact of genes on disease risk by aggregating the
results from individual rare variants on the basis of characteristics shared by groups
of variants (Capanu and Begg 2011). The WECARE study was designed with the
10
purpose of examining the joint roles of radiation exposure and genetic susceptibility
in the etiology of breast cancer. It is a multicenter, population-based, nested case-
control study, where cases are women with asynchronous contralateral breast
cancer (CBC) and controls are women with unilateral breast cancer (UBC)
(Bernstein et al 2004). Participants were identified, recruited, and interviewed
through four population-based cancer registries in the United States that are part of
the National Cancer Institute’s Surveillance, Epidemiology, and End Results
program: the Los Angeles County Cancer Surveillance Program; Cancer Surveillance
System of the Fred Hutchinson Cancer Research Center (Seattle); State Health
Registry of Iowa; and Cancer Surveillance Program of Orange County/San Diego-
Imperial Organization of Cancer Control (Orange County/San Diego). The fifth
registry from which participants were recruited was the Danish Cancer Registry
(Bernstein et al., 2004).
708 women with CBC were selected as cases from a cohort of 52,536 women
with histological confirmed breast cancer reported to one of the five population-
based cancer registries. Cases were required to meet the following criteria: (1)
diagnosed between January 1
st
, 1985 and December 31
st
, 2000 with UBC followed
by a second primary, in situ or invasive, breast cancer in the contralateral breast,
diagnosed at least 1 year later; (2) resided in the same study reporting area for both
diagnoses; (3) had no previous or intervening cancer diagnosis; (4) were under age
55 years at the time of diagnosis of the first primary breast cancer; (5) were alive at
the time of contact; and (6) provided informed consent, completed an interview, and
11
provided a blood sample. A 1-year interval between first and second breast cancer
diagnosis was used to rule out synchronous disease (Begg et al. 2008, Bernstein et al
2004, Borg et al 2010).
1399 women in the cohort of survivors of a first breast cancer serve as
controls. They were selected based on the same criteria as cases except that they
were not diagnosed with CBC by December 31
st
, 1999, and that they had not had
prophylactic mastectomy of the contralateral breast during the at-risk interval. (The
time between cases’ two di agnoses defined the “at risk interval”.) Two controls
were individually matched to each case on year of birth (in 5-year strata), year of
diagnosis (in 4-year strata), registry region, and race/ethnicity. In addition, cases
and controls were counter-matched on registry-reported radiation exposure to
improve statistically efficiency (Langholz and Goldstein 1996, Langholz et al. 2003),
so that each case-control triplet contained two radiation exposed women and one
unexposed (Bernstein et al. 2004). Counter-matching brings the marginal full cohort
exposure information into the sample, and hence is a more efficient sampling design
that increases the variability in exposure values over that of random sampling. In
particular, the counter-matching technique implemented in the WECARE study
results in the following situation in each study set – 1) for each exposed case, one
exposed and one unexposed control were selected from the relevant stratum, 2) for
each unexposed case, two exposed controls were selected. Here, the exposure status
refers to the indicator of radiation treatment as recorded in the cancer registry
where the subjects were identified. The counter-matching technique ensured that
12
each triplet contributed to the analysis, avoiding the situation where all members of
a matched set had the same radiation status.
The only difference between an analysis of a counter-matched sample and
that of a simple random sample is the inclusion of the weights in the model. The
weights depend on the number of exposed and unexposed subjects in the risk set.
An offset term is the log of the counter-matching weights, and is added to the log-
linear model with a coefficient fixed at one. An offset term is included in all analyses
to account for the counter-matched sampling design.
The original WECARE study was designed to focus on mutations in the ATM
gene, which plays a central role in recognition of DSBs (Bernstein et al. 2003,
Bernstein et al. 2005, Bernstein et al. 2006). Concannon and colleagues (2008) have
shown that in the same study population, there is an increased CBC risk associated
with rare variants in ATM that are predicted to be deleterious by evolutionary
conservation and a reduced CBC risk associated with some common ATM variants.
Furthermore, a statistically significant increase in CBC risk (Relative Risk [RR] = 2.0,
95%CI 1.1-3.9) has been found among those women who carry rare deleterious
ATM variants and were exposed to radiation treatment to the contralateral breast,
compared non-carriers who didn’t receive the radiation treatment (Bernstein et al.,
2010).
Many genes (i.e., BRCA1, BRCA2, CHEK2, ATM) found to be associated with
increased susceptibility to breast cancer function within a common biochemical
pathway involved in signaling and coordinating DSBs response (Thompson and
13
Easton 2004). The WECARE study was then extended to include BRCA1, BRCA2, and
CHEK2, which are all involved in DNA damage responses, and later still to a broader
set of 38 candidate genes involved in this and other DSB damage responses. The
main effects of ionizing radiation have been previously reported (Stovall et al. 2008,
Langholz et al. 2009), as have marginal associations with ATM (Bernstein et al. 2006,
Concannon et al. 2008), BRCA1/2 (Begg et al. 2008, Bernstein et al. 2010, Borg et al.
2010, Figueiredo et al. 2010, Capanu et al. 2011, Quintana et al. 2011), CHEK2
(Mellemkjaer et al. 2008), and interactions of radiation with ATM (Bernstein et al.
2010) and of BRCA1/2 with treatments and reproductive factors (Poynter et al.
2010, Reding et al. 2010), amongst other risk factors.
Summary
Ionizing radiation causes DSBs in DNA, which can invoke several DNA
damage response mechanisms, such as HRR, NHEJ, apoptosis and cell cycle arrest.
We selected 38 candidate genes that are involved in DSB damage response pathway.
We are interested in exploring whether there are associations between some of the
candidate genes and the outcome. If there is an association between a gene and the
disease, we want to find out which SNPs within this gene are driving this
association. In what follows, multilevel posterior probabilities, conditional
probabilities and Bayes Factors are used to address these questions. The SNP
inclusion probabilities are grouped based on which genes they belong to. The gene
inclusion probabilities are the posterior probability that at least one SNP within this
14
gene is associated with the disease. This gene inclusion posterior probability can be
defined as either the sum of the posterior probabilities of all models that include at
least one of the SNPs within the given gene, or simply as one minus the posterior
probability that none of the SNPs within the corresponding gene is included.
15
CHAPTER 3: THE GENERAL BAYESIAN FRAMEWORK
Bayesian Framework
The WECARE data involves multiple genes and environmental factors, which
interact in complex ways. Most association studies for discovery of novel main
effects and interactions in the previous analyses of WECARE data were based on
standard frequentist approaches. Lately, Bayesian methods have been used in
related WECARE investigations. Capanu et al. (2011) presented a hierarchical
statistical modeling approach involving psedo-likelihood estimation of the relative
risk parameters with Bayesian estimation of the variance compotents, to assess
risks of rare BRCA1 and BRCA2 variants to CBC with WECARE data. Quintana et al.
(2011) introduced the Bayesian risk index (BRI) methodology incorporating model
uncertainty in investigating the aggregate effects of multiple rare variants in BRCA1
and BRCA2 on risk of developing CBC with WECARE data.
The Bayesian framework can be applied to a wide range of statistical models.
Many advantages have been found in using Bayesian methods to model the
relationships in an integrated “systems biology” manner, particularly in a genome -
wide setting (Wilson et al. 2010ab). For instance, it is intuitive to make an inference
within a Bayesian framework, and the Bayesian models are extremely flexible.
Hence, the Bayesian approaches are particularly suited for solving complex genetic
questions.
We chose a Bayesian framework for application to the WECARE study
because it provides a very natural setting for incorporating complex structures and
16
multiple parameters. In addition, the Bayesian framework allows us to incorporate
external information in the analysis in various ways, such as by specifying prior
probability distributions for the parameters of interest and integrating known
correlation information among covariates into its variance-covariance matrix for a
specified distribution.
Bayes rule is fundamental to a Bayesian approach:
where θ denotes a vector of the parameters and D denotes observed data;
represents the joint probability; and represent conditional
probabilities. In the frequentist approach, the parameters of the model are assumed
fixed and unknown; instead, the Bayesian framework treats the parameters as
random variables. Hence, we specify both the prior distribution of the parameter
and the likelihood of the observed data in order to define the joint probability
model. Intuitively, Bayes rule allows us to update our current beliefs about the
unknown parameter θ given observed data D via the conditional
probability .
Previous experiences with genetic studies revealed that it is relatively
infrequent to detect SNP associations. The approach we implemented to explore
multiple hypothesis integrates potential submodels into a single hierarchical model.
Bayes Factors (BFs) are used to compare composite hypotheses in a Bayesian
variable selection framework. Hence, the values of BFs derived from the model
17
depend on the prior distribution over models. Therefore, selection of appropriate
priors on the model space is important.
There are many ways of formalizing a prior in Bayesian approaches. A
simpler prior formalization is preferred. There are many forms of prior
incorporation, such as lasso (Tibshirani, 1996), BIC (Chen and Chen, 2008), and
Beta-Binomial prior (Wilson et al, 2010b). The “Spike and Slab” prior was
introduced as a Bayesian approach to subset selection for the regression coefficients
in linear regression model (Mitchell and Beauchamp, 1988). This prior assumes
that the regression coefficients are mutually independent with a two-component
mixture distribution made up of a uniform flat distribution (the slab) and a
degenerate distribution at zero (the spike).
Ishwaran and Rao (2005) introduced a Bayesian rescaled spike and slab
hierarchical model, which was specifically designed for the multi-group gene
detection problem for DNA microarray analysis. This rescaled spike and slab model
reduces model uncertainty and is effective for prediction. They showed that the
spike and slab model identified important biological signals and minimized false
detection. Ishwaran et al. (2010) also applied Bayesian spike and slab models to
compute the estimators in correlated high dimensional problems based on weighted
generalized ridge regression, and demonstrated the effectiveness of this model.
Spike and slab models and their applications have become more and more popular
for challenging problems in different fields over the last decade. One of the most
important advantages for these models is selective shrinkage (Ishwaran and Rao,
18
2011), which allows the posterior mean to shrink toward zero only for non-
informative variables.
The prior selection for our model is in part inspired by the concept of a
“spike and slab” prior distribution. We incorporated directions in the “spike and
slab” concept to our model. Instead of a two-component mixture distribution, the
prior on SNP inclusion indicators in our model follows a 3-point distribution
( 1,0,+1), in which most regression coefficients are zero and a few coefficients have
some effects with effective or protective directions.
Markov chain Monte Carlo methods have been widely used to approximate
the posterior distribution of the models (for review and application of MCMC
methods, see [Gilks et al. 1998]). Basically MCMC methods entail iterative sampling
of each unknown in the model (e.g., latent variables or parameters) in turn,
conditional on the observed data and the current values of all the other unknowns;
this process continues for many iterations, ultimately yielding samples from the
joint posterior distribution of all parameters given the data after a “burn in” period
for convergence. Specifically, we use the Metropolis-Hastings algorithm (MH) for
obtaining a sequence of random samples from the multi-dimensional probability
distribution. Basically, a random walk MH algorithm explores models in the
neighborhood of the current model by proposing a new model based on a random
change to the current model, which is then accepted or rejected with a defined
probability (Robert and Casella, 2004).
19
Bayes Factors
Jeffreys (1935, 1961) developed the Bayesian approach for quantifying the
evidence in favor of a scientific hypothesis. The centerpiece of this approach was to
compose an index comparing the predictions made by two competing scientific
theories. This approach introduced statistical models to compare the probability of
the data according to each of the two hypotheses, and used Bayes’ theorem to
compute the posterior probability that one of the hypotheses is favored. On the
basis of Jeffreys’ Bayesian approach, Kass and Raftery (1993, 1995) defined the
Bayes Factor (BF) as the ratio of the posterior odds of alternative hypothesis to its
prior odds. For our analysis, we use Bayes factors at the SNP-level and the gene-
level to measure the change of the evidence provided by data for one hypothesis to
the other.
Assume two hypothesis (
and
) are to be compared, where pr(
) = 1 –
pr(
) . Let D represent data, and the posterior probability pr(
|D)= 1 pr(
|D).
Based on Bayes’s theorem, we can derive
(i=1,2),
therefore,
where BF is defined as
20
Specifically, we are interested in exploring which gene is associated with the
outcome of interest (contralateral breast cancer), and we address the following
hypotheses:
: There is no association between the gene of interest and the outcome
: At least one SNP within the gene is associated with the outcome of
interest.
We are also interested in finding out which SNP is driving the association:
: There is no association between the SNP of interest and the outcome
: The SNP of interest is associated with the outcome.
We can therefore calculate the Bayes factor for each of the above hypotheses
respectively:
Bayes Factors offer a way of evaluating evidence in favor of a null hypothesis,
and provide a way of incorporating external information into the evaluation of
evidence about a hypothesis. The advantages of Bayes factors over conventional
significance tests include that the Bayes factors are very general and do not require
alternative models to be nested.
We use Bayes Factors at both the SNP-level and the gene-level to compare
the posterior odds provided by data to their prior odds of a pair of hypothesis. The
detailed formula can be found in the section on posterior summarization in Chapter
21
3. Our interpretation of the significance of the analysis results (Table 1) is based on
the Bayes Factors summary of the evidence proposed by Jeffreys (1961), as
modified by Kass and Raftery (1995). We are essentially looking for genes and SNPs
which yield Bayes Factors greater than 3, or equivalently, 2ln(BF) greater than 2, as
candidate variables for positive associations with disease.
Bayesian Hierarchical Models Structure
Conventional genetic analyses often ignore the external information about
the set of plausible models or their parameters. Hierarchical modeling offers a
solution to problems of multiple inference by incorporating higher-level “p rior”
models into a conventional analysis (Greenland 1993). By “borrowing strength”
from the similarities in one’s data, the hierarchical modeling approach can give
estimates that are more plausible and stable than conventional approaches (Witte
1997).
A typical hierarchical Bayes modeling setup might involve two levels. The
first level might be a traditional regression model for the outcome of interest in
relation to various main effects and interactions. The second level can be a model for
the first level regression coefficients in relation to prior covariates that describe
characteristics of the variables, in our case, the pathways they act in. The prior
covariates can be derived from various pathway ontology databases such as Gene
Ontology, Ingenuity Pathways Analysis, Protein Analysis Through Evolutionary
Relationships, or literature mining. The second level model could include in
22
information on gene networks in the covariance of the first level coefficients. One
can add additional levels to allow for SNPs within genes, different variables
accounting for each environmental factor, or to distinguish each type of main effect
or interaction (Thomas 2010b).
Take the analysis of GWAS data as an example. GWASs were designed to
decipher the genetic basis of complex phenotypes. They first investigate hundreds
of thousands of SNPs across the genome, and then further evaluate the most-
promising SNPs with additional subjects, for replication or a joint analysis. The
conventional approach ignores the extensive information known about the SNPs
and entails simply selecting SNPs with the smallest association p-values using a
standard maximum-likelihood test (Satagopan et al. 2002). Instead of assuming
each SNP is equally causal, one can quantitatively incorporate existing information
about the SNPs into the analysis in a weighted false discovery approach, using
prespecified weights (Pe'er et al. 2006, Roeder et al. 2006, Sun et al. 2006).
However, as the number of prior covariates grows, the chance of sparse strata also
grows, it become increasingly difficult to evaluate data with these approaches.
Application of a hierarchical modeling framework avoids these limitations. It
combines various types of prior information simultaneously, and distinguishes true
causal variants from noise more clearly by estimating the weights using the
predictive value of the covariates in a multiple regression fashion, using all the
GWAS data (Chen and Witte 2007).
23
Assume one collected data on multiple related exposures of interest X and an
outcome of interest Y. One wants to estimate the coefficients β for the effects of X on
Y. Conventionally, one would estimate β from a generalized linear model for the
expectation of Y conditional on X,
(1)
where
is a monotonic differentiable strictly increasing link function between
random and systematic components, and Y has mean E(Y|X) and variance
.
Conventional analytic approaches to estimating β using (1) include: 1) fitting a full
model containing all of the exposures; 2) reducing a full model with a preliminary
testing algorithm; and 3) constructing numerous one-at-a-time models.
Unfortunately, none of the approaches properly address issues of multiple
comparisons (Morris 1983, Thomas et al. 1985).
Hierarchical modeling provides a coherent framework for multiple inference
problems, using shrinkage estimation to improve estimation accuracy (Greenland
1993; Witte et al. 1994). Prior knowledge about gene functions, protein interactions,
and disease pathways have been used in various hierarchical modeling approaches
for a single study (Capanu and Begg 2011, Capanu et al. 2011, Chen and Thomas
2010, Conti et al. 2003, Hoffmann et al. 2010, Hung et al. 2007, Lewinger et al. 2007,
Quintana et al. 2011, Shahbaba et al. 2012, Thomas et al. 2009a, Wilson et al.
2010b), and has been incorporated into joint modeling of related studies with
different designs (Li et al. 2013).
24
Assuming that one also has information about the expectation and variances
or covariances of the parameter of interest (β), hierarchical modeling uses higher
level “priors” to model β as random variables whose joint distribution is a function
of hyperparameters. Such additional information can be used in a second-stage
generalized linear model for the expectation of β conditional on this information,
(2)
where
is a strictly increasing link function, β has mean and variance
, and Z
is a second-stage design matrix expressing the similarities between the β.
One can combine results from the different level models by taking weighted
averages to obtain hierarchical estimates (posterior estimates). Weights in this
context reflect how well each stage was able to estimate that level’s parameters (see
[Chen and Witte 2007] for the implementation of hierarchical models on GWAS data
structures).
25
CHAPTER 4: MULTI-LEVEL BAYESIAN HIERARCHICAL MODEL: A
NOVEL APPROACH FOR PATHWAY ANALYSIS OF GENETIC
ASSOCIATION STUDIES
Hierarchical Modeling: Applications to Genetic Data
The hierarchical modeling approach has been applied in a variety of
situations (Witte 1994, Greenland 1992), particularly in association studies of
candidate genes or regions (Thomas et al 1992, Witte 1997, Kim et al 2001, Conti
and Witte 2003, Hung et al. 2004, Liu et al 2005). Thomas et al. (1992) presented a
hierarchical empirical-Bayes approach for testing associations with large numbers
of candidate genes. The simulation results indicated that hierarchical empirical-
Bayes was superior to maximum likelihood with or without haplotype effects. Witte
(1997) concluded that hierarchical modeling generally gives better estimates than
conventional maximum likelihood as long as the higher level models can provide a
reasonable approximation to reality.
Thus, hierarchical modeling is remarkably well suited to genetic analysis,
where a large amount of information was collected with relevant descriptors for
each variant (Witte 1997). Theoretical and simulation work has demonstrated the
potential improvement possible when incorporating higher level models, especially
for evaluation of large amounts of data on a limited number of subjects (Morris
1983, Greenland 1993, Witte 1994, Witte and Greenland 1996). This is precisely the
26
situation faced by GWAS. Chen and Witte (2007) illustrated how a hierarchical
method can be used to determine an optimal ranking of SNPs for follow-up in GWAS.
One can enrich the overall GWAS signal by including existing information and
borrowing strength from similarities among SNPs in a hierarchical model.
Mapping the genes for a complex disease requires finding multiple genetic
loci that may contribute to the onset of the disease. Hierarchical modeling has been
successfully adapted to such genetic epidemiology research. In order to model the
probability of transmitting a risk allele to a diseased child for a case-parent triad
data, Swartz et al. (2006) proposed Stochastic Search Gene Suggestion (SSGS) -- a
hierarchical Bayesian model using conditional logistic regression likelihood and
specifically incorporating genetic covariance. Hung et al (2007) applied an
empirical-Bayes approach allowing the prior knowledge regarding the evolutionary
biology and physicochemical properties of the variant to be incorporated into the
hierarchical model, in order to improve the estimation of risk for specific variants
involved in DNA repair and cell cycle control pathways. Wilson (2010b) presents an
efficient Bayesian hierarchical model search strategy for multilevel inference on SNP
associations. This method searches over the space of genetic markers and
alternative genetic parameterizations, and shows higher power to detect association
than standard procedures in simulation studies. These hierarchical Bayesian
models use techniques incorporating prior information into the risk estimates,
stochastically search variables to explore the posterior distribution on the model
27
space, and use Bayesian model averaging methods to make inference about the
importance of the variables.
In addition, Capanu et al. (2008) have employed hierarchical modeling using
pseudo-likelihood and Gibbs sampling methods to estimate the relative risk of
individual rare variants using data from a case-control study and showed that one
can draw strength from the aggregating power of hierarchical models to distinguish
the variants that contribute to cancer risk. They further studied the statistical
properties of the method and investigated the validity of these hierarchical
modeling techniques (Capanu and Begg 2011). They concluded that hierarchical
modeling is a promising strategy to interpret the evidence from future association
studies that involve sequencing of known or suspected cancer genes (Capanu et al.
2011).
In GWASs, there have been successes in identifying SNPs associated with
complex and common diseases; in breast cancer, many genome-wide association
studies have also been conducted (Easton et al. 2007, Hunter et al. 2007, Gold et al.
2008, Thomas et al. 2009c). The premise of these studies, which involve the use of
common SNPs, is that the major influences on the risk of common diseases (such as
cancer) are likely to involve common variants. This premise has been challenged,
since only very small proportion of the overall phenotypic variance was explained
by the discovered disease-susceptibility loci for most common diseases (Gorlov et al.
2008, Maher 2008, Schork et al. 2009). Multiple rare variants may influence a trait
either on their own or synergistically with common variants (Bansal et al 2010).
28
Quintana et al. (2011) sought to avoid the Common Disease Common Variant
(CDCV) hypothesis by constructing a Bayesian risk index based on multiple rare
variants within a region. Further evidence shows that selected rare variants in
known causal gene confer very high risk, while common variants appear to convey
little or no risk; in particular, the genetic risk of breast cancer may be primarily
caused by rare variants (Capanu et al. 2011). Moreover, the improvements of
hierarchical modeling estimates over conventional logistic regression increase as
the variants become rarer (Capanu and Begg 2011).
Both common variants and rare variants contribute to individual
susceptibility to common complex disease, because the allelic architecture of
susceptibility variants has a wide range of allele frequencies and effect sizes (Wen et
al. 2004, Azzopardi et al. 2008, McCarthy 2009, Schork et al. 2009). The usual
association tests for common variants are underpowered for detecting low
frequency variants. Several collapsing methods have been proposed to increase the
power to model multiple markers simultaneously by combining a set of rare
variants and testing their collective frequency in cases vs. controls (Li and Leal
2008, Madsen and Browning 2009, Morgenthaler and Thilly 2007). However, a
major drawback of these methods is that they neglect the directions of the effect, i.e.,
protective or risk causing, of individual variants. Extensive reviews of statistical
methods for rare variants can be found in Asimit and Zeggini (2010), Morris and
Zeggini (2010), and Bansal et al. (2010). Powerful alternative tools are required to
detect variants with various frequencies associated with disease of interest.
29
Multi-level Bayesian Hierarchical Model: Statistical Methods
We propose a novel approach for pathway analysis of genetic association
studies: Multi-level Bayesian hierarchical modeling. The first (subject)-level model
specification can be described as a logistic regression model that relates the mean of
binary outcome (disease status) to a subset of predictor variables:
where Y is a binary outcome variable for an individual,
denotes some general
structure or parameterization of a set of covariates,
is the effect of
on the
outcome of interest Y, and
indicates whether covariate j is included in the model.
The regression coefficients j and variable inclusion indicators I j in this model are
then further modeled in the higher levels of the hierarchy. By averaging across the
distribution of inclusion indicators, one can draw inferences about the probability
and magnitude of the effects of individual genes or SNPs.
Treating each polymorphism as independent may be appropriate in a GWAS
study, since little or no external information about most SNPs is known. However, in
a pathway-driven study, the combination of external knowledge such as the
association of gene with disease, and gene set information such as gene-gene
correlations, may be used as prior biological knowledge about gene function to
facilitate more powerful analysis (for review of using pathway information in
30
candidate gene studies, see Thomas et al. [2009], Volk et al. [2008], Wang et al.
[2010], Wilson et al. [2010a]). Hence, we specify the relations among the observed
data with three stages to create a joint probability model.
Our model is based on a hierarchical Bayes framework comprising a subject-
level model for the association between genes and disease and a gene-level model
for the regression coefficients of the gene-disease association parameters. The
subject-level model is specified in terms of a SNP-burden for each gene (the SNP
level), comprised of the number of positively associated SNPs minus the number of
negatively associated SNPs.
If covariates are highly correlated, the posterior probability of an association
will be diluted or distributed across several correlated covariates due to competing
evidence for an association. Hence, we used Pearson correlation coefficients among
SNPs within a gene to eliminate the diffusion of the significance among SNPs with
strong linkage disequilibrium (LD). If a set of SNPs within the same gene have a
correlation coefficient beyond some arbitrary value, one of the SNPs is selected to be
included in the model with equal probability. Since there is no standard threshold,
we chose a Pearson correlation coefficient of 0.8 as a cut point.
After the initial screening, the choice of whether a SNP is included or not and,
if included, its direction is governed by prior probabilities that in principle vary
across genes or across SNPs within genes. In the simulation, we assume the
probability of SNP inclusion and its direction is inversely related to the size of the
gene. Each gene has a log relative risk coefficient that also has a prior distribution
31
that depends upon a vector of external gene-specific “prior covariates” and a
covariance matrix that depends upon external gene-gene connection information.
The overall model is represented as a directed acyclic graph in Figure 1. For the
simulations and the analysis of the real WECARE data, we extracted the information
from Gene Ontology (GO) for the 38 WECARE candidate genes to construct the prior
covariate and correlation information, as described in more detail in the simulation
section.
Model I
Level 1: The subject-level model for case-control data is a conditional logistic model
of the form
where
denotes the case-control status for the ith individual; X denotes a vector of
fixed covariates (confounders) and is its coefficient vector; G denotes a function
of the raw SNP genotypes for gene g containing NS g SNPs,
with weights
∈ {-1,
0, 1}. In WECARE study, an offset term is the log value of the counter-matching
weights, as explained in Chapter 2.
Note that the exponential function of in the logistic model ensures that the
effects of each gene will be positive, thereby avoiding the label-switching problem
that would arise if the signs of g and all the W gs were reversed for a given gene.
This also avoids having to deal with truncated normal distributions if g were
constrained to be positive.
32
Level 2: The logistic model has regression coefficients given by the gene-level of
the hierarchical model:
g = Z g´ + b g + e g (3)
where
= ( 0,…, NZ) ~ N(0,V
I)
b = (b 1,…b NG) ~ N(0,
2
A)
e = (e 1,…,e NG) ~ N(0,
2
I)
We incorporate information regarding the relations between the factors into
the design matrix Z, here structured as a gene by pathway matrix of binary values,
each indicating whether a gene is in a particular pathway. Basically, Z contains
second-stage covariates for each of the genetic factors. π is a column vector of
coefficients corresponding to these higher-level effects, and is assigned a normal
probability distribution. We incorporate the prior gene-gene connection
information in A matrix for b with a multivariate normal distribution centered at
zero. The term e is included as a residual error, also given a zero mean independent
normal distribution, with
2
specifying the residual variance of the second-stage
covariates.
Level 3: The SNP-level deterministic model is designated to obtain G, where each
gene is uniquely determined by SNP inclusion parameters in the model and by sets
of previous states of these SNPs. G serves as a design matrix of genetic factors (we
may consider additional environmental factors and interaction terms in the first
33
level of the model, if needed) for the individuals within the study. In other words,
the function serves as a risk index for each gene:
(4)
where weights W = 1, 0, or +1 have probabilities
(5)
and
denotes the average number of SNPs within a gene; we assigned c to be the
minimum number of SNPs within any gene;
denotes the number of SNPs within
gene g. This form of prior probabilities on the SNP indicator variables keeps the
expected number of SNPs included in the model to be roughly similar across genes,
while allowing genes with more SNPs to have similar probabilities of being included
as genes with fewer SNPs. For now, we treat as fixed parameters, but these too
could be given hyperpriors.
The posterior estimates for the association parameters resulted from the
three-stage hierarchical Bayesian analysis are an inverse-variance weighted average
between the conventional estimates from the logistic regression only and the
estimated conditional second-stage means, Zπ . Between the maximum likelihood
first-stage estimates and the second stage prior estimates, the weights will favor the
one with smaller variance. This intuitive weight adjustment is one of the important
34
differences between Bayesian hierarchical approach and the single-stage logistic
regression analysis.
Finally, the variance components have standard inverse gamma hyperprior
distributions:
2
~ IG(df e,E)
2
~ IG(df b,B)
Model II
Model II shares many similarities with Model I. The differences are stated
below.
Level 1: The subject-level model for case-control data is also a conditional logistic
model, but the coefficient of each gene effect is
, instead of the exponential of
in
Model I:
,
Level 2: The second level of the hierarchical model still aims to obtain the logistic
regression coefficient
, but is in the form:
where denotes the probability density of normal distribution, and denotes the
cumulative density of normal distribution. This is a proper density for
, since
35
thus the probability distribution of
integrates to one.
Level 3: The third level of Model II remains the same as Model I (Equation 4) with
the same probabilities for the weights (Equation 5).
Fitting the Model
Model I
Model fitting is performed in series of MCMC steps. The methods to update each
parameter in each iteration are described below.
The coefficients ( ) of subject-level confounders are updated using single
Newton-Raphson iteration towards the maximum likelihood estimate (MLE)
of , following a random multivariate normal update to sample the new .
The procedure is based on the approximation that the likelihood for is
quadratic with flat priors.
Selection of SNPs to include in the model involves evaluating the three
posterior probabilities for d = {-1, 0, +1} and selecting
with the
corresponding probability
[W s = d | Y,S,W; ] [Y | {G g({W gs=d,W
gs,S g),G
g}; { gsd,
g}] [W s=d | d, NS g]
where gsd is a single Newton step iteration towards the MLE of g if W s were
set to d.
36
Update the vector of regression coefficients using a multivariate
Metropolis-Hastings move with proposal ´ ~ N( ,
I) and acceptance
probability
[ | Z, ,b,Y,S,W,
2
] [Y | G(W,S); ] [ | Z´ + b,
2
I]
Note: an alternative possibility, not yet implemented, would be to use
[ | Z, ,A,Y,S,W,
2
] [Y | G(W,S); ] [ | Z´ ,
2
I+
2
A]
Update the vector of random effects b with a similar Metropolis-Hastings
move with acceptance probability
[b | ,Z,,
2
,
2
,A] [ | Z´ +b,
2
I] [b | 0,
2
A]
Again, the vector and this step could probably be completely omitted
simply by using [ | Z´ ,
2
I+
2
A] in lieu of [ | Z´ + b,
2
I] above and in the
various other updates below.
Update the prior regression coefficients by a simple linear regression and
taking a multivariate normal around its MLE:
[ | ,Z,b,
2
] [ | Z´ + b,
2
I] [ | 0,V
I]
Update the variances
and
using a Metropolis-Hastings move with
proposals ln( ´) ~ N(ln(),
) and similarly for , with acceptance
probabilities
[, | ,Z, ,A] [ | Z´ ,
2
I+
2
A] [
2
] [
2
]
As noted above, for now we’re treating the s as fixed, but these too could be
given prior distributions and estimated as well.
37
Model II
Model fitting for Model II is similar to Method I except for some details in
updating
s and s. For sampling
s, we continue to use Metropolis-Hastings as
in Model I, but the prior contributions for genes with
= 0 are based on the
cumulative density function (CDF) rather than on the probability density function
(PDF) of the distribution. The likelihood contribution is still the same as in Model I.
s are updated using Newton-Raphson as in Model I, except that (1) instead of
resetting s to zeros before updating them to (score x inverse information) in
Method I, we accumulate score and information contributions as additions to the old
s in Model II; (2) we accumulate the first and second derivatives of the cumulative
normal for genes with
.
Posterior Summarization
We tabulate the following quantities:
For each SNP: the posterior probability of W gs = 1,0,+1 and Bayes Factor
where the first factor are posterior probabilities given the data D and the
second are the priors.
For each gene: the posterior probability that at least one SNP is included in
the model and its Bayes factor
38
We also tabulate the posterior means and SDs of each
and a significance
test as approximately
, along with the mean
number of SNPs included in the model.
For the other parameters, , ,
2
,
2
: we simply tabulate the posterior means
and SDs.
Finally, we tabulate the posterior distributions of numbers of SNPs and
numbers of genes with at least one SNP included in the model. We still have
to work out the corresponding prior distributions, which are essentially
mixtures of binomials, so we have only approximate BFs for these quantities.
39
CHAPTER 5: SIMULATION STUDIES
To evaluate the performance of our model, we conducted simulation studies
based on the structure of the WECARE Study data. Specifically, we used the real
SNP, covariate, and counter-matching offset data for each risk set and reassigned the
case status in each risk set based on an assumed relative risk model. We assigned
the with their estimated values from the real data and randomly assigned weights
W gs to simulate SNPs. Log relative risk coefficients g for each gene were simulated
under the level 2 of the models (Equation 3). The simulated values of the variances
in this level were assigned as = = 0.5.
The analysis was performed on a total of 504 SNPs in 38 genes (ranging from
1 to 51 SNPs per gene) involved in DNA damage response pathways (i.e., DNA
repair, cell cycle checkpoint control, and apoptosis). Using the Gene Ontology, we
extracted 860 terms relating to biological process or molecular function annotated
to any of these 38 genes and selected four of these GO terms (DNA damage
checkpoint, MRE11 complex, double-strand break repair via nonhomologous end
joining, and negative regulation of cell cycle) as prior covariates in the Z matrix.
These GO terms were assigned coefficients = 0.25, 0.5, 0.75, and 1 respectively.
The intercept 0 was set to 2 for Model I, and was set to 0 for Model II. All 860 GO
terms were used to construct a correlation matrix A for the similarity in the ways
each pair of genes was described in the GO (Figure 2).
40
The resulting gene indices G g(W,S) and the corresponding g, along with the
real X i and estimated coefficients and offset terms, were then used to compute
each subject’s relative risk. These were then used to randomly assign which
member of each risk set would be designated as the case. The following estimates
are based on multiple data replicates for each of several parameter choices; each
realization of the data uses 1000 MCMC scans for tabulation after a burn-in of 500
scans. For Model I, we simulated 10 parameter sets and 10 data replicates for each
of parameter set. Simulation analysis for Model II was based on single randomly
chosen parameter set and simulated data set.
The SNP inclusion indicators were assigned +/- 1 or 0 with probabilities
given by Equation 5 with
and c = 1.
We plotted estimated parameters versus simulated parameters from Model I
(Figure 3a) and Model II (Figure 3b) based on SNP inclusion prior φ=0.05. The
plots suggested that the values of the parameters both our model estimate yield
adequate correlation with the true parameters.
We selected different values for φs to ensure that fewer than 10% of the
SNPs were included in the model. We compared the estimated exp(β)s from Model I
(Figure 4a), and estimated βs in Model II (Figure 4b), respectively, based on
different SNP inclusion φ priors (0.025, 0.05, 0.1), where the data and parameter
was simulated on a fixed value (
= + = 0.05 and c=1). Both graphs suggest that
the estimated parameters are insensitive to priors.
41
We summarized the BFs for SNP inclusion using Kass and Raffery’s
classification (>3, >20, >150; or equivalent to 2ln(BF) >2, >6, >10, ) (Figure 5.1ab for
Model I and 5.2ab for Model II). Each column represents the average posterior
probabilities for truly negative SNPs, most common truly null SNPs, and truly
positive SNPs. Each color represents the estimated probability of SNPs inclusion and
their directions, given the true SNP inclusions and their directions. Truly associated
SNPs had much larger posterior probabilities of being associated on average (and in
the right direction) than null SNPs (Figure 5.1a and 5.2a). Specificity is high for both
models; however, the sensitivity of Model II is almost double the sensitivity of Model
I (Figure 5.1b and 5.2b).
In addition to SNP associations, we also summarized simulation results for
gene-level associations (Figure 5.1c and 5.2c). Both models identified genes with
positive to strong association. Red stars indicate the gene having at least one SNP
included in the model. As shown, not every gene with at least one SNP included was
identified as a strong candidate by the models.
We summarized the frequency distribution of SNPs based on their minor
allele frequency (MAFs) (Table 2). After filtering highly correlated SNPs, 384 SNPs
were employed for the simulation analysis. About 66% of SNPs included in each
model were common variants (MAF>0.05); about a quarter of the SNPs are
uncommon variants (0.01<=MAFs<0.05); and less than 9% of SNP included in each
model the analyses are rare alleles (MAF<0.05). By examining the posterior
probabilities and Bayes factors by categories of MAF (Figure 6.1 and 6.2), we
42
discovered that both of our models worked satisfactorily in identifying common
variants but not rare variants -- the sensitivity increased along with the categories,
while the specificity remained high (Figure 6 left panels). The histograms based on
SNP Bayes Factors categorized by MAFs also show an increase in sensitivity (Figure
6 right panels). Model II appeared to yield a higher sensitivity than Model I, with an
approximate doubling for common variants if the true SNP indicator was 1, and
with an approximate tripling for common variants if the true SNP indicator was 1.
The data management procedures were conducted with SAS 9.2 (SAS
Institute Inc., Cary NC), R (http://www.R-project.org/), and Microsoft Excel (2007);
the analysis procedures were conducted with C/C++ (Microsoft Visual Studio 2008).
43
CHAPTER 6: APPLICATION TO THE WECARE DATA
Using the same settings as for the simulation studies, we applied our models
to the real WECARE data with the aim of identifying candidate genes from a set of
genes thought to be involved in DSB response pathways -- which of them were
associated with second primary breast cancer, and among those associated, which
SNPs within the genes are driving these associations.
Among 2017 breast cancer survivors, 708 were cases with second primary
breast cancer, and 1399 served as controls, matched on age, race and latency of the
disease. Since we focused on the main effects of genes but not GxE interactions, we
did not put restrictions based on radiotherapy or radiation dose. Furthermore, no
restrictions based on ATM or BRCA genotypes were applied. Missing genotypes
were treated as non-carriers. The missing data problem can be resolved by using
Monte Carlo maximum likelihood if the complete data likelihood is known
(Thompson and Guo 1991). The MCMC method provides a natural solution by
imputing values for the missing data at each iteration; sampling from their full
conditional distribution given the available data (Gilks et al 1998). Among the 504
SNPs in 38 genes from candidate gene panels, 384 SNPs were employed in the
analysis after eliminating highly correlated SNPs within a gene. Twelve variables
(age, menarche, menopause, family history, pregnancy, histology, treatment, the
FGFR2 GWAS-identified SNP (not thought to be involved in DSB damage response),
and known deleterious variants in ATM, BRCA1, BRCA2, and CHECK2) were treated
44
as fixed covariates; an offset term to adjust for the counter-matched design was also
included.
We constructed prior covariates and gene-gene connections from the Gene
Ontology. 860 GO terms in the categories of Molecular Function or Biological
Process that were annotated to any of the 38 genes were used to construct the
correlation matrix A. More specifically, the diagonal elements of the A matrix are all
one, and the off-diagonal elements reflect the correlation between a pair of genes
given all 860 GO term values for each gene. We selected 4 of these GO terms (DNA
damage checkpoint, MRE11 complex, DSB repair via NHEJ, negative regulation of
cell cycle) associated with DNA DSBs response to be included as prior covariates Z.
The results of real data analysis were based on 10,000 iterations after 4,000
iterations burn-in. The trace plots of joint probability for both models appeared to
have converged (Figure 7), however, the trace plots for some individual β and b in
both models appeared unstable (not shown); the trace plots of some other
parameters estimated such as and π were highly stable (not shown).
Model I identified one gene (MDC1) with strong association (2ln(BF) > 6),
one gene (RAD51) with positive association (2ln(BF) > 2), and two genes (ATM and
NBN) with weak associations (2ln(BF) > 0) with second primary breast cancer
(Figure 8b). This finding matched the results for the probabilities of gene inclusion
(Figure 8a), where the highest probability of gene inclusion identified the same 4
candidate genes as identified by gene-level Bayes Factors. Model II identified only
one gene (MDC1) with positive association based on gene-level Bayes Factors
45
(Figure 8d), as well as quite a few genes with weak association. MDC1 also had the
highest posterior probability of being included in the model (Figure 8c).
Based on Bayes Factors, we identified SNPs driving the gene-disease
associations with strong evidence (2ln(BF) > 6) in genes NBN and RAD51, and with
positive evidence (2ln(BF) > 2) in genes ATM, CHECK2, MDC1, MRE11A, using Model
I (Figure 9a). For Model II, we identified SNPs driving the gene-disease associations
with positive evidence in genes ATM, FANCA, LIG4, MDC1, NBN, and RAD51 (Figure
9b). Both models identified 9 SNPs with positive evidence for driving gene-disease
association. Among those identified SNPs from both models, 4 SNPs (rs4713354,
rs2269705, rs9297757, rs1801320) are in common.
Table 3 presents the Bayes Factors for selected SNPs identified by Model I
and Model II, as well as those identified by a previous WECARE publication (Brooks
et al. 2012). The log relative risk estimates (lnRR) are based on the adjusted
conditional multivariate logistic regression from each model.
Seven of the nine SNPs identified by Model I have been found associated with
breast cancer risk in previous investigations. More specifically, among these seven
identified SNPs, five were selected by Brooks et al. (2012) on WECARE data, and two
were selected by other breast cancer studies. Brooks et al. (2012) identified six
SNPs that have a significant association (p-value < 0.05 uncorrected for multiple
components) with CBC risk, among 152 SNPs in 6 genes (CHEK2, MRE111A, MDC1,
NBN, RAD50, TP53BP1). Five (rs6005861 in CHEK2, rs4713354 in MDC1,
rs13447682 in MRE11A, rs9297757 and rs3736640 in NBN) of the six SNPs Brooks
46
and colleagues identified were also identified by our model (Model I). Model I also
identified SNP rs1800057, a variant in ATM, which was previously shown to be
associated with a statistically significant reduction in CBC risk (Concannon et al.
2008). Furthermore, rs1801320 (135G>C) identified by Model I is a SNP in the 5'-
untranslated region (UTR) of the RAD51 gene. Investigators have found mixed
results for its role in breast cancer risk from previous studies (Antoniou et al. 2007,
Yu et al. 2011). In addition to these previously distinguished SNPs, Model I also
identified rs4987951 in ATM and rs2269705 in MDC1, about which no previous
evidence was found for association with breast cancer risk.
Model II identified rs4713354 in MDC1, rs9297757 in NBN, and rs1801320 in
RAD51, which were also identified using Model I; and these have been shown to be
associated with breast cancer risk in previous studies, as discussed above. In
addition, Model II identified rs664677 in ATM, which was not identified by Model I,
but a recent meta-analysis (Shen et al. 2012) has shown its association with
increased breast cancer risk. Model II also identified rs7187436 in FANCA,
rs9468811 in MDC1, rs1555902 and rs11620361 in LIG4, but no previous literature
was found with evidence for their associations with breast cancer risk. Nevertheless,
these could be novel findings from our models.
Table 4 lists the numbers of pairs of the homozygous reference allele,
heterozygous allele, and homozygous risk allele for cases (CBC) and controls (UBC),
respectively, for all the SNPs identified by our models and by Brooks et al. (2012).
We also report the estimated s from simple logistic regression for each
47
selected SNP, adjusted for age, menarche, menopause, family history, pregnancy,
histology, treatment, the FGFR2 GWAS-identified SNP, deleterious variants in ATM,
BRCA1,BRCA2,CHECK2 and offset term. Based on the adjusted simple logistic
regression, among fifteen selected SNPs, nine SNPs were statistically significant
associated with CBC risk (PC41, rs1800057, v_IVS14m55, rs6005861, rs11620361,
rs4713354, rs2269705, rs13447682, rs3736640, rs3736640, rs1801320; p values
<0.05), three were found marginally significant (rs1555902, rs11620361,
rs9297757; 0.05 <=p values <0.1). Eight of the nine variants selected by Model I
were statistically significant (rs1800057, v_IVS14m55, rs6005861, rs4713354,
rs2269705, rs13447682, rs3736640, and rs1801320; p values < 0.05), and one was
marginal significant (rs9297757, p = 0.097). Four of the nine variants selected by
model II were significant (PC41, rs4713354, rs2269705, rs1801320; p values <
0.05), three of the rest were marginally significant (rs1555902, rs11620361,
rs9297757; 0.05 <=p values < 0.1). Four of the six variants selected by Brooks et al.
(2012) are significant (rs6005861, rs4713354, rs13447682, rs3736640; p values <
0.05), one is marginal significant (rs9297757, p value = 0.097).
Among these selected variants with statistical significance and marginal
significance from adjusted simple logistic regression, based on their relative risks,
we identified that PC41, rs11620361, rs4713354 and rs269705 are casual variants,
and that rs1800057, v_IVS14m55, rs6005861, rs1555902, rs13447682, rs9297757,
rs3736640 and rs1801320 are protective variants. It is interesting to note that both
PC41 and v_IVS14m55 are SNPs in the same gene (ATM), and that both rs1555902
48
and rs11620361 are SNPS in the same gene (LIG4), but these SNPs within the same
gene produced opposite effects in terms of CBC risk.
49
CHPATER 7: DISCUSSION
We proposed two models based on a Bayesian hierarchical modeling
framework for multi-level variable selection, motivated by WECARE study to
identify effective (causal or protective) genes associated to CBC risk. CBC is a
complex disease involving multiple genetic and environmental factors that
complicate mapping the genetic components of these diseases. Our multi-level
Bayesian hierarchical selection methods are applicable to these types of
investigations. These methods are extremely flexible, allowing for model
uncertainty to be taken into consideration. Additionally, the hierarchical nature of
the model provides means to incorporate a priori knowledge about genes (such as
known covariance structure) into the model, which can improve variable selection
among multiple predictors. These approaches can be used in a variety of fields,
particularly in genetic epidemiology, including identifying variants associated with
diseases of interest and their directions of effects.
Using our models, we are able to identify which genes within a pathway are
associated with the outcome of interest and, given that certain genes are associated,
which SNPs are responsible for these associations. As demonstrated in the WECARE
study application section, both of the two Bayesian hierarchical models were able to
detect some positive to strong evidence of gene-disease associations, and also
pinpoint specific SNPs that are most likely driving these associations. By performing
a model search to determine which variants to include and using Bayesian model
50
averaging techniques to calculate posterior quantities of the interest, our Bayesian
hierarchical models demonstrate reasonable sensitivity with high specificity.
Simulation study shows that model II demonstrated much higher sensitivity than
Model I. Both models show high specificity.
Additional simulation studies could help to elucidate the performance of our
methods. A set of independent genetic association study-based simulations could be
further developed to examine the power of the multi-level Bayesian hierarchical
methods. Including some alternative variable selection approaches for power
comparison with our methods would be a suitable avenue for future research. The
alternatives include penalized regression methods such as Lasso (Tibshirani, 1996),
Elastic Net (Zou and Hastie 2005) and Group Bridge (Huang et al. 2009), and basic
Bayesian model uncertainty methods such as choosing the “non -informative” prior
distribution (uniform prior, binomial prior, or beta-binomial prior) over individual
models. The marginal true positive rates (TPR) and false positive rate (FPR) as the
proportion of causal and non-causal predictors respectively can be calculated for
each of the methods; ROC curves can be plotted for power analysis.
Since it seems unreasonable to consider the known deleterious variants in
ATM, BRCA1/2, and CHEK2 as exchangeable with the tagging SNPs, we have treated
these variants as fixed covariates, along with age, menarche, menopause, FH,
pregnancy, histology, treatment, FGFR, forcing these covariates into all models.
Unfortunately, this precludes borrowing strength across all the variants within
these genes—i.e., given that we know that some variants in these genes are
51
deleterious, it would seem more likely that there would be other causal variants in
the same genes, and that if these four genes have similar prior covariate values Z g,
that should inform the estimation of the corresponding gs and draw the estimates
of s for other genes that are highly correlated with them in the A matrix towards
the g values for these genes.
Our model framework uses a simple MH algorithm to sample models of
interest from the enormous space of possible models. Both of our models take
approximately 3 hours to complete on a single processor when performing 10,000
iterations (after 4000 iterations burn-in) of the current MH/MCMC algorithm on the
WECARE data. The trace plots showed that the overall likelihood and most of the
higher-level model parameters reached convergence well before burn-in, although
some of the more specific gene-level parameters remained somewhat unstable.
Our results are encouraging from a broad perspective. Each model we
proposed identified nine SNPs that are possibly driving the gene-disease
associations; four of them are in common for both models. Model I identified five
SNPs that were discovered by a previous analysis on WECARE data, and two SNPs in
previous general breast cancer risk studies. Model II identified two SNPs discovered
by previous WECARE analyses, and two SNPs discovered in previous studies for
potential association with general breast cancer risk. It is possible that both of our
models identified some novel SNPs responsible for gene-breast cancer association
that were not detected in previous investigations. Although Model I identified more
SNPs with previously discovered association with breast cancer risk than Model II
52
did in the real data analysis, the simulation study suggested that Model II had more
statistical power than Model I. Further studies of both of these models would be
useful for identifying novel candidate genes and SNPs associated with disease risk.
Our models use latent indicator variables at the SNP level to do variable
selection, and Bayesian shrinkage at the gene level towards a prior mean vector and
covariance matrix that depend on external information. By incorporating these
biological covariates about the gene structures in the WECARE study, we showed
evidence of associations of CBC with the genes ATM, CHEK2, MDC1, MRE11A, NBN,
RAD51 (Model I), or with the genes ATM, FANCA, LIG4, MDC1, NBN, RAD51 (Model
II).
The premise of the WECARE study was that the power to detect main effects
of relatively rare genetic mutations and their interactions with environmental
factors would be considerably enhanced by restricting consideration to women with
a first primary breast cancer and then studying the determinants of developing a
second primary breast cancer. In the current models, once we adjusted for the
possible confounding variables, we focused on identifying main effects within the
variants of interest. However, if we ignore the interactions of genes and
environment, but merely estimate the contributions of genes and environment
separately, we might not be correctly estimating the population attributable risk of
the disease, because a proportion of the disease might be explained by the joint
effects of genes and environment (Hunter 2005). Extending Lewinger et al.’ s (2007)
hierarchical Bayes prioritization approach, Sohns et al. (under review) developed a
53
novel empirical hierarchical Bayes approach to detect GxE interactions in GWAs. It
can serve as one of the recommended methods for follow-up SNP selection,
particularly when G-E associations are suspected.
Deficiencies in cellular response to DNA damage can predispose to cancer.
Ionizing radiation is known to cause cluster damage and DSBs that pose problems
for cellular repair process, and hence is a breast cancer carcinogen. Many genes
encode products that are essential for the normal cellular response to DSBs, but
predispose to breast cancer when mutated. The carriers of certain haplotypes may
be susceptible to the DNA-damaging effects of radiation therapy associated with
radiation-induced breast cancer. These genetic factors and radiation exposure,
individually or via interaction, may contribute to the development of radiation-
induced CBC. A previous WECARE analysis found an interaction between ATM and
radiation treatment contributing to the risk of developing CBC -- women who carry
rare ATM missense variants predicted to be deleterious and were exposed to
radiation had a statistically significant higher risk of CBC compared with unexposed
women who carried the wild-type genotype, or compared with unexposed women
who carried the same deleterious ATM missense variant (Bernstein et al. 2010).
The WECARE study design was based on counter-matching on radiotherapy
status, increasing the informativeness of the collected dosimetry data by increasing
the variability of radiation dose within the case-control sets, and allowing for
unbiased estimation of the main effects and interactions of interest. The focus on
DNA DSB damage response pathway genes was entirely predicated on an interest in
54
DSBs induced by ionizing radiation. Thus, we had stronger priors for gene-radiation
interactions than for main effects of these genes, and the ability to detect gene-
radiation therapy interaction was consequently enhanced (Bernstein et al. 2004).
Radiation therapy is highly plausible at the biological level in gene-environment
interaction for the WECARE study. It is worth noting that we tried to implement the
environment (radiation therapy or radiation dose) and gene-environment
interaction term as extension to our models, but by far, no interesting findings were
detected. Even in the post-GWAS era, analysis of GxE interaction remains one of the
greatest challenges. Our models are highly extensive — it is of future interest to
investigate possible gene-environment interactions with further refinement to our
models.
In our current models, we aim to detect both common and rare variants that
contribute to the gene-disease association; hence, we gave the same weight to
common variants, uncommon variants, and rare variants within a gene. Based on
the simulation results, our models appear to identify effects of common variants
with higher accuracy than for rare variants (Figure 6). However, this result may be
due to the unbalanced number of the common, uncommon, and rare variants across
the SNP data (Table 2). For common variants (and perhaps in candidate gene
studies for uncommon variants), it may be possible to allow each SNP to have its
own regression coefficient from some continuous distribution, but constraints
would be needed if both SNP- and gene-level parameters were to be estimated to
ensure identifiability.
55
The Bayesian hierarchical framework provides a flexible and powerful basis
for selecting multiple genes within a pathway and selecting multiple SNPs within
each gene, incorporating external prior information at either the SNP or gene level.
Capanu et al. (2011) and Quintanta et al. (2013) proposed similar models
incorporating external information at the SNP level. In this thesis, we have only used
external information at the gene level, since the GO does not provide any annotation
of specific variants within genes. However, there are many ways to classify SNPs a
priori, such as simple indicators for whether they are coding or noncoding variants,
or the predictions of programs like SIFT (Ng and Henikoff 2003) and PolyPhen (Xi et
al. 2004) based on predicted effects on protein conformation or evolutionary
conservation. Such information could easily be incorporated into a multinomial
logistic or probit model for the inclusion probabilities s (Quintana et al., 2013). The
current version of our program treats the deleterious SNP inclusion probability +
and protective SNP inclusion probability
as fixed constants dependent on the size
of the gene the SNP belongs to, but in principle, other prior distributions on the
model specific coefficients can be incorporated into the framework. For instance, +
and
could be assigned prior Beta or Beta-Bionomial distributions, subject to the
constraint that + +
< 1. In reality, we might want to keep the SNP inclusion
probability less than 10%, but this can be adjusted according to the parameters of
the prior distribution.
56
The information that can be incorporated into an analysis via prior
covariates is extremely flexible. In particular, for our motivating application of
genetic association studies, a vast amount of external biological information exists
for the variants under consideration, as discussed in Conti et al. (2009). Many of
methods reviewed in Cooper and Shendure (2011) used a combination of
evolutionary, biochemical and structural information to guide the estimation. We
expect an increase in power if more information were implemented into our model
framework.
In summary, we applied multi-level hierarchical modeling in a Bayesian
framework incorporating a priori knowledge about genes involved in radiation
induced DNA DSB pathway into the CBC risk estimates. We applied these models to
WECARE data, and identified potentially effective genes and SNPs associated to CBC
risk, based on Bayes Factor cutoffs. The results of the analysis of real WECARE data
are encouraging; however, independent replication is required to confirm the
associations found in this study. A replication GWAS study (WECARE II) using 2100
CBC cases and 2100 UBC controls is currently in progress. Further investigation of
the association of the effects of gene-radiation interaction and risk estimates is
possible.
To systematically study the effects of rare variants is one of the goals of our
model development in future application. Although the data we used contain mainly
common variants, and the identification test accuracy appears to be satisfactory
57
only for common variants, our models are in principle equally applicable to data
containing rare variants, e.g., from next generation sequencing data.
TABLES
Table 1: Bayes Factos’s Interpretation for association (Kass and Raftery, 1995)
Bayes Factor (BF) (HA:H0) 2ln(BF) Evidence against H0
1 to 3 0-2 Not worth more than a bare mention
3 to 20 2 to 6 Positive
20 to 150 6 to 10 Strong
>150 > 10 Very strong
Table 2. Frequency Distribution of SNPs based on their MAFs
SNPs
MAFs Model I
a
Model II
b
<0.01 34 (8.9%) 34 (8.8%)
>=0.01 and <0.05 96 (25.0%) 95 (24.7%)
>=0.05 254 (66.1%) 256 (66.5%)
Total 384 385
a
Average number of SNPs included in the model I from 10 data replicates for each of the 10 replicates of parameter sets
b
Number of SNPs included in model II from single replicate of data and parameter set
58
Table 3: Bayes Factors in SNP-level and Gene-level from Model I and Model II for Selected SNPs.
Gene rs#
Model I
Model II
BFsnp 2lnBFsnp BFgene 2lnBFgene lnRR
d
BFsnp 2lnBFsnp BFgene 2lnBFgene lnRR
d
ATM
PC41
b
0.00 n/a
1.41 0.68 0.68
2.73 2.01
1.58 0.91 0.11
rs1800057
a
4.58 3.04
1.67 1.03
v_IVS14m55
a
9.04 4.40
1.99 1.38
CHEK2
rs6005861
a,c
7.00 3.89
0.36 -2.02 0.60
1.41 0.69
1.27 0.48 0.10
FANCA
rs7187436
b
eliminated
0.00 n/a n/a
2.87 2.11
1.69 1.05 0.08
LIG4
rs1555902
b
1.23 0.41
0.25 -2.78 0.34
2.96 2.17
1.63 0.98 0.10
rs11620361
b
1.38 0.64
3.17 2.31
MDC1
rs9468811
b
0.69 -0.74
20.71 6.06 0.54
2.76 2.03
6.00 3.58 0.48
rs4713354
a,b,c
9.72 4.55
13.48 5.20
rs2269705
a,b
15.91 5.53
8.41 4.26
MRE11A
rs13447682
a,c
5.70 3.48
0.52 -1.30 0.57
1.42 0.70
0.99 -0.03 0.07
NBN
rs14448
c
0.20 -3.22
2.62 1.93 0.45
1.62 0.96
1.45 0.75 0.11
rs9297757
a,b,c
27.33 6.62
5.49 3.41
rs3736640
a,c
4.14 2.84
1.32 0.56
RAD51
rs1801320
a,b
21.38 6.12
3.51 2.51 0.53
3.32 2.40
1.89 1.27 0.14
a
SNPs identified by model I based on Bayes Factors. Only those SNPs with 2ln(BF)exceeding 2 were selected.
b
SNPs identified by model II based on Bayes Factors. Only those SNPs with 2ln(BF)exceeding 2 were selected.
c
SNPs identified by Brooks et al 2012 based on per-allele RR (log-additive model), adjusted for age at first diagnosis and counter-
matching weight. Only those SNPs with p value for trend < 0.05 were selected.
d
is estimated by subject-level adjusted conditional logistic regression: in Model I is equivalent to
; in Model
II is equivalent to
.
59
Table 4. Association Between Selected Variants in DNA-Damage Response Genes and CBC risk
Gene rs#
Homozygous;
reference allele
Heterozygous
Homozygous;
risk allele
lnRR
d
p value
e
Case
(CBC)
Control
(UBC)
Case
(CBC)
Control
(UBC)
Case
(CBC)
Control
(UBC)
(95% CI)
ATM
PC41
b
207 473 361 671 140 255
0.15 (0.01, 0.28) 0.038
rs1800057
a
680 1322 28 76 0 1
-0.47 (-0.95, -0.01) 0.046
v_IVS14m55
a
674 1278 34 121 0 0
-0.66 (-1.32, -0.25) 0.002
CHEK2
rs6005861
a,c
680 1311 27 86 1 2
-0.40 (-0.85, 0.06) 0.086
FANCA
rs7187436
b
263 468 323 677 122 254
-0.07 (-0.21, 0.07) 0.313
LIG4
rs1555902
b
573 1098 130 280 5 21
-0.21 (-0.43, 0.01) 0.065
rs11620361
b
469 986 216 375 23 38
0.18 (0.00, 0.36) 0.050
MDC1
rs9468811
b
674 1324 34 73 0 2
0.15 (-0.30, 0.60) 0.502
rs4713354
a,b,c
535 1116 157 267 16 16
0.47 (0.26, 0.68) <0.001
rs2269705
a,b
589 1220 113 175 6 4
0.50 (0.25, 0.76) <0.001
MRE11A
rs13447682
a,c
690 1343 18 54 0 2
-0.56 (-1.12, -0.01) 0.046
NBN
rs14448
c
640 1215 60 171 8 13
-0.11 (-0.40, 0.18) 0.447
rs9297757
a,b,c
651 1233 148 52 5 18
-0.26 (-0.58, 0.05) 0.097
rs3736640
a,c
676 1288 32 107 0 4
-0.64 (-1.27, -0.21) 0.003
RAD51
rs1801320
a,b
646 1209 58 186 4 4
-0.31 (-0.62, 0.00) 0.048
a
SNPs identified by model I based on Bayes Factors. Only those SNPs with 2ln(BF)exceeding 2 were selected.
b
SNPs identified by model II based on Bayes Factors. Only those SNPs with 2ln(BF)exceeding 2 were selected.
c
SNPs identified by Brooks et al 2012 based on per-allele RR. Only those SNPs with p value for trend < 0.05 were selected.
d
: regression coefficients of each SNP from simple logistic regression, adjusted for age, menarche, menopause, family history,
pregnancy, histology, treatment, the FGFR2 GWAS-identified SNP, deleterious variants in ATM, BRCA1,BRCA2,CHECK2 and offset
term.
e
p-values associated with Wald-z test for estimates from simple logistic regression adjusted for fixed covariants listed in d.
60
FIGURES
Figure 1: Directed acyclic graph describing the structure of the model. Boxes describe observed data, circles represent
latent variables or model parameters, and the triangle represents a logical node.
61
Figure 2: Graphical representation of the A matrix derived from the Gene Ontology. The lower levels of the graph
indicate sets of genes with high correlations across the 860 GO terms.
62
Figure 3: Comparison of estimated exp( )s based on different priors ( ) from Model I (a); and
comparison of estimated s based on different priors ( ) from Model II (b). The black dots are
simulated true betas corresponding to each gene. The true was set to 0.05.
63
Figure 4: The correlation between Estimated Exp(Beta) and True Exp(Beta) from Model I (a) and the correlation analysis
between Estimated Betas and True Betas from Model II (b). The simulation is based on single replica of data set and
parameter set for both Model (500 burn-ins, 1000 iterations).
64
Figure 5.1: Simulation analysis for Model I. Sensitivity/Specificity based on proportions of the number of SNPs (a), Sensitivity
based on Bayes Factors of SNPs (b), Bayes factors of genes (c). The red stars indicate the gene has at least one SNP included in the
model based on simulated data. The upper panels is based on simulation analysis on 10 data replicates per parameter set for 10
replicates of parameter sets, the lower panel is based on a single replicate of data.
65
Figure 5.2: Simulation analysis for Model II. Sensitivity/Specificity based on proportions of the number of SNPs (a), Sensitivity
based on Bayes Factors of SNPs (b), Bayes factors of genes (c). The red stars indicate the gene has at least one SNP included in the
model based on simulated data. The upper panels is based on simulation analysis on 10 data replicates per parameter set for 10
replicates of parameter sets, the lower panel is based on a single replicate of data.
66
60
Figure 6.1: Simulation analysis for Model I by MAF. Sensitivity/Specificity based on proportions of the number of SNPs
categorized by MAF (Left panels). Sensitivity based on Bayes Factors of SNPs categorized by MAF (Right panels).
67
Figure 6.2: Simulation analysis for Model II by MAF. Sensitivity/Specificity based on proportions of the number of SNPs
categorized by MAF (Left panels). Sensitivity based on Bayes Factors of SNPs categorized by MAF (Right panels).
68
Figure 7: Trace plots of Conditional LogLikelihood for Model I (a) and for Model II (b). Each plot is based on 10,000
iterations with 4,000 burn-ins for real data.
69
Figure 8: Gene-disease association analysis for Model I (a,b) and for Model II (c,d). Probability of at least 1 SNP in the gene
was included in the model (a,c). Candidate genes identification based on Bayes Factors.
70
Figure 9: SNP inclusion analysis for Model I (a) and for Model II (b).
71
72
REFERENCES
Antoniou, A. C., O. M. Sinilnikova, J. Simard, M. Léoné, M. Dumont, S. L. Neuhausen, J.
P. Struewing, D. Stoppa-Lyonnet, L. Barjhoux, D. J. Hughes, I. Coupier, M.
Belotti, C. Lasset, V. Bonadona, Y. J. Bignon, T. R. Rebbeck, T. Wagner, H. T.
Lynch, S. M. Domchek, K. L. Nathanson, J. E. Garber, J. Weitzel, S. A. Narod, G.
Tomlinson, O. I. Olopade, A. Godwin, C. Isaacs, A. Jakubowska, J. Lubinski, J.
Gronwald, B. Górski, T. Byrski, T. Huzarski, S. Peock, M. Cook, C. Baynes, A.
Murray, M. Rogers, P. A. Daly, H. Dorkins, R. K. Schmutzler, B. Versmold, C.
Engel, A. Meindl, N. Arnold, D. Niederacher, H. Deissler, A. B. Spurdle, X. Chen,
N. Waddell, N. Cloonan, T. Kirchhoff, K. Offit, E. Friedman, B. Kaufmann, Y.
Laitman, G. Galore, G. Rennert, F. Lejbkowicz, L. Raskin, I. L. Andrulis, E.
Ilyushik, H. Ozcelik, P. Devilee, M. P. Vreeswijk, M. H. Greene, S. A. Prindiville,
A. Osorio, J. Benitez, M. Zikan, C. I. Szabo, O. Kilpivaara, H. Nevanlinna, U.
Hamann, F. Durocher, A. Arason, F. J. Couch, D. F. Easton, G. Chenevix-Trench,
G. M. o. C. R. i. B. M. C. S. (GEMO), E. S. o. B. a. B. M. C. (EMBRACE), G. C. f. H. B.
a. O. C. (GCHBOC), K. C. C. f. R. i. F. B. C. (kConFab) and C. o. I. o. M. o. B.
(CIMBA) (2007). "RAD51 135G-->C modifies breast cancer risk among
BRCA2 mutation carriers: results from a combined analysis of 19 studies."
Am J Hum Genet 81(6): 1186-1200.
Asimit, J. and E. Zeggini (2010). "Rare variant association analysis methods for
complex traits." Annu Rev Genet 44: 293-308.
Azzopardi, D., A. R. Dallosso, K. Eliason, B. C. Hendrickson, N. Jones, E. Rawstorne, J.
Colley, V. Moskvina, C. Frye, J. R. Sampson, R. Wenstrup, T. Scholl and J. P.
Cheadle (2008). "Multiple rare nonsynonymous variants in the adenomatous
polyposis coli gene predispose to colorectal adenomas." Cancer Res 68(2):
358-363.
Bansal, V., O. Libiger, A. Torkamani and N. J. Schork (2010). "Statistical analysis
strategies for association studies involving rare variants." Nat Rev Genet
11(11): 773-785.
Basu S, Pan W. 2011. Comparison of statistical tests for disease association with rare
variants. Genetic Epidemiology 35(7):606-619.
Begg, C. B., R. W. Haile, A. Borg, K. E. Malone, P. Concannon, D. C. Thomas, B.
Langholz, L. Bernstein, J. H. Olsen, C. F. Lynch, H. Anton-Culver, M. Capanu, X.
Liang, A. J. Hummer, C. Sima and J. L. Bernstein (2008). "Variation of breast
cancer risk among BRCA1/2 carriers." Jama-Journal of the American Medical
Association 299(2): 194-201.
73
Bernstein, J. L., S. Teraoka, R. W. Haile, A. L. Borresen-Dale, B. S. Rosenstein, R. A.
Gatti, A. T. Diep, L. Jansen, D. R. Atencio, J. H. Olsen, L. Bernstein, S. L.
Teitelbaum, W. D. Thompson, P. Concannon and W. S. C. Grp (2003).
"Designing and implementing quality control for multi-center screening of
mutations in the ATM gene among women with breast cancer." Human
Mutation 21(5): 542-550.
Bernstein, J. L., B. Langholz, R. W. Haile, L. Bernstein, D. C. Thomas, M. Stovall, K. E.
Malone, C. F. Lynch, J. H. Olsen, H. Anton-Culver, R. E. Shore, J. D. Boice, G. S.
Berkowitz, R. A. Gatti, S. L. Teitelbaum, S. A. Smith, B. S. Rosenstein, A. L.
Borresen-Dale and P. Concannon (2004). "Study design: Evaluating gene-
environment interactions in the etiology of breast cancer - the WECARE
study." Breast Cancer Research 6(3): R199-R214.
Bernstein, J. L., P. Concannon, B. Langholz, W. D. Thompson, L. Bernstein, M. Stovall,
D. C. Thomas and W. S. C. Grp (2005). "Multi-center screening of mutations in
the ATM gene among women with breast cancer - The WECARE study."
Radiation Research 163(6): 698-699.
Bernstein, J., L. Bernstein, B. Langholz, D. Thomas, M. Stovall, M. Capanu, W. D.
Thompson, J. Olson, K. Malone, C. Lynch, H. Anton-Culver, R. Shore, J. Boice, C.
Begg, A. Wolitzer, R. Gatti, B. Rosenstein, A. L. Borrenson-Dale, P. Concannon
and R. Haile (2006). "The interaction of radiation, the ATM gene and breast
cancer." American Journal of Epidemiology 163(11): S251-S251.
Bernstein, J. L., S. Teraoka, M. C. Southey, M. A. Jenkins, I. L. Andrulis, J. A. Knight, E.
M. John, R. Lapinski, A. L. Wolitzer, A. S. Whittemore, D. West, D. Seminara, E.
R. Olson, A. B. Spurdle, G. Chenevix-Trench, G. G. Giles, J. L. Hopper and P.
Concannon (2006). "Population-based estimates of breast cancer risks
associated with ATM gene variants c.7271T > G and c.1066-6T > G (IVS10-6T
> G) from the breast cancer family registry." Human Mutation 27(11): 1122-
1128.
Bernstein, J. L., R. W. Haile, M. Stovall, J. D. Boice, Jr., R. E. Shore, B. Langholz, D. C.
Thomas, L. Bernstein, C. F. Lynch, J. H. Olsen, K. E. Malone, L. Mellemkjaer, A.-
L. Borresen-Dale, B. S. Rosenstein, S. N. Teraoka, A. T. Diep, S. A. Smith, M.
Capanu, A. S. Reiner, X. Liang, R. A. Gatti, P. Concannon and W. S. C. Grp
(2010). "Radiation Exposure, the ATM Gene, and Contralateral Breast Cancer
in the Women's Environmental Cancer and Radiation Epidemiology Study."
Journal of the National Cancer Institute 102(7): 475-483.
Borg, A., R. W. Haile, K. E. Malone, M. Capanu, A. Diep, T. Torngren, S. Teraoka, C. B.
Begg, D. C. Thomas, P. Concannon, L. Mellemkjaer, L. Bernstein, L. Tellhed, S.
Xue, E. R. Olson, X. Liang, J. Dolle, A.-L. Borresen-Dale, J. L. Bernstein and W. S.
74
C. Grp (2010). "Characterization of BRCA1 and BRCA2 Deleterious Mutations
and Variants of Unknown Clinical Significance in Unilateral and Bilateral
Breast Cancer: The WECARE Study." Human Mutation 31(3): E1200-E1240.
Botstein D, Risch N. 2003. Discovering genotypes underlying human phenotypes:
past successes for mendelian disease, future approaches for complex disease.
Nat Genet 33(Suppl): 228-237
Brooks, J. D., S. N. Teraoka, A. S. Reiner, J. M. Satagopan, L. Bernstein, D. C. Thomas,
M. Capanu, M. Stovall, S. A. Smith, S. Wei, R. E. Shore, J. D. Boice, Jr., C. F.
Lynch, L. Mellemkaer, K. E. Malone, X. Liang, R. W. Haile, P. Concannon, J. L.
Bernstein and W. S. C. Grp (2012). "Variants in Activators and Downstream
Targets of ATM, Radiation Exposure, and Contralateral Breast Cancer Risk in
the WECARE Study." Human Mutation 33(1): 158-164.
Cantor, R. M., K. Lange and J. S. Sinsheimer (2010). "Prioritizing GWAS results: A
review of statistical methods and recommendations for their application."
Am J Hum Genet 86(1): 6-22.
Capanu, M., I. Orlow, M. Berwick, A. J. Hummer, D. C. Thomas and C. B. Begg (2008).
"The use of hierarchical models for estimating relative risks of individual
genetic variants: an application to a study of melanoma." Stat Med 27(11):
1973-1992.
Capanu, M. and C. B. Begg (2011). "Hierarchical modeling for estimating relative
risks of rare genetic variants: properties of the pseudo-likelihood method."
Biometrics 67(2): 371-380.
Capanu, M., P. Concannon, R. W. Haile, L. Bernstein, K. E. Malone, C. F. Lynch, X.
Liang, S. N. Teraoka, A. T. Diep, D. C. Thomas, J. L. Bernstein, C. B. Begg and W.
S. C. Grp (2011). "Assessment of Rare BRCA1 and BRCA2 Variants of
Unknown Significance Using Hierarchical Modeling." Genetic Epidemiology
35(5): 389-397.
Chen, G. K. and J. S. Witte (2007). "Enriching the analysis of genomewide association
studies with hierarchical modeling." Am J Hum Genet 81(2): 397-404.
Chen, J.H., and Chen, Z. H. (2008). Extended Bayesian information criteria for model
selection with large model space. Biometrika 95(3): 759-771.
Chen LS, Hutter CM, Potter JD, Liu Y, Prentice RL, Peters U, Hsu L. 2010. Insights into
colon cancer etiology via a regularized approach to gene set analysis of
GWAS data. Am J Hum Genet 86(6):860-71.
75
Chen, G. K. and D. C. Thomas (2010). "Using biological knowledge to discover higher
order interactions in genetic association studies." Genet Epidemiol 34(8):
863-878.
Clyde, M.A. DeSimone, H. and Parmigiani, G. (1996). Prediction via orthogonalized
model mixing. J. Amer. Statist. Assoc. 91, 1197-1208.
Concannon, P., R. W. Haile, A.-L. Borresen-Dale, B. S. Rosenstein, R. A. Gatti, S. N.
Teraoka, A. T. Diep, L. Jansen, D. P. Atencio, B. Langholz, M. Capanu, X. Liang,
C. B. Begg, D. C. Thomas, L. Bernstein, J. H. Olsen, K. E. Malone, C. F. Lynch, H.
Anton-Culver, J. L. Bernstein and E. Women's Environm Canc Radiation
(2008). "Variants in the ATM gene associated with a reduced risk of
contralateral breast cancer." Cancer Research 68(16): 6486-6491.
Conti, D. V., V. Cortessis, J. Molitor and D. C. Thomas (2003). "Bayesian modeling of
complex metabolic pathways." Human Heredity 56(1-3): 83-93.
Conti, D.V., J.P. Lewinger, R.R. Tyndale, N.L. Benowitz, G.E. Swan, P.D.Thomas (2009)
"Using ontologies in hierarchical modeling of genes and exposurein biological
pathways." NCI Monographs September 2009; 20:539-584.
Conti, D. V. and J. S. Witte (2003). "Hierarchical modeling of linkage disequilibrium:
genetic structure and spatial relations." Am J Hum Genet 72(2): 351-363.
Cooper, G. M. and J. Shendure (2011). "Needles in stacks of needles: finding disease-
causal variants in a wealth of genomic data." Nat Rev Genet 12(9): 628-640.
Easton, D. F., K. A. Pooley, A. M. Dunning, P. D. Pharoah, D. Thompson, D. G. Ballinger,
J. P. Struewing, J. Morrison, H. Field, R. Luben, N. Wareham, S. Ahmed, C. S.
Healey, R. Bowman, K. B. Meyer, C. A. Haiman, L. K. Kolonel, B. E. Henderson,
L. Le Marchand, P. Brennan, S. Sangrajrang, V. Gaborieau, F. Odefrey, C. Y.
Shen, P. E. Wu, H. C. Wang, D. Eccles, D. G. Evans, J. Peto, O. Fletcher, N.
Johnson, S. Seal, M. R. Stratton, N. Rahman, G. Chenevix-Trench, S. E. Bojesen,
B. G. Nordestgaard, C. K. Axelsson, M. Garcia-Closas, L. Brinton, S. Chanock, J.
Lissowska, B. Peplonska, H. Nevanlinna, R. Fagerholm, H. Eerola, D. Kang, K.
Y. Yoo, D. Y. Noh, S. H. Ahn, D. J. Hunter, S. E. Hankinson, D. G. Cox, P. Hall, S.
Wedren, J. Liu, Y. L. Low, N. Bogdanova, P. Schürmann, T. Dörk, R. A.
Tollenaar, C. E. Jacobi, P. Devilee, J. G. Klijn, A. J. Sigurdson, M. M. Doody, B. H.
Alexander, J. Zhang, A. Cox, I. W. Brock, G. MacPherson, M. W. Reed, F. J.
Couch, E. L. Goode, J. E. Olson, H. Meijers-Heijboer, A. van den Ouweland, A.
Uitterlinden, F. Rivadeneira, R. L. Milne, G. Ribas, A. Gonzalez-Neira, J.
Benitez, J. L. Hopper, M. McCredie, M. Southey, G. G. Giles, C. Schroen, C.
Justenhoven, H. Brauch, U. Hamann, Y. D. Ko, A. B. Spurdle, J. Beesley, X. Chen,
A. Mannermaa, V. M. Kosma, V. Kataja, J. Hartikainen, N. E. Day, D. R. Cox, B. A.
76
Ponder, S. collaborators, kConFab and A. M. Group (2007). "Genome-wide
association study identifies novel breast cancer susceptibility loci." Nature
447(7148): 1087-1093.
Figueiredo, J. C., J. D. Brooks, D. V. Conti, J. N. Poynter, S. N. Teraoka, K. E. Malone, L.
Bernstein, W. D. Lee, D. J. Duggan, A. Siniard, P. Concannon, M. Capanu, C. F.
Lynch, J. H. Olsen, R. W. Haile and J. L. Bernstein (2011). "Risk of contralateral
breast cancer associated with common variants in BRCA1 and BRCA2:
potential modifying effect of BRCA1/BRCA2 mutation carrier status." Breast
Cancer Research and Treatment 127(3): 819-829.
George, E. I. and R. E. McCulloch (1993). "VARIABLE SELECTION VIA GIBBS
SAMPLING." Journal of the American Statistical Association 88(423): 881-
889.
George, E. I. and R. E. McCulloch (1997). "Approaches for Bayesian variable
selection." Statistica Sinica 7(2): 339-373.
Gilks, W.R., Richardson S., and Spiegelhalter D.J. Markov Chain Monte Carlo In
Practice. Boca Raton: Chapman & Hall/CRC, 1998. Print.
Gold, B., T. Kirchhoff, S. Stefanov, J. Lautenberger, A. Viale, J. Garber, E. Friedman, S.
Narod, A. B. Olshen, P. Gregersen, K. Kosarin, A. Olsh, J. Bergeron, N. A. Ellis, R.
J. Klein, A. G. Clark, L. Norton, M. Dean, J. Boyd and K. Offit (2008). "Genome-
wide association study provides evidence for a breast cancer risk locus at
6q22.33." Proc Natl Acad Sci U S A 105(11): 4340-4345.
Gorlov, I. P., O. Y. Gorlova, S. R. Sunyaev, M. R. Spitz and C. I. Amos (2008). "Shifting
paradigm of association studies: value of rare single-nucleotide
polymorphisms." Am J Hum Genet 82(1): 100-112.
Greenland, S. (1992). "A semi-Bayes approach to the analysis of correlated multiple
associations, with an application to an occupational cancer-mortality study."
Stat Med 11(2): 219-230.
Greenland, S. (1993). "Methods for epidemiologic analyses of multiple exposures: a
review and comparative study of maximum-likelihood, preliminary-testing,
and empirical-Bayes regression." Stat Med 12(8): 717-736.
Helleday, T., E. Petermann, C. Lundin, B. Hodgson and R. A. Sharma (2008). "DNA
repair pathways as targets for cancer therapy." Nat Rev Cancer 8(3): 193-
204.
77
Hoffmann TJ, Marini NJ, Witte JS. 2010. Comprehensive approach to analyzing rare
genetic variants. PLoS One 5(11):e13584.
Huang, J., S. Ma, H. Xie and C. H. Zhang (2009). "A group bridge approach for variable
selection." Biometrika 96(2): 339-355.
Hung, R. J., P. Brennan, C. Malaveille, S. Porru, F. Donato, P. Boffetta and J. S. Witte
(2004). "Using hierarchical modeling in genetic association studies with
multiple markers: application to a case-control study of bladder cancer."
Cancer Epidemiol Biomarkers Prev 13(6): 1013-1021.
Hung, R. J., M. Baragatti, D. Thomas, J. McKay, N. Szeszenia-Dabrowska, D. Zaridze, J.
Lissowska, P. Rudnai, E. Fabianova, D. Mates, L. Foretova, V. Janout, V.
Bencko, A. Chabrier, N. Moullan, F. Canzian, J. Hall, P. Boffetta and P. Brennan
(2007). "Inherited predisposition of lung cancer: a hierarchical modeling
approach to DNA repair and cell cycle control pathways." Cancer Epidemiol
Biomarkers Prev 16(12): 2736-2744.
Hunter, D. J. (2005). "Gene-environment interactions in human diseases." Nat Rev
Genet 6(4): 287-298.
Hunter, D. J., P. Kraft, K. B. Jacobs, D. G. Cox, M. Yeager, S. E. Hankinson, S. Wacholder,
Z. Wang, R. Welch, A. Hutchinson, J. Wang, K. Yu, N. Chatterjee, N. Orr, W. C.
Willett, G. A. Colditz, R. G. Ziegler, C. D. Berg, S. S. Buys, C. A. McCarty, H. S.
Feigelson, E. E. Calle, M. J. Thun, R. B. Hayes, M. Tucker, D. S. Gerhard, J. F.
Fraumeni, R. N. Hoover, G. Thomas and S. J. Chanock (2007). "A genome-wide
association study identifies alleles in FGFR2 associated with risk of sporadic
postmenopausal breast cancer." Nat Genet 39(7): 870-874.
Ishwaran, H. and J. S. Rao (2005). "Spike and slab gene selection for multigroup
microarray data." Journal of the American Statistical Association 100(471):
764-780.
Ishwaran, H. Kogalur, B., J. S. Rao (2010). "Spike and slab variable selection:
Frequentist and Bayesian strategies." Annals of Statistics 33(2): 730-773.
Ishwaran, H. and J. S. Rao (2011). "Consistency of spike and slab regression."
Statistics & Probability Letters 81(12): 1920-1928.
Jeffereys, H. (1935), "Some Tests of Significance, Treated by the Theory of
Probability," Proceedings of the Cambridge Philosophy Society, 31, 203-222.
Jeffereys, H. (1961), Theory of Probability (3
rd
ed.) Oxford, U.K.: Oxford University
Press.
78
Kass, R.E. and Raftery, A.E. (1993). "Bayes Factors and Model Uncertainty,"
Technical Report 571, Carnegie Mello University, Dept. of Statistics.
Kass, R. E. and Raftery, A. E. (1995). "BAYES FACTORS." Journal of the American
Statistical Association 90(430): 773-795.
Kim, L. L., B. A. Fijal and J. S. Witte (2001). "Hierarchical modeling of the relation
between sequence variants and a quantitative trait: addressing multiple
comparison and population stratification issues." Genet Epidemiol 21 Suppl
1: S668-673.
Land, C. E. (1995). "Studies of cancer and radiation dose among atomic bomb
survivors. The example of breast cancer." JAMA 274(5): 402-407.
Langholz, B. and L. Goldstein (1996). "Risk set sampling in epidemiologic cohort
studies." Statistical Science 11(1): 35-53.
Langholz, B., D. C. Thomas, W. D. Thompson and J. Bernstein (2003). "The WECARE
study to assess genetic susceptibility to radiation-induced breast cancer: A
counter-matched case-control study." Genetic Epidemiology 25(3): 257-257.
Langholz, B., D. C. Thomas, M. Stovall, S. A. Smith, J. D. Boice, Jr., R. E. Shore, L.
Bernstein, C. F. Lynch, X. Zhang and J. L. Bernstein (2009). "Statistical
Methods for Analysis of Radiation Effects with Tumor and Dose Location-
Specific Information with Application to the WECARE Study of Asynchronous
Contralateral Breast Cancer." Biometrics 65(2): 599-608.
Lavin, M. F. (2008). "Ataxia-telangiectasia: from a rare disorder to a paradigm for
cell signalling and cancer." Nat Rev Mol Cell Biol 9(10): 759-769.
Lee, J. H. and T. T. Paull (2005). "ATM activation by DNA double-strand breaks
through the Mre11-Rad50-Nbs1 complex." Science 308(5721): 551-554.
Lee, J. H., A. A. Goodarzi, P. A. Jeggo and T. T. Paull (2010). "53BP1 promotes ATM
activity through direct interactions with the MRN complex." EMBO J 29(3):
574-585.
Lewinger, J. P., D. V. Conti, J. W. Baurley, T. J. Triche and D. C. Thomas (2007).
"Hierarchical Bayes prioritization of marker associations from a genome-
wide association scan for further investigation." Genetic Epidemiology 31(8):
871-882.
Li, B. and S. M. Leal (2008). "Methods for detecting associations with rare variants
for common diseases: application to analysis of sequence data." Am J Hum
Genet 83(3): 311-321.
79
Li, R., D. V. Conti, D. Diaz-Sanchez, F. Gilliland and D. C. Thomas (2013). "Joint
Analysis for Integrating Two Related Studies of Different Data Types and
Different Study Designs Using Hierarchical Modeling Approaches." Hum
Hered 74(2): 83-96.
Liu, X., E. Jorgenson and J. S. Witte (2005). "Hierarchical modeling in association
studies of multiple phenotypes." BMC Genet 6 Suppl 1: S104.
Madsen, B. E. and S. R. Browning (2009). "A groupwise association test for rare
mutations using a weighted sum statistic." PLoS Genet 5(2): e1000384.
Maher, B. (2008). "Personal genomes: The case of the missing heritability." Nature
456(7218): 18-21.
McCarthy, M. I. (2009). "Exploring the unknown: assumptions about allelic
architecture and strategies for susceptibility variant discovery." Genome Med
1(7): 66.
Mellemkjaer, L., C. Dahl, J. H. Olsen, L. Bertelsen, P. Guldberg, J. Christensen, A. L.
Borresen-Dale, M. Stovall, B. Langholz, L. Bernstein, C. F. Lynch, K. E. Malone,
R. W. Haile, M. Andersson, D. C. Thomas, P. Concannon, M. Capanu, J. D. Boice,
Jr., J. L. Bernstein and W. S. C. Grp (2008). "Risk for contralateral breast
cancer among carriers of the CHEK2*1100delC mutation in the WECARE
Study." British Journal of Cancer 98(4): 728-733.
Mitchell, T.J., and Beauchamp, J.J. (1988). Baysian variable selection in linear
regression. J. Am. Stat. Assoc. 83(404): 1023-1032.
Morgenthaler, S. and W. G. Thilly (2007). "A strategy to discover genes that carry
multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums
test (CAST)." Mutat Res 615(1-2): 28-56.
Morris CN (1983): Parametric empirical Bayes inference: theory and applications
(with discussion). J Am Stat Assoc 78:47-65.
Morris, A. P. and E. Zeggini (2010). "An evaluation of statistical approaches to rare
variant analysis in genetic association studies." Genet Epidemiol 34(2): 188-
193.
Ng, P. C. and S. Henikoff (2003). "SIFT: Predicting amino acid changes that affect
protein function." Nucleic Acids Res 31(13): 3812-3814.
Paull, T. T. and J. H. Lee (2005). "The Mre11/Rad50/Nbs1 complex and its role as a
DNA double-strand break sensor for ATM." Cell Cycle 4(6): 737-740.
80
Pe'er, I., P. I. de Bakker, J. Maller, R. Yelensky, D. Altshuler and M. J. Daly (2006).
"Evaluating and improving power in whole-genome association studies using
fixed marker sets." Nat Genet 38(6): 663-667.
Poynter, J. N., B. Langholz, J. Largent, L. Mellemkjaer, L. Bernstein, K. E. Malone, C. F.
Lynch, A. Borg, P. Concannon, S. N. Teraoka, S. Xue, A. T. Diep, T. Törngren, C.
B. Begg, M. Capanu, R. W. Haile, J. L. Bernstein and W. S. C. Group (2010).
"Reproductive factors and risk of contralateral breast cancer by BRCA1 and
BRCA2 mutation status: results from the WECARE study." Cancer Causes
Control 21(6): 839-846.
Quintana, M. A., J. L. Berstein, D. C. Thomas and D. V. Conti (2011). "Incorporating
Model Uncertainty in Detecting Rare Variants: The Bayesian Risk Index."
Genetic Epidemiology 35(7): 638-649.
Quintana, M.A., Conti, D.V. (2013). Integrative Variable Selection via Bayesian Model
Uncertainty. Statistics in Medicine.
Quintana, M.A., Schumacher, F.R., Casey, G., Bernstein, J.L., Li, L. and Conti, D.V.
(2013). Incorporating Prior Biologic Information for High Dimensional Rare
Variant Association Studies.
Raftery, A.E., Madigan, D.M., and Hoeting (1993), "Model Selection and Accounting
for Model Uncertianty in Linear Regression Models" Technical Report 2262,
University of Washington, Dept. of Statistics.
Rebbeck, T. R., M. Spitz and X. Wu (2004). "Assessing the function of genetic variants
in candidate gene association studies." Nat Rev Genet 5(8): 589-597.
Roeder, K., S. A. Bacanu, L. Wasserman and B. Devlin (2006). "Using linkage genome
scans to improve power of association in genome scans." Am J Hum Genet
78(2): 243-252.
Robert, C.P., and Casella, G. (2004). "Monte Carlo Statistical Methods. Springer, New
Yor.
Reding, K. W., J. L. Bernstein, B. M. Langholz, L. Bernstein, R. W. Haile, C. B. Begg, C. F.
Lynch, P. Concannon, A. Borg, S. N. Teraoka, T. Törngren, A. Diep, S. Xue, L.
Bertelsen, X. Liang, A. S. Reiner, M. Capanu, K. E. Malone and W. C. S. Group
(2010). "Adjuvant systemic therapy for breast cancer in BRCA1/BRCA2
mutation carriers in a population-based study of risk of contralateral breast
cancer." Breast Cancer Res Treat 123(2): 491-498.
81
Rockova, V., E. Lesaffre, J. Luime and B. Lowenberg (2012). "Hierarchical Bayesian
formulations for selecting variables in regression models." Statistics in
Medicine 31(11-12): 1221-1237.
Satagopan, J. M., D. A. Verbel, E. S. Venkatraman, K. E. Offit and C. B. Begg (2002).
"Two-stage designs for gene-disease association studies." Biometrics 58(1):
163-170.
Shahbaba, B., C. M. Shachaf and Z. X. Yu (2012). "A pathway analysis method for
genome-wide association studies." Statistics in Medicine 31(10): 988-1000.
Shen, L., Z. H. Yin, Y. Wan, Y. Zhang, K. Li and B. S. Zhou (2012). "Association between
ATM polymorphisms and cancer risk: a meta-analysis." Mol Biol Rep 39(5):
5719-5725.
Schork, N. J., S. S. Murray, K. A. Frazer and E. J. Topol (2009). "Common vs. rare allele
hypotheses for complex diseases." Curr Opin Genet Dev 19(3): 212-219.
Sohns, M., A. Rosenberger and H. Bickeboller (2009). Integration of a priori gene set
information into genome-wide association studies. BMC Proc. England. 3
Suppl 7: S95.
Sohns, M., W. Viktorova, C. I. Amos, P. Brennan, G. Fehringer, V. Gaborieau, Y. Han, J.
Heinrich, J. Chang-Claude, R. J. Hung, M. Muller-Nurasyid, A. Risch, D. Thomas
and H. Bicheboller (under review). "Empirical Hierarchical Bayes Approach
for Gene-Environment Interactions: Development and Application to
Genome-Wide Association Studies of Lung Cancer."
Stovall, M., S. A. Smith, B. M. Langholz, J. D. Boice, Jr., R. E. Shore, M. Andersson, T. A.
Buchholz, M. Capanu, L. Bernstein, C. F. Lynch, K. E. Malone, H. Anton-Culver,
R. W. Haile, B. S. Rosenstein, A. S. Reiner, D. C. Thomas, J. L. Bernstein and W.
S. C. Grp (2008). "Dose to the contralateral breast from radiotherapy and risk
of second primary breast cancer in the WECARE study." International Journal
of Radiation Oncology Biology Physics 72(4): 1021-1030.
Sun, L., R. V. Craiu, A. D. Paterson and S. B. Bull (2006). "Stratified false discovery
control for large-scale hypothesis testing with application to genome-wide
association studies." Genet Epidemiol 30(6): 519-530.
Swartz, M. D., M. Kimmel, P. Mueller and C. I. Amos (2006). "Stochastic search gene
suggestion: a Bayesian hierarchical model for gene mapping." Biometrics
62(2): 495-503.
82
Teraoka, S. N., J. L. Bernstein, A. S. Reiner, R. W. Haile, L. Bernstein, C. F. Lynch, K. E.
Malone, M. Stovall, M. Capanu, X. Liang, S. A. Smith, J. Mychaleckyj, X. Hou, L.
Mellemkjaer, J. D. Boice, Jr., A. Siniard, D. Duggan, D. C. Thomas, P. Concannon
and W. S. C. Grp (2011). "Single nucleotide polymorphisms associated with
risk for contralateral breast cancer in the Women's Environment, Cancer, and
Radiation Epidemiology (WECARE) Study." Breast Cancer Research 13(6).
Thomas, D. C., J. Siemiatycki, R. Dewar, J. Robins, M. Goldberg and B. G. Armstrong
(1985). "The problem of multiple inference in studies designed to generate
hypotheses." Am J Epidemiol 122(6): 1080-1095.
Thomas, D., B. Langholz, D. Clayton, J. Pitkäniemi, E. Tuomilehto-Wolf and J.
Tuomilehto (1992). "Empirical Bayes methods for testing associations with
large numbers of candidate genes in the presence of environmental risk
factors, with applications to HLA associations in IDDM." Ann Med 24(5): 387-
392.
Thomas, D. C., A. Borg, M. Capanum, P. Concannon, D. V. Conti, R. W. Haile, X. Liang,
A. S. Reiner, M. Stovall, S. N. Teraoka and J. L. Bernstein (2009a). "A
Comprehensive Model for DNA Repair Genes and Radiation in Second Breast
Cancers: The WECARE Collaborative Study Group." Genetic Epidemiology
33(8): 807-807.
Thomas, D. C., D. V. Conti, J. Baurley, F. Nijhout, M. Reed and C. M. Ulrich (2009b).
"Use of pathway information in molecular epidemiology." Hum Genomics
4(1): 21-42.
Thomas, G., K. B. Jacobs, P. Kraft, M. Yeager, S. Wacholder, D. G. Cox, S. E. Hankinson,
A. Hutchinson, Z. Wang, K. Yu, N. Chatterjee, M. Garcia-Closas, J. Gonzalez-
Bosquet, L. Prokunina-Olsson, N. Orr, W. C. Willett, G. A. Colditz, R. G. Ziegler,
C. D. Berg, S. S. Buys, C. A. McCarty, H. S. Feigelson, E. E. Calle, M. J. Thun, R.
Diver, R. Prentice, R. Jackson, C. Kooperberg, R. Chlebowski, J. Lissowska, B.
Peplonska, L. A. Brinton, A. Sigurdson, M. Doody, P. Bhatti, B. H. Alexander, J.
Buring, I. M. Lee, L. J. Vatten, K. Hveem, M. Kumle, R. B. Hayes, M. Tucker, D. S.
Gerhard, J. F. Fraumeni, R. N. Hoover, S. J. Chanock and D. J. Hunter (2009c).
"A multistage genome-wide association study in breast cancer identifies two
new risk alleles at 1p11.2 and 14q24.1 (RAD51L1)." Nat Genet 41(5): 579-
584.
Thomas, D. (2010a). "Gene-environment-wide association studies: emerging
approaches." Nature Reviews Genetics 11(4): 259-272.
Thomas, D. (2010b). Methods for Investigating Gene-Environment Interactions in
Candidate Pathway and Genome-Wide Association Studies. Annual Review of
83
Public Health, Vol 31. J. E. Fielding, R. C. Brownson and L. W. Green. 31: 21-
36.
Thompson, E. A. and S. W. Guo (1991). "Evaluation of likelihood ratios for complex
genetic models." IMA J Math Appl Med Biol 8(3): 149-169.
Thompson, D. and D. Easton (2004). "The genetic epidemiology of breast cancer
genes." J Mammary Gland Biol Neoplasia 9(3): 221-236.
Tibshirani, R. (1996). Regression shrikage and selection via the Lasso. J. R. Stat. Soc.
58(1), 267-288PM ID.
Tintle, N., F. Lantieri, J. Lebrec, M. Sohns, D. Ballard and H. Bickeboller (2009).
"Inclusion of A Priori Information in Genome-Wide Association Analysis."
Genetic Epidemiology 33: S74-S80.
Tokunaga, M., C. E. Land, T. Yamamoto, M. Asano, S. Tokuoka, H. Ezaki and I.
Nishimori (1987). "Incidence of female breast cancer among atomic bomb
survivors, Hiroshima and Nagasaki, 1950-1980." Radiat Res 112(2): 243-
272.
Volk, H. E., F. Gilliland, D. Diaz-Sanchez and D. V. Conti (2008). "Hierarchical
Modeling of Pathway-Based Candidate Genes and Gene-Environment
Interactions." Genetic Epidemiology 32(7): 669-670.
Wang K, Li M, Bucan M. 2007. Pathway-based approaches for analysis of
genomewide association studies. Am J Hum Genet 81(6):1278-83.
Wang, K., M. Y. Li and H. Hakonarson (2010). "Analysing biological pathways in
genome-wide association studies." Nature Reviews Genetics 11(12): 843-
854.
Wen, G., S. K. Mahata, P. Cadman, M. Mahata, S. Ghosh, N. R. Mahapatra, F. Rao, M.
Stridsberg, D. W. Smith, P. Mahboubi, N. J. Schork, D. T. O'Connor and B. A.
Hamilton (2004). "Both rare and common polymorphisms contribute
functional variation at CHGA, a regulator of catecholamine physiology." Am J
Hum Genet 74(2): 197-207.
Wilson, M. A., J. W. Baurley, D. C. Thomas and D. V. Conti (2010a). Complex System
Approaches to Genetic Analysis: Bayesian Approaches. Computational
Methods for Genetics of Complex Traits. J. C. Dunlap and J. H. Moore. 72: 47-
71.
84
Wilson, M. A., E. S. Iversen, M. A. Clyde, S. C. Schmidler and J. M. Schildkraut (2010b).
"BAYESIAN MODEL SEARCH AND MULTILEVEL INFERENCE FOR SNP
ASSOCIATION STUDIES." Annals of Applied Statistics 4(3): 1342-1364.
Witte, J. S., S. Greenland, R. W. Haile and C. L. Bird (1994). "Hierarchical regression
analysis applied to a study of multiple dietary exposures and breast cancer."
Epidemiology 5(6): 612-621.
Witte, J. S. and S. Greenland (1996). "Simulation study of hierarchical regression."
Stat Med 15(11): 1161-1170.
Witte, J. S. (1997). "Genetic analysis with hierarchical models." Genet Epidemiol
14(6): 1137-1142.
Xi, T., I. M. Jones and H. W. Mohrenweiser (2004). "Many amino acid substitution
variants identified in DNA repair genes during human population screenings
are predicted to impact protein function." Genomics 83(6): 970-979.
Yu, K. D., C. Yang, L. Fan, A. X. Chen and Z. M. Shao (2011). "RAD51 135G>C does not
modify breast cancer risk in non-BRCA1/2 mutation carriers: evidence from
a meta-analysis of 12 studies." Breast Cancer Res Treat 126(2): 365-371.
Zou, H. and T. Hastie (2005). "Regularization and variable selection via the elastic
net." Journal of Royal Statistical Society: Series B (Statistical Methodology)
68(1):49-67.
Abstract (if available)
Abstract
We proposed a novel statistical method to investigate the involvement of multiple genes thought to be part of a common pathway for a particular disease. If a gene is identified to be associated with the disease, we are also interested in discovering which single nucleotide polymorphisms (SNPs) within this gene are responsible for this association. Here we present a Bayesian hierarchical modeling strategy that allows for multiple SNPs within each gene, with external prior information at either the SNP or gene level. The model involves variable selection at the SNP level through latent indicator variables and Bayesian shrinkage at the gene level towards a prior mean vector and covariance matrix that depend on external information. The entire model is fitted using Markov Chain Monte Carlo methods. Simulation studies show good ability to recover the underlying model. The method is applied to data on 504 SNPs in 38 candidate genes involved in DNA damage response in the Women's Environmental Cancer And Radiation Epidemiology study (WECARE) of second breast cancers in relation to radiotherapy exposures.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Bayesian hierarchical models in genetic association studies
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Small area cancer incidence mapping using hierarchical Bayesian methods
PDF
Bayesian model averaging methods for gene-environment interactions and admixture mapping
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Screening and association testing of coding variation in steroid hormone coactivator and corepressor genes in relationship with breast cancer risk in multiple populations
PDF
Gene-set based analysis using external prior information
PDF
Bayesian models for a respiratory biomarker with an underlying deterministic model in population research
PDF
High-dimensional regression for gene-environment interactions
PDF
A hierarchical physiologically-based pharmacokinetic modeling platform for genetic and exposure effects in metabolic pathways
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
Two-step study designs in genetic epidemiology
PDF
Differential methylation analysis of colon tissues
PDF
Comparisons of four commonly used methods in GWAS to detect gene-environment interactions
PDF
Carcinogen metabolism genes, meat intake, and colorectal cancer risk
PDF
Application of random effects models to a clinical retrospective hierarchical database
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Functional characterization of colon cancer risk-associated enhancers: connecting risk loci to risk genes
PDF
Bayesian multilevel quantile regression for longitudinal data
PDF
Functional based multi-level flexible models for multivariate longitudinal data
Asset Metadata
Creator
Duan, Lewei
(author)
Core Title
Using multi-level Bayesian hierarchical model to detect related multiple SNPs within multiple genes to disease risk
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Publication Date
05/02/2013
Defense Date
05/02/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
breast cancer,candidate genes,DNA damage response,hierarchical Bayes models,OAI-PMH Harvest,pathway models,WECARE study
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Thomas, Duncan C. (
committee chair
), Conti, David V. (
committee member
), Marjoram, Paul (
committee member
)
Creator Email
leweidua@usc.edu,leweiduan@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-250211
Unique identifier
UC11288168
Identifier
etd-DuanLewei-1645.pdf (filename),usctheses-c3-250211 (legacy record id)
Legacy Identifier
etd-DuanLewei-1645.pdf
Dmrecord
250211
Document Type
Thesis
Rights
Duan, Lewei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
breast cancer
candidate genes
DNA damage response
hierarchical Bayes models
pathway models
WECARE study