Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
(USC Thesis Other)
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
CHARACTERIZATION AND DISCOVERY OF
GENETIC ASSOCIATIONS:
MULTIETHNIC FINE-MAPPING AND
INCORPORATION OF FUNCTIONAL
INFORMATION
by
Kan Wang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree of
DOCTOR OF PHILOSOPHY (Biostatistics)
December 2018
Copyright 2018 Kan Wang
Table of Contents
Chapter 1 Introduction .............................................................................................................................. 8
1.1 General background: Multiethnic genetic association ....................................................................... 8
1.1.1 Genome-wide association studies .............................................................................................. 8
1.1.2 Fine-mapping ........................................................................................................................... 10
1.2 Multiethnic fine-mapping................................................................................................................ 11
1.2.1 Meta-analysis ........................................................................................................................... 12
1.2.2 Bayesian model selection ......................................................................................................... 17
1.3 Regularized regression .................................................................................................................... 18
1.3.1 Bias-variance Tradeoff ............................................................................................................. 19
1.3.2 Ridge regression ...................................................................................................................... 21
1.3.3 Least absolute shrinkage and selection operator (Lasso) .......................................................... 22
1.3.4 Quantile regression .................................................................................................................. 25
Chapter 2 Genome-wide Association Study of prostate cancer in men of African Ancestry................... 29
2.1 Abstract ........................................................................................................................................... 29
2.2 Introduction ..................................................................................................................................... 30
2.3 Methods .......................................................................................................................................... 30
2.4 Results ............................................................................................................................................ 32
2.5 Discussion ....................................................................................................................................... 33
2.6 Funding ........................................................................................................................................... 33
Chapter 3 Multiethnic Joint Analysis of Marginal SNP Effects .............................................................. 42
3.1 Abstract ........................................................................................................................................... 42
3.2 Introduction ..................................................................................................................................... 43
3.3 Methods .......................................................................................................................................... 45
3
3.3.1 JAM Linear Model for Summary Statistics .............................................................................. 45
3.3.2 Multiethnic JAM ...................................................................................................................... 46
3.3.3 Limitation of full ranked genotype matrix ............................................................................... 48
3.3.4 Bayesian Model Selection ........................................................................................................ 48
3.3.5 Simulation Study on Structured LD ......................................................................................... 50
3.3.6 Simulation Study with Real Data ............................................................................................. 51
3.3.7 Applied examples..................................................................................................................... 52
3.4 Results ............................................................................................................................................ 53
3.4.1 Simulation Study ...................................................................................................................... 53
3.4.2 Case study of prostate cancer ................................................................................................... 57
3.5 Discussion ....................................................................................................................................... 60
3.6 R package R2BGLiMS ................................................................................................................... 61
3.7 Appendix......................................................................................................................................... 62
Chapter 4 Germline Variation at 8q24 and Prostate Cancer Risk in Men of European Ancestry ............ 72
4.1 Abstract ........................................................................................................................................... 72
4.2 Introduction ..................................................................................................................................... 73
4.3 Methods .......................................................................................................................................... 74
4.4 Results ............................................................................................................................................ 75
4.5 Discussion ....................................................................................................................................... 76
Chapter 5 Quantile(Q1) Regularized Regression .................................................................................... 82
5.1 Introduction ..................................................................................................................................... 82
5.2 Methods .......................................................................................................................................... 83
5.2.1 Quantile Regression ................................................................................................................. 83
5.2.2 Quantile(Q1) Penalty ............................................................................................................... 84
5.2.3 Coordinate Descent for Q1-regularization ............................................................................... 86
5.2.4 Q1 Penalty with External Information ..................................................................................... 89
5.3 Simulation Studies .......................................................................................................................... 90
5.3.1 No external information ........................................................................................................... 92
4
5.3.2 With external information ........................................................................................................ 96
5.3.3 High dimensional scenario ..................................................................................................... 100
5.4 Case study: Breast cancer survival prediction ............................................................................... 101
5.5 Discussion ..................................................................................................................................... 102
5.6 R package hierr ............................................................................................................................. 103
Supplemental Paper 1 Two Novel Susceptibility Loci for Prostate Cancer in Men of African Ancestry
................................................................................................................................................................ 104
Abstract ............................................................................................................................................... 104
Introduction ........................................................................................................................................ 104
Methods .............................................................................................................................................. 105
Results ................................................................................................................................................ 106
Discussion ........................................................................................................................................... 108
Funding ............................................................................................................................................... 109
Supplemental Paper 2 Germline Variation at 8q24 and Prostate Cancer Risk in Men of European
Ancestry .................................................................................................................................................. 112
Abstract ............................................................................................................................................... 112
Introduction ........................................................................................................................................ 112
Results ................................................................................................................................................ 113
Discussion ........................................................................................................................................... 116
Methods .............................................................................................................................................. 119
Acknowledgements ............................................................................................................................. 123
5
Acknowledgments
First and foremost, I would like to express my sincere gratitude to my advisor and the chair of
my committee, Dr. David Conti, for his consistent support and guidance throughout my Ph.D. Your
passion and diligence have always inspired me in the past years. I also owe my deepest gratitude to all the
committee members, Dr. Duncan Thomas, Dr. William Gauderman, Dr. Juan Pablo Lewinger, Dr. Jay
Batroff and, Dr. Yan Liu. Thank you all for your insights and advice and I truly enjoy working with you. I
feel very lucky to have the opportunities to work on novel statistical methods in the P01 group with
tremendous support from the students as well as the faculties. I am also very grateful to the multiethnic
cohort (MEC) group, especially Dr. Christopher Haiman for his support and guidance in two applied
projects.
I would like to extend my gratitude to Dr. Paul Newcombe from the University of Cambridge
for his guidance in the mJAM project.
Finally, I want to especially thank my wife Sisi for her unconditional support for the past ten
years.
6
Abstract
While GWAS has been very successful in identifying SNP-trait associations for many complex diseases,
most of the index Single Nucleotide Polymorphisms(SNPs) are only surrogates of the underlying
functional variants. In the post-GWAS era, it is still an open question on how to fine-map the summary
results from GWAS to prioritize the causal variants. High linkage disequilibrium (LD) remains one of the
major challenges for fine-mapping studies as the surrogates and the true functional variants can be nearly
identical in one single population. Many fine-mapping methods were proposed to address this issue,
mainly from two perspectives: aggregating information across multiple ethnic groups where the LDs are
lower or incorporating external functional information. In the past decade, the increasing availability of
GWAS studies from diverse ethnic groups enormously facilitated the implementation of multiethnic fine-
mapping. On the other hand, the enrichment of gene annotations also provides an unprecedented amount
of high-quality functional information for us to exploit.
In this dissertation, we propose two novel fine-mapping methods: multiethnic Joint Analysis of
Marginal SNP effects (mJAM), and Quantile (Q1) regularized regression, to address the problem from the
above two angles respectively. Each of them is also a statistical method motivated by the applied work I
undertook in two other papers. The first one is presented in chapter 2, where we ran a GWAS on African
American population with 10,202 prostate cancer cases and 10,810 controls and identified two novel loci
of prostate cancer susceptibility regions. Naturally, multiethnic fine-mapping is the next step to follow up
on these results. To address this need, we propose mJAM in chapter 3, which can perform joint analysis
on summary statistics across multiple populations in a scalable Bayesian framework. In the next chapter,
we present the second applied paper, where we performed fine-mapping on 8q24, a region of special
interest for prostate cancer, for a combined 74,449 prostate cancer cases and 54,097 controls of men in
European ancestry to prioritize independent signals in the region. We identified 12 SNPs that account for
25% of variance explained for the familial risk of prostate cancer by known genetic variants. With strong
statistical evidence, it’s crucial to further validate these variants with functional information. In chapter 5,
7
we propose another fine-mapping method Q1 regularized regression that can incorporate functional
information in a quantile regression on the second stage to improve both the prediction accuracy and the
model selection performance. We present this method in the framework of genetic studies in the
dissertation, but it is a very generic method that can be readily applied to other fields as well.
8
Chapter 1 Introduction
1.1 General background: Multiethnic genetic association
1.1.1 Genome-wide association studies
Two decades ago, family-based studies were still the mainstream methods in the attempt to detect the
causal variants in genetic epidemiology studies. Despite its success for rare Mendelian diseases,
increasing evidence indicated its limitations for common diseases whose risk is affected by multiple small
effects with low penetrance. Unfortunately, lots of the common diseases with significant socioeconomic
implications share these characteristics. To address this issue, Cordell and Clayton proposed population-
based genetic association studies (Cordell and Clayton, 2005). Candidate gene studies were the herald of
them all.
Back to the date when only a handful of SNPs could be genotyped at a time, candidate gene
studies rely heavily on prior biological information to pre-select a list of high profile candidate variants to
investigate. Therefore, statistical association analysis was limited solely to these candidates. Criticisms of
candidate gene studies include significant selection bias on prior information and inconsistencies with
follow-up studies due to population stratification (Liu et al., 2013). Nevertheless, it was the most fruitful
genetic association studies in the pre-GWAS era as it identified hundreds of genes associated with
common diseases by 2002.
In the early 21
st
century, the breakthrough of genotyping technology made it possible to scan
the entire genome with millions of typed SNPs, providing an overwhelmingly large number of genetic
variants. After 13 years since its launch in 1990, the Human Genome Project (Lauder et al., 2001) was
9
declared completed in 2003, which started a golden era of GWAS with unprecedented success. Contrary
to candidate gene studies, GWAS is an agnostic approach as it does not favor any region or variant based
on the prior biological information. Millions of SNPs are tested marginally one at a time. The total
number of tests performed became so large that extremely stringent significance threshold should be
applied to keep false positives under control. Most GWAS studies adopt a fixed the significance level at
8
5 10
regardless of the exact number of tests performed. This threshold is equivalent to a Bonferroni
correction for one million independent tests at overall type one error of 0.05. Consequently, the sample
sizes required to detect any genome-wide significant signals are quite demanding, as studies with
hundreds of samples are often underpowered to claim any novel findings at this threshold. What makes it
more challenging for the complex diseases is that most of the risk variants have small effect sizes. In
response to these challenges, different research groups started combining their datasets from various
sources and established many consortia to boost the overall power of identifying novel risk loci. For
prostate cancer, the PRACTICAL (Eeles et al., 2009) consortium was founded in 2008. It currently
includes 123 studies with more than 120,000 cases and 100,000 controls. Founded by the National
Human Genome Research Institute in 2008, the GWAS Catalog have documented 3300 curated
publications and more than 60,00 SNP-trait associations by September 2018, and the speed of inclusion
has been accelerating ever since its founding.
Inevitably, GWAS also has its limitations behind its glorious success. Here we briefly go over
several pitfalls that draw the most criticisms. To begin with, the selection bias in published results almost
certainly lead to inflated effect sizes (Hirschhorn et al., 2002). It could be compensated to some extent by
performing meta-analysis or increase the sample size. Moreover, only a small proportion of heritability
has been explained by the combination of all discovered genetic variants for many highly heritable
disease (McCarthy et al. 2008). Even after considering the potential causes of the missing heredity such as
gene-environmental interactions, gene-gene interactions, and rare variants, the proportion accounted for is
still far from satisfactory. Also, most GWAS studies only considered marginal effects without taking into
10
account the LD structures among populations. More often than not, the SNPs identified by GWAS are
only the surrogates of the biologically causal variants. As a result, low replication rates of previously
reported SNPs in different ethnic groups are not uncommon. In the post-GWAS era, fine-mapping
methods are playing an increasingly crucial role in refining the signals from GWAS and prioritizing
candidate functional variants.
1.1.2 Fine-mapping
With the increasingly matured imputation techniques like impute2, minimac3 and BEAGLE, the density
of genetic variants within a single region have become extremely high, with thousands of SNPs reside
within 1MB of a spanning region on the chromosome. As a result of the relatively lower rate of
recombination, SNPs closer to each other are more likely to be co-inherited to the next generation during
evolution, leading to linkage disequilibrium at these loci. Among those correlated SNPs, as many as
hundreds could be genome-wide significant in a large GWAS study. Take the region 8q24 for prostate
cancer as an example, 1077 SNPs were marginally genome-wide significant with p-values less than
8
5 10
in Illumina Oncoarray with 53,449 cases and 36,224 controls of European ancestry. In this
scenario, univariate analyses like GWAS do not take full advantage of the LD structures to further
localize the candidate functional variants. In response to the limitation of marginal estimates, fine-
mapping methods usually take advantage of correlation structures of a population under a joint or
conditional model to prioritize the causal SNPs.
There are two popular strategies to seek improvement for most fine-mapping methods:
introducing more appropriate statistical models from the data analysis perspective or incorporating more
accurate external information from a biological perspective. As one of the most straightforward model
selection methods, the stepwise selection is still widely used despite its lack of robustness and reliance on
individual-level data. More comprehensive model selection methods like the Integrative Bayesian Model
Uncertainty(iBMU) incorporate biological information under a Bayesian framework (Quintana and Conti,
11
2013), which leads to significant power increase compared to lasso and elastic net. While the above
methods require individual-level data, methods that require only summary statistics are gaining increased
popularity since they are far more accessible. To better utilize a large amount of available GWAS
summary results, Newcombe proposed Joint Analysis of Marginal statistics(JAM), where a joint model is
constructed from marginal results with the assist of a reference genotype matrix. (Newcombe, Conti and
Richardson, 2016). From the study design perspective, fine-mapping in African populations, versus in
other ethnic groups like Europeans and Asians, could end up with fewer selected candidate SNPs
empirically because Africans have relatively smaller LD blocks from more recombination events in their
long evolutionary history. However, due to logistics and coordination challenges, it is more realistic to
follow up on existing GWAS results as they are, where the samples sizes of available GWAS in the
African population are usually much smaller than the European population. In this regard, multiethnic
fine-mapping is becoming an increasingly powerful tool to prioritize functional genetic variants as it can
take advantage of the different LD structures in diverse populations.
1.2 Multiethnic fine-mapping
In the past decade, with the availability of GWAS results from diverse ethnic groups, multiethnic fine-
mapping has gained increased popularity. However, it is always challenging to analyze SNP-trait
associations across multiple ethnic groups. On the one hand, the variation in allele frequencies and LD
structures could be huge for specific regions. On the other hand, prevalence, aggressiveness and survival
rate of the disease all vary among diverse populations. In addition to these fundamental differences,
confounding factors like gene-environment interactions, lifestyles, and diets are also non-negligible.
Despite the technical challenges of analyzing multiple cohorts, their heterogeneities can also be the
fundamental advantage of multiethnic fine-mapping over methods confined to one population.
Specifically, many methods are proposed to leverage heterogeneous LD structures either explicitly or
12
implicitly to improve the fine-mapping resolution. The group of meta-analysis, including fixed-effect
meta-analysis, is currently the most widely applied method in multiethnic fine-mapping.
1.2.1 Meta-analysis
Genome-wide association studies have been a very successful tool as it discovered a significant number
of disease-associated loci with relatively large effect sizes in the past decade. However, an even richer
pool of small effect loci remains mostly untouched by single genome-wide association studies due to their
limited sample sizes. Increasing the sample size of every single study to the desired level with tens of
thousands of participants can be extremely expensive. On the other hand, with an increasing number of
GWAS results available, combining information from existing GWAS is much more cost-efficient. Meta-
analysis is among the most popular tools to aggregate marginal results from different GWAS to boost the
power in identifying disease-associated loci.
One inevitable dilemma to consider for any meta-analysis is the tradeoff between sample size
and between-study heterogeneity. It is tempting to pull all available studies together to reach maximal
sample size, but introducing too much heterogeneity can ultimately hurt the power. If one chooses to
minimize heterogeneity by including only one study, it will defeat the purpose of performing a meta-
analysis in the first place. In practice, there is always a delicate balance between increasing sample size
and keeping between-study heterogeneity in check. The choice of any specific meta-analysis methods also
implies different assumptions on heterogeneity. Fixed-effect assumes a common effect across all studies,
whereas random-effect explicitly model the heterogeneity, allowing each study to have a different effect
size. One interesting observation about the power of these two methods is that the random-effect model
almost never generates more significant p-values than fixed-effect, even in the scenario that the actual
effects are entirely different across studies. This astonishing discovery induced the new random-effect
model, which relaxed the overly stringent null hypothesis of the standard random-effect model. We will
introduce these three methods in the following sections with more details.
13
Denote
ˆ
{ , 1,..., }
k
kK the marginal estimates for a given SNP in several studies, and their
standard errors are
ˆ
{ ( ), 1,..., }
k
SE k K . These are the summary statistics to be meta-analyzed.
Theoretically, if the sample size is large enough, we can assume the study-specific estimates come from a
normal distribution with the mean being their true effect size
k
.
ˆ
k k k
, where ~ (0, )
kk
NV
The variance
k
V can be approximated by the sample variance. Since the mean is zero, it is
equivalent to the squared standard error
2
ˆ
()
k
SE .
Fixed-effect meta-analysis
The fixed-effect model assumes there is a uniformed fixed effect for all the studies combined in the meta-
analysis. Namely,
k
for all 1,..., kK . The maximum likelihood estimator (MLE) for the true
effect size for the fixed-effect model is a weighted average of the effects from individual studies with the
inverse of variance being the weights.
1
1
ˆ
( / )
ˆ
(1/ )
K
kk
k
FE K
k
k
V
V
A likelihood ratio test with one degree of freedom can be used to test if the effect is zero:
2
2
1
1
ˆ
( / )
(1/ )
K
kk
k
FE K
k
k
V
V
14
Random-effect meta-analysis
The traditional random-effect model assumes the study-specific effect sizes are sampled from a normal
distribution with the between-study variance
2
, which can be estimated using the method of moments
(DerSimonian and Laird, 1986). Similar to the fixed-effect, the MLE for the effect estimate is in the form
of a weighted average of study-specific estimates. The only difference is that the weights are the
reciprocals of the sum of between-study variance
2
and estimator variance
2
ˆ
()
k
SE :
2
1
2
1
ˆ
/ ( )
ˆ
1/ ( )
K
kk
k
RE K
k
k
V
V
The likelihood ratio test for the main effect is still a 1-df chi-squared test with the following statistic:
22
2 1
2
1
ˆ
( / ( ))
1/ ( )
K
kk
k
RE K
k
k
V
V
The weights from the random-effect model are more balanced compared to fixed-effect as the
between-study variance
2
is a common term in the denominator for each study. As opposed to fixed-
effect, where the large studies usually dominate the weighted estimation, smaller studies in the random-
effect model will have relatively larger weights. The random effect model does not aim to estimate one
common effect across all studies but to estimate the mean of a distribution of true effects that could vary
between studies.
One concern about the random-effect model is that it implicitly assumes heterogeneity under
the null hypothesis by normalizing the effect size to get the test statistics. This assumption is overly
conservative and leads to less significant p-values under almost all scenarios than the fixed-effect even
when the actual effects are heterogeneous across different groups. To address this issue, Han and Eskin
15
proposed a new random-effect model that combines the null assumption in the fixed-effect and the
alternative assumption of the traditional random-effect (Han and Eskin, 2011). In this framework, the
MLEs can be computed iteratively with the following update rules (Hardy and Thompson, 1996):
22
1
1
22
1
ˆ
ˆ / ( )
ˆ
ˆ (1/ ( ) )
K
t k t
k
t K
kt
k
V
V
2
1
22
2 1
1
22
1
ˆ
ˆ ()
ˆ ()
ˆ
1
ˆ ()
K
k t k
k kt
t K
k kt
V
V
V
With the MLEs, the test statistics for the new random effect is:
22
22
1 1 1
ˆˆ
ˆ ()
log( )
ˆˆ
K K K
k k k
new
k k k
k k k
V
S
V V V
When the total number of studies is sufficiently large,
new
S follows a mixture of
2
1
and
2
2
with
equal weights. With only a handful of studies, the asymptotic p-value tends to be conservative, thus
tabulated values should be used instead (Han et al., 2011).
Bayesian meta-analysis
With an increasing pool of genome-wide association studies from multiple ethnic groups, trans-ethnic
meta-analysis provides an opportunity to increase the power of identifying the causal variants in fine-
mapping by taking advantage of different LD structures of diverse populations. However, the
heterogeneity of both the effect size and the spectrum of the causal variants among remotely related
groups could be so tremendous that fixed-effect meta-analysis would not be appropriate. On the other
hand, random-effect meta-analysis assumes population-specific effects, which does not take into account
16
the genetic relatedness among all study populations. Intuitively, we would expect the effect sizes are more
similar for two groups that are closely-related than those that are distant from each other. Morris proposed
a meta-analysis of trans-ethnic association studies (MANTRA) in a Bayesian framework that groups
studies into clusters based on the relatedness and performed a group-heterogeneous fixed-effect meta-
analysis, where studies in the same group share a uniformed effect size while allowing variations among
different groups.
Under the Bayesian framework, the Bayes' factor is used to quantify the evidence of alternative
model
1
H versus the null model
0
H . It is a ratio of the two marginal likelihoods of the observed effects
under
1
H and
0
H respectively. Namely,
1
0
ˆ
( , )
ˆ
( , )
L V H
L V H
The marginal likelihoods can be computed by integrating the joint posterior density over the
unknown parameter space, which contains the groping information and the parameters of the distributions
of effect sizes within each group.
ˆˆ
( , ) ( , ) ( ) L V H L V P H d
Then, the Metropolis-Hastings MCMC algorithm can be used to get an approximation of the
likelihood of the observed effects under given parameters. In combination with an appropriate prior
distribution for the parameter, we could estimate the marginal likelihood, and therefore the Bayes' factor
as well. Simulation results indicate that MANTRA is more powerful than both the fixed-effect and the
random-effect meta-analysis under heterogeneous effects while remains the same power as the fixed-
effect when the effects are homogeneous.
When the overlapping of SNPs among studies is rather sparse, focus solely on SNPs typed in all
studies results in a small group of unrepresentative variants. Also, imputation methods are unable to work
with only summary statistics. In this case, we can use a multilocus Bayesian meta-analysis proposed by
17
Newcombe et al. to accommodate summary information from all the studies through a Bayesian
hierarchical model.
1.2.2 Bayesian model selection
In the proposed method in chapter 3, we applied Bayesian model selection to integrate the information
from low-dimensional models to provide predictor-level inclusion probabilities. Here we give a brief
overview of Bayesian model selection.
Assume our outcome y
and predictors X are of the same format as in the previous section.
Also, we denote
nq
ZR
to be a matrix of measured confounders that will be included in every model.
For each model in the model space MM
, we can use a generalized linear model to link the
predictors to the mean of the outcome variable :
1
0
: [ ] M g Z X
where g is the links function,
0
is the intercept, is the parameter estimate vector of the confounders,
X
specifies the subset of predictors to be included in the model M
and
is the effect of these
predictors in this model.
Given all possible models as a set M , posterior model probabilities of any model M
measure the degree
of support given by the data to one specific model:
( ) ( )
()
( ) ( )
MM
P y M P M
P M y
P y M P M
18
where () PM
is the prior probability of model M
and () P y M
is its marginal likelihood, which can
be approximated by its MLE. For predictor level inference, posterior inclusion probability of one
predictor
j
X is computed as the sum of all posterior model probabilities that include
j
X :
1
:
( ) ( )
j
j
MM
P y P M y
1.3 Regularized regression
Due to the enormous amount of variants across the whole genome, genetic data usually have more
features than total samples. In this high dimensional setting, regularized regression is a powerful model
selection method to reduce the dimensionality and prioritize functional genetic variants. In this section,
we will introduce L1 Lasso and L2 ridge regression as two benchmarks. Afterward, we will review some
fundamental properties of quantile regression, as a preparation for our novel method quantile (Q1)
regularized regression that will be discussed in chapter 5.
Consider one of the most common regression scenarios with one continuous response variable
n
yR and a matrix of predictors
np
XR
. It is straightforward to compute the ordinary least squares
(OLS) estimates by minimizing the sum of squared errors:
2
2
ˆ
arg min
p
OLS
R
yX
Its analytical
solution has the form
1
ˆ
()
TT
OLS
X X X y
. There are two major concerns about OLS when the number
of features is more than or close to the number of observations. First, OLS has rather poor prediction
accuracy in these scenarios because the variance of OLS estimates scales linearly with the number of
features. This leads to the discussion of the bias-variance tradeoff in the prediction model. To be more
specific, some biased estimators with dramatically smaller variance could help to minimize the mean
squared error (MSE). Soft-threshold operator and regularized regression are two well-developed
techniques to address this tradeoff. We will present more details about this topic in the following section.
19
The second concern is overfitting. In the high dimensional scenario, the interpretation of OLS estimates is
meaningless as there is an infinite number of solutions and at least one variable will have a positive
estimate in one solution and negative estimate in another. To deal with this matter, constraining the
estimators by a regularization set is, again, among the most popular techniques to improve OLS estimates
in high dimensions. With an ever-growing pool of regularized sets thanks to the booming popularity of
big data, L1 and L2 remain the most widely used regularized regressions.
1.3.1 Bias-variance Tradeoff
In supervised learning, the risk of a predictor can be quantified by the expectation of a previously chosen
loss function given the actual distribution of the data. However, it would defeat our purpose of
constructing a prediction model if the underlying distribution is known. Therefore, the empirical risk is
used instead in practice, where we take the sample mean of the loss function across all observations. In
theory, the empirical risk approaches the true risk as the number of observations increases to infinity.
However, the sample size is usually quite limited in reality. When the model is complicated enough, the
empirical risk can always be as small as zero, but the predicting ability drops dramatically due to
overfitting. To better understand the components that drive the prediction error, we need to decompose it
into several parts and analyze their relations.
Suppose there are N observations in the training dataset {( , ), 1,..., } i
i
D x y i N , the
prediction function learned from this data is ()
D
hx , and the loss function is denoted as ( ( ), )
D
l h x y . For
simplicity, we will derive the decomposition in the case of a squared loss
2
( ( ), ) [ ( ) ]
DD
l h x y h x y .
However, it is worth mentioning that similar decomposition is still valid for other loss functions like zero-
one loss and real-valued loss, under certain conditions.
Given the joint distribution of training data ( , ) p x y , the prediction function is a function
concerning this quantity. Therefore, its risk has the form
2
[ ( )] [ ( ) ] ( , )
DD
xy
R h x h x y p x y dxdy
. The
20
averaged risk over the training data then follows
2
[ ( )] [ ( ) ] ( , )
D D D
D x y
E R h x h x y p x y dxdydD
. We
will show this averaged risk can be decomposed into three components: variance, bias, and noise.
The strategy we used to separate the three terms of the above form one by one is very
straightforward. First, the averaged prediction ()
DD
E h x is subtracted from the averaged risk to parse out
variance. Then, the averaged target []
y
Ey is subtracted from the remaining term in a similar fashion to
give us the final form of bias and noise. The trick for this decomposition is the cross-terms are zeroes in
both steps.
Step1: Parse out the variance
22
[ ( )] [ ( ) ( )] ( , ) [ ( ) ] ( , )
D D D D D D D
D x y D x y
Variance
E R h x h x E h x p x y dxdydD E h x y p x y dxdydD
Step 2: Since the remaining does not depend on the training data anymore, we can simplify the integral
22
[ ( ) ] ( , ) [ ( ) ] ( , )
D D D D
D x y x y
E h x y p x y dxdydD E h x y p x y dxdydD
Apply similar trick as step 1 with zero cross-term, we have:
2
2 2 2
[ ( ) ] ( , ) [ ( ) [ ]] ( , ) [ [ ] ] ( , )
D D D D y y
x y x y x y
Noise
Bias
E h x y p x y dxdydD E h x E y p x y dxdydD E y y p x y dxdydD
This completes the decomposition:
2
Risk Variance Bias Noise
In order to construct a predictor with a smaller expected risk, we can either reduce its bias or
shrink its variance as the noise term is independent of our choice of predictors or loss functions. Typically,
complicated models tend to have less bias and larger variance than simpler ones. We could often find
ourselves in the scenario that decreases one would increase the other. To achieve minimal bias, we could
always memorize the labels of the entire training data. However, this model would perform horribly on
test data due to its enormous variance. On the other hand, a predictor with a constant function has zero
variance but massive bias under most circumstances. Either example sacrifices too much on the other
element in the process of minimizing one of them. To better utilize this tradeoff, it often ends up with a
21
compromise of choosing a biased parsimonious model that the reduction of variance is worthy of the bias
introduced.
1.3.2 Ridge regression
Ridge regression was proposed by Hoerl and Kennard in 1970. Its primary motivation is to obtain a better
estimator than ordinary least squares estimator with a smaller mean squared error. It has been noticed that
when the prediction vectors are far from orthogonal, the prediction variance could be so huge that the
estimates might have the wrong signs in some extreme scenarios. In ridge regression, instead of
estimating the parameters in ' XX that has small eigenvalues and hence deviates considerably from a unit
matrix, the estimation is based on ' , 0
p
X X kI k . By adding a small positive value on the diagonal
terms, the new system
*
ˆ
[ ' ] '
p
X X kI X Y behaves more like an orthogonal system.
In reducing the mean square error, ridge regression surely is taking advantage of the bias-
variance tradeoff discussed previously. Let , 1,...,
i
ip be the eigenvalues of ' XX and
2
be the
variance of the errors under the normality assumption. The total squared error can be decomposed into
bias and variance as follows:
* * 2 2 2 2
1
ˆˆ
[( )'( )] / ( ) '( ' )
p
ii
i
E k k X X kI
Denote
22
1
1
( ) : / ( )
p
ii
i
kk
and
22
2
( ) : '( ' ) k k X X kI
. Notice that
1
() k is a
continuous, monotonically decreasing function of k. While
2
() k is continuous and monotonically
increasing with respect to k. It is equivalent to ordinary least squares when there is no regularization.
Namely, k equals to zero. Then the bias term
2
() k is zero. However, as we could quickly discover when
' XX has one or several small eigenvalues, the total variance
1
() k increases dramatically when k is set
to be zero. This is a typical example where we are willing to introduce some bias in
2
() k to reduce the
variance in
1
() k that could help reduce the overall estimation error.
22
The intuitive justification of using a biased estimator as in ridge regression is that the
information we have in the data can imply some reasonable range for the estimator. Therefore, imposing
prior bounds on the estimates here is the right thing to do, even at the cost of introducing bias.
To formalize ridge regression, constraining estimates or penalize the objective function are two most
commonly used forms.
Constraint form: min ( )
p
R
fX
subject to
2
2
t for some t and a smooth convex loss function f .
Equivalently, the penalized form is more popular:
2
2
min ( )
p
R
fX
for some
1.3.3 Least absolute shrinkage and selection operator (Lasso)
Lasso regression was proposed by Tibshirani in 1996. By then, the standard alternatives for ordinary least
squares estimators are subset selection and ridge regression. As discussed previously, ridge regression
induces
2
L regularization that shrinks coefficients and reduces prediction variance. It can obtain stable
estimates, but it does not shrink any components to zero. Therefore, it is unable to perform variable
selection given a large number of features. Empirically, there are usually many small estimates from the
ridge regression, which is difficult to interpret. On the other hand, the subset selection uses a
0
L
regularization that penalizes directly on the number of nonzero coefficients in the model. It is a valid
variable selection method, but its selection results could be extremely unstable as tiny changes in the data
could lead to entirely different sets of selected variables. In addition, the computational burden of subset
selection is usually too challenging to scale up to high dimensions since the optimization object is neither
continuous nor convex.
The motivation of Lasso came from the non-negative garotte proposed by Breiman in 1993. The
constraint form is as follows:
02
1
ˆ
min ( )
N
i j j ij
ij
y c x
subject to 0,
jj
j
c c t
. Intuitively, it
starts with the ordinary least squares estimates
0
ˆ
and weights them by some non-negative values with a
23
constraint sum. This method shrinks some of the estimates to zero and tends to be more stable than subset
selection. Its prediction error is lower than subset selection in almost all scenarios and is comparable to
ridge regression unless there are many small non-zero coefficients in the actual model. One crucial
limitation of non-negative garotte is its dependence on both the sign and magnitude of ordinary least
squares estimates, which themselves could be misleading. Lasso gets around this limitation completely.
Its coefficients could have different signs than the OLS estimates.
The idea and the formulation of Lasso are similar to that of the ridge regression. It also has
constraint and penalized forms, using
1
L regularization instead of
2
L to shrink the estimates.
Constraint form: min ( )
p
R
fX
subject to
1
t , for some t and a smooth convex loss function f .
Penalized form:
2
2
min ( )
p
R
fX
for some .
Under Kuhn-Tucker conditions, these two forms are equivalent.
In a Bayesian framework, the Lasso estimates can be interpreted as the solution that maximizes
the posterior distribution of model parameters, given the prior distribution being an independent double-
exponential. Interestingly, the ridge estimates are equivalent to the maximum a posteriori (MAP)
estimator using a normal prior. Intuitively, double exponential distribution has a steeper peak and larger
tails on both ends than the standard normal distribution. It is not surprising that in most cases, Lasso
estimates tend to be either larger than ridge estimates or precisely zero.
To better visualize the effect of different regularizations adopted by Lasso and ridge regression,
we refer to the classic graphical representation in the 2-dimensional parameter space for linear regression.
The objective function in constraint form
2
1
min ( )
p
N
i j ij
R
ij
yx
is equivalent to the quadratic function
00
ˆˆ
min( ) ( )
p
TT
R
XX
. In the 2-d case, it is an elliptical contour centered at the OLS estimates in
the parameter space with principal axes at 45 degrees to the co-ordinate axes. Without regularization, the
optimal value of the quadratic function would be zero when we take the OLS solution. The Lasso
24
constraint is a rotated square while the ridge constraint is a circle. Both regularization sets are centered at
the origin. The constraint solution for the minimization problem is where the contour first touches the
regularization set as the solution deviates from OLS estimates.
Figure from Chapter 3 of Hastie et al. (2009)
For Lasso, the contour meets the square meet at its corner. Namely,
2
is shrunk to zero in this
example. While for ridge regression, both
1
and
2
are shrunk, but neither is zero. From this graphical
example, we can see the
1
L regularization set is more likely to meet the elliptical contour at its corner,
generating more zero estimates than the
2
L regularization set with a ball shape.
Both
2
L ridge and
1
L Lasso, as well as
0
L subset selection, utilize the tradeoff between variance
and bias to reduce the mean squared errors for prediction. Computationally, the former two have convex
objective functions and therefore are computationally tractable. Ridge regression even has a closed form
solution. Subset selection is discrete and difficult to compute in high dimensional scenarios. From the
variable selection perspective, both Lasso and subset selection are valid choices. However, the selection
of Lasso tends to be more robust to minor changes in the data. On the other hand, ridge regression suffers
from its relatively lower interpretability due to a huge amount of small nonzero coefficients. If compared
by their mean squared errors, all three methods have the potential to beat the other two competitors under
25
specific scenarios. When the number of true effects is rare, and their effect sizes are large, subset selection
has the best performance. If, on the contrary, there are a large quantity of small sized effects, ridge
regression tends to beat the other two. Finally, if both the effect sizes and the number of true effects are
not as extreme, Lasso is likely to out-perform both ridge and subset selection.
1.3.4 Quantile regression
To summarize the relationship between a list of predictors and the variable of interest, linear regression
uses the conditional mean of the outcome given the predictors. Standard linear regression applies ordinary
least squares estimates, which minimizes the sum of squared errors. One fundamental assumption of OLS
is that the errors are normally distributed. In practice, small deviations from normality usually can be
compensated by large sample size. However, ordinary least squares estimates would be very inefficient if
the errors are highly non-normal, especially when they have long-tails in the distribution. When the
number of outliers is at least moderate, estimators that tend to put less weight on the extreme observations
than least squares are more desirable. In these scenarios, one natural alternative summary statistic to
consider is the conditional median since it is more robust to outliers and non-normal errors than the
conditional mean. Median regression is also known as the least absolute deviation regression, with loss
function being the sum of absolute errors. It can be further generalized to quantile regression by adopting
different penalties for overprediction and underprediction. To be more specific, the sum of absolute errors
is weighted by 1 and respectively in the objective function. This asymmetric penalty provides more
flexible ways to characterize the data. By specifying the desired conditional quantile to investigate, we
can capture the heterogeneities in various parts of the distribution.
Before presenting the formal definition of quantile regression, it would be helpful for us to
develop the intuition of the sample quantiles as the solution of an optimization problem. For a continuous
response variable { : 1,..., }
t
y t T , the th sample quantile is the optimal solution to the following
minimization problem:
26
{ : } { : }
min[ (1 ) ]
tt
tt
R
t t y t t y
yy
(1)
Proof: Let
{ : } { : }
( ) : (1 )
tt
tt
t t y t t y
f y y
and y
be the th sample quantile of y .
We will show that for any
*
yR
,
*
( ) ( ) f y f y
Without loss of generality, we assume
*
yy
and denote
*
: y y y
**
* * *
{ : } { : } { : } { : }
( ) ( ) (1 ) (1 )
tt tt
t t q t t
t t y y t t y y t t y y t t y y
f y f y y y y y y y y y
**
**
{ : } { : } { : } { : }
( ) ( (1 ) (1 ) )
tt tt
t t t t
t t y y t t y y t t y y t t y y
y y y y y y y y
**
**
{ : } { : } { : } { : }
( ) (1 ) ( )
tt tt
t t t t
t t y y t t y y t t y y t t y y
y y y y y y y y
(2)
Notice that
* * * *
**
{ : } { : } { : } { : }
t t t t
t t t
t t y y t t y y t t y y t t y y
y y y y y y y y y
Therefore we have
* * *
*
{ : } { : } { : } { : }
t t t t
t t t
t t y y t t y y t t y y y t t y y
y y y y y y y
(3)
Similarly,
**
**
{ : } { : } { : } { : }
tt tt
t t t
t t y y t t y y t t y y t t y y y
y y y y y y y
(4)
Plus (3) and (4) into (3):
* * *
**
{ : } { : } { : } { : }
( ) ( ) (1 ) ( ) ( )
t t t t
tt
t t y y t t y y y t t y y t t y y y
f y f y y y y y y y
(5)
27
Notice we have the following inequalities:
*
*
{ : }
0
t
t
t t y y y
yy
,
**
{ : } { : }
tt
t
t t y y y t t y y y
y y y
Thus (5) can be simplified as the following:
**
*
{ : } { : } { : }
( ) ( ) (1 ) ( )
t tt
t t y y t t y y t t y y y
f y f y y y y
**
{ : } { : } { : } { : } { : }
()
t t t tt
t t y y t t y y t t y y t R t t y y y t t y y
y y y y y y
By the CDF definition of th quantile,
{ : }
t
t t y y t R
yy
We have proved that for any
*
yR
:
*
( ) ( ) f y f y
, where the equality holds if and only if
*
yy
.
Namely, the solution of the minimization problem (1) yields the th sample quantile.
With the above consistency in mind, we can generalize the minimization problem to incorporate a
sequence of the k-dimensional vector { : 1,..., }
t
x t T . The th conditional quantile is precisely the
solution to the minimization problem:
{ : } { :
min[ (1 ) ]
k
t t t t
t t t t
R
t t y x t t y x
y x y x
(6)
Define loss function ( ) : ( ) (1 )( ) u u u
The above objective function has the form ()
tt
t
yx
, where () u is a tilted version of the
absolute function () f x x
28
The following three plots show () u
for 0.5 , 0.1 and 0.9 respectively
To facilitate computations, we would like to use the equivalent form of the quantile loss function:
11
( ) ( ) (1 )( ) ( )
22
u u u u u
Involving either absolute value or indicators, the objective function of quantile regression is
non-differentiable. Therefore no analytical solution is available. By taking advantage of its convexity,
several efficient algorithms can provide numerical solutions to this problem. The most popular solver is
the simplex method in linear programming. Since quantile regression only deals with the order of the
errors, not the scale of them, it is invariant under all monotonic transformations. It is proven that the
quantile regression estimator is asymptotically normal with analytical variance. However, it is rather
difficult to estimate given its complex form. Re-sampling methods such as bootstrap are usually
implemented to get an estimation of the standard errors instead.
Quantile regression might not be the most efficient for a specific error distribution, but it is
flexible enough to deal with a wide range of parametric models with relatively high efficiency. If the
errors are indeed normally distributed, least squares have slightly higher efficiency than quantile
regression. On the other hand, it would be much safer to use the later when there is uncertainty about the
correct parametric model so to avoid catastrophic scenarios like the least squares' zero efficiency for
Cauchy distribution.
29
Chapter 2 Genome-wide Association Study of
prostate cancer in men of African Ancestry
This chapter contains the work I undertook in the following paper:
Conti, D., Wang, K., et al. (2017). Two Novel Susceptibility Loci for Prostate Cancer in Men of African
Ancestry. Journal of National Cancer Institute 109(8): djx084
2.1 Abstract
It is a challenging task locating the genetic variants that are potentially causing a particular disease among
more than 100 million candidate single nucleotide polymorphisms (SNPs) across the entire genome. First,
the number of features is much larger than the number of samples, which usually put us in a high
dimensional scenario. Second, the correlation among SNPs close to each other can be very high in one
particular population. Also, aside from a handful of variants with large effect sizes, the majority of the
SNP-trait associations have small effect sizes, especially for complex diseases like cancer. With all these
challenges, it is unlikely to accomplish the task with one stroke. To circumvent the high dimensional
problem, people resort to the marginal models like the Genome-wide association studies (GWAS), where
variants are analyzed only one at a time across the genome. Despite its limitations in addressing the other
two challenges, GWAS have been very successful in identifying risk loci associated with the disease. In
this chapter, we performed a GWAS in 10,202 prostate cancer cases and 10,810 controls in men of
African ancestry to search for risk loci that potentially account for the additional risk of prostate cancer in
African Americans than in other ethnic groups. We identified two novel risk loci on chromosomes 13q34
and 22q19. At a known risk loci 8q24, we found five independent signals that were conditional genome-
wide significant. We included these seven SNPs, together with the previously reported prostate cancer
30
risk variants, to construct a polygenic risk score (PRS). The top 10% of men with the highest PRSs have a
3-fold prostate cancer risk compared to men with average risks.
2.2 Introduction
African American men have a 1.6-fold prostate cancer incidence compared to other ethnic groups
(Kolonel et al., 2000). The underlying factors that drive this difference may come from multiple aspects:
differences in genetics, environmental exposures, and socioeconomic status. To investigate the genetic
factors that contribute to the disparities, we performed a GWAS on African American population to
identify additional risk loci for prostate cancer. For regions with multiple signals, we conducted a model
selection to obtain a set of independent variants that are more representative for the risk loci than
previously reported variants. We constructed a PRS with the novel variants we identified together with a
list of know PCa risk variants, to capture the overall PCa risk in the population (Supplemental Table 1).
2.3 Methods
To boost the power of identifying risk loci with small effect sizes, we conducted a fixed-effect meta-
analysis to combine the GWAS results from 4 studies with a total of 10,202 prostate cancer cases and
10,810 controls. The four studies are the African Ancestry Prostate Cancer GWAS Consortium with
4,853 cases and 4,678 controls (Han et al., 2016), the Ghana Prostate Study with 474 cases and 458
controls (Cook et al., 2014), the Kaiser/ProHealth Prostate Cancer Study with 601 cases and 1,650
controls (Hoffmann et al., 2015), and the ELLIPSE/PRACTICAL OncoArray Consortium with 4,274
cases and 4,024 controls.
Within each study, we used marginal logistic regressions with study-specific covariates to test
17.8 million risk variants one at a time through a Wald test across the entire genome for their association
31
with the prostate cancer. In this analysis, we focused only on common variants with minor allele
frequencies more than 1% in the African American population. We adopted 5.00×10
-8
as the threshold for
a genome-wide significant P value.
For regions with evidence of multiple signals like 8q24, we performed a revised forward
selection that uses meta statistics as the inclusion criteria to identify independent risk variants. We limited
our candidates to be the SNPs with genome-wide significant marginal P values from the meta-analysis of
four studies. At each step, logistic regressions of all remaining candidate SNPs were examined in two
largest datasets AAPC and Oncoarray separately, conditional on the variants already included in the
previous steps. We then combined the conditional estimates of each remaining candidate through an
inverse variance-weighted meta-analysis. The SNP with the smallest meta P value will be included in the
model if it is still significant (< 5×10
-5
). We started this process with the null model in both populations
and terminated when no remaining candidate was significant in the meta-analysis conditional on all
selected variants.
To examine if the risk variants identified are representative of the population prostate cancer
risk, we constructed a polygenic risk score (PRS) for each individual by averaging their risk allele across
the entire chromosome. The set of SNPs include previously known prostate cancer risk variants plus two
novel variants we discovered at 13q34 and 22q12 as well as the set of independent SNPs at 8q24. Then,
we can group the samples into percentile categories with respect to their PRSs with 7 categories (<1%, 1-
10%, 10-25%, 25-75%, 75-90%, 90-99%, >99%). The relative risk of each category compared to the
interquartile range (25-75% group) is estimated through a logistic regression with the covariates in the
Oncoarray data.
32
2.4 Results
In the meta-analysis combining four studies, 775 SNPs had marginally genome-wide significant P values
(< 5.00×10
-8
) (Supplemental Figure 1). Most of these SNPs reside in the known susceptibility regions for
prostate cancer such as 8q24, 2p15(EHBP1), 2q37(MLPH), 6q22(RFX6), 8p21(NKX3-1),
10q11(MSMB), 11q13(MYEOV), 12q13(KRT8), 17q21(ZNF652), and Xp11(NUDT11/LINC01496).
Especially for 8q24, 501 SNPs were genome-wide significant. In addition to previously known regions,
two novel regions 13q34 and 22q12.1 were discovered with genome-wide significant risk-associated
alleles. At 13q34, rs75823044 (1.5% frequency, OR = 1.55, 95% CI = 1.37 to 1.76, P = 6.10×10
-12
) was
the most significant candidate and was correlated with four other alleles within 45kb of the insulin
receptor substrate 2 (IRS-2) and 20 kb of a long noncoding RNA (LINC00676). One of them
rs151190668 (OR = 1.67, 95% CI = 1.43 to 1.96, P = 1.70×10
-10
) is potentially a functional variant. At
22q12.1, rs78554043 (1.5% frequency, OR = 1.62, 95% CI =1.39 to 1.89, P = 7.50×10
-10
) was the most
significant variant. It is highly correlated with a missense polymorphism rs17886163 (P = 1.38×10
-9
) in
the CHEK2 gene (Table 1).
At 8q24, 501 SNPs were marginally genome-wide significant from the meta-analysis. A
considerable number of SNPs are moderately correlated with the top hit (Supplemental Figure 2). A novel
triallelic (A/T/G) variant rs72725854 was the most significant risk variant across the genome. The risk
allele T (OR = 2.33, 95% CI = 2.16 to 2.50, P = 1.08×10
-109
) is exclusive in the African population. In the
meta forward selection, another four SNPs rs72725879 (OR = 1.31, P = 2.62×10
-30
), rs7463326 (OR =
1.29, P = 9.67×10
-16
), rs7812894 (OR = 1.15, P = 4.78×10
-6
), and rs12549761 (OR = 1.30, P = 7.18×10
-6
)
were also conditionally significantly associated with the prostate cancer (Table 2).
Among the 100 previously known prostate cancer risk variants, we included all 94 of those that
are polymorphic in the African American population with MAF larger than 0.05. In addition to these
variants, we included two novel signals at 13q34 and 22q12.1 as well as the five SNPs at 8q24. For the
33
group of men in the top 10% of the PRS, the relative risk of getting prostate cancer is 3.0 times (95% CI =
2.5 to 3.6) of that in the baseline group (interquartile range of PRS) in the African population. This result
is consistent with the previously published result in the European population where the top 10% PRS
group has a 2.9-fold relative risk (95% CI = 2.75 to 3.12) compared to the group in the interquartile range
of the PRS (Table 3).
2.5 Discussion
African American population has the lowest LD compared to other ethnic groups, which is ideal for
genetic association studies. However, even after combining four large studies, the total sample size of our
study is still limited compared to GWAS in the European population, which limited the statistical power
to identify rare variants (MAF < 0.05) with moderate effect sizes (OR < 1.22). To follow up on the ever-
increasing amount of GWAS results across multiple ethnic groups, we can resort to multiethnic fine-
mapping to aggregate information and increase the power of identifying the causal variants. We will
present a novel method for multiethnic fine-mapping in the next chapter to address this need.
In addition to introducing multiple populations, incorporating biological information is another
way to improve the fine-mapping resolution. In this study, functional information was not integrated into
the discovery of the risk variants. In chapter 5, we will present a novel feature selection method that can
integrate external information in the discovery phase of genetic associations.
2.6 Funding
This work was supported by the National Cancer Institute at the National Institutes of Health, ELLIPSE
GAME-ON U19 initiative for prostate cancer (U19 CA148537).
34
Table 1. Association results for risk variants at 13q34 and 22q12.1
Meta-Analysis
SNP ID
Nearby
Genes
Alleles
b
OR
P-value Chr. (95% CI)
Position
a
rs75823044
IRS2 T/C
1.55
6.1x10
-12
13q34 (1.37-1.76)
110,360,784
rs78554043
CHEK2 C/G
1.62
7.5x10
-10
22q12.1 (1.39-1.89)
28,374,943
a
Genome Build
37/HG19
b
Risk allele/reference allele
c
Risk Allele Frequency in controls
35
Table 2. 8q24 results in AAPC and ELLIPSE/OncoArray
SNP ID Position A1/A2
Conditional
OR
Conditional
P-value
Marginal
OR
Marginal
P-value
rs72725854 128074815 T/A(G) 2.13 5.19E-73 2.32
1.08E-
109
rs72725879 128103969 T/C 1.31 2.62E-30 1.42 6.91E-60
rs7463326 128027954 G/A 1.29 9.67E-16 1.23 7.62E-13
rs7812894 128520479 A/G 1.15 4.78E-06 1.18 4.43E-10
rs12549761 128540776 C/G 1.30 7.18E-06 1.38 2.49E-09
a
Adjusting for age, study, PCs 1-10 and local ancestry at 8q24.
36
Table 3. Polygenic risk scores in men of African ancestry.
Risk Category
Percentiles
a
European
Ancestry
OR (95%CI)
b
African Ancestry
OR (95%CI)
c
<1% 0.19 (0.13-0.27) 0.18 (0.10-0.33)
1-10% 0.31 (0.28-0.35) 0.36 (0.30-0.43)
10-25% 0.52 (0.48-0.55) 0.55 (0.48-0.63)
25-75% 1.00 (Baseline) 1.00 (Baseline)
75-90% 1.78 (1.68-1.88) 1.68 (1.47-1.92)
90-99% 2.93 (2.75-3.12) 3.02 (2.52-3.63)
≥ 9 9 % 5.65 (4.83-6.62) 4.23 (2.39-7.50)
a
Risk score percentiles are population-specific.
b
Based on 100 variants in Al Olama et al. (2014)
c
Based on a subset of variants from Al Olama et al. , 13q34, 22q12.1
and 8q24 (n=101).
37
Supplementary Table 1. Description and study design of the studies included in the
meta-analysis.
Study Name Group No. of
Cases in
study
No. of
Controls in
study
No. of
Cases in
the
analysis
No. of
Controls in
the
analysis
Multiethnic Cohort
(MEC), African
Americans
AAPC GWAS 1841 1758 1784 1669
Southern Community
Cohort Study
AAPC GWAS 263 523 250 513
The Prostate, Lung,
Colorectal, and Ovarian
Cancer Screening Trial
AAPC GWAS 286 269 231 240
The Cancer Prevention
Study II Nutrition
Cohort
AAPC GWAS 76 152 64 112
Prostate Cancer Case-
Control Studies at MD
Anderson
AAPC GWAS 543 474 528 437
Identifying Prostate
Cancer Genes
AAPC GWAS 368 172 354 157
The Los Angeles Study
of Aggressive Prostate
Cancer
AAPC GWAS 296 303 288 287
Prostate Cancer
Genetics Study
AAPC GWAS 75 85 71 85
Case-Control Study of
Prostate Cancer
among African
Americans in
Washington, DC
AAPC GWAS 292 359 263 341
King County
(Washington) Prostate
Cancer Studies
AAPC GWAS 145 81 141 75
The Gene-Environment
Interaction in Prostate
Cancer Study
AAPC GWAS 234 92 224 89
North Carolina Prostate
Cancer Study
AAPC GWAS 216 249 209 241
Selenium and Vitamin
E Cancer Prevention
Trial
AAPC GWAS 223 224 212 208
Prostate Cancer in a
Black Population
AAPC GWAS 238 231 234 224
Ghana Ghana 498 494 474 458
Kaiser ProHealth GWAS 610 1,665
601 1,650
Vanderbilt Bio Vu ELLIPSE/OncoArray 213 0 204 0
Center for Prostate
Disease Research
ELLIPSE/OncoArray 145 44 135 41
38
EPIdemiology of
Prostate CAncer
ELLIPSE/OncoArray 64 63 20 9
Karuprostate ELLIPSE/OncoArray 384 411 363 386
Multiethnic Cohort
Study
ELLIPSE/OncoArray 489 529 475 523
Moffitt Prostate Cancer
Study
ELLIPSE/OncoArray 106 93 101 91
Nashville Men's Health
Study
ELLIPSE/OncoArray 188 201 176 188
Prostate Cancer
Prevention Trial
ELLIPSE/OncoArray 44 129 44 121
The North Carolina-
Louisiana Prostate
Cancer Project
ELLIPSE/OncoArray 1022 0 967 0
The Prostate Cancer
and Environment Study
ELLIPSE/OncoArray 72 58 70 57
CerePP French
Prostate Cancer Case-
Control Study
ELLIPSE/OncoArray 107 105 101 85
Southern Community
Cohort Study
ELLIPSE/OncoArray 301 1557 291 1498
South Carolina
Prostate Cancer Study
ELLIPSE/OncoArray 64 39 57 32
Selenium and Vitamin
E Cancer Prevention
Trial
ELLIPSE/OncoArray 30 173 28 170
San Francisco Prostate
Cancer Study
ELLIPSE/OncoArray 86 37 81 36
A Case Control Study
in Uganda
ELLIPSE/OncoArray 571 485 560 480
UK Prostate Cancer
Study
ELLIPSE/OncoArray 375 0 365 0
San Antonio
Biomarkers of Risk
ELLIPSE/OncoArray 106 106 105 106
Wake Forest Prostate
Cancer Study
ELLIPSE/OncoArray 59 66 59 49
Washington University
Prostate Cancer Study
ELLIPSE/OncoArray 75 153 72 152
39
Supplemental Figure 1. Manhattan Plot from the Meta-Analysis (10,202 cases and 10,810
controls)
40
Supplemental Figure 2. Regional association plot of 8q24.
41
References for Chapter 2
1. Kolonel LN, Henderson BE, Hankin JH, et al. A multiethnic cohort in Hawaii and Los Angeles:
Baseline characteristics. Am J Epidemiol. 2000;151(4):346–357.
2. Al Olama AA, Kote-Jarai Z, Berndt SI, et al. A meta-analysis of 87,040 individuals identifies 23 new
susceptibility loci for prostate cancer. Nat Genet. 2014;46(10):1103–1109.
3. Eeles RA, Olama AA, Benlloch S, et al. Identification of 23 new prostate cancer susceptibility loci
using the iCOGS custom genotyping array. Nat Genet. 2013;45(4):385–391, 391e1–391e2.
4. Gudmundsson J, Sulem P, Gudbjartsson DF, et al. Genome-wide association and replication studies
identify four variants associated with prostate cancer susceptibility. Nat Genet. 2009;41(10):1122–1126.
5. Gudmundsson J, Sulem P, Gudbjartsson DF, et al. A study based on whole-genome sequencing yields a
rare variant at 8q24 associated with prostate cancer. Nat Genet. 2012;44(12):1326–1329.
6. Takata R, Akamatsu S, Kubo M, et al. Genome-wide association study identifies five new
susceptibility loci for prostate cancer in the Japanese population. Nat Genet. 2010;42(9):751–754.
7. Thomas G, Jacobs KB, Yeager M, et al. Multiple loci identified in a genome-wide association study of
prostate cancer. Nat Genet. 2008;40(3):310–315.
8. Haiman CA, Patterson N, Freedman ML, et al. Multiple regions within 8q24 independently affect risk
for prostate cancer. Nat Genet. 2007;39(5):638–644.
9. Han Y, Rand KA, Hazelett DJ, et al. Prostate cancer susceptibility in men of African Ancestry at 8q24.
J Natl Cancer Inst. 2016;108(7)
10. Cook MB, Wang Z, Yeboah ED, et al. A genome-wide association study of prostate cancer in West
African men. Hum Genet. 2014;133(5):509–521.
11. Hoffmann TJ, Van Den Eeden SK, Sakoda LC, et al. A large multiethnic genome-wide association
study of prostate cancer identifies novel risk variants and substantial ethnic differences. Cancer Discov.
2015;5(8):878–891.
12. Patti ME, Sun XJ, Bruening JC, et al. 4PS/insulin receptor substrate (IRS)-2 is the alternative
substrate of the insulin receptor in IRS-1-deficient mice. J Biol Chem. 1995;270(42):24670–24673.
13. Southey MC, Goldgar DE, Winqvist R, et al. PALB2, CHEK2 and ATM rare variants and cancer risk:
Data from COGS. J Med Genet. 2016;53(12):800–811.
14. Freedman ML, Haiman CA, Patterson N, et al. Admixture mapping identifies 8q24 as a prostate
cancer risk locus in African-American men. Proc Natl Acad Sci U S A. 2006;103(38):14068–14073.
15. Popejoy AB, Fullerton SM. Genomics is failing on diversity. Nature. 2016; 538(7624):161–164.
42
Chapter 3 Multiethnic Joint Analysis of
Marginal SNP Effects
3.1 Abstract
To follow up regions identified through GWAS, multiethnic fine-mapping can improve the ability to
identify an underlying causal variant by leveraging different linkage disequilibrium (LD) structures across
diverse populations. In this context, the fixed-effect meta-analysis (FE) remains the most commonly used
approach for its ease of interpretation and its ability to pinpoint the causal SNP, especially under the
assumption of a common effect across populations. Previously, Joint Analysis of Marginal SNP Effects
(JAM) demonstrated how fine-mapping based on summary statistics from a single population could be
improved through joint modeling whereby correlation structures are explicitly accounted for from a
reference panel and use summary statistics to construct joint model likelihoods within one population.
Here, we expand upon JAM to a multiethnic setting (mJAM) by jointly fitting a two-stage framework of
JAM and FE and explicitly incorporating the different LD structures across populations. The rankings of
causal SNPs are determined via Bayesian model selection with reversible jump MCMC. We present
several simulation studies showing that for realistic effect sizes, mJAM out-performs competitive
methods like FE and the conditional and joint analysis (COJO) step-wise selection. We also investigated
its performance with imbalanced sample sizes across different populations, mimicking a real data scenario
in prostate cancer where we performed a case study to prioritize candidate SNPs at several known loci.
43
3.2 Introduction
Genome-wide association studies (GWAS) have achieved success in identifying statistical associations of
genomic regions with complex diseases and traits. Although thousands of risk regions have been
identified in GWAS for a variety of diseases, there remains a challenging task to identify the specific
causal variants or further prioritize genetic variants for functional studies within each region. The index
SNPs identified in GWAS are usually not biologically causal to the disease but instead in LD with the
underlying causal SNPs [Visscher et al., 2012]. Through genotype imputation or targeted sequencing, the
density of a region can be improved massively leading to many potential variants with modest to high
correlations with the causal SNPs. However, a high correlation among variants could lead to more false
positives which decreases the precision of identifying the true causal variants. This also hinders inference
on the number of causal variants when the associations are examined one at a time.
Combining information across multiple ethnic groups is one potential way to facilitate
identification of causal SNPs, especially when the correlation within one population is too high to provide
any solid conclusions. Due to different population histories, major ethnic groups may have distinct LD
structures. This variation in LD structure can then be leveraged to more accurately identify the underlying
causal SNP or reduce the inferred credible set. For example, compared to Europeans and Asians, African
Americans typically have smaller LD blocks with weaker correlations as the number of recombination
events for each region is expected to be higher due to a longer evolutionary history [Campell et al., 2008].
If a true causal variant exists across the populations its corresponding estimated association across
populations should be more consistent than the estimated association for proxy SNPs. Despite not
explicitly considering the LD structures within each population, fixed effect meta-analysis performs
rather well in practice, especially under the assumption of a common effect across populations.
Alternative approaches build upon the FE approach by either leveraging heterogeneity or incorporating
functional information [Morris et al., 2011; Kichaev et al., 2014,2015].
44
Compared to univariate analysis, alternative approaches often consider multiple genetic variants
in a region together and use joint or conditional modeling to attempt to identify the true causal variant in
the presence of high LD. With individual-level data, joint analysis can help identify secondary signals at a
locus [Servin and Stephens, 2007; Lango Allen et al., 2010; Ripke et al., 2011; Sklar et al., 2011] and
step-wise selection has been used to discover multiple signals at a locus [Galarneau et al., 2010; Trynka et
al., 2011]. However, step-wise selection can be very unstable with a large amount of highly correlated
SNPs and resulting p values from the final selected model tend to be conservative. Penalized regression
techniques like Lasso [Tibshirani, 1996] and elastic net [Cho et al., 2009] are potentially more stable but
can tend to be too sparse and unable to effectively prioritize the true causal SNPs in high LD situations.
Unlike penalized methods that determine non-zero effect sizes by cross-validation through optimizing
prediction errors, Bayesian methods compute posterior probabilities for models within the model space to
infer the probability that each SNP is causal. Ideally, exact inference is possible by enumerating all
possible models or combinations of SNPs. However, as the number of SNPs increases the model space
increases rapidly and exhaustive searches become impractical. Therefore, stochastic search algorithms are
often used to perform inference on posterior distributions. For example, piMASS [Guan et al., 2011] and
BVS [Quintana et al., 2012; Quintana et al., 2013] both use a MCMC to search through the model space,
while the later can also incorporate external annotations as prior information in the Bayesian model
selection to further prioritize causal SNPs.
In practice, many GWAS efforts combine summary statistics from numerous studies and
implementation of joint models on individual-level data is limited in practice. Several methods have been
proposed to construct joint models on summary results [Verzilli et al., 2008; Hormozdiari et al., 2014;
Benner et al., 2015; Yang et al., 2012]. In general, these methods all use reference data to estimate the
correlations between SNPs and then integrate this correlation structure in a multivariable regression
framework to approximate the corresponding individual-level analysis. Differences between methods are
due to variations in the assumptions for residual error and algorithms for model selection. For example,
45
Newcombe et al. [2016] proposed the Joint Analysis of Marginal SNP Effects (JAM) by invoking a
Cholesky transformation on the linear regression likelihood, adopting a g-prior for effect estimates, and
implementing a computationally efficient framework Bayesian model selection via a Reversible Jump
MCMC stochastic search algorithm.
In this paper, we present multi-ethnic JAM, an extension that simultaneously fits joint models
while explicitly incorporating correlation structures from different populations.
3.3 Methods
3.3.1 JAM Linear Model for Summary Statistics
To follow up on the signals identified in GWAS, where only univariate estimates are reported, JAM
utilizes the correlation structure from a reference panel to perform joint analysis using summary-level
marginal statistics. Assume individual-level genotype data
NP
XR
is available and we model the
phenotype trait
N
yR with a linear regression:
2
~ ( , )
N
y N X I , (1)
where X is mean centered with respect to each SNP and y is also mean centered. The conditional effect
of all SNPs is denoted by
P
R . As defined in Newcombe et al. [2016], we consider a new vector
' z X y , where the mth element is the sum of each individuals’ trait values multiplied by their
respective number of risk alleles of the mth SNP. Thus, z follows a multivariate normal distribution with
induced covariance ' XX :
2
~ ( ' , ' )
P
z MVN X X X X , (2)
46
Since any empirical covariance ' XX is inherently Hermitian and positive semi-definite, we can perform a
Cholesky decomposition as follows:
'' X X L L ,
where L is upper triangular with non-negative diagonal elements. If we assume X is full column rank,
then ' XX is also full rank and therefore positive definite. In this case, the Cholesky decomposition L is
also invertible. If we apply its transposed inverse
1
' L
to z , the residual variance of the transformed
distribution is again independent:
12
' ~ ( , )
PP
L z MVN L I
, (3)
We estimate z and ' XX based on the marginal estimates reported in GWAS and a reference panel that
has a similar correlation structure as the study population. To estimate the above statistics, we need to
assume the population reaches Hardy-Weinberg Equilibrium along with an additive genetic model.
Details of this derivation are provided in the appendix in Newcombe et al. [2016]. This approach assumes
that the marginal summary statistics are from a linear model. When the summary statistics come from the
analysis of a binary outcome, the transformation is performed following the approach presented in
[Benner et al.,2015; Chen et al., 2015] and shown to perform well for small effect sizes. [Pirinen et al.,
2013]
3.3.2 Multiethnic JAM
When univariate estimates for the same set of SNPs are available from multiple populations, we can
conceptually obtain the joint estimates from a given model using JAM in each population separately, and
then meta-analyze these estimates in a two-stage framework. Here, we propose to fit these two steps
jointly. To simplify notation and without loss of generality, we consider the scenario with three
populations. Within each population and for a given set of M SNPs within each region, a JAM linear
47
model can be constructed to establish the relationship between the transformed total trait burden and the
transformed genotype dosage.
( ) 1 ( ) ( ) 2
' ~ ( , )
i i i
PP
L z MVN L I
, for 1,2,3 i (4)
where
() i
L is the Cholesky decomposition of the empirical covariance matrix of the ith population and
() i
z is the trait burden over all individuals on the risk alleles across the set of SNPs for this population,
with
( ) ( ) 1 ( )
:'
i i i
L
z L z
, for 1,2,3 i .
With the ethnic-specific conditional estimates
()
ˆ
iP
R obtained from all three populations i=1,2,3 in the
first level, we can subsequently fit a fixed-effect meta-analysis model on the second level:
(1)
(2)
(3)
ˆ
ˆ
ˆ
P
P
P
I
I
I
(5)
where
2
~ (0, ) N and
P
R contains the vector of SNP effects, which are assumed to be identical in
each population.
By replacing the 𝛽 s in equation (4) with equation (5), we have the following linear-mixed model:
(1) (1)
(2) (2)
(3) (3)
00
00
00
LP
LP
LP
z L I
z L I
z L I
(6)
where
2
3
~ (0, )
P
NI
After simplifying the design matrices, the above linear model is equivalent to
48
(1) (1)
(2) (2)
(3) (3)
L
L
L
zL
zL
zL
We define new notations for integrated inputs of the transformed summary statistics
() i
L
z and Cholesky
decomposition of covariance matrix
() i
L as follows:
(1)
* (2)
(3)
:
L
LL
L
z
zz
z
,
(1)
* (2)
(3)
:
L
LL
L
The above linear regression has the following simple form:
* * 2
33
~ ( , )
L P P
z MVN L I (7)
3.3.3 Limitation of full ranked genotype matrix
One practical limitation of JAM and mJAM is that for each population, the genotype matrix is required to
be of full rank. While pruning to remove highly correlated SNPs is one option, we propose a
regularization to enforce a positive semi-definite matrix to be positive definite by introducing a small
positive element on the diagonal. This regularization term has a very trivial effect on the estimates while
guaranteeing the invertibility of the empirical covariance matrix X'X.
3.3.4 Bayesian Model Selection
Given the above new mJAM model, we follow the same model selection procedure as described in JAM
Newcombe et al. [2016].
49
Briefly, denote an indicator vector of length P that represents the SNPs to be included in the model.
Correspondingly, let M
be the sub-model with estimates
M
and
M
L
be the corresponding columns of
L . For each sub-model M
, the likelihood becomes:
22
33
( , , ) ( , )
L M P M M P
p z M MVN L I
We adopt a conjugate g-prior on
M
, and an inverse-gamma prior on
2
:
22
3
( , ) (0, ' )
M P M M
p M MVN L L
2
00
( ) ( , ) p M InvGa a b
The joint posterior distribution
2
( , , )
ML
p M z
is proportional to
2 2 2
( , , ) ( , ) ( ) ( )
L M M
p z M p M p M p M
We integrate over the conjugate priors for
2
( , ) , to obtain a simplified marginal posterior over the
model space. A beta-binomial with hyper-parameters (1, P), is specified to increase sparsity as the number
of SNPs increase thereby providing an intrinsic correction for multiplicity [Wilson et al., 2010]. Thus, the
analytical expression for the posterior support of each model in the model space is:
/2 (2 1)/2
( ) ( 1) (2 ( ))
2
(2 )
p aP
L
P
p M z b S M
P
PP
P
where
1
( ) ( )' ( )' ( ' ) '
1
L L L M M M M L
S M z z z L L L L z
50
For inference on specific SNPs, we sum up the posterior probabilities of all models containing a particular
SNP to get the posterior inclusion probability with respect to that SNP:
:1
( 1 ) ( )
j
j L L
M
p z p M z
For a limited number of SNPs, we can enumerate through all the sub-models in the model space. For
larger numbers of SNPs, we use a stochastic search with reversible jump MCMC to help reduce the
running time.
3.3.5 Simulation Study on Structured LD
We conducted a simulation study to compare the performance of mJAM to two potential alternative
approaches: fixed-effect meta-analysis and COJO step-wise selection. Currently, fixed-effect meta-
analysis (FE) takes an inverse-variance weighted average of the marginal estimates from multiple studies
or populations. Designed to construct multivariate models from summary statistics, COJO adopts a very
similar model likelihood to JAM and uses a step-wise selection using marginal p-values. Additionally, for
use of COJO on multiple populations we used a pooled LD. We used positive predictive value(PPV) and
sensitivity among top-ranked SNPs to measure the performance of competing methods. For mJAM, the
SNPs are ranked by posterior probability of inclusion and for COJO and FE they are ranked by p-value.
We also investigate the ability for mJAM to infer the number of causal SNPs by computing the Bayes
factors, i.e., the ratio of posterior odds over prior odds, for models of a particular size or number of SNPs.
We performed two sets of scenarios: one with simulated correlation structures designed with similar LD
structures across populations but different in the overall level of LD; and another simulation using real
genetic correlation structures observed in the PRACTICAL/ELLIPSE Consortium [Schumacher et al.,
2018].
51
For the first set of scenarios, we simulated genotypes for three populations varying the sample
sizes in each population and the total number of candidate SNPs. In the baseline scenario, each population
has 5000 individuals and a total of 50 SNPs are simulated in 5 blocks of 10 SNPs each. Within each block
of 10 SNPs, the pair-wise correlations are uniformly identically set to a constant value r
2
for each
population. For simplicity, the same r
2
is used for each of the 5 blocks within one population. r
2
varies
across the three populations with the r
2
set to 0.9
2
, 0.8
2
and 0.7
2
, respectively. We then selected 5 of the
simulated SNPs across the first three of the five blocks as causal SNPs with each having a relatively weak
odds ratio of 1.1 for the disease. We also perform scenarios with 1 and 2 causal SNPs for comparison.
Corresponding LD heatmaps are shown in Appendix, Figure 1.
To investigate the performance under imbalanced sample sizes, we simulated three additional
scenarios with varying numbers of individuals for each population while keeping the total number of
subjects at 15,000. This includes a scenario with 12,500 samples for the first population and only 1,250
for each of the remaining two populations. We also simulated two extreme examples where all 15,000
samples come from the same population with either the lowest or the highest LD, respectively.
To evaluate the scalability of mJAM, we expanded upon our baseline simulation setting with balanced
sample sizes with two additional scenarios with 1,000 SNPs in the region: the first includes 100 10-SNPs
blocks and the second includes 5 200-SNP blocks. For each scenario, the 5 causal SNPs are in the first
three blocks.
3.3.6 Simulation Study with Real Data
To better capture realistic LD patterns, we performed simulations based on real correlation within three
populations (Europeans, African Americans, and Asians) from the PRACTICAL/ELLIPSE Consortium.
The available sample sizes for these three populations are 89,674 Europeans, and 9,531 African
Americans. We investigated two scenarios with balanced and imbalanced sample sizes. In the first
scenario, we simulated 40 SNPs within the 17q24 region for 5,000 from each population using a
52
multivariate normal model and the corresponding estimated correlation structure. A total of 4 causal SNPs
was chosen such that the signal-noise ratio was the same as the previous examples. In the second real LD
based simulation, we used all available individuals in the three populations and chose the same 4 causal
SNPs. The heatmaps of this region for each ethnicity are shown in the Appendix, Figure 4.
3.3.7 Applied examples
We applied mJAM on 37 known regions for prostate cancer across two populations: 89,674 Europeans
and 9,531 African Americans. Prior to the model selection, we used Priority Pruner to remove highly
correlated (r
2
>= 0.9) in either population by first pruning in the European population and then in the
African American population, conditional on those selected. For each specific region, we ran mJAM with
summary statistics and reference data from the European and African-American populations. We also ran
JAM in each population separately. Under each setting, we selected the top SNPs based on their posterior
probabilities and constructed 95% credible sets from those SNPs included in the models with a
cumulative posterior probability of 95%. To reduce the uncertainty of model selection, we obtained our
credible set by overlapping the SNPs presented in the credible sets in four independent runs. We report
the number of signals inferred and the sizes of 95% credible sets selected by mJAM and JAM. We also
performed analyses using COJO and a fixed-effect meta-analysis and report the SNPs that pass a
Bonferroni corrected significant threshold.
We conducted another case study on the region 8q24, which harbors multiple prostate cancer
risk loci. For this region, we used all the SNPs in the spanning region (127.6Mb to 129.0Mb) that were
present in both the European and the African American Oncoarray data and we overlapped the dosage
with the meta results in the European population [Schumacher et al., 2018] and the meta results in the
African American population [Conti et al., 2017]. We used the Oncoarray data as the reference panels. In
four independent runs of mJAM, we obtained the overlapping SNPs in the 95% credible sets and
clustered them into multiple groups by running Priority Pruner on the credible sets. Furthermore, we
53
evaluated the correlation between variants in the credible set and known signals at 8q24. Previously, 12
independent signals were identified in the European population at 8q23 [Marco et al., 2018] and 5 were
identified in the African American population [Conti et al., 2017]. For each variant in the credible set, we
located the known signal that has the highest r
2
.
3.4 Results
3.4.1 Simulation Study
The simulation results for 50 candidates SNPs with 5 blocks of 10 SNPs each and 5 causal SNPs are
shown in Figure 1(a)-(d). While the total sample is fixed at 15,000 the sample sizes for each population
differ across the scenarios. Each figure displays the results averaged over 1,000 replicates for the mean
PPV (solid lines) and sensitivity (dashed lines) for rank SNPs from mJAM (in red), FE (in green), COJO
(in blue) and pooled JAM (in black). For the baseline scenario with 5,000 samples for each population, all
four methods have rather high PPV with the top two ranked SNPs. However, mJAM provides higher PPV
and sensitivity as we increase the number of top-ranked SNPs to the top 5 ranked SNPs. In this scenario,
the pooled LD has a slight impact on performance (mJAM’s sensitivity of 0.69 vs. the pooled JAM
analysis with sensitivity = 0.66). In contrast, the Bayesian model selection procedure has an advantage
over COJO’s stepwise model selection (pooled JAM analysis with sensitivity = 0.66 vs. COJO’s
sensitivity =0.51). In the COJO analysis, the step-wise model selection and the accompanied Bonferroni
correction for inclusion in the final model great impacts performance. Scenario (b) mimics a more
realistic situation in which a single population (e.g., a study of individuals of European ancestry) has the
largest sample by a considerable margin in comparison to the two remaining populations. In this scenario
(Figure 1(b)), FE’s performance drops dramatically compared to the equal sample size scenario
(sensitivity = 0.46 for the top 5 SNPs). While COJO performs well for a small number of top SNPs (e.g.,
sensitivity = 0.51 for top 5 SNPs), it is unable to capture a large proportion of the true signal even when
54
we include the top 40% of all the candidate SNPs (sensitivity = 0.68 at top 20 SNPs). Similar trends are
observed in Figure 1(c) where all 15,000 individuals are from one population with high LD. Both
conditional methods perform well for a few top rank SNPs (e.g., mJAM’s sensitivity = 0.54 and COJO’s
sensitivity = 0.48 for the top 5 SNPs), but performance drops for COJO as the number of ranked SNPs
increases, reflecting COJO’s reliance on a single best model in comparison to mJAM’s Bayesian selection
of the posterior probability of inclusion across numerous models. In the scenario in which all 15,000
individuals come from the population with the lowest LD (Figure 1(d)), all methods show slightly better
performance than the baseline scenario 1(a).
Figure 1(d) 15,000 individuals from one
population with the lowest LD.
Figure 1(c) 15,000 individuals from one
population with the highest LD.
Figure 1(a) 5,000 individuals in each
population.
Figure 1(b) 12,500, 1,250 and 1,250
individuals in each population.
55
Figure 2 shows the log of Bayes factors of all models with exact size i (red bars) and all models
with at least size i (blue bars) or greater under the baseline simulation scenario (i.e. 5,000 individuals in
each of three populations) with 1, 2 and 5 causal SNP, respectively. Demonstrating correct inference,
when only one causal SNP is simulated (Figure 2a), the Bayes factors for all models with size 1(red bar at
i=1) is very significant (>100) and for all models with at least size 1 (blue bar at i=1). All other Bayes
factors are very small (<1.2). With 2 causal SNPs simulated (Figure 2b), the Bayes factor of all model
with size i=2 is larger than 100 while all other red bars including i=1 are very low, indicating strong
evidence for two signals. When there are 5 causal SNPs (Figure 2c), there is extremely strong evidence
for at least 5 signals (BF ~ 1000 for i=5).
For simulation scenarios with 1,000 simulated SNPs with the total number of causal SNPs
maintained at 5, inference becomes more difficult as the signal/noise ratio decreases dramatically.
Although the simulation of 100 10-SNPs blocks adds substantial “noise” from blocks in low LD with the
causal SNPs in the first three blocks, the overall impact on performance was limited. As shown in Figure
3(a), mJAM (sensitivity=0.67) and FE (sensitivity=0.58) had similar performance for top-ranked SNPs,
but the increase in sensitivities is slower than the corresponding scenario with only 40 SNPs total. In
contrast, for COJO the performance (sensitivity = 0.40) is greatly impacted due to the conservative nature
of Bonferroni correction. For the scenario in which 5 LD blocks are maintained but each block consists of
200 correlated SNPs, there is a large impact on performance. In Figure 3(b), the performance of all three
Figure 2(a) Bayes Factors with 1
causal SNP
Figure 2(b) Bayes Factors with 2
causal SNPs
Figure 2(c) Bayes Factors with 5
causal SNPs
56
methods is modest, although mJAM still maintains a considerable margin over FE and COJO in PPV and
sensitivity.
For simulation based on real data for 40 SNPs in region 17q24 the ability of all methods to
identify the simulated causal SNP was diminished since the correlations are rather high and the LD blocks
are quite large. In Figure 4(a), 5,000 individuals were simulated from each population. Here, mJAM has
considerable increase in power (sensitivity = 0.48) compared to FE (sensitivity = 0.31) and COJO
(sensitivity = 0.39) for the top 4 SNPs. In using an averaging correlation reference, JAM has similar
performance (sensitivity = 0.36) as COJO, with both suffering from not leveraging the population-specific
LD. In Figure 4(b), where the simulation used a large European population (89,674) relative to the other
two populations (9,531 African-Americans and 2,075 Asian), all three methods have improved PPV and
Figure 3(a) 1,000 SNPs: 100 blocks, 10
SNPs each block.
Figure 3(b) 1,000 SNPs: 5 blocks, 200
SNPs each block.
Figure 4(a) Real genotype at 17q24 with
5000 individuals in each population.
Figure 4(b) Real genotype at 17q24 with
89,674 EUR, 9,531 AFR and 2,075 ASN.
57
sensitivity. mJAM has a demonstrated edge over COJO in identifying the causal variants for the top 4
rank SNPs and above (sensitivities of 0.93 and 0.92, respectively). Of note, FE (sensitivity = 0.35)
performance is limited as it does not model the SNPs jointly with a single large population and
corresponding LD structure dominating the results.
3.4.2 Case study of prostate cancer
For region 8p21, a total of 1620 SNPs entered the candidate set for model selection after pruning in both
EUR and AFR populations. The locuszoom plots for the meta-analysis marginal p-values are shown in
Figure 5(a) and 5(b) with the LD structure indicated of the EUR and AFR populations, respectively.
Running JAM in the EUR population alone indicated there is only one signal in the region. The 95%
credible set includes 6 SNPs, which form three independent clusters. The largest cluster includes four
SNPs, including the top hit rs7835802(chr8: 23522452:C:T) and three other SNPs: rs7821330,
rs200617944, and rs4872176 with pair-wise correlations range from 0.76 to 0.94 in the European
population. These four SNPs are also correlated with each other in the African American population
(Pearson's r between 0.40 and 0.75). In addition to this cluster of four SNPs, two other independent SNPs
are also included in the credible set: rs143055702 and rs74434323. While no SNPs entered the JAM
model selection in AFR population alone, incorporating the AFR population in the mJAM also showed
strong evidence for one signal and was able to pinpoint 2 SNPs out of the 6 identified in EUR. The top hit
rs7821330 was selected in both models, together with an independent signal rs7821330. By including the
AFR population in mJAM, we were able to filter out three SNPs rs7835802, rs200617944 and rs4872176
that are less correlated with the top hit rs7821330 in AFR population than in EUR. The pair-wise Pearson
correlation of the 6 SNPs is shown in Appendix table 1. In addition, one other independent SNP
rs74434323 is no longer in the credible set. After re-considering all the candidate SNPs that were tagged
by selected SNPs but filtered out by Priority Pruner, mJAM ended up with a final 95% credible set of 16
SNPs, while JAM in EUR population had 46. Therefore, by incorporating the AFR population into the
58
analysis, and leveraging the different LD pattern, mJAM was better able to pinpoint the signal leading to
a considerably more precise set of candidate SNPs. In contrast, a total of 73 SNPs passed Bonferroni
corrected significance threshold for COJO and 507 SNPs were genome-wide significant in FE.
As a second example, we conducted the same analysis on 17q12, where 1053 SNPs were
retained after running Priority Pruner. The locuszoom plots are shown in Figure 6(a) and 6(b). While both
mJAM and JAM detected only one signal in this region, mJAM ended up with a smaller credible set of 2
SNPs in the regions versus 3 by running JAM in EUR alone. The top hit in the locuszoom plot
rs11651755 (chr17:36099840:C:T) was selected in both model but mJAM was able to narrow down the
only signal in this region to two correlated candidate rs11651755 and rs11263761(Pearson correlation
0.93 in EUR and 0.37 in AFR). While running JAM in EUR population was unable to filter out two other
SNPs rs11649743 and rs718961 that are independent with the top signal rs11651755. The final 95%
credible set of mJAM is reduced to 7 SNPs from 11 in JAM. In comparison, 26 SNPs were selected by
COJO, and 245 SNPs passed the genome-wide significance threshold when using FE to meta-analyze
their estimates in both populations.
Figure 5(a) LocusZoom plot for 1620
SNPs at 8p21 with EUR LD
Figure 5(b) LocusZoom plot for 1620
SNPs at 8p21 with AFR LD
59
In another region 12q13, running mJAM in EUR and AFR resulted in the same credible sets as
JAM in EUR population with only 2 SNPs. In comparison, COJO ended up with 13 SNPs, and FE had
211 genome-wide significant SNPs. We performed the above analysis in 6 other regions. Together with
the three regions mentioned above, the average size of 95% credible set for mJAM is 11.9, which has a 33%
reduction compared to that of running JAM in EUR alone (average size 17.7). In addition, the average
model size for COJO is 30.6 and the average number of SNPs that are marginally genome-wide
significant in FE is 316.3. Despite the overall reduction in the size of the credible set, we do observe a
larger credible set for mJAM in some regions. For example, at 20q13, the credible set of mJAM includes
35 SNPs, compared to 28 when running JAM in EUR alone.
Finally, for region 8q24, a total of 5377 SNPs entered the mJAM analysis without pruning. In 4
independent runs of mJAM, we have strong evidence for 14 signals (BF >100) within the region. The 95%
credible set had 52 SNPs, including 7 from the previously identified 8 SNPs in the European fine-
mapping and 3 from the 5 signals from the African American fine-mapping. By running priority pruner
on the credible sets with European or African American LD, the 52 SNPs fell into 15 groups with r
2
=0.1
threshold (Table 3).
Figure 6(a) LocusZoom plot for 1053
SNPs at 17q12 with EUR LD
Figure 6(b) LocusZoom plot for 1053
SNPs at 17q12 with AFR LD
60
3.5 Discussion
We propose a novel multiethnic fine-mapping method, mJAM that uses a Bayesian model selection
framework and can perform joint analysis on summary statistics across multiple populations
simultaneously. More importantly, mJAM models the covariance structure for each population
independently so additional information is contributed from each population by the joint models that
combine this information. Our work is a multi-ethnic extension of JAM, proposed by Newcombe et al.
[2016] and therefore inherits many of its advantages. Among them, a closed-form marginal model
likelihood and the reversible jump MCMC stochastic search are essential to the scalability of this method.
In addition, the convenience of requiring only summary statistics to perform a joint analysis is even more
valuable for multi-ethnic fine-mapping than single population scenarios.
In a set of realistic simulation settings, mJAM demonstrated the ability to make inference on the
number of signals, to differentiate signals from noise, and finally to provide higher sensitivity and
positive predictive value than fixed-effect meta-analysis and COJO step-wise selection. We also
investigated the influence of imbalanced sample sizes on the performance of mJAM and found out that
inclusion of even a smaller number of samples from another population can help identify causal variants.
Although the sensitivity and positive predictive values of mJAM both decrease slightly with more
imbalanced sample sizes in populations, this effect is rather small compared to alternative methods like
the fixed-effect meta-analysis.
We developed a numeric trick to avoid the practical limitation of requiring a full ranked
genotype matrix for mJAM by introducing a regularization term on the empirical covariance matrix X'X.
This greatly expanded the scenarios where mJAM can be applied to and improved its computational
stability.
We also carried out a case study of prostate cancer where mJAM is applied to several
previously known prostate cancer susceptible regions. In most regions, mJAM was able to further
61
prioritize candidate SNPs by generating much smaller credible sets, compared to running JAM in either
ethnicity alone. However, there were some regions where mJAM generated larger credible sets than
running JAM in the largest EUR sample alone. We think this is likely caused by conflicting evidence
from the two populations in these particular regions; possibly due to heterogeneous causal SNP sets, or
due to differences in genotyping accuracy, which would lead to apparently different patterns of signals. In
this case, given our primary interest is to prioritize the causal SNPs that are common across ethnic groups,
mJAM is likely to include those potentially influential SNPs that are otherwise left out in single
population fine-mapping.
3.6 R package R2BGLiMS
We collaborated with Dr. Paul Newcombe to expand his R package R2BGLiMS, incorporating the
mJAM function. Previously, R2BGLiMS can perform JAM model selection in a single population via a
reversible jump MCMC. With the latest updates, marginal statistics from multiple populations can be fit
in a joint model with population-specific correlation structures. In terms of the efficiency of the algorithm,
both the running time and memory requirement scale linearly with the number of populations.
In addition to the new functionality, we overcame a practical limitation of running JAM/mJAM
models. Previously, we required the reference genotype matrices to be full-ranked. This assumption is
easily violated given a larger set of candidate SNPs with multiple populations. Pruning is usually required
to meet this assumption for a single population. However, for multiple populations, pruning tends to be
very labor-intensive and its strategy is not well-defined. We purpose to introduce a ridge regularization
term in the empirical covariance matrices to enforce their invertibility. By adopting a small ridge term, the
covariance matrices are the same as before numerically but are guaranteed to be positive definite. This
dramatically improves the usability of the package, as users now can input any genotype matrix without
pruning.
62
3.7 Appendix
Supplemental Figure 1(a). The correlation structure of three simulated populations.
Supplemental Figure 1(b). Real correlation structures for three populations: Europeans,
African Americans and Asians
63
Table 1(a). Pair-wise correlation of 6 SNPs in the 95% credible sets at 8p21 in the
European population
rs143055702 rs7821330 rs7835802 rs200617944 rs4872176 rs74434323
rs143055702 1.00
rs7821330 -0.01 1.00
rs7835802 -0.01 0.94 1.00
rs200617944 -0.01 0.88 0.92 1.00
rs4872176 -0.01 0.88 0.83 0.76 1.00
rs74434323 0.00 0.00 -0.01 -0.01 0.00 1.00
Table 1(b). Pair-wise correlation of 6 SNPs in the 95% credible sets at 8p21 in the African
American population
rs143055702 rs7821330 rs7835802 rs200617944 rs4872176 rs74434323
rs143055702 1.00
rs7821330 0.00 1.00
rs7835802 -0.01 0.54 1.00
rs200617944 -0.01 0.74 0.40 1.00
rs4872176 -0.01 0.75 0.63 0.55 1.00
rs74434323 0.00 0.00 -0.01 -0.01 0.00 1.00
64
Table 2(a). Pair-wise correlation of 4 SNPs in the 95% credible sets at 17q12 in the
European population
rs11263761 rs11651755 rs11649743 rs718961
rs11263761 1.00
rs11651755 0.93 1.00
rs11649743 -0.12 -0.10 1.00
rs718961 -0.08 -0.07 0.85 1.00
Table 2(b). Pair-wise correlation of 4 SNPs in the 95% credible sets at 17q12 in the
African American population
rs11263761 rs11651755 rs11649743 rs718961
rs11263761 1.00
rs11651755 0.37 1.00
rs11649743 -0.08 0.03 1.00
rs718961 -0.06 0.03 0.61 1.00
65
Table 3. 52 SNPs from the 95% credible sets of mJAM
SNP Position A1 A2 RAF_EUR MetaOR_EUR RAF_AFR MetaOR_AFR
rs35904296 127902364 A G 0.68 1.08 0.91 1.16
rs35171476 127905632 C T 0.68 1.08 0.91 1.16
rs7828737 127909080 G A 0.68 1.08 0.91 1.16
rs1914295 127910317 T C 0.68 1.08 0.93 1.21
rs11775030 127911006 G A 0.49 1.11 0.69 0.98
rs35554116 127915160 T C 0.78 1.09 0.96 1.31
rs13271478 127921284 A G 0.68 1.07 0.91 1.16
rs13271795 127921396 A T 0.68 1.07 0.93 1.22
rs35561981 127923347 A T 0.67 1.07 0.93 1.21
rs17755283 127923443 C T 0.67 1.07 0.93 1.21
rs10441523 127924563 C T 0.30 1.12 0.08 0.88
rs1487240 128021752 A G 0.75 1.16 0.83 1.24
rs9297750 128022973 A G 0.75 1.16 0.83 1.25
rs7463326 128027954 G A 0.75 1.16 0.84 1.26
Chr8:128083282:A:G 128083282 G A 0.00 1.32 0.04 2.41
rs76229939 128085394 G A 0.00 1.28 0.04 2.44
rs114798100 128085434 G A 0.00 1.36 0.04 2.44
rs111906932 128086204 A G 0.00 1.74 0.02 1.90
rs1016343 128093297 T C 0.20 1.25 0.21 1.06
rs1031587 128093472 A G 0.57 1.14 0.69 1.17
rs4871008 128093541 C T 0.57 1.14 0.69 1.17
rs77084358 128093695 G A 0.00 2.07 0.02 1.82
Chr8:128093856:V1 128093856 TA TAA 0.57 1.14 0.67 1.18
rs74421890 128096183 T C 0.00 1.21 0.03 1.70
rs6997559 128097578 T C 0.53 1.06 0.68 1.16
rs7837848 128099532 T G 0.17 1.15 0.20 1.05
rs28663168 128100409 A G 0.17 1.15 0.32 0.96
rs7000967 128103360 G A 0.51 1.05 0.78 1.21
rs1456315 128103937 T C 0.32 1.08 0.54 1.29
rs72725879 128103969 T C 0.18 1.16 0.35 1.41
rs5013678 128103979 T C 0.78 1.19 0.90 1.34
rs111724742 128108418 G T 0.02 1.65 0.03 1.23
rs35365584 128108726 G A 0.67 1.16 0.87 1.24
rs1456306 128116500 G A 0.72 1.13 0.87 1.24
rs4341134 128121288 C T 0.03 1.55 0.36 1.30
rs76927191 128158835 C T 0.95 1.17 0.94 1.10
rs371053591 128161166 TAAA TAA 0.44 1.03 0.08 0.85
rs58642622 128161670 A ACT 0.44 1.03 0.12 0.90
rs116146766 128169317 C T 0.95 1.18 0.95 1.10
rs115294184 128169322 C T 0.95 1.18 0.95 1.10
66
rs13276380 128175538 T C 0.09 1.11 0.02 0.87
rs13266271 128181807 T C 0.09 1.11 0.02 0.87
rs72609852 128193303 A G 0.09 1.11 0.02 0.87
rs17464492 128342866 A G 0.71 1.15 0.78 1.14
rs6983267 128413305 G T 0.51 1.22 0.89 1.31
rs7832031 128516952 A G 0.10 1.42 0.05 1.24
rs7812429 128520173 A G 0.10 1.42 0.08 1.19
rs7814837 128522202 T G 0.10 1.42 0.08 1.19
rs4582524 128528435 G C 0.10 1.42 0.06 1.24
rs13255059 128530616 A G 0.10 1.42 0.06 1.25
rs11986220 128531689 A T 0.10 1.43 0.06 1.25
rs12549761 128540776 C G 0.88 1.27 0.96 1.41
67
Table 4. Groupings of 52 SNPs from the 95% credible sets of mJAM
SNP EURGroup EURBestTag EURR2 AFRGroup AFRBestTag AFRR2
rs35904296 Group4 rs1914295 0.99 Group4 rs1914295 0.74
rs35171476 Group4 rs1914295 1.00 Group4 rs1914295 0.75
rs7828737 Group4 rs1914295 1.00 Group4 rs1914295 0.75
rs1914295 Group4 rs1914295 1.00 Group4 rs1914295 1.00
rs11775030 Group4 rs1914295 0.47 Group4 rs1914295 0.15
rs35554116 Group4 rs1914295 0.56 Group4 rs1914295 0.52
rs13271478 Group4 rs1914295 0.98 Group4 rs1914295 0.73
rs13271795 Group4 rs1914295 0.98 Group4 rs1914295 0.98
rs35561981 Group4 rs1914295 0.92 Group4 rs1914295 0.93
rs17755283 Group4 rs1914295 0.92 Group4 rs1914295 0.93
rs10441523 Group4 rs1914295 0.20 Group15 rs6983267 0.03
rs1487240 Group1 rs1487240 1.00 Group1 rs1487240 1.00
rs9297750 Group1 rs1487240 1.00 Group1 rs1487240 0.99
rs7463326 Group1 rs7463326 1.00 Group1 rs7463326 1.00
Chr8:128083282:A:G Group8 rs72725879 7.39E-04 Group8 rs72725879 0.04
rs76229939 Group8 rs72725879 1.85E-04 Group8 rs72725879 0.04
rs114798100 Group8 rs6983267 1.73E-04 Group8 rs72725879 0.04
rs111906932 Group9 rs6983267 6.79E-05 Group9 rs5013678 3.70E-03
rs1016343 Group6 rs72725879 0.69 Group14 rs72725879 0.15
rs1031587 Group10 rs5013678 0.25 Group14 rs5013678 0.16
rs4871008 Group10 rs5013678 0.25 Group14 rs5013678 0.16
rs77084358 Group9 rs6983267 0.00 Group9 rs5013678 4.01E-03
Chr8:128093856:V1 Group10 rs5013678 0.25 Group14 rs5013678 0.15
rs74421890 Group9 rs7812894 0.00 Group9 rs5013678 0.00
rs6997559 Group10 rs5013678 0.22 Group14 rs5013678 0.16
rs7837848 Group6 rs72725879 0.89 Group14 rs72725879 0.16
rs28663168 Group6 rs72725879 0.89 Group14 rs72725879 0.04
rs7000967 Group10 rs72725879 0.23 Group10 rs72725879 0.14
rs1456315 Group6 rs72725879 0.47 Group6 rs72725879 0.45
rs72725879 Group6 rs72725879 1.00 Group6 rs72725879 1.00
rs5013678 Group5 rs5013678 1.00 Group5 rs5013678 1.00
rs111724742 Group14 rs183373024 0.29 Group9 rs72725879 0.01
rs35365584 Group5 rs5013678 0.28 Group5 rs5013678 0.40
rs1456306 Group5 rs5013678 0.37 Group5 rs5013678 0.37
rs4341134 Group15 rs77541621 0.50 Group6 rs72725879 0.18
rs76927191 Group11 rs5013678 0.01 Group11 rs72725879 0.03
rs371053591 Group10 rs5013678 0.10 Group10 rs6983267 0.09
rs58642622 Group10 rs5013678 0.10 Group10 rs6983267 0.05
rs116146766 Group11 rs5013678 0.01 Group11 rs72725879 0.02
rs115294184 Group11 rs5013678 0.01 Group11 rs72725879 0.02
68
rs13276380 Group12 rs183373024 0.12 Group12 rs183373024 0.08
rs13266271 Group12 rs183373024 0.12 Group12 rs183373024 0.08
rs72609852 Group12 rs183373024 0.11 Group12 rs183373024 0.07
rs17464492 Group7 rs17464492 1.00 Group7 rs17464492 1.00
rs6983267 Group2 rs6983267 1.00 Group2 rs6983267 1.00
rs7832031 Group13 rs7812894 0.99 Group13 rs7812894 0.25
rs7812429 Group13 rs7812894 1.00 Group13 rs7812894 0.45
rs7814837 Group13 rs7812894 1.00 Group13 rs7812894 0.45
rs4582524 Group13 rs7812894 0.97 Group13 rs7812894 0.16
rs13255059 Group13 rs7812894 0.97 Group13 rs7812894 0.15
rs11986220 Group13 rs7812894 0.98 Group13 rs7812894 0.15
rs12549761 Group3 rs12549761 1.00 Group3 rs12549761 1.00
69
References for Chapter 3
1. Al Olama AA, Kote-Jarai Z, Berndt SI, et al. 2014. A meta-analysis of 87,040 individuals identifies 23
new susceptibility loci for prostate cancer. Nat Genet. 46(10):1103–1109.
2. Asimit JL, Hatzikotoulas K, McCarthy M, Morris AP, Zeggini E. 2016. Trans-ethnic study design
approaches for fine-mapping. Eur J Hum Genet. 24(9):1330-6. doi: 10.1038/ejhg.2016.
3. Benner C, Spencer CCA, Ripatti S, Pirinen M. 2015. FINEMAP: efficient variable selection using
summary data from genome-wide association studies. bioRxiv. doi: 10.1101/027342.
4. Brown PJ, Vannucci M, Fearn T. 1998. Multivariate Bayesian variable selection and
prediction. J R Stat Soc Ser B, 60(3):627–641.
5. Campbell MC, Tishkoff SA. 2008. African genetic diversity: implications for human demographic
history, modern human origins, and complex disease mapping. Annu Rev Genomics Hum Genet. 9:403-33.
doi: 10.1146/annurev.genom.9.081307.164258.
6. Chen W, Larrabee BR, Ovsyannikova IG, Kennedy RB, Haralambieva IH, Poland GA, Schaid DJ.
2015. Fine Mapping Causal Variants with an Approximate Bayesian Method Using Marginal Test
Statistics. Genetics. 200(3):719-36. doi: 10.1534/genetics.115.176107.
7. Daniel J Schaid, Wenan Chen, Nicholas B Larson. 2018. From genome-wide associations to candidate
causal variants by statistical fine-mapping. Nature Reviews Genetics. 19,491-504
8. Dadaev T, Saunders EJ et al. 2018. Fine-mapping of prostate cancer susceptibility loci in a large meta-
analysis identifies candidate causal variants. Nature Communications. 9, 2256 (2018)
9. Galarneau G, et al. Fine-mapping at three loci known to affect fetal hemoglobin levels explains
additional genetic variation. Nat. Genet. 2010; 42:1049–1051.
10. Yongtao Guan Y, Stephens M. 2011. Bayesian variable selection regression for genome-wide
association studies and other large-scale problems. Ann. Appl. Stat. Vol. 5, No. 3, 1780-1815
11. Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, Eskin E. 2014. Identifying causal variants at loci
with multiple signals of association. Genetics 198(2):497-508. doi: 10.1534/genetics.114.167908.
12. Kichaev G, Pasaniuc B. 2015. Leveraging Functional-Annotation Data in Trans-ethnic Fine-Mapping
Studies. Am J Hum Genet. 97(2):260-71. doi: 10.1016/j.ajhg.2015.06.007.
13. Kichaev G, Yang WY, Lindstrom S, Hormozdiari F, Eskin Ef, Price AL, Kraft P, Pasaniuc B. 2014.
Integrating functional data to prioritize causal variants in statistical finemapping studies. PLoS Genet
10(10): e1004722.
14. Lango Allen H, Estrada K, Lettre G, Berndtf SI, Weedon MN. 2010. Hundreds of variants clustered in
genomic loci and biological pathways affect human height. Nature 467(7317):832–838.
15. Li YR, Keating BJ. 2014. Trans-ethnic genome-wide association studies: advantages and challenges
of mapping in diverse populations. Genome Med. 6(10):91. doi: 10.1186/s13073-014-0091-5.
70
16. Liang F, Paulo R, Molina G, Clyde MA, Berger JO. 2008. Mixtures of g priors for Bayesian variable
selection. J Am Stat Assoc 103: 410–423.
17. Manolio TA. 2010. Genomewide association studies and assessment of the risk of disease. N Engl J
Med 363:166–176.
18. Morris AP. 2011. Transethnic Meta-Analysis of Genomewide Association Studies. Genet Epidemiol.
35(8): 809–822. doi: 10.1002/gepi.20630
19. Newcombe PJ, Conti DV, Richardson S. 2016. JAM: A Scalable Bayesian Framework for Joint
Analysis of Marginal SNP Effects. Genet Epidemiol. 40(3):188-201. doi: 10.1002/gepi.21953.
20. Newcombe PJ, Verzilli C, Casas JP, Hingorani AD, Smeeth L, Whittaker JC. 2009. Multilocus
Bayesian meta-analysis of gene-disease associations. Am J HumGenet 84(5):567–580.
21. Pickrell JK. 2014. Joint analysis of functional genomic data and genome-wide association studies of
18 human traits. Am J HumGenet 94(4):559–573.
22. Quintana MA, Schumacher FR, Casey G, Bernstein JL, Li L, Conti DV. 2012. Incorporating prior
biologic information for high-dimensional rare variant association studies. Hum Hered. 74(3-4):184-95.
doi: 10.1159/000346021.
23. Quintana MA, Conti DV. 2013. Integrative variable selection via Bayesian model uncertainty. Stat
Med 32(28):4938–4953.
24. Ripke S, Sanders AR, Kendler KS, Levinson DF, Sklar P, Holmans PA, Lin DY, Duan J, Ophoff RA,
Andreassen OA, et al. 2011. Genome-wide association study identifies five new schizophrenia loci. Nat
Genet 43(10):969–976.
25. Servin B, Stephens M. 2007. Imputation-based analysis of association studies: candidate regions and
quantitative traits. PLoS Genet 3(7):1296–1308.
26. Sklar P,Ripke S, Scott L,Andreassen O,Cichon S,CraddockN, Edenberg H,Nurnberger J, Rietschel M,
Blackwood D, et al. 2011. Large-scale genome-wide association analysis of bipolar disorder identifies a
new susceptibility locus near ODZ4. Nat Genet 43(10):977–983.
27. Tibshirani R. 1996. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B 58(1):267–
288.
28. Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. 2005. Sparsity and smoothness via the fused
lasso. J R Stat Soc Ser B 67(1):91–108.
29. Trynka G, et al. Dense genotyping identifies and localizes multiple common and rare variant
association signals in celiac disease. Nat. Genet. 2011; 43:1193–1201.
30. Verzilli C, et al. 2008. Bayesian meta-analysis of genetic association studies with different sets of
markers. Am J Hum Genet 82(4):859-72. doi: 10.1016/j.ajhg.2008.01.016.
71
31. Visscher PM, Brown MA, McCarthy MI, Yang J. 2011. Five years of GWAS discovery. Am J Hum
Genet. 90(1):7-24. doi: 10.1016/j.ajhg.2011.11.029.
32. Wakefield J. 2009. Bayes factors for genome-wide association studies: comparison with P-values.
Genet Epidemiol 33(1):79–86.
33. Wu TT, Chen YF, Hastie T, Sobel E, Lange K. 2009. Genome-wide association analysis by lasso
penalized logistic regression. Bioinformatics 25(6):714–721.
34. Yang J, Ferreira T, Morris A P, Medland S E, Madden PAF, Heath AC, Martin NG, Montgomery GW,
Weedon MN, Loos RJ, et al. 2012. Conditional and joint multiple-SNP analysis of GWAS summary
statistics identifies additional variants influencing complex traits. Nat Genet 44(4):369–375.
35. Zou H, Hastie T. 2005. Regularization and variable selection via the elastic net. J R Stat Soc Ser B
67(2):301–320.
72
Chapter 4 Germline Variation at 8q24 and
Prostate Cancer Risk in Men of European
Ancestry
This chapter contains the work I undertook in the following paper:
Germline Variation at 8q24 and Prostate Cancer Risk in Men of European Ancestry
Matejcic, M., Sanders, E., Dadaev, T., Brook, M., Wang, K, et al. (Accepted by Nature Communications)
4.1 Abstract
The 8q24 region harbors multiple risk variants for distinct cancers. In genome-wide association studies
(GWAS) for prostate cancer, the 8q24 region typically has hundreds of marginally significant variants
which form several linkage disequilibrium (LD) blocks. Despite the overwhelming evidence of the
statistical association of prostate cancer with these variants, their biological mechanism is still largely
unknown. To follow up the associations identified in GWAS, we performed a comprehensive fine-
mapping analysis of the 8q24 region in men of European ancestry from the Illumina OncoArray and the
iCOGs with a total of more than 71,000 prostate cancer cases and 53,000 controls. We identified 3 novel
variants among 12 independent signals (conditional p < 4.28×10
-15
). Their allele dosages weighted by
their corresponding per-allele conditional ORs were used to construct a polygenic risk score (PRS) in
both datasets. Men in the highest PRS group (top 1%) have a 4-fold (95% CI=3.62-4.40) risk compared to
the baseline group with the PRS lies in the interquartile range. For the familial relative risk (FRR) of
prostate cancer that currently can be explained by the genetic variants, the identified 12 SNPs accounted
for 25%, the largest portion among all known risk loci. We also performed eQTL analysis to integrate
73
functional annotations and GWAS results to provide further insight into the potential biological
mechanism of these variants.
4.2 Introduction
With the development of imputation techniques, the density of genetic variants within a single region has
become extremely high, leading to high linkage disequilibrium (LD) for SNPs that are close to each other.
It is still an open question how to follow up on the risk loci identified in GWAS which could contain
numerous highly correlated genetic variants. Take the region 8q24 for prostate cancer as an example,
1077 SNPs were marginally genome-wide significant with p-values less than
8
5 10
in Illumina
Oncoarray with 53,449 cases and 36,224 controls of European ancestry. In this scenario, univariate
analyses are unable to filter out the surrogate SNPs that have high significance due to their correlation to
the functional variants. To address this issue, people usually use a joint or conditional model to adjust for
the correlation among candidate variants. In chapter 3, we discussed a novel Bayesian model selection
method that can perform joint analysis on summary statistics across multiple populations, taking
advantage of both the LD structure within each population and the heterogeneity across them. In this
chapter, we resort to another strategy to refine fine-mapping resolution: combining functional information
with statistical analysis. Specifically, we performed a novel meta-stepwise selection on 2,772 genotyped
and imputed variants spanning the risk region (127.6-129.0Mb) in two datasets: Illumina OncoArray and
the iCOGs. We also performed joint analysis of marginal summary statistics (JAM) on the iCOGs data
using to overall multiethnic (>93,000 cases and >72,000 controls) meta-analysis statistics to confirm the
variants we discovered in the meta-stepwise selection. For the SNPs that were identified in the statistical
analysis, we followed up on their functional annotations to better understand their biological mechanisms.
Also, we integrated the functional information and the marginal statistics from GWAS through a eQTL
analysis to better localize the true causal variants.
74
4.3 Methods
We used two fine-mapping approaches to prioritize the candidate variants at 8q24. The first approach
requires individual-level genotype data while the second one works on summary statistics with a
reference panel.
In the first fine-mapping method, we performed a meta-stepwise selection with the individual-
level imputed data from two datasets: Illumina OncoArray with 53,449 prostate cancer cases and 36,224
controls and the iCOGs with 18,086 prostate cancer cases and 16,711 controls. We filtered the original
5,600 SNPs in the spanning risk region (127.6-129.0Mb) by their marginal meta p values with threshold
5x10
−2
. A total of 2,772 SNPs entered as the candidates for analysis. A modified version of forward and
backward stepwise model selection was performed on the meta p-values combining the two datasets. At
each step, all remaining candidate variants were included into a study-specific logistic regression model
one at a time conditioning on the variables already in the model. The corresponding p-values from a Wald
test of the newly included candidate variant were then meta-analyzed with both studies. The candidate
with the most statistically significant p-value from the meta-analysis is included in the model if the
corresponding p-value is genome-wide significance 5x10
−8
. A backward elimination step is also included.
In some scenarios, when only summary level estimates are available,
In the second fine-mapping approach, we performed JAM using summary statistics, the
marginal meta estimates, with randomly selected 20,000 cases and 20,000 controls from the OncoArray
dataset. Our candidate set is the 2,772 variants that are nominally statistically significant. Before running
JAM, Priority Pruner was used to prune the candidates at r
2
=0.9 level. We performed 4 independent runs
of JAM and took the overlapping SNPs together with their surrogates that were pruned previously as the
credible set. We checked if the variants identified in the previous step were present in the credible set to
confirm our findings. For the SNPs presented in the credible set of JAM, we also followed up on their
75
functional annotations in publicly available data sources. Further functional analysis was performed
where we integrated GWAS and eQTL analysis as described in Nica et al.
With a set of SNPs identified in the fine-mapping studies, a PRS was calculated as the weighted
sum of dosages with the weights being the conditional estimates from a logistic regression model with all
the identified SNPs and the study-specific covariates like principal components and country of sub-
studies. The resulting scores were then categorized into percentile categories (<1%, 1-10%, 10-25%, 25-
75%, 75-90%, 90-99%, >99%). The risk for each category was estimated relative to the interquartile
range in each dataset separately and then meta-analyzed in a fixed-effect meta-analysis. We compared the
FRR contribution of the 12 variants at 8q24 to the FRR that can be explained by all 175 known prostate
cancer risk variants across the genome. Heritability is also estimated through the LMM method.
4.4 Results
We identified 12 variants in the meta-stepwise selection with risk allele frequencies between 0.006 and
0.998 that surpassed genome-wide statistical significance (p-values between 3.2×10
-8
and 8.00×10
-78
) and
with per-allele odds ratios ranging from 1.10 (rs5013678) to 2.65 (rs183373024) (Table 1). Among the 12
SNPs identified, two novel variants rs190257175 and rs12549761 are not correlated with any of the
known prostate cancer risk variants. For the other 10 variants, 8 are either previously known risk variants
or correlated (r
2
≥ 0.42) to known risk variants: rs1487240, rs77541621, rs72725879, rs5013678,
rs183373024, rs17464492, rs6983267, rs7812894. The remaining two SNPs: rs1914295 and rs7851380
are at most modestly correlated with previously reported prostate cancer risk variant.
Subjects in the lowest PRS group has an OR of 0.52 (95% CI: 0.45-0.59). While in the highest
group the OR increases to 3.99 (95% CI: 3.62-4.40). The JAM analysis ended up with 50 variants in the
credible set, which were mapped to 174 variants before pruning. Among the12 SNPs identified in the
stepwise selection, 11 were included in the credible set. The only exception is rs190257175, which had
76
the weakest conditional association. The functional annotations of these 12 SNPs indicated overwhelming
biological evidences: 6 are located within putative transcriptional enhancers (rs72725879, rs5013678,
rs78511380, rs6983267, and rs7812894), 8 intersect with transcription factor binding sites (rs77541621,
rs190257175, rs5013678, rs183373024, rs78511380, rs17464492, rs6983267, and rs7812894), 7 are
located near long noncoding RNAs associated with prostate cancer (rs1914295, rs1487240, rs77541621,
rs72725879, rs5013678, rs183373024, and rs78511380).
The overall effect of these 12 SNPs over prostate cancer risk is computed in a PRS. Men in the
top 1% of the PRS group has a 4-fold risk compared to the baseline group within the interquartile range of
the PRS (Table 2). The total FRR explained by all known prostate cancer variants (175 in total) is 37%.
The 12 SNPs at 8q24 are responsible for 9.4%, a proportion of 25.4% of the total explained FRR (Table
3). The total inheritability explained by all known variants was 0.118, where 22.2% came from the 12
SNPs at 8q24.
This large fine-mapping analysis in men of European ancestry has increased the number of potential risk
loci at 8q24 from 7 to 12. These findings further highlight the importance of this genomic region in
prostate cancer susceptibility. Carriers in the top one percent have a 3-fold increase in PC risk.
4.5 Discussion
We identified 12 independent SNPs associated with prostate cancer risk using two fine-mapping methods:
meta-stepwise selection using individual-level data and joint analysis of summary statistics with marginal
estimates from GWAS. We found strong biological evidence for these SNPs. In addition, they are
responsible for 25% of the FRR and 22% of the inheritability that can be explained by all known prostate
cancer risk loci. We incorporated the functional information in our analysis in two approaches. First, we
followed up on the functional annotation of the risk variants identified in the statistical models. Also, we
performed an integrated analysis combining the eQTL analysis with GWAS results. However, both
77
approaches have their limitations. In the first approach, the discovery phase relies solely on statistical
associations, which could leave out some biologically functional variants even with strong evidence in the
candidate sets. For the second approach, the biological information was combined with marginal GWAS
results, which trends to be noisy in the presence of high LD. Ideally, we want to incorporate the
functional information in the discovery phase of the genetic associations with joint or conditional effects.
This motivates the novel model selection method we are going to discuss in the next chapter: Quantile
(Q1) regularized regression.
78
Table 1. Marginal and conditional estimates for genetic markers at 8q24 independently
associated with prostate cancer risk
Conditional association
1
Marginal association
Variant ID
2
Position
3
Allele
4
RAF
5
LD cluster
6
OR (95%CI)
7
p-value OR (95%CI)
8
p-value
rs1914295 127910317 T/C 0.68 block 1
1.10 (1.08-1.12) 7.30x10
-25
1.09 (1.07-1.11) 3.07x10
-21
rs1487240 128021752 A/G 0.74 block 1
1.20 (1.17-1.22) 2.77x10
-66
1.16 (1.14-1.18) 2.97x10
-54
rs77541621 128077146 A/G 0.02 block 2
1.85 (1.76-1.94) 2.93x10
-137
1.83 (1.74-1.92) 4.33x10
-137
rs190257175 128103466 T/C 0.99 block 2
1.60 (1.42-1.80) 4.28x10
-15
1.36 (1.22-1.53) 6.90 x10
-8
rs72725879 128103969 T/C 0.18 block 2
1.31 (1.28-1.35) 1.26x10
-83
1.17 (1.14-1.19) 3.96x10
-48
rs5013678 128103979 T/C 0.78 block 2
1.10 (1.08-1.13) 1.58x10
-19
1.20 (1.17-1.22) 4.44x10
-68
rs183373024 128104117 G/A 0.01 block 2
2.67 (2.43-2.93) 4.89x10
-95
3.20 (2.92-3.50) 6.60x10
-138
rs78511380 128114146 T/A 0.92 block 2
1.19 (1.14-1.23) 3.48x10
-18
0.97 (0.94-1.00) 0.027
rs17464492 128342866 A/G 0.72 block 3
1.16 (1.14-1.18) 3.01x10
-52
1.17 (1.15-1.19) 9.05x10
-61
rs6983267 128413305 G/T 0.51 block 4
1.18 (1.16-1.20) 5.68x10
-84
1.23 (1.21-1.25) 3.15x10
-135
rs7812894 128520479 A/T 0.11 block 5
1.37 (1.33-1.40) 1.55x10
-122
1.45 (1.41-1.49) 1.20x10
-181
rs12549761 128540776 C/G 0.87 block 5 1.21 (1.18-1.24) 1.61x10
-45
1.28 (1.25-1.31) 1.38x10
-78
1
Each variant was incorporated in the stepwise model based on the strength of marginal association from the meta-analysis of OncoArray and
iCOGS data.
2
Variant that remained genome-wide significantly associated with PCa risk (p<10
-8
) in the final stepwise model
3
Chromosome position based on human genome build 37.
4
Risk allele/reference allele.
5
Risk allele frequency.
6
LD clusters were inferred based on recombination hotspots using Haploview 4.2
29
and defined as previously reported by Al Olama et al.
7
Per-allele odds ratio and 95% confidence interval adjusted for country, 7(OncoArray)/8(iCOGS) principal components and all other variants in
the table.
8
Per-allele odds ratio and 95% confidence interval adjusted for country and 7(OncoArray)/8(iCOGS) principal components.
79
Table 2. Relative risk of PCa for polygenic risk score (PRS) groups
1
No. of individuals Risk estimates for PRS groups
Risk category percentile
2
Controls Cases
OR (95% CI)
3
p-value
≤ 1% 530 339 0.52 (0.45-0.59) 2.11x10
-20
1%-10% 4771 3636
0.62 (0.59-0.65) 6.26x10
-90
10%-25% 7936 7359
0.75 (0.72-0.78) 3.62x10
-54
25%-75% 26464 32743
1.00 (Ref)
75%-90% 7940 13431
1.37 (1.32-1.41) 6.55x10
-77
90%-99% 4766 11451
1.93 (1.86-2.01) 4.13x10
-249
>99% 528 2576 3.99 (3.62-4.40) 5.64x10
-172
1
PRS were calculated for variants from the final stepwise model with allele dosage from OncoArray and iCOGS
weighted by the per-allele conditionally adjusted odds ratios from the meta-analysis.
2
Risk category groups were based on the percentile distribution of risk alleles in overall controls.
3
Estimated effect of each PRS group relative to the interquartile range (25-75%) in OncoArray and iCOGS datasets
separately, and then meta-analyzed across the two studies; odds ratios were adjusted for country and
7(OncoArray)/8(iCOGS) principal components.
80
Table 3. Proportion of familial relative risk (FRR) and heritability (hg
2
) of PCa explained
by known risk variants
Source No. of variants
Proportion of FRR
(95%CI)
% of total
FRR hg
2
(SE)
% of total
hg
2
8q24
1
12
9.42 (8.22-10.88) 25.4
0.027 (0.011) 22.2
HOXB13
2
1
1.91 (1.20-2.85) 5.2
0.004 (0.005) 3.0
All other
variants
2,3
162
25.77 (22.94-29.36) 69.5
0.092 (0.010) 74.9
Total 175 37.08 (32.89-42.49) 100 0.118 (0.012) 100
1
Conditional estimates were derived by fitting a single model with all variants from OncoArray data.
2
Risk estimates and allele frequencies for regions with a single variant are from a meta-analysis of OncoArray,
iCOGS and 6 additional GWAS
3
.
3
Risk variants included from fine-mapping of PCa susceptibility loci in European ancestry populations
11
.
81
References for Chapter 4
1. Al Olama, A. A., et al. (2014). A meta-analysis of 87,040 individuals identifies 23 new susceptibility
loci for prostate cancer. Nat Genet.
2. Amos C. The OncoArray Consortium: a Network for Understanding the Genetic Architecture of
Common Cancers. Cancer Epidemiol Biomarkers Prevent.
3. Schumacher, F. R. et al. Association analyses of more than 140,000 men identify 63 new prostate
cancer susceptibility loci. Nat. Genet. (2018). doi:10.1038/s41588-018-0142-8
4. Al Olama, A. A. et al. Multiple loci on 8q24 associated with prostate cancer susceptibility. Nat. Genet.
41, 1058–1060 (2009).
5. Haiman, C. A. et al. Multiple regions within 8q24 independently affect risk for prostate cancer. Nat.
Genet. 39, 638–644 (2007)
6. Han, Y. et al. Generalizability of established prostate cancer risk variants in men of African ancestry.
Int. J. Cancer 136, 1210–1217 (2015).
7. Han, Y. et al. Prostate Cancer Susceptibility in Men of African Ancestry at 8q24. J. Natl. Cancer Inst.
108, (2016).
8. Conti, D. V. et al. Two Novel Susceptibility Loci for Prostate Cancer in Men of African Ancestry. J.
Natl. Cancer Inst. 109, (2017).
9. Hoffmann, T. J. et al. A large multiethnic genome-wide association study of prostate cancer identifies
novel risk variants and substantial ethnic differences. Cancer Discov. 5, 878–891 (2015).
10. Newcombe, P. J., Conti, D. V. & Richardson, S. JAM: A Scalable Bayesian Framework for Joint
Analysis of Marginal SNP Effects. Genet. Epidemiol. 40, 188–201 (2016).
11. Dadaev, T. et al. Fine-mapping of prostate cancer susceptibility loci in a large meta-analysis identifies
candidate causal variants. Nat. Commun. 9, 2256 (2018).
12. Nica, A. C. et al. Candidate causal regulatory effects by integration of expression QTLs with complex
trait genetic associations. PLoS Genet. 6, e1000895 (2010).
82
Chapter 5 Quantile(Q1) Regularized Regression
5.1 Introduction
Assume we have a continuous outcome of interest and a large set of features that are linearly associated
with it. Two questions can help us to better understand their relationship: Can we identify the features that
are the most relevant to the outcome? And with these features, how well can we predict the outcome of
interest? To single out the features that are the most interesting, numerous variable selection methods
have been proposed. For its simplicity and interpretability, the step-wise selection is widely used in
practice. However, it proved to be very unstable in the sense that minor changes in the data could result in
entirely different sets of selected variables. P-values, as its inclusion criteria, have been under critical
review in the recent decades as an increasing number of researchers believe posterior probabilities should
be used instead. In light of this recognition, Hierarchical model selection as a more robust selection
method is gaining popularity in many fields. Among them, Lasso has been widely embraced as one of the
most popular methods, where estimates are shrunk towards zero to produce a parsimonious model. In
Lasso, every feature has the same degree of shrinkage regardless of their magnitudes or directions. In
many realistic settings, this uniform shrinkage is suboptimal in terms of prediction, especially compared
to differential shrinkage among features. However, there is always a trade-off between the versatility of
shrinkage and the additional parameters to estimate with a fixed sample size to distribute among training
and validation. Here, we present a more flexible framework that expands the L1 penalty to a quantile(Q1)
penalty which adopts asymmetric shrinkage for estimates in different quadrants. Noted that the L1 penalty
is a special case of the Q1 penalty with the quantile set to be 0.5 (i.e the median). By cross-validating
among a set of candidate quantiles, the Q1 regularized regression is able to adjust to the optimal quantile
that best suits the pattern of the data. Importantly, with just one additional quantile parameter, the shape
of the constraint region becomes much more flexible, which is particularly useful with the presence of
83
imbalanced numbers of positive and negative estimates. In addition to the above extension on the first
level in the hierarchical model, we can further improve the feature selection and prediction performance
by incorporating predictor level external information on the second level. For example, in genetics, gene
annotations can help prioritize genetic variants that are potentially causal for a variety of diseases. Here,
we propose to adopt a quantile regression on the second level of the hierarchical model to incorporate
external information when its dimension is relatively low compared to the feature dimension. In case of
high dimensional external information, we can further expand the second level into a regularized quantile
regression. Extensive simulations indicate that our model has higher precision in identifying truly
effective features and better prediction performance on the outcome, compared to Lasso.
5.2 Methods
5.2.1 Quantile Regression
Ordinary least squares characterize the linear relationship between an outcome variable and a group of
predictors using conditional mean. Both its computational convenience and interpretability are two of its
many advantages. Central limit theory definitely contributed to its primacy in the 70s in the applied field
as the normality assumption of residuals seemed well-founded in most real-world scenarios. However,
this assumption should not be taken for granted even when the sample size is large, especially when the
error distribution has long tails. With a sizable number of outliers, ordinary least squares have rather poor
performance. In 1978, Roger Koenker and Gilbert Bassett, Jr. proposed the use of conditional quantile to
replace conditional mean in these scenarios. By specifying a specific quantile to investigate, we're able to
depict the relationship of outcome and predictors at a certain position of their distribution. This method is
more robust to outliers and has the flexibility to capture the heterogeneities at a different part of the
distribution. Actually, before Koenker and Bassett published their work on quantile regression, its special
case median regression had already exhibited its robustness to outliers compared to least squares and
84
started to accumulate popularity among the research community. Similar to ordinary least squares that
minimize the sum of squared errors, median regression minimizes the sum of absolute errors. To further
generalize to quantile regression, the objective function becomes a weighted sum of absolute errors with
two different weights for overprediction and underprediction respectively.
For a continuous response variable { : 1,..., }
t
y t T , we have proved in chapter 1 that the th
sample quantile can be defined as the solution of the following problem:
{ : } { : }
min[ (1 ) ]
tt
tt
R
t t y t t y
yy
After incorporating a sequence of the k-dimensional vector { : 1,..., }
t
x t T . The th
regression quantile is defined as the solution to the minimization problem:
{ : } { :
min[ (1 ) ]
k
t t t t
t t t t
R
t t y x t t y x
y x y x
Although there are no analytical solutions to the above optimization problem, the objective
function is convex. It is equivalent to tilt the absolute function in an appropriate manner. The simplex
method is very efficient to obtain its numerical solution. To estimate the standard errors, bootstrap re-
sampling is usually applied in place of estimating its analytical variance. One note should be made that
the convexity of the objective function is very crucial in the implementation of our proposed hierarchical
Bayesian model selection method in the following section.
5.2.2 Quantile(Q1) Penalty
In this section, we propose a novel variable selection method that expands upon Lasso L1 penalty to a
more flexible Q1 penalty without adding the external information into consideration.
Assume our response variable y is a vector of length N: : { , 1,..., }
N
n
y y R n N R .
85
Our explanatory variables X is an N by P matrix with each row corresponding to one observation:
: { , 1,..., }
P N P
n
X x R n N R
The first level of our model is multiple linear regression with all the candidates with
P
R :
0
yX
We can either solve the following minimization problem analytically or treat it as an optimization
problem to get a numerical solution:
2
2
ˆ
arg min
p
R
yX
Recall Lasso introduces an L1 penalty on the second level. The penalized form of the optimization
problem is as follows:
2
21
arg min( )
p
R
yX
Figure from Chapter 3 of Hastie et al. (2009)
86
By applying Q1-penalty in place for the L1 penalty, the optimization problem takes the form:
2
arg min( ( ))
p
j
R
j
yX
, where ( ) ( ) (1 )( ) u u u
(5.1)
For computational convenience, we plugged in the following equation to simplify our objective function:
11
( ) ( ) (1 )( ) ( )
22
u u u u u
(5.2)
One note that should be mentioned is when we consider the median, namely 0.5 , the quantile
regularization will be exactly the same as Lasso. The following plots show the graphical representation of
the constraint region for Q1-regularization with different quantiles.
5.2.3 Coordinate Descent for Q1-regularization
Similar to Lasso, optimization problem (5.1) does not have an analytical solution. Yet we are in serious
need of a computationally efficient algorithm to find numerical solutions as the computational burden for
Q1 regularization is one dimension higher than Lasso, at least in theory. This is because we need to find
87
the optimal quantile that controls the shape of the constraint region as well as the optimal lambda that
controls the size of the constraint region as in Lasso. Between these two parameters, tuning lambda
remains the bigger challenge because it takes additional efforts just to find the range that contains the
optimal lambda while for the quantile parameter, there is a natural range from 0 to 1. Another key
difference between these two parameters is conceptual, lambda is considered more of an intermediate
internal parameter and is usually not reported while the quantile parameter is of our primary interest as it
indicates how imbalanced the optimal constraint region is in the data. With these recognitions, instead of
interactive search through the entire model space for the optimal hyper-parameter pair lambda and
quantile, we currently implement a more straightforward strategy that treats these two parameters
separately. At the first step, we pre-specify the number of quantiles to consider and generate a list of
candidate quantiles. Obviously, the more refined the interval is, the closer we will be to the global optimal.
However, we need to consider if the extra insight we get from more refined values of quantiles is worthy
of the extra computational cost, which scales linearly with the number of candidate values in the range.
Also, the increasing number of candidate quantiles introduces extra burden for multiple comparisons as
well. In the extreme case, if we treat the quantile to be a real number in the interval [0,1], the extra
precision might not be achieved anyways since it is a numerical solution after all. Even if we can obtain
this real-number precision for the quantile parameter, it is not as interpretable as a categorical one where
we can report the probability mass function of the optimal quantile instead of a probability density
function. Noted that we can always categorize the later, but the extra computational burden could be huge.
For now, our suggested number of quantiles to consider is between 10 to 100, which should produce close
to optimal results.
With a set of candidate quantiles, we can use coordinate descent to find the optimal solution for
the convex optimization problem with a given quantile in (5.1). It is important to note the following
conditions with which the coordinate descent is guaranteed to find the optimal solution.
88
For ( ) :
n
f x R R , assume there exists one point
n
xR such that () fx is minimized along each
coordinate axis. We are interested in the optimality of this solution.
Scenario1: If () fx is convex and differentiable, then ( ) min ( ) f x f z
Scenario2: If () fx is convex but non-differentiable, then ( ) min ( ) f x f z
Scenario3: If
1
( ) ( ) ( )
P
jj
j
f x g x h x
where () gx is convex and differentiable, each ()
j
hx is convex,
then ( ) min ( ) f x f z
Both the objective functions of Lasso and Q1 satisfy the conditions in scenario3. Therefore, the
coordinate descent is guaranteed to converge to the optimal solution.
In the following part, we derive the update rule of coordinate descent for the Q1 penalty.
For computational simplicity, we redefine the objective function to be:
2 1
( ) 2 ( )
2
j
j
f x y X
, where
11
( ) ( )
22
u u u
(5.3)
Taking partial derivative with respect to
j
, we have:
()
( ) (2 1)
TT
j j j j j j j
j
fx
x x x X y
Here
j
X
is X without column j ,
j
x is the j th column of X
By setting the above to zero, we have the following update rule:
()
( 1)
( ) (2 1)
()
T
jj
Tt
j j j t
j T
jj xx
x y X
S
xx
(5.4)
89
5.2.4 Q1 Penalty with External Information
Previously, we discussed the strategy to expand the L1 penalty to a Q1 penalty to provide flexibility in the
shape of the constraint region that potentially has a better fit in the data. In this section, we expand the
previous model further to incorporate predictor level external information in the variable selection as well
as prediction tasks under a Q1 penalty. Recall the first level model is a linear regression of our
explanatory variables on a continuous outcome. Now at the second level, a quantile regression is used to
incorporate external information
PM
ZR
on the estimates obtained at first level covariates:
( ) ~ QZ , where () Q is the quantile function
Here Z does not include the design column for intercept. Intuitively, the quantile loss function
is a tilted L1 norm with distinct shapes in different quadrants of the parameter space.
2
2
,
arg min ( ( ))
pM
tt
RR
t
y X Z
, where ( ) ( ) (1 )( ) u u u
(5.5)
By solving objective function (5.5), we jointly estimate the effect of each SNP on the first level
while considering the external information on covariates on the second level. If the dimension of external
variables is smaller than the number of features in the first level, we can still find the optimal solution of
(5.5) using similar strategy discussed in the previous section. However, when the dimension of external
information is extremely high, additional regularization at the second level is required to prevent
overfitting. We propose to adopt an L1 penalty on the second level in the high dimensional setting to aid
the model selection. To be more specific, the revised minimization problem for the hierarchical model
becomes:
2
12
21
,
arg min ( ( ) )
pM
tt
RR
t
y X Z
(5.6)
90
It is indeed a challenge to efficiently find the optimal hyper-parameters in the above
optimization problem as we still have two continuous parameters without a targeted range to start with
even after we fixed the quantile parameter. A brute force grid search would increase the number of
candidates along with the number of paths until the coordinate descent converges. Thanks to the ongoing
work of Garrett Weaver, where he proposed a much more efficient algorithm to search through the grid,
the computational time of a Q1-L1 penalty model is still manageable under realistic dimensions of
external variables.
5.3 Simulation Studies
To evaluate the performance of Q1 regularized regression, we conduct comprehensive simulations and
examine its performance in two aspects: feature selection and prediction of the outcome. We report false
positive and false negative rates for feature selection, and test mean squared error (MSE) for prediction.
The distribution of the optimal quantile is also reported under each simulation setting to provide further
insights. There are two sets of simulations: (i) Q1-regularized regression without external information as
in equation (5.1); (ii) Q1-regularized regression with external information as in equation (5.5). Here we
only investigated the scenarios where the external information has a lower dimension than the predictors.
To begin with, we simulated X, which contains 1000 samples for 100 uncorrelated features with
categorical values 0,1,2 which mimics genotyped SNPs in genetics. Then, we specify a vector of true
features with their effect sizes and use a linear regression model to simulate a continuous outcome of
interest y for each sample. With X and y, Q1 regularized regression and Lasso can be readily applied and
their performances can be compared. To investigate the influence of incorporating external information,
we further simulated a one-dimensional vector Z, with indicator values for each predictor in the first level,
to mimic gene annotations at the second level of the hierarchical model.
We explored our simulations with varying model settings in several aspects:
91
(1) The number of true signals: 3 scenarios
With 5, 10 and 50 true signals simulated from the total 100 features;
(2) The proportion of positive effects in all signals: 3 scenarios
With 50%, 80% and 100% of the true signal having positive effects;
(3) The informativeness of external information in terms of sensitivity and specificity: 4 scenarios
With (sensitivity, specificity) = {(0.9, 0.9), (0.9, 0.5), (0.5, 0.9), (0.5, 0.5)}
(4) The informativeness of external information in terms of the direction of effects: 3 scenarios
(4.1) Z 1 = {0,1}: The external variable is only informative on the positive effects.
Sensitivity: the probability of all positive features having Z 1 = 1
Specificity: the probability of all negative features and null features having Z 1 = 0
(4.2) Z 2 = {0,1}: The external variable is informative on all the true effects but is unable to differentiate
the direction of effects.
Sensitivity: the probability of all positive and negative features having Z 2 = 1
Specificity: the probability of all null features having Z 2 = 0
(4.3) Z 3 = {-1,0,1}: The external variable can differentiate positive effects and negative effects.
Sensitivity: the probability of correctly assign Z 3 for positive and negative features
Specificity: the probability of correctly assign Z 3 for null features
92
We assume there is no "cross-over", meaning the external information would not indicate a risk
feature being protective, etc. Therefore, the true positive features will have Z 3 = {0,1}; the true negative
features will have Z 3 = {-1,0}; the null features will have Z 3 = {-1,0,1}
We did a grid search of all the above settings (1)-(4) with and without external information for a
total of 117 scenarios. Under each scenario, we report statistics averaged across 1000 replicates. For each
implementation of quantile regularization, we pre-specified to take 11 values: 9 values from 0.1 to 0.9
with 0.1 space between adjacent values, together with two extreme quantiles 0.01 and 0.99.
Our method was implemented in the R package hierr while the most widely-used R package for
Lasso is glmnet. Both of these packages use the coordinate descent algorithm to obtain a numerical
solution. Theoretically, Q1 regularized regression without external information is equivalent to Lasso
when the quantile is set to 0.5. Therefore, we compared our parameter estimates in this setting to that of
Lasso’s. With the same cross-validation folder and random seeds, the estimates are consistent up to seven
decimal places.
5.3.1 No external information
In the first set of scenarios, we compared the performance of Lasso to Q1 regularized regression without
incorporating any external information. We set the baseline model to have 10 signals with positive effects
among 100 features. We then explored the scenarios with different signal-noise ratios and proportions of
positive effects of all signals to investigate their impact on the performance of the prediction and the
model selection.
As shown in Figure 1, the test MSE of Q1 regularized regression is 1.11 in the baseline model,
a 12% reduction compared to that of Lasso (1.26). This improvement is no surprise since our cross-
validation criteria are MSE. Q1 regularized regression always has lower internal MSE and, if not
overfitted, gets a lower test MSE as well. In this scenario, all the effects are positive values. Therefore,
93
the smaller quantiles that penalize less on positive effects are more likely to be the optimal ones. Indeed,
the most frequently chosen quantiles are Q0.2 (36%), Q0.3 (32%) and Q0.4 (14%). In contrast, only 3%
of the time Lasso (Q0.5) is the optimal model. The Q1 regularized regression reduces the false negative
rate from Lasso’s 6.0% to 3.6% but has an increased false positive rate of 14.4%, compared to Lasso’s
false positive rate 6.9%.
Next, we keep all signals to be positive but change the number of signals to alter the signal-
noise ratio. Since the total feature size is kept to 100, it is equivalent to change the number of signals.
When we increase the number of signals from 10 to 50, Lasso has a 137% increase in MSE compared the
baseline scenario while Q1 regularized regression only increases by 0.5%.(Figure 1(a)) With 50 positive
signals, Q1 has a 63% reduction in MSE compared to Lasso. The selection of optimal quantiles is also
shifted away from the median and towards more extreme quantiles with 41% chance being the Q0.1 and
38% being Q0.2. With more extreme lower quantiles being selected, the penalty on the positive effects is
further reduced, while the stringent penalty on the negative effects shrinks them towards zero. The
combination of these two results in a reduction in false negative rates, Q1 only has 4.0% false negatives
while Lasso has 11.1%. However, false positive rate increases from Lasso’s 19.3% to Q1’s 23.8%. In the
opposite model setting, if there are only 5 signals, the MSE of Lasso (1.09) is only slightly higher than Q1
regularized regression (1.05). The optimal quantiles are less extreme: Q0.3 (33%), Q0.2 (21%), Q0.4
(18%). As a result, the false negative rate is 2.6% compared to Lasso’s 3.5% and the false positive rate is
9.0% compared to Lasso’s 4.3%.
Then we keep the total number of signals as 10 but decrease the proportion of positive effects
from 100% in the baseline model to 80% and eventually to 50%. (Figure 1(b)) As we are moving towards
a more balanced decomposition of positive and negative effects from the baseline model, the
improvement of test MSE for Q1 over Lasso gradually decreases from 0.15 to 0.07 and finally to 0.01.
This is consistent with the distribution of optimal quantiles. When the majority of the signals have
positive effects, the lower quantiles are more likely to be chosen and only 2.7% of the time when Q0.5 is
94
the optimal quantile. However, in the balanced scenario with 5 positive effects and 5 negative effects,
Q0.5 has the highest probability of 0.34 being optimal among all quantiles. The reduction of MSE is only
5.1%, from Q1’s 1.039 to Lasso’s 1.034. Intuitively, as the number of positive and negative signals are
more balanced, there is less evidence to penalize the quadrants differentially. In general, Q1 has lower
false negatives but higher false positives compared to Lasso. For now, the most dramatic improvement of
MSE comes from the scenario with a large signal-noise ratio (e.g., 1) and extremely imbalanced effects
(all positive effects). When we have a relatively smaller signal-noise ratio (0.1) and perfectly balanced
effects, the improvements become quite trivial. Can we get any meaningful improvement with a large
signal-noise ratio (1) and perfectly balanced effects? To answer this question, we simulated 50 signals
from 100 features with 25 positive effects and 25 negative effects. In this setting, although Q0.5 has a 50%
chance of being optimal, Q1 still has a 5% decrease in MSE (1.12) compared to Lasso (1.17). In this case,
the model selection results are quite similar. Q1 has a false negative rate of 10.9%, compared to Lasso’s
11.0%. The false positive rate of Q1 is 19.8%, slightly higher than Lasso’s 19.5%.
To summarize the prediction performance, the additional flexibility of choosing the optimal
quantile always provide opportunities for Q1 to improve upon Lasso. Even under the optimal scenario for
Lasso with a balanced number of positive and negative effects, Q1 out-performs Lasso in a consistent
manner. In general, the improvement potential becomes greater with a larger signal-noise ratio and very
imbalanced direction of effects. With only one additional parameter introduced compared to Lasso, over-
fitting is very unlikely to occur for Q1. As for the model selection, there is a trade-off between false
positives and false negatives. In general, as the optimal quantiles in Q1 move towards extreme values, the
false negative rate will drop at the cost of an increased false positive rate. One thing worth noticing is that
current criteria for choosing the optimal quantile in cross-validation are the MSEs, which can be easily
changed to other statistics that could have a tighter correlation with false positive or false negative rates.
However, it is unlikely to break the trade-off relationship between these two statistics unless additional
information is provided.
95
Figure 1(a) Test MSE of 5,10 and,50 signals
with all positive effects
Figure 1(b) Test MSE of 10 signals with 50%,
80%, and 100% positive effects
Figure 1(c) False positive rates of 5,10 and,50
signals with all positive effects
Figure 1(d) False positive rates of 10 signals
with 50%, 80%, and 100% positive effects
Figure 1(e) False negative rates of 5,10 and,50
signals with all positive effects
Figure 1(f) False negative rates of 10 signals
with 50%, 80%, and 100% positive effects
96
5.3.2 With external information
In this section, we discuss the performance of Q1 regularized regression, where external information is
incorporated through a quantile regression at the second stage. We expand the 9 baseline scenarios
without external information from the previous section to 108 by introducing 3 types of information with
4 levels of informativeness. We compare their performance across different settings of external
information and to their corresponding baseline scenarios without external information. Test MSE is
reported for prediction ability. False positive and false negative rates are reported for model selection.
First, we investigate the impact of the information quality in terms of sensitivity and specificity
on the improvement scale of model performance. Intuitively, population prevalence plays an important
role in conveying the influence of sensitivity and specificity, which are method-based metrics, to
population-based metrics like false positive and false negative rates. Indeed, we find it necessary to
stratify the discussion into two categories with the presences of many signals or very few signals in the
candidate features.
To investigate the scenario when the signals are rather sparse, we show two examples of 5 and
10 signals out of 100 features respectively. Under both scenarios, we compare the model performance
when incorporating three types of external information of all levels of informativeness: high sensitivity
with high specificity, high sensitivity with low specificity, the low sensitivity with high specificity, and
low sensitivity with low specificity. For prediction performance, specificity has more impact than
sensitivity in both scenarios, especially when the signal is sparse. Incorporating high specificity external
information improves the prediction performance, regardless of the sensitivity levels and the type of
information (Figure 2(a), (b)). The test MSE can drop below the baseline with no external information
when we incorporate low specificity and high sensitivity information (Figure 2(a)). The false positive rate
is dominated by specificity as well. It goes up quickly with decreased specificity (Figure 2(c), (d)). False
negative is impacted by sensitivity but usually under control compared to the baseline. With high
97
sensitivity, there is a huge improvement in false negative rates. In all informativeness levels, we observe
some improvement in false negative rates compared to the baseline.
Figure 2(b) Test MSE of 10 positive signals
Figure 2(a) Test MSE of 5 positive signals
Figure 2(c) False positive rates of 5 positive signals
Figure 2(d) False positive rates of 10 positive signals
Figure 2(e) False negative rates of 5 positive signals
Figure 2(f) False negative rates of 10 positive signals
98
When the signals are quite common, we find a more dramatic divergence in the performance for
different types of external information. With a comparable number of signals and noise features, both
sensitivity and specificity play a vital role in prediction. Given high sensitivity and specificity information,
we can make a considerable decrease in MSE when the type of information is effective in capturing the
information with the underlying pattern of the true effects (Z1, Z2 in Figure 3(a), Z3 in Figure 3(b)).
When the type of information is misleading, it could have a negative(Z3 in Figure 3(a)) or little (Z1, Z2 in
Figure 3(b)) gain. With a lot of features affected by either sensitivity or specificity, we could gain
information with only one good metric (Z3 in Figure 3(b)) or lose information with one bad one (Z1 in
Figure 3(a)). The type of information really matters. For false positive rates, high specificity usually
improves it (Z1,2,3 in Figure 3(c); Z3 in Figure 3(d)) unless the type of information violates the pattern of
the effects. (Z1, Z2 in Figure 3(d)). False negatives are usually less of a problem and are dominated by
sensitivity (Figure 3(e), (f)).
In summary, to improve the MSE, we need to have a high number of correct information,
affected by sensitivity, specificity, and prevalence. When the signals are sparse, it is usually a trade-off
between FP and FN and quite difficult to improve FP (Figure 4(c)). With many signals, high-quality
information could improve both metrics simultaneously (Z1, Z2, Z3 in Figure 4(a)) with FP dominated by
specificity and FN dominated by sensitivity.
Last, we compare the performance of different types of information. When the signal is
balanced in both directions, Z3 has a huge advantage over Z1 and Z2 for both predictions (Figure 3(b))
and model selection (Figure 3(d), 3(f)). For imbalanced cases, Z1 and Z2 perform very similarly for
prediction (Figure 3(a)), but Z1 tend to have more balanced FP and FN than Z2. (Figure 3(c), (e)).
99
Figure 3(a) Test MSE of 50 positive signals
Figure 3(b) MSE of 25 positive and 25 negative signals
Figure 3(c) False positive rates of 50 positive signals
Figure 3(d) False positive rates of 25 positive and 25
negative signals
Figure 3(e) False negative rates of 50 positive signals
Figure 3(f) False negative rates of 25 positive and 25
negative signals
100
5.3.3 High dimensional scenario
We expand the total features from 100 to 10,000 in our baseline scenario without external information to
investigate the impact of high dimensional data on our model. Ten positive signals are simulated, and the
sample size is kept at 1,000. In this case, Q1 regularized regression still out-performs Lasso by 8.6% in
test MSE (Figure 5). This improvement is not as much compared to the baseline scenario with only 100
candidate features (16.2%) but it Q1 regularized regression still possess a clear edge over Lasso in high
dimensional settings.
Figure 4(a) FP vs FN for 50 positive signals Figure 4(b) FP vs FN for 25 positive and 25
negative signals
Figure 4(c) FP vs FN for 10 positive signals
101
5.4 Case study: Breast cancer survival prediction
We performed a case study to predict the survival status of breast cancer for the Molecular Taxonomy of
Breast Cancer International Consortium (METABRIC) data. We categorized the original outcome
survival time to be the survival status at 5 years. The predictor variables are clinical variables and 20000
probes. After filtering the observations with censored outcome status, we further exclude the individuals
with ER-negative status, as previous literature showed a lack of predictive power of genetic variants for
this group of patients. Finally, 594 samples remained in the training data. We used four meta-gene
categories that have been reported in previous papers to be associated with breast cancer survival as our
predictor-level external information. We performed Q1 regularized regression with and without
incorporating the external information Z and obtain test AUC on a separate validation dataset that
contains 564 individuals. As a direct comparison to our model, we run Lasso and use the features being
selected to construct a prediction model. We also run ridge regression as a second comparison to our
methods. Since the training sample is rather small, the different split of folds in the cross-validation could
affect the results. Therefore, we run 10 replicates for all four methods with a randomized fold ID across
replicates, while keeping the same fold ID within each replicate across all the methods. For ten replicates,
Figure 5 Test MSE Lasso vs Q1 No Z for 10
positive signals among different number of features
102
the mean test AUC for Lasso is 0.629. Without incorporating external information, Q1 regularized
regression only improves slightly over Lasso with a test AUC of 0.633. After incorporating meta-gene
information as external information, Q1 regularized regression improves the AUC dramatically to 0.693.
As another comparison, ridge regression performs better than Lasso in general with 0.651 test AUC, but
not as good as Q1 with external information.
5.5 Discussion
In this section, we proposed three model selection methods under the same 2-stage hierarchical model
framework: (i) Quantile regularized regression; (ii) Quantile regularized regression with external
information; (iii) Quantile-Lasso regularized regression with external information for high dimensional
data.
Quantile regularized regression expands upon the L1 penalty in Lasso to a more flexible
quantile penalty. This additional flexibility of differentially penalizing estimates in different quadrants
provides a data-driven way of asymmetric shrinkage for the L1 penalty. Both the simulation results and
the real data analysis show its improved prediction performance over Lasso even without external
information. This is due to choosing the optimal internal MSE across all quantiles, which is likely to
translate into higher test MSE when the sample size is large enough. However, when the data is high-
dimensional, cross-validation can overfit the model and the choice of hyper-parameters is often
suboptimal. This could pose a greater challenge for Q1 than for L1 since we have an additional quantile
parameter to estimate.
When the external variable is high-dimensional, we need to introduce regularization on the
second level to prevent over-fitting. It is worth mentioning that the computational cost of imposing
regularizations on the second level is quite expansive. Since the benefit of incorporating external
information is dependent on the informativeness of the external data, we suggest pre-processing this
103
information with extra caution to guarantee its quality and also to reduce the runtime by reducing its
dimension.
5.6 R package hierr
The methods discussed in this chapter are implemented by Garrett Weaver in an R package hierr, which is
developed specifically for hierarchical regularized regression. In the package, users can specify the type
of regularization penalties on either the first level or the second level of the hierarchical model. To run
quantile penalty without external information, the user needs to specify the first level regularization
penalty to be Q1 and do not import external data. To run quantile penalty with external information, the
user can use the same argument while importing the external data. If the second level information is high-
dimensional, it is suggested to impose a second level regularization together with the first level quantile
regularization. The hierr package is flexible enough to handle all combinations of different types of
penalties on these two levels including L1, L2 and Q1 penalty. For instructions and more detailed
descriptions, please refer to the following Github page: https://github.com/USCbiostats/hierr.
104
Supplemental Paper 1 Two Novel Susceptibility
Loci for Prostate Cancer in Men of African
Ancestry
Conti, D., Wang, K., et al. (2017). Journal of National Cancer Institute 109(8): djx084
Abstract
Prostate cancer incidence is 1.6-fold higher in African Americans than in other populations. The risk
factors that drive this disparity are unknown and potentially consist of social, environmental, and genetic
influences. To investigate the genetic basis of prostate cancer in men of African ancestry, we performed a
genome-wide association meta-analysis using two-sided statistical tests in 10,202 cases and 10,810
controls. We identified novel signals on chromosomes 13q34 and 22q12, with the risk-associated alleles
found only in men of African ancestry (13q34: rs75823044, risk allele frequency = 2.2%, odds ratio [OR]
= 1.55, 95% confidence interval [CI] = 1.37 to 1.76, P = 6.10×10
-12
; 22q12.1: rs78554043, risk allele
frequency = 1.5%, OR=1.62, 95% CI=1.39 to 1.89, P = 7.50×10
-10
). At 13q34, the signal is located 5’ of
the gene IRS2 and 3’ of a long noncoding RNA, while at 22q12 the candidate functional allele is a
missense variant in the CHEK2 gene. These findings provide further support for the role of ancestry-
specific germline variation in contributing to population differences in prostate cancer risk.
Introduction
The incidence of prostate cancer (PCa) in African American men is 1.6-fold higher than in other
racial/ethnic populations
1
, remaining one of the most important unanswered health disparities globally.
105
Reasons for this disparity are unknown and likely involve a multitude of factors, including social and
environmental factors and inherited susceptibility. Genome-wide association studies (GWAS) have
identified more than 160 common risk alleles for PCa (2-7), including the susceptibility region on
chromosome 8q24, which harbors multiple variants that have been suggested to contribute to racial/ethnic
differences in PCa risk (8,9).
Methods
To search for additional PCa risk variants in men of African ancestry that may contribute to their greater
disease incidence, we combined genetic association results from the African Ancestry Prostate Cancer
GWAS Consortium (AAPC, 4,853 cases and 4,678 controls) (9), the Ghana Prostate Study (474 cases and
458 controls) (10), the Kaiser/ProHealth Prostate Cancer Study (601 cases and 1,650 controls) (11) ,and
the ELLIPSE/PRACTICAL OncoArray Consortium (4,274 cases and 4,024 controls) (Supplemental
Table 1). Subjects provided written informed consent to participate in the study. The protocol and consent
documents were approved by the institutional review boards at each of the participating institutions. A
total of 17.8 million genotyped and imputed single nucleotide polymorphisms (SNPs) and
insertion/deletion variants with frequencies of 1% or more were tested for an association with PCa risk.
For each SNP, per-allele odds ratios (ORs) and 95% confidence intervals (CIs) were estimated using
unconditional logistic regression, and we tested for allele dosage effects through a 1-degree of freedom
Wald trend test. All statistical tests were two-sided. Results from each study were combined through a
meta-analysis of 10,202 cases and 10,810 controls (Supplementary Methods). The cut-point for genome-
wide statistical significance was a P value of less than 5.00×10
-8
.
106
Results
Only minor evidence of inflation in the test statistic was observed following adjustment for global genetic
ancestry (λ=1.04). In the meta-analysis, 775 alleles achieved genome-wide statistical significance (P <
5.00×10
-8
). These alleles were located at the 8q24 risk region (543 alleles) and other known susceptibility
regions on chromosomes 2p15(EHBP1), 2q37(MLPH), 6q22(RFX6), 8p21(NKX3-1), 10q11(MSMB),
11q13(MYEOV), 12q13(KRT8), 17q21(ZNF652), and Xp11(NUDT11/LINC01496) (Supplementary
Figure 1A, available online). Outside of these regions, genome-wide statistically significant associations
were also observed on chromosomes 13q34 and 22q12.1 (Table 1; Supplementary Figure 1B), with the
risk-associated alleles found almost exclusively in men of African ancestry. At 13q34, marker
rs75823044 (2.2% frequency) was associated with an odds ratio of 1.55 (95% CI = 1.37 to 1.76, P =
6.10×10
-12
). This marker is located within a cluster of five moderately correlated alleles (r
2
> 0.30)
approximately 45kb 3’ of the gene insulin receptor substrate 2 (IRS-2), a signaling protein that mediates
the effect of insulin and insulin-like growth factor 1 (12), and 20 kb 5’ of a long noncoding RNA
(LINC00676). Of these five variants, rs151190668 (OR = 1.67, 95% CI = 1.43 to 1.96, P = 1.70×10
-10
)
appears to be the best functional candidate because it is located in a region containing epigenetic
chromatin modifications and androgen receptor and FOXA1 binding consistent with regulatory sequences
(Figure 1; Supplementary Methods).
At 22q12.1, the association signal was also defined by multiple low-frequency African
ancestry-specific variants spanning approximately 944 kb, with rs78554043 being the most statistically
significant variant (1.5% frequency, OR = 1.62, 95% CI =1.39 to 1.89, P = 7.50×10
-10
) (Table 1).
The variant rs78554043 is correlated (r
2
= 1) with a missense polymorphism (rs17886163, Ile448Ser, P =
1.38×10
-9
) in the CHEK2 gene, which is a likely candidate for the underlying biologically functional
allele. Although the Ile448Ser missense is characterized as “benign” by Polyphen2 and ClinVar and
“tolerated” by SIFT 10A5Q6 (Supplementary Methods), it involves a nonconservative nonpolar to polar
change in the amino acid. While the possibility of rare regulatory variation cannot be excluded, this
107
nonconservative change provides support for previous studies suggesting that rare/less common missense
variants in CHEK2 may be important in PCa development (13).
The risk alleles rs75823044 and rs78554043 are found almost exclusively in African ancestry
populations. In the 1000 Genomes Project populations (n = 2504 subjects), the risk allele for rs75823044
was found in 48 of 661 African ancestry samples (AFR), one of 85 Peruvians, and one of 96 Punjabi. For
rs78554043, the risk allele was found in 30 of 661 AFR samples, one of 104 Puerto Ricans, and one of 94
Colombians.
At 13q34 and 22q12.1, no nominally statistically significant (P < .05) evidence of effect
heterogeneity was noted by age (above vs below the median age in cases plus controls of 64, P ≥ .27) or
disease aggressiveness (high-risk vs low-risk PCa, P≥ .20) (Supplementary Methods). GWAS of high-
risk disease (vs controls) and high- (n = 2984) vs low-risk (n = 3012) disease (Supplementary Methods)
did not reveal any novel PCa loci of genome-wide statistical significance that could differentiate risk by
disease aggressiveness (Supplementary Figure 1, C and D). In addition, aside from 8q24 (14), admixture
mapping using 220 474 genotyped SNPs in case-case and case-control comparisons of local ancestry
(Supplementary Methods) failed to identify any novel risk regions harboring risk alleles that are highly
differentiated in frequency between men of African and European ancestry (Supplementary Figure 2A).
The most statistically significant PCa risk association genome-wide was observed with a novel triallelic
(A/T/G) variant at 8q24, with the T allele found in approximately 12% of cases and approximately 6% of
controls (rs72725854 at position 128,074,815 located in “region 2”) (8). The risk allele (T) is only found
in populations of African ancestry with a per-allele odds ratio of 2.33 (95% CI = 2.16 to 2.50, P =
1.08×10
-109
) (Supplementary Figure 2B) and is in linkage disequilibrium with African ancestry-specific
risk alleles rs114798100 (4%, OR = 2.43, 95% CI = 2.21 to 2.66, P = 4.07×10
-81
) and rs111906932 (2%,
OR = 1.92, 95% CI = 1.70 to 2.16, P = 1.44×10
-26
) (8,9). These SNPs are not correlated (r
2
= 0 for
rs114798100 and rs111906932) but define all observed haplotypes with the risk allele T of rs72725854,
and thus describe the same association signal. In stepwise models, four additional variants were found to
108
capture risk (P < 10
-5
) across the 8q24 locus (127.8–128.8 Mb) in men of African ancestry
(Supplementary Table 2).
Of the 100 reported PCa risk loci, 94 variants are polymorphic with a MAF of 0.05 or greater, 81
are directionally consistent with previous results in other populations (OR > 1), and 47 are nominally
statistically significant associations (P < .05) in men of African ancestry. Based on a polygenic risk score
(Supplementary Methods) comprising these risk variants as well as novel variants at 13q34 and 22q12.1
and variants shown to capture risk at 8q24 in men of African ancestry (n = 5), the 10% of men with the
highest polygenic risk scores have a 3.0-fold relative risk (95% CI = 2.5 to 3.6) of PCa compared with
men with “average risk” (polygenic risk scores in the 25th to 75th percentiles) (Supplementary Table 3),
which is comparable with that observed for the top 10% of the risk score distribution in men of European
ancestry (OR = 2.9, 95% CI = 2.8 to 3.1) (2). Estimates for the top 1% of the polygenic risk scores in each
population are 4.2 and 5.7, respectively.
Discussion
The main limitation of this study is suboptimal statistical power (<80%) to detect modest effects (OR <
1.22) at genome-wide levels of statistical significance for common alleles with minor allele frequencies of
less than 10%, particularly in analyses stratified by disease aggressiveness. Another limitation is the lack
of understanding regarding the biological mechanisms through which genetic variation in these
susceptibility regions influences risk.
Our findings substantiate the importance of conducting large-scale genetic studies in diverse
populations for the discovery of novel risk loci that are ancestry-specific (15). Further discovery efforts
and fine-mapping of known loci will be needed to better understand the contribution of germline variation
to PCa in men of African ancestry.
109
Funding
This work was supported by the National Cancer Institute at the National Institutes of Health, ELLIPSE
GAME-ON U19 initiative for prostate cancer (U19 CA148537).
110
Figure 1. Regional plot of a novel genome-wide statistically significant prostate cancer risk region
at chromosome 13q34. Single nucleotide polymorphisms (SNPs) are plotted by their position 110 kb on
either side of the index SNP (purple diamond) on the chromosome against their association (_log10 P)
with prostate cancer risk in men of African ancestry. SNPs surrounding the index SNP are colored to
indicate the local LD structure using pairwise r2 data from the African ancestry samples panel of the 1000
Genomes Project (November 2014 phase III). Below are peaks from TF and histone modification ChIP-
seq experiments in the same genomic window (see the Supplementary Methods). All ChIP-seq in LNCaP
unless otherwise indicated.
111
References for Supplemental Paper 1
1. Kolonel LN, Henderson BE, Hankin JH, et al. A multiethnic cohort in Hawaii and Los Angeles:
Baseline characteristics. Am J Epidemiol. 2000;151(4):346–357.
2. Al Olama AA, Kote-Jarai Z, Berndt SI, et al. A meta-analysis of 87,040 individuals identifies 23 new
susceptibility loci for prostate cancer. Nat Genet. 2014;46(10):1103–1109.
3. Eeles RA, Olama AA, Benlloch S, et al. Identification of 23 new prostate cancer susceptibility loci
using the iCOGS custom genotyping array. Nat Genet. 2013;45(4):385–391, 391e1–391e2.
4. Gudmundsson J, Sulem P, Gudbjartsson DF, et al. Genome-wide association and replication studies
identify four variants associated with prostate cancer susceptibility. Nat Genet. 2009;41(10):1122–1126.
5. Gudmundsson J, Sulem P, Gudbjartsson DF, et al. A study based on whole-genome sequencing yields a
rare variant at 8q24 associated with prostate cancer. Nat Genet. 2012;44(12):1326–1329.
6. Takata R, Akamatsu S, Kubo M, et al. Genome-wide association study identifies five new
susceptibility loci for prostate cancer in the Japanese population. Nat Genet. 2010;42(9):751–754.
7. Thomas G, Jacobs KB, Yeager M, et al. Multiple loci identified in a genome-wide association study of
prostate cancer. Nat Genet. 2008;40(3):310–315.
8. Haiman CA, Patterson N, Freedman ML, et al. Multiple regions within 8q24 independently affect risk
for prostate cancer. Nat Genet. 2007;39(5):638–644.
9. Han Y, Rand KA, Hazelett DJ, et al. Prostate cancer susceptibility in men of African Ancestry at 8q24.
J Natl Cancer Inst. 2016;108(7)
10. Cook MB, Wang Z, Yeboah ED, et al. A genome-wide association study of prostate cancer in West
African men. Hum Genet. 2014;133(5):509–521.
11. Hoffmann TJ, Van Den Eeden SK, Sakoda LC, et al. A large multiethnic genome-wide association
study of prostate cancer identifies novel risk variants and substantial ethnic differences. Cancer Discov.
2015;5(8):878–891.
12. Patti ME, Sun XJ, Bruening JC, et al. 4PS/insulin receptor substrate (IRS)-2 is the alternative
substrate of the insulin receptor in IRS-1-deficient mice. J Biol Chem. 1995;270(42):24670–24673.
13. Southey MC, Goldgar DE, Winqvist R, et al. PALB2, CHEK2 and ATM rare variants and cancer risk:
Data from COGS. J Med Genet. 2016;53(12):800–811.
14. Freedman ML, Haiman CA, Patterson N, et al. Admixture mapping identifies 8q24 as a prostate
cancer risk locus in African-American men. Proc Natl Acad Sci U S A. 2006;103(38):14068–14073.
15. Popejoy AB, Fullerton SM. Genomics is failing on diversity. Nature. 2016; 538(7624):161–164.
112
Supplemental Paper 2 Germline Variation at
8q24 and Prostate Cancer Risk in Men of
European Ancestry
Marco Matejcic, Edward J. Saunders, Tokhir Dadaev, Mark N. Brook, Kan Wang, et al. (2018) Nature
Communications
Abstract
Chromosome 8q24 is a susceptibility locus for multiple cancers, including prostate cancer. Here we
combine genetic data across the 8q24 susceptibility region from 71,535 prostate cancer cases and 52,935
controls of European ancestry to define the overall contribution of germline variation at 8q24 to prostate
cancer risk. We identify 12 independent risk signals for prostate cancer (p<4.28x10
-15
), including three
risk variants that have yet to be reported. From a polygenic risk score (PRS) model, derived to assess the
cumulative effect of risk variants at 8q24, men in the top 1% of the PRS have a 4-fold (95%CI=3.62-4.40)
greater risk compared to the population average. These 12 variants account for ~25% of what can be
currently explained of the familial risk of prostate cancer by known genetic risk factors. These findings
highlight the overwhelming contribution of germline variation at 8q24 on prostate cancer risk which has
implications for population risk stratification.
Introduction
Prostate cancer (PCa) is the most common cancer among men in the US, with 161,360 new cases and
26,730 related deaths estimated in 2017 1. Familial and epidemiological studies have provided evidence
113
of substantial heritability of PCa
2
, and ~170 common risk loci have been identified through genome-wide
association studies (GWAS)
3
. The susceptibility region on chromosome 8q24 has been shown to be a
major contributor to PCa risk, with multiple variants clustered in five linkage disequilibrium (LD) blocks
spanning ~600 Mb that are independently associated with risk
4
. Many of these association signals
reported at 8q24 have been replicated across racial/ethnic populations
5,6
, pointing to common shared
functional variants within 8q24. However, rare ancestry-specific variants have also been detected, which
confer larger relative risks of PCa (odds ratios [ORs] >2.0) than common risk variants in the region and
signify allelic heterogeneity in the contribution of germline variation at 8q24 to PCa risk across
populations
7
.
In the current study, we perform a comprehensive investigation of genetic variation across the 1.4 Mb
cancer susceptibility region at 8q24 (127.6-129.0 Mb) in relation to PCa risk. We combine genotyped and
imputed data from two large GWAS consortia (PRACTICAL/ELLIPSE OncoArray and iCOGS)
including >124,000 individuals of European ancestry to search for novel risk variants as well as to
determine the overall contribution of genetic variation at 8q24 to PCa heritability. Our findings
underscore the sizable impact of genetic variation in the 8q24 region in explaining inter-individual
differences in PCa risk, with potential clinical utility for genetic risk prediction.
Results
Marginal and Conditional Association Analysis. Genotype data from the Illumina OncoArray and
iCOGS array and imputation to 1000 Genomes Project (1KGP) were generated among 71,535 PCa cases
and 52,935 controls of European ancestry from 86 case-control studies (see Methods). Of the 5,600
genotyped and imputed variants at 8q24 (127.6-129.0 Mb) with minor allele frequency (MAF)>0.1%
retained for analysis (see Methods), 1,268 (23%) were associated with PCa risk at p<5x10
-8
while 2,772
(49%) were marginally associated at p<0.05. These 5,600 markers capture, at r
2
>0.8, 90% and 97% of all
114
variants at 8q24 (127.6-129.0 Mb) with MAF≥1% and ≥5%, respectively (based on 1KGP Phase 3 EUR
panel). In a forward and backward stepwise selection model on variants marginally associated with PCa
risk (p<0.05, n=2,772; see Methods), we identified 12 variants with conditional p-values from the Wald
test between 2.93x10
-137
and 4.28x10
-15
(Table 1). None of the other variants were statistically significant
at p<5x10
-8
after adjustment for the 12 independent hits (Figure 1). The 8q24 region by major association
signals is provided as Supplementary Figure 1. Of these 12 stepwise signals, three had alleles with
extreme risk allele frequencies (RAFs) that conveyed large effects (rs77541621, RAF=2%, OR=1.85,
95%CI=1.76-1.94; rs183373024, RAF=1%, OR=2.67, 95%CI=2.43-2.93; rs190257175, RAF=99%,
OR=1.60, 95%CI=1.42-1.80). The remaining variants had RAFs between 0.11 and 0.92 and conditional
ORs that were more modest and ranged from 1.10 to 1.37 (Table 1). For 8 of the 12 variants, the allele
found to be positively associated with PCa risk was the predominant allele (i.e. >50% in frequency). For
two variants, rs78511380 and rs190257175, the marginal associations were not genome-wide significant
and substantially weaker than those in the conditional model. For rs78511380, the marginal OR was
slightly protective (OR=0.97; p=0.027), but reversed direction and was highly statistically significant
when conditioning on the other 11 variants (OR=1.19; p=3.5x10
-18
; Table 1).
Haplotype Analysis. The haplotype analysis showed an additive effect of the 12 independent risk
variants consistent with that predicted in the single variant test; co-occurrence of the 8q24 risk alleles on
the same haplotype does not further increase the risk of PCa (Supplementary Table 1). The unique
haplotype carrying the reference allele for rs190257175 (GCTTAT, 0.5% frequency) is also the sole
haplotype associated with a reduced risk of PCa, suggesting that having the C allele confers a protective
effect. The reference allele for rs78511380 (A, 8% frequency) occurs on a haplotype in block 2 together
with the risk alleles for rs190257175, rs72725879 and rs5013678 (haplotype GTTTAA, 8%) which
obscures the positive association with the T allele of rs78511380. Thus, the marginal protective effect
associated with the risk allele for rs78511380 reflects an increased risk associated with the occurrence on
a risk haplotype with other risk alleles (Supplementary Table 1).
115
Correlation with Known Risk Loci. The 12 risk variants spanned across the five LD blocks previously
reported to harbor risk variants for PCa at 8q24
4
, with block 2 harboring six signals, blocks 1 and 5 two
signals each, and blocks 3 and 4 only one (Supplementary Figure 2). Except for a weak correlation
between rs72725879 and rs78511380 in block 2 (r
2
=0.28), the risk variants were uncorrelated with each
other (r2≤0.09; Supplementary Data 1), which corroborates their independent association with PCa risk.
Eight of the variants (rs1487240, rs77541621, rs72725879, rs5013678, rs183373024, rs17464492,
rs6983267, rs7812894) have been previously reported either directly (Supplementary Table 2) or are
correlated (r
2
≥0.42) with known markers of PCa risk from studies in populations of European, African or
Asian ancestry (Supplementary Data 1)
4,7–10
. The marginal estimates for previously published PCa risk
variants at 8q24 in the current study are shown in Supplementary Table 2. The variant rs1914295 in block
1 is only weakly correlated with the previously reported risk variants rs12543663 and rs10086908
(r
2
=0.17 and 0.14, respectively), while rs7851380 is modestly correlated with the previously reported risk
variant rs1016343 (r
2
=0.28). The remaining two variants, rs190257175 and rs12549761, are not correlated
(r
2
<0.027) with any known PCa risk marker.
Polygenic Risk Score and Familial Relative Risk. To estimate the cumulative effect of germline
variation at 8q24 on PCa risk, a polygenic risk score (PRS) was calculated for the 12 independent risk
alleles from the final model based on allele dosages weighted by the per-allele conditionally adjusted ORs
(see Methods). Compared to the men at ‘average risk’ (i.e. the 25th-75th PRS range among controls), men
in the top 10% of the PRS distribution had a 1.93-fold relative risk (95%CI=1.86-2.01) (Table 2), with the
risk being 3.99-fold higher (95%CI=3.62-4.40) for men in the top 1%. Risk estimates by PRS category
are not modified by family history (FamHist-yes: OR=4.24, 95%CI=2.85-6.31; FamHist-no: OR=3.38,
95%CI=2.88-3.97). To quantify the impact of germline variation at 8q24, we also estimated the
proportion of familial relative risk (FRR) and heritability of PCa contributed by 8q24 and compared this
to the proportions explained by all known PCa risk variants including 8q24 (see Methods). The 175
116
established PCa susceptibility loci identified to date
3,11
are estimated to explain 37.08% (95%CI=32.89-
42.49) of the FRR of PCa, while the 12 independent signals at 8q24 alone capture 9.42% (95%CI=8.22-
10.88), which is 25.4% of the total FRR explained by known genetic risk factors for PCa (Table 3). This
is similar to the proportion of heritability explained by 8q24 variants (22.2%) compared to the total
explained heritability by the known risk variants (0.118). In comparison, the next highest contribution of
an individual susceptibility region to the FRR of PCa is the TERT region at chromosome 5p15, where 5
independent signals contributed 2.63% (95%CI=2.34-3.00). No other individual GWAS locus has been
established as explaining >2% of the FRR, including the low frequency, non-synonymous, moderate
penetrance HOXB13 variant (rs138213197) at chromosome 17q21 that is estimated to explain only 1.91%
(95%CI=1.20-2.85) of the FRR 11.
JAM Analysis. We explored our data with a second fine-mapping approach, JAM (Joint Analysis of
Marginal summary statistics)
12
, which uses GWAS summary statistics to identify credible sets of variants
that define the independent association signals in susceptibility regions (see Methods). The 95% credible
set for the JAM analysis confirmed all of the independent signals from stepwise analysis except
rs190257175, for which evidence for an association was weak (variant-specific Bayes factor (BF) =1.17).
There were 50 total variants included in the 95% credible set, and 174 after including variants in high LD
(r2>0.9) with those in the credible set (Supplementary Data 2).
Discussion
In this large study of germline genetic variation across the 8q24 region, we identified 12 independent
association signals among men of European ancestry, with three of the risk variants (rs1914295,
rs190257175 and rs12549761) being weakly correlated (r
2
≤0.17) with known PCa risk markers. The
combination of these 12 independent signals at 8q24 capture approximately one quarter of the total PCa
117
FRR explained by known genetic risk factors, which is substantially greater than any other known PCa
risk locus.
The 8q24 region is the major susceptibility region for PCa; however, the underlying biological
mechanism(s) through which germline variation in this region influences PCa risk remains uncertain. For
each of the 12 risk variants at 8q24, the 95% credible set defined noteworthy (i.e. putative functional)
variants based on summary statistics while accounting for LD. To inform biological functionality, we
overlaid epigenetic functional annotation using publicly available datasets (see Methods) with the location
of the 12 independent signals (and corresponding 174 variants within their 95% credible sets;
Supplementary Data 3). Of the 12 independent lead variants, 6 are situated within putative transcriptional
enhancers in prostate cell-lines; either through intersection with H3K27AC (rs72725879, rs5013678,
rs78511380, rs6983267 and rs7812894) or through a ChromHMM enhancer annotation (rs17464492,
rs6983267, rs7812894). Eight of the 12 stepwise hits (rs77541621, rs190257175, rs5013678,
rs183373024, rs78511380, rs17464492, rs6983267, rs7812894) also intersect transcription factor binding
site peaks from multiple ChIP-seq datasets representing the AR, ERG, FOXA1, GABPA, GATA2,
HOXB13 and NKX3.1 transcription factors, with all 8 intersecting a FOXA1 mark and half an AR
binding site. These variants may therefore exert their effect through regulation of enhancer activity and
long-range expression of genes important for cancer tumorigenesis and/or progression
13
. The variant
rs6983267 has also been shown to act in an allele-specific manner to regulate prostate enhancer activity
and expression of the proto-oncogene MYC in vitro and in vivo
14,15
. However, despite the close
proximity to the MYC locus, no direct association has been detected between 8q24 risk alleles and MYC
expression in normal and tumor human prostate tissues
16
. The rare variant with the largest effect on risk,
rs183373024, shows high evidence of functionality based on overlap with multiple DNaseI and
transcription factor binding site peaks (for AR, FOXA1, HOXB13 and NKX3.1), which supports previous
findings of an allele-dependent effect of this variant on the disruption of a FOXA1 binding motif
17
.
Seven independent signals (rs1914295, rs1487240, rs77541621, rs72725879, rs5013678, rs183373024,
rs78511380) and variants correlated at r
2
>0.9 with these signals (Supplementary Data 2) are located
118
within or near a number of prostate cancer–associated long noncoding RNAs (lncRNAs),
including PRNCR1, PCAT1 and CCAT2, previously reported to be upregulated in human PCa cells
18
and tissues
19,20
. Based on eQTL annotations in prostate adenocarcinoma cells, the independent signal
rs1914295 and three correlated variants (r
2
>0.9; Supplementary Data 2) are associated with
overexpression of FAM84B, a gene previously associated with progression and poor prognosis of PCa in
animal studies
21
. Variants correlated at r
2
>0.9 with rs7812894 (n=9; Supplemental Table 4) are eQTLs
for POU5F1B, a gene overexpressed in cancer cell lines and cancer tissues
22,23
, although its role in PCa
development is unknown. Whilst we have successfully refined the 8q24 region and identified a subset of
variants with putative biological function within our credible set, multi-ethnic comparisons may help
refine the association signals even further and precisely identify the functional alleles and biological
mechanisms that modify PCa risk.
Whereas the individual associations of the 8q24 variants with PCa risk are relatively modest (ORs<2.0,
except for rs183373024), their cumulative effects are substantial, with risk being 4-fold higher for men in
the top 1% of the 8q24-only PRS. The contribution to the overall FRR of PCa is substantially greater for
the 8q24 region (9.42%) than for any other known GWAS locus, including the moderate penetrance non-
synonymous variant in HOXB13 (1.91%). The ability of these markers to explain ~25.4% of what can be
currently explained by all known PCa risk variants is a clear indication of the important contribution of
germline variation at 8q24 on PCa risk. Our study was predominantly powered to analyze variants with
MAF>1% as the imputed variants with MAF=0.1-1% were most likely to fail quality control (QC);
however, the high density of genotyped markers and haplotypes at 8q24 in the OncoArray and iCOGS
studies provided a robust backbone for imputation and increased the chances to impute lower MAF
variants with high imputation quality score. Understanding of the biology of these variants and the
underlying genetic basis of PCa could provide new insights into the identification of reliable risk-
prediction biomarkers for PCa, as well as enable the development of effective strategies for targeted
screening and prevention.
119
Methods
Study Subjects, Genotyping and Quality Control. We combined genotype data from the PRACTICAL/
ELLIPSE OncoArray and iCOGS consortia
3,24
, which included 143,699 men of European ancestry from
86 case-control studies largely based in either the US or Europe. In each study, cases primarily included
men with incident PCa while controls were men without a prior diagnosis of the disease.
Both of the OncoArray and iCOGS custom arrays were designed to provide high coverage of common
alleles (minor allele frequency [MAF]>5%) across 8q24 (127.6-129.0 Mb) based on the 1000 Genomes
Project (1KGP) Phase 3 for OncoArray, and the European ancestry (EUR) panel from HapMap Phase 2
for iCOGS. A total of 57,580 PCa cases and 37,927 controls of European ancestry were genotyped with
the Illumina OncoArray, and 24,198 PCa cases and 23,994 controls of European ancestry were genotyped
with the Illumina iCOGS array. For both studies, sample exclusion criteria included duplicate samples,
first-degree relatives, samples with a call rate <95% or with extreme heterozygosity (p<10
−6
), and
samples with an estimated proportion of European ancestry <0.8
3,24
. In total, genotype data for 53,449
PCa cases and 36,224 controls from OncoArray and 18,086 PCa cases and 16,711 controls from iCOGS
were included in the analysis. Genetic variants with call rates <0.95, deviation from Hardy-Weinberg
equilibrium (p<10
-7
in controls), and genotype discrepancy in >2% of duplicate samples were excluded.
Of the final 498,417 genotyped variants on the OncoArray and 201,598 on the iCOGS array that passed
QC, 1,581 and 1,737 within the 8q24 region, respectively, were retained for imputation.
All studies complied with all relevant ethical regulations and were approved by the institutional review
boards at each of the participating institutions. Informed consent was obtained from all study participants.
Additional details of each study are provided in the Supplementary Note 1.
Imputation Analysis. Imputation of both OncoArray and iCOGS genotype data was performed using
SHAPEIT 25 and IMPUTEv2
26
to the October 2014 (Phase 3) release of the 1KGP reference panel. A
total of 10,136 variants from OncoArray and 10,360 variants from iCOGS with MAF>0.1% were imputed
120
across the risk region at 8q24 (127.6-129.0 Mb). Variants with an imputation quality score >0.8 were
retained for a total of 5,600 overlapping variants between the two datasets.
Statistical Analysis. Unconditional logistic regression was used to estimate per-allele odds ratios (ORs)
and 95% confidence intervals (CIs) for the association between genetic variants (single nucleotide
polymorphisms and insertion/deletion polymorphisms) and PCa risk adjusting for country and principal
components (7 for OncoArray and 8 for iCOGS). Allele dosage effects were tested through a 1-degree of
freedom two-tailed Wald trend test. The marginal risk estimates for the 5,600 variants at 8q24 that passed
QC were combined by a fixed effect meta-analysis with inverse variance weighting using METAL
27
. A
modified forward and backward stepwise model selection with inclusion and exclusion criteria of
p≤5x10
−8
was performed on variants marginally associated with PCa risk from the meta results (p<0.05,
n=2,772). At each step, the effect estimates for the candidate variants from both studies (OncoArray and
iCOGS) were meta-analyzed and each variant was incorporated into the model based on the strength of
association. All remaining variants were included one-at-a-time into the logistic regression model
conditioning on those already incorporated in the model. We applied a conservative threshold for
independent associations, with variants kept in the model if their meta p-value from the Wald test was
genome-wide significant at p≤5x10
−8
after adjustment for the other variants in the model. Correlations
between variants in the final model and previously published PCa risk variants at 8q24 were estimated
using the 1KGP Phase 3 EUR panel (Supplementary Data 1).
Haplotype analysis. Haplotypes were estimated in the Oncoarray data only using variants from the final
stepwise model selection (n=12) and the EM algorithm
28
within LD block regions inferred based on
recombination hotspots using Haploview 4.2 (Broad Institute, Cambridge, MA, USA)
29
. Only haplotypes
with an estimated frequency ≥0.5% were tested.
121
Polygenic Risk Score and Familial Relative Risk. An 8q24-only polygenic risk score (PRS) was
calculated for variants from the final model (n=12) with allele dosage from OncoArray and iCOGS
weighted by the per-allele conditionally adjusted ORs from the meta-analysis. Categorization of the PRS
was based on the percentile distribution in controls, and the risk for each category was estimated relative
to the interquartile range (25-75%) in OncoArray and iCOGS separately, and then meta-analyzed across
the two studies. We estimated the contribution of 8q24 variants to the familial (first-degree) relative risk
(FRR) of PCa (FRR=2.5)
30
under a multiplicative model, and compared this to the FRR explained by all
known PCa risk variants including 8q24 (Supplementary Data 4). We also estimated heritability of PCa
using the LMM approach as implemented in GCTA
31
. For regions which have been fine-mapped using
the OncoArray meta-analysis data, we used the updated representative lead variants, otherwise the
originally reported variant was included provided that it had replicated at genome-wide significance in the
meta-analysis; this identified a total of 175 independently associated PCa variants for the FRR and
heritability calculations
3,11
. For these analyses, we used conditional estimates from fitting a single model
with all variants in the OncoArray dataset for regions with multiple variants and the overall marginal
meta-analysis results from Schumacher et al.
3
for regions with a single variant. To correct for potential
bias in effect estimation of newly discovered variants, we implemented a Bayesian version of the
weighted correction
32
, which incorporates the uncertainty in the effect estimate into the final estimates of
the bias-corrected ORs, 95%CIs and the corresponding calculations of percent FRR explained.
JAM Analysis. To confirm the stepwise results and identify candidate variants for potential functional
follow-up, we used a second fine-mapping approach, JAM (Joint Analysis of Marginal summary statistics)
12
. JAM is a multivariate Bayesian variable selection framework that uses GWAS summary statistics to
identify the most likely number of independent associations within a locus and define credible sets of
variants driving those associations. JAM was applied to summary statistics from the meta-analysis results
using LD estimated from imputed individual level data from 20,000 cases and 20,000 controls randomly
selected from the OncoArray sub-study. LD pruning was performed using Priority Pruner
122
(http://prioritypruner.sourceforge.net/) on the 2,772 marginally associated variants at r
2
=0.9, resulting in
825 tag variants analyzed in four independent JAM runs with varying starting seeds. Credible sets were
determined as the tag variants that were selected in the top models that summed to a specific cumulative
posterior probability in all four of the independent JAM runs, plus their designated high LD proxy
variants from the pruning step.
Functional Annotation. Variants in the 95% credible set (n=50) plus variants correlated at r
2
>0.9 with
those in the credible set (n=174) were annotated for putative evidence of biological functionality using
publicly available datasets as described by Dadaev et al.
11
. Briefly, variants were annotated for proximity
to gene (GENCODEv
19
), miRNA transcripts (miRBase release
20
), evolutionary constraint (according to
GERP++, SiPhy and PhastCons algorithms), likelihood of pathogenicity (CADDv1.3) and overlap with
prospective regulatory elements in prostate-specific datasets (DNaseI hypersensitivity sites, H3K27Ac,
H3K27me3 and H3K4me3 histone modifications, and for AR, CTCF, ERG, FOXA1, GABPA, GATA2,
HOXB13 and NKX3.1 transcription factor binding sites) in a mixture of LNCaP, PC-3, PrEC, RWPE1
and VCaP cell lines and human prostate tumor tissues downloaded from the Cistrome Data Browser
(http://cistrome.org/db/). The chromatin state in which each variant resides was assessed using
ChromHMM annotations from two prostate cell lines (PrEC and PC3). Cis-gene regulation was evaluated
using 359 prostate adenoma cases from The Cancer Genome Atlas (TCGA PRAD; https://gdc-
portal.nci.nih.gov) that passed QC
11
. The eQTL analysis was performed using FastQTL with 1000
permutations for each gene within a 1Mb window. We then used the method by Nica et al.
33
that
integrates eQTLs and GWAS results in order to reveal the subset of association signals that are due to cis
eQTLs. For each significant eQTL, we added the candidate variant to the linear regression model to
assess if the inclusion better explains the change in expression of the gene. We retrieved the p-value of
the model, assigning p-value of 1 if the eQTL and variant are the same. Then we ranked the p-values in
descending order for each eQTL, and finally calculated the colocalization score for each pair of eQTL and
123
variants. In general, if an eQTL and candidate variant represent the same signal, this will be reflected by
the variant having a high p-value, a low rank and consequently a high colocalization score.
Data availability
The authors declare that data supporting the findings of this study are available within the paper [and in
the supplementary information files]. However, some of the data used to generate the results of this study
are available from the first author and the PRACTICAL Consortium upon request.
Acknowledgements
Genotyping of the OncoArray was funded by the US National Institutes of Health (NIH) [U19 CA
148537 for ELucidating Loci Involved in Prostate Cancer SuscEptibility (ELLIPSE) project and
X01HG007492 to the Center for Inherited Disease Research (CIDR) under contract number
HHSN268201200008I]. Additional analytic support was provided by NIH NCI U01 CA188392 (PI:
Schumacher).
The PRACTICAL consortium was supported by Cancer Research UK Grants C5047/A7357,
C1287/A10118, C1287/A16563, C5047/A3354, C5047/A10692, C16913/A6135, European Commission's
Seventh Framework Programme grant agreement n° 223175 (HEALTH-F2-2009-223175), and The
National Institute of Health (NIH) Cancer Post-Cancer GWAS initiative grant: No. 1 U19 CA 148537-01
(the GAME-ON initiative).
We wish to thank all GWAS study groups contributing to the data set from which this study was
conducted: OncoArray; iCOGS; The PRACTICAL (Prostate Cancer Association Group to Investigate
Cancer-Associated Alterations in the Genome) Consortium; and The GAME-ON/ELLIPSE Consortium.
Detailed acknowledgements and funding information for all GWAS study groups and from all the
individual studies involved in the PRACTICAL Consortium are included in Supplementary Note 1.
124
We would also like to thank the following for funding support: The Institute of Cancer Research and The
Everyman Campaign, The Prostate Cancer Research Foundation, Prostate Research Campaign UK (now
Prostate Action), The Orchid Cancer Appeal, The National Cancer Research Network UK, The National
Cancer Research Institute (NCRI) UK. We are grateful for support of NIHR funding to the NIHR
Biomedical Research Centre at The Institute of Cancer Research and The Royal Marsden NHS
Foundation Trust.
125
Table 1. Marginal and conditional estimates for genetic markers at 8q24 independently
associated with prostate cancer risk
Conditional association
1
Marginal association
Variant ID
2
Position
3
Allele
4
RAF
5
LD cluster
6
OR (95%CI)
7
p-value OR (95%CI)
8
p-value
rs1914295 127910317 T/C 0.68 block 1
1.10 (1.08-1.12) 7.30x10
-25
1.09 (1.07-1.11) 3.07x10
-21
rs1487240 128021752 A/G 0.74 block 1
1.20 (1.17-1.22) 2.77x10
-66
1.16 (1.14-1.18) 2.97x10
-54
rs77541621 128077146 A/G 0.02 block 2
1.85 (1.76-1.94) 2.93x10
-137
1.83 (1.74-1.92) 4.33x10
-137
rs190257175 128103466 T/C 0.99 block 2
1.60 (1.42-1.80) 4.28x10
-15
1.36 (1.22-1.53) 6.90 x10
-8
rs72725879 128103969 T/C 0.18 block 2
1.31 (1.28-1.35) 1.26x10
-83
1.17 (1.14-1.19) 3.96x10
-48
rs5013678 128103979 T/C 0.78 block 2
1.10 (1.08-1.13) 1.58x10
-19
1.20 (1.17-1.22) 4.44x10
-68
rs183373024 128104117 G/A 0.01 block 2
2.67 (2.43-2.93) 4.89x10
-95
3.20 (2.92-3.50) 6.60x10
-138
rs78511380 128114146 T/A 0.92 block 2
1.19 (1.14-1.23) 3.48x10
-18
0.97 (0.94-1.00) 0.027
rs17464492 128342866 A/G 0.72 block 3
1.16 (1.14-1.18) 3.01x10
-52
1.17 (1.15-1.19) 9.05x10
-61
rs6983267 128413305 G/T 0.51 block 4
1.18 (1.16-1.20) 5.68x10
-84
1.23 (1.21-1.25) 3.15x10
-135
rs7812894 128520479 A/T 0.11 block 5
1.37 (1.33-1.40) 1.55x10
-122
1.45 (1.41-1.49) 1.20x10
-181
rs12549761 128540776 C/G 0.87 block 5 1.21 (1.18-1.24) 1.61x10
-45
1.28 (1.25-1.31) 1.38x10
-78
1
Each variant was incorporated in the stepwise model based on the strength of marginal association from the meta-analysis of OncoArray and
iCOGS data.
2
Variant that remained genome-wide significantly associated with PCa risk (p<10
-8
) in the final stepwise model
3
Chromosome position based on human genome build 37.
4
Risk allele/reference allele.
5
Risk allele frequency.
6
LD clusters were inferred based on recombination hotspots using Haploview 4.2
29
and defined as previously reported by Al Olama et al.
4
.
7
Per-allele odds ratio and 95% confidence interval adjusted for country, 7(OncoArray)/8(iCOGS) principal components and all other variants in
the table.
8
Per-allele odds ratio and 95% confidence interval adjusted for country and 7(OncoArray)/8(iCOGS) principal components.
126
Table 2. Relative risk of PCa for polygenic risk score (PRS) groups
1
No. of individuals Risk estimates for PRS groups
Risk category percentile
2
Controls Cases
OR (95% CI)
3
p-value
≤ 1% 530 339 0.52 (0.45-0.59) 2.11x10
-20
1%-10% 4771 3636
0.62 (0.59-0.65) 6.26x10
-90
10%-25% 7936 7359
0.75 (0.72-0.78) 3.62x10
-54
25%-75% 26464 32743
1.00 (Ref)
75%-90% 7940 13431
1.37 (1.32-1.41) 6.55x10
-77
90%-99% 4766 11451
1.93 (1.86-2.01) 4.13x10
-249
>99% 528 2576 3.99 (3.62-4.40) 5.64x10
-172
1
PRS were calculated for variants from the final stepwise model with allele dosage from OncoArray and iCOGS
weighted by the per-allele conditionally adjusted odds ratios from the meta-analysis.
2
Risk category groups were based on the percentile distribution of risk alleles in overall controls.
3
Estimated effect of each PRS group relative to the interquartile range (25-75%) in OncoArray and iCOGS datasets
separately, and then meta-analyzed across the two studies; odds ratios were adjusted for country and
7(OncoArray)/8(iCOGS) principal components.
127
Table 3. Proportion of familial relative risk (FRR) and heritability (hg
2
) of PCa explained
by known risk variants
Source No. of variants
Proportion of FRR
(95%CI)
% of total
FRR hg
2
(SE)
% of total
hg
2
8q24
1
12
9.42 (8.22-10.88) 25.4
0.027 (0.011) 22.2
HOXB13
2
1
1.91 (1.20-2.85) 5.2
0.004 (0.005) 3.0
All other
variants
2,3
162
25.77 (22.94-29.36) 69.5
0.092 (0.010) 74.9
Total 175 37.08 (32.89-42.49) 100 0.118 (0.012) 100
1
Conditional estimates were derived by fitting a single model with all variants from OncoArray data.
2
Risk estimates and allele frequencies for regions with a single variant are from a meta-analysis of OncoArray,
iCOGS and 6 additional GWAS
3
.
3
Risk variants included from fine-mapping of PCa susceptibility loci in European ancestry populations
11
.
1
References for Supplemental Paper 2
1. Siegel, R. L., Miller, K. D. & Jemal, A. Cancer Statistics, 2017. CA. Cancer J. Clin. 67, 7–30 (2017).
2. Hjelmborg, J. B. et al. The heritability of prostate cancer in the Nordic Twin Study of Cancer. Cancer
Epidemiol. Biomark. Prev. Publ. Am. Assoc. Cancer Res. Cosponsored Am. Soc. Prev. Oncol. 23, 2303–
2310 (2014).
3. Schumacher, F. R. et al. Association analyses of more than 140,000 men identify 63 new prostate
cancer susceptibility loci. Nat. Genet. (2018). doi:10.1038/s41588-018-0142-8
4. Al Olama, A. A. et al. Multiple loci on 8q24 associated with prostate cancer susceptibility. Nat. Genet.
41, 1058–1060 (2009).
5. Haiman, C. A. et al. Multiple regions within 8q24 independently affect risk for prostate cancer. Nat.
Genet. 39, 638–644 (2007).
6. Han, Y. et al. Generalizability of established prostate cancer risk variants in men of African ancestry.
Int. J. Cancer 136, 1210–1217 (2015).
7. Gudmundsson, J. et al. A study based on whole-genome sequencing yields a rare variant at 8q24
associated with prostate cancer. Nat. Genet. 44, 1326–1329 (2012).
8. Han, Y. et al. Prostate Cancer Susceptibility in Men of African Ancestry at 8q24. J. Natl. Cancer Inst.
108, (2016).
9. Hoffmann, T. J. et al. A large multiethnic genome-wide association study of prostate cancer identifies
novel risk variants and substantial ethnic differences. Cancer Discov. 5, 878–891 (2015).
10. Conti, D. V. et al. Two Novel Susceptibility Loci for Prostate Cancer in Men of African Ancestry. J.
Natl. Cancer Inst. 109, (2017).
11. Dadaev, T. et al. Fine-mapping of prostate cancer susceptibility loci in a large meta-analysis identifies
candidate causal variants. Nat. Commun. 9, 2256 (2018).
12. Newcombe, P. J., Conti, D. V. & Richardson, S. JAM: A Scalable Bayesian Framework for Joint
Analysis of Marginal SNP Effects. Genet. Epidemiol. 40, 188–201 (2016).
13. Jia, L. et al. Functional enhancers at the gene-poor 8q24 cancer-linked locus. PLoS Genet. 5,
e1000597 (2009).
14. Pomerantz, M. M. et al. The 8q24 cancer risk variant rs6983267 shows long-range interaction with
MYC in colorectal cancer. Nat. Genet. 41, 882–884 (2009).
15. Wasserman, N. F., Aneas, I. & Nobrega, M. A. An 8q24 gene desert variant associated with prostate
cancer risk confers differential in vivo activity to a MYC enhancer. Genome Res. 20, 1191–1197 (2010).
2
16. Pomerantz, M. M. et al. Evaluation of the 8q24 prostate cancer risk locus and MYC expression.
Cancer Res. 69, 5568–5574 (2009).
17. Hazelett, D. J., Coetzee, S. G. & Coetzee, G. A. A rare variant, which destroys a FoxA1 site at 8q24,
is associated with prostate cancer risk. Cell Cycle Georget. Tex 12, 379–380 (2013).
18. Chung, S. et al. Association of a novel long non-coding RNA in 8q24 with prostate cancer
susceptibility. Cancer Sci. 102, 245–252 (2011).
19. Prensner, J. R. et al. Transcriptome sequencing across a prostate cancer cohort identifies PCAT-1, an
unannotated lincRNA implicated in disease progression. Nat. Biotechnol. 29, 742–749 (2011).
20. Zheng, J. et al. The up-regulation of long non-coding RNA CCAT2 indicates a poor prognosis for
prostate cancer and promotes metastasis by affecting epithelial-mesenchymal transition. Biochem.
Biophys. Res. Commun. 480, 508–514 (2016).
21. Wong, N. et al. Upregulation of FAM84B during prostate cancer progression. Oncotarget 8, 19218–
19235 (2017).
22. Suo, G. et al. Oct4 pseudogenes are transcribed in cancers. Biochem. Biophys. Res. Commun. 337,
1047–1051 (2005).
23. Hayashi, H. et al. The OCT4 pseudogene POU5F1B is amplified and promotes an aggressive
phenotype in gastric cancer. Oncogene 34, 199–208 (2015).
24. Eeles, R. A. et al. Identification of 23 new prostate cancer susceptibility loci using the iCOGS custom
genotyping array. Nat. Genet. 45, 385–391, 391e1–2 (2013).
25. Delaneau, O., Marchini, J. & Zagury, J.-F. A linear complexity phasing method for thousands of
genomes. Nat. Methods 9, 179–181 (2011).
26. Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for
the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).
27. Willer, C. J., Li, Y. & Abecasis, G. R. METAL: fast and efficient meta-analysis of genomewide
association scans. Bioinforma. Oxf. Engl. 26, 2190–2191 (2010).
28. Excoffier, L. & Slatkin, M. Maximum-likelihood estimation of molecular haplotype frequencies in a
diploid population. Mol. Biol. Evol. 12, 921–927 (1995).
29. Barrett, J. C., Fry, B., Maller, J. & Daly, M. J. Haploview: analysis and visualization of LD and
haplotype maps. Bioinforma. Oxf. Engl. 21, 263–265 (2005).
30. Johns, L. E. & Houlston, R. S. A systematic review and meta-analysis of familial prostate cancer risk.
BJU Int. 91, 789–794 (2003).
3
31. Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait
analysis. Am. J. Hum. Genet. 88, 76–82 (2011).
32. Zhong, H. & Prentice, R. L. Bias-reduced estimators and confidence intervals for odds ratios in
genome-wide association studies. Biostat. Oxf. Engl. 9, 621–634 (2008).
33. Nica, A. C. et al. Candidate causal regulatory effects by integration of expression QTLs with complex
trait genetic associations. PLoS Genet. 6, e1000895 (2010).
4
Bibliography
Al Olama, A. A., Kote-Jarai, Z., Berndt, S. I., et al. (2014). A meta-analysis of 87,040 individuals identifies 23 new
susceptibility loci for prostate cancer. Nat Genet, 46(10):1103-9
Al Olama, A. A., Kote-Jarai, Z., Giles, G. G., et al. (2009). Multiple loci on 8q24 associated with prostate cancer
susceptibility. Nat Genet 41, 1058-1060.
Altshuler, D., Daly, M. J., and Lander, E. S. (2008). Genetic Mapping in Human Disease. Science, 322(5903), 881–
888.
Asimit, J. L., Hatzikotoulas, K., McCarthy, M., Morris, A. P., and Zeggini, E. (2016) Trans-ethnic study design
approaches for fine-mapping. European Journal of Human Genetics, 24(9), 1330-6
Cantor, R. M., Lange, K., and Sinsheimer,J. S. (2010). Prioritizing GWAS results: A review of statistical
methods and recommendations for their application. The American Journal of Human Genetics, 86(1), 6–22.
Chen, W., Larrabee, B. R., Ovsyannikova, I. G., et al. (2015). Fine Mapping Causal Variants with an Approximate
Bayesian Method Using Marginal Test Statistics. Genetics 200, 719-736.
Conti, D. V., and Witte, J. S. (2003). Hierarchical modeling of linkage disequilibrium: genetic structure and spatial
relations. The American Journal of Human Genetics 72, 351-363.
Coram, M. A., Candille, S. I., Duan, Q., et al. (2015). Leveraging Multi-ethnic Evidence for Mapping Complex
Traits in Minority Populations: An Empirical Bayes Approach. Am J Hum Genet 96, 740-752.
Cordell, H. J., and Clayton, D. G. (2005). Genetic association studies. The Lancet, 366(9491), 1121– 1131.
Clayton, D. (2012). Link Functions in Multi-Locus Genetic Models: Implications for Testing, Prediction,
and Interpretation. Genetic Epidemiology 36, 409–418.
Eeles, R. A., Kote-Jarai, Z., Al Olama, A. A., et al. (2009). Identification of seven new prostate cancer
susceptibility loci through a genome-wide association study. Nat Genet 41, 1116-1121.
Ewing, C. M., Ray, A. M., Lange, E. M., et al. (2012). Germline mutations in HOXB13 and prostate-cancer risk. N
Engl J Med 366, 141-149.
Frazer, K. A., Murray, S. S., Schork, N. J., and Topol, E. J. (2009). Human genetic variation and its contribution
to complex traits. Nat Rev Genet, 10(4), 241–251.
Freedman, M. L., Monteiro, A. N. A., Gayther, S. A., Coetzee, G. A., Risch, A., Plass, C., et al. (2011). Principles
for the post-GWAS functional characterization of cancer risk loci. Nature Publishing Group, 43(6), 513–518.
George, E. I., and Foster, D. P. (2000). Calibration and empirical Bayes variable selection. Biometrika, 87(4), 731–
747.
Haiman, C. A., Chen, G. K., Blot, W. J., et al. (2011a). Characterizing genetic risk at known prostate cancer
susceptibility loci in African Americans. PLoS Genet 7, e1001387.
5
Haiman, C. A., Chen, G. K., Blot, W. J., et al. (2011b). Genome-wide association study of prostate cancer in men of
African ancestry identifies a susceptibility locus at 17q21. Nat Genet 43, 570-573.
Han, B., and Eskin, E. (2011). Random-Effects Model Aimed at Discovering Associations in Meta-
Analysis of Genome-wide Association Studies. The American Journal of Human Genetics 88, 586-598.
Han, Y., Hazelett, D. J., Wiklund, F., et al. (2015). Integration of multiethnic fine-mapping and genomic annotation
to prioritize candidate functional SNPs at prostate cancer susceptibility regions. Hum Mol Genet 24(19):5603-18
Han, Y., Rand, K. A., Hazelett, D. J., et al. (2016a). Prostate Cancer Susceptibility in Men of African Ancestry at
8q24. J Natl Cancer Inst 108(7).
Han, Y., Rand, K. A., Hazelett, D. J., et al. (2016b). Prostate Cancer Susceptibility in Men of African Ancestry at
8q24. J Natl Cancer Inst 108(7).
Han, Y., Signorello, L. B., Strom, S. S., et al. (2014). Generalizability of established prostate cancer risk variants in
men of African ancestry. Int J Cancer. 136(5):1210-7
Hastie, T., Tibshirani, R., and Friedman, J. H. (2009). The Elements of Statistical Learning; Data Mining,
Inference and Prediction, Springer, New York. Second edition.
Hazelett, D. J., Conti, D. V., Han, Y., et al. (2016). Reducing GWAS Complexity. Cell Cycle 15, 22-24.
Hirschhorn, J. N., Lohmueller, K., Byrne, E., and Hirschhorn, K. (2002). A comprehensive review of genetic
association studies. Genetics in Medicine, 4(2), 45–61.
Hoerl, A. E., and Kennard R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems.
Technometrics 12(1), 55-67.
Hormozdiari, F., Kostem, E., Kang, E. Y., Pasaniuc, B., and Eskin, E. (2014). Identifying causal variants
at loci with multiple signals of association. Genetics 198, 497-508.
Liu, J., Lewinger, J. P., Gilliland, F. D., Gauderman, W. J., and Conti, D. V. (2013). Confounding and
Heterogeneity in Genetic Association Studies with Admixed Populations. American Journal of Epidemiology,
177(4):351-60.
Kichaev, G., and Pasaniuc, B. (2015). Leveraging Functional-Annotation Data in Trans-ethnic Fine-Mapping
Studies. Am J Hum Genet 97, 260-271.
Kichaev, G., Yang, W. Y., Lindstrom, S., et al. (2014). Integrating functional data to prioritize causal variants in
statistical fine-mapping studies. PLoS Genet 10, e1004722.
Koenker, R., and Bassett G. Jr (1978). Regression quantiles. Econometrica 46(1), 33-50.
Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., et al. (2001). Initial
sequencing and analysis of the human genome. Nature, 409(6822), 860–921.
Li, Y. R., and Keating, B. J. (2014). Trans-ethnic genome-wide association studies: advantages and challenges of
mapping in diverse populations. Genome Med 6, 91.
6
Libbrecht, M. W., and Noble, W. S. (2015). Machine learning applications in genetics and genomics. Nat Rev Genet
16, 321-332.
Lin, X., Qu, L., Chen, Z., et al. (2013). A novel germline mutation in HOXB13 is associated with prostate cancer
risk in Chinese men. Prostate 73, 169-175.
McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., et al. (2008). Genome-wide association studies
for complex traits: consensus, uncertainty and challenges. Nat Rev Genet, 9(5), 356–369.
Morris, A. P. (2011). Trans-ethnic meta-analysis of genome-wide association studies. Genet Epidemiol 35, 809-822.
Neale, M., Cardon, L., and Division, N. A. T. O. S. A. (1992). Methodology for Genetic Studies of Twins and
Families: Springer.
Newcombe, P. J., Conti, D. V., and Richardson, S. (2016) JAM: A scalable Bayesian framework for joint
analysis of marginal SNP effects. Genet Epidemiol 40(3), 188-201.
Newcombe, P. J., Verzilli, C., Casas, J. P., Hingorani, A. D., Smeeth, L. and Whittaker, J. C. (2009).
Multilocus Bayesian Meta-Analysis of Gene-Disease Associations. The American Journal of Human
Genetics 84, 567-580.
Ong, R. T., Wang, X., Liu, X., and Teo, Y. Y. (2012). Efficiency of trans-ethnic genome-wide meta-analysis and
fine-mapping. Eur J Hum Genet 20, 1300-1307.
Pasaniuc, B., and Price, A. L. (2017) Dissecting the genetics of complex traits using summary association
statistics. Nat Rev Genet 18
Quintana, M. A., Berstein, J. L., Thomas, D. C., and Conti, D. V. (2011). Incorporating model uncertainty in
detecting rare variants: the Bayesian risk index. Genetic Epidemiology, 35(7), 638–649.
Quintana, M. A., and Conti, D. V. (2013). Integrative variable selection via Bayesian model uncertainty.
Statistics in Medicine 32(28), 4938-53.
Quintana, M. A., Schumacher, F. R., Casey, G., Bernstein, J. L., Li, L., and Conti, D. V. (2012).
Incorporating prior biological information for high-dimensional rare variant association studies. Human
Heredity 74, 184-195.
Spain, S. L., and Barrett, J. C. (2015). Strategies for fine-mapping complex traits. Hum Mol Genet. 24(R1), R111-9
Stephens, M., and Balding, D. J. (2009). Bayesian statistical methods for genetic association studies. Nat Rev
Genet.
Stram, D. (2013). Design, Analysis, and Interpretation of Genome-Wide Association Scans: Springer New York.
Raftery, A. E. (1996). Approximate Bayes factors and accounting for model uncertainty in generalised
linear models. Biometrika 83(2), 251-266.
7
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.
Series B (Methodological), 267–288.
Tsionas, E. G. (2003). Bayesian quantile inference. Journal of Statistical Computation and Simulation, 73(9), 659–
674.
Wakefield, J. (2007). A Bayesian measure of the probability of false discovery in genetic epidemiology studies. Am
J Hum Genet 81, 208-227.
Wakefield, J. (2009). Bayes factors for genome-wide association studies: comparison with P-values. Genet
Epidemiol 33, 79-86.
Yuan, M., and Lin, Y. (2005a). Efficient empirical Bayes variable selection and estimation in linear models. Journal
of the American Statistical Association, 100(472), 1215–1225.
Zaitlen, N., Pasxaniuc, B., Gur, T., Ziv, E., and Halperin, E. (2010) Leveraging Genetic Variability across
Populations for the Identification of Causal Variants. The American Journal of Human Genetics, 86, 23-33.
Abstract (if available)
Abstract
While GWAS has been very successful in identifying SNP-trait associations for many complex diseases, most of the index Single Nucleotide Polymorphisms (SNPs) are only surrogates of the underlying functional variants. In the post-GWAS era, it is still an open question on how to fine-map the summary results from GWAS to prioritize the causal variants. High linkage disequilibrium (LD) remains one of the major challenges for fine-mapping studies as the surrogates and the true functional variants can be nearly identical in one single population. Many fine-mapping methods were proposed to address this issue, mainly from two perspectives: aggregating information across multiple ethnic groups where the LDs are lower or incorporating external functional information. In the past decade, the increasing availability of GWAS studies from diverse ethnic groups enormously facilitated the implementation of multiethnic fine-mapping. On the other hand, the enrichment of gene annotations also provides an unprecedented amount of high-quality functional information for us to exploit. ❧ In this dissertation, we propose two novel fine-mapping methods: multiethnic Joint Analysis of Marginal SNP effects (mJAM), and Quantile (Q1) regularized regression, to address the problem from the above two angles respectively. Each of them is also a statistical method motivated by the applied work I undertook in two other papers. The first one is presented in chapter 2, where we ran a GWAS on African American population with 10,202 prostate cancer cases and 10,810 controls and identified two novel loci of prostate cancer susceptibility regions. Naturally, multiethnic fine-mapping is the next step to follow up on these results. To address this need, we propose mJAM in chapter 3, which can perform joint analysis on summary statistics across multiple populations in a scalable Bayesian framework. In the next chapter, we present the second applied paper, where we performed fine-mapping on 8q24, a region of special interest for prostate cancer, for a combined 74,449 prostate cancer cases and 54,097 controls of men in European ancestry to prioritize independent signals in the region. We identified 12 SNPs that account for 25% of variance explained for the familial risk of prostate cancer by known genetic variants. With strong statistical evidence, it’s crucial to further validate these variants with functional information. In chapter 5, we propose another fine-mapping method Q1 regularized regression that can incorporate functional information in a quantile regression on the second stage to improve both the prediction accuracy and the model selection performance. We present this method in the framework of genetic studies in the dissertation, but it is a very generic method that can be readily applied to other fields as well.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Bayesian hierarchical models in genetic association studies
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Incorporating prior knowledge into regularized regression
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
High-dimensional regression for gene-environment interactions
PDF
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Bayesian model averaging methods for gene-environment interactions and admixture mapping
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
Leveraging functional datasets of stimulated cells to understand the relationship between environment and diseases
PDF
Latent unknown clustering with integrated data (LUCID)
PDF
Polygenic analyses of complex traits in complex populations
PDF
Bayesian models for a respiratory biomarker with an underlying deterministic model in population research
PDF
A genome wide association study of multiple sclerosis (MS) in Hispanics
PDF
Combination of quantile integral linear model with two-step method to improve the power of genome-wide interaction scans
PDF
High-dimensional feature selection and its applications
PDF
Improving the power of GWAS Z-score imputation by leveraging functional data
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
Asset Metadata
Creator
Wang, Kan
(author)
Core Title
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
05/26/2019
Defense Date
09/14/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Bayesian hierarchical models,feature selection,fine-mapping,high-dimensional data,OAI-PMH Harvest,regularized regression
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Conti, David (
committee chair
), Batroff, Jay (
committee member
), Gauderman, William (
committee member
), Lewinger, Juan Pablo (
committee member
), Thomas, Duncan (
committee member
)
Creator Email
kanwang@usc.edu,kanwang7scorpio@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-107974
Unique identifier
UC11675339
Identifier
etd-WangKan-6962.pdf (filename),usctheses-c89-107974 (legacy record id)
Legacy Identifier
etd-WangKan-6962.pdf
Dmrecord
107974
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Wang, Kan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Bayesian hierarchical models
feature selection
fine-mapping
high-dimensional data
regularized regression