Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping
(USC Thesis Other)
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
i
IDENTIFICATION AND FINE-MAPPING OF GENETIC
SUSCEPTIBILITY LOCI FOR PROSTATE CANCER
AND STATISTICAL METHODOLOGY FOR
MULTIETHNIC FINE-MAPPING
by
Ying Han
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY (Biostatistics)
August 2016
Copyright 2016 Ying Han
ii
To my dear parents
Peihua Li and Shengquan Han
and my beloved husband
Dezhi Kang
iii
Acknowledgements
I would like to express my sincere gratitude to my mentors, Dr. Christopher Haiman and
Dr. Daniel Stram, for their tremendous help and unconditional support over the past six
years. They introduced me to the exciting interdisciplinary field of statistical genetics,
provided me rigorous training in both theory and practical application, offered me
precious opportunities to work on many cutting-edge projects, collaborate with top-tier
scientists in this field, write papers and peer reviews, and present my work at
international conferences. Scientific research is not always smooth sailing, but their
insights and knowledge have guided me to navigate through uncertainty. Their dedication
to science and mentoring influences me a lot, and I am extremely grateful to be their
student.
I would like to extend my appreciation to my committee members, Dr. David
Conti, Dr. Fredrick Schumacher, and Dr. Gerhard Coetzee, for their constructive
feedback and constant support along the way. It has been an awesome experience to work
closely with them on many different projects. Their valuable inputs and motivating
discussions are crucial for me to keep improving my work. I have learned so much from
every one of them.
I am very grateful to be part of the multiethnic cohort (MEC) group. My special
thanks go to Dr. Brian Henderson, who founded the MEC study and inspired many
people including me. I would also like to thank Dr. Gary Chen and Alexander Stram for
teaching me the computational tools and infrastructure for processing large genetic
datasets, Grace Sheng and Peggy Wan for their assistance on data preparation and
integration across many collaborative studies, Dr. Kristin Rand and Dr. Dennis Hazelett
iv
for their vital contribution to incorporating sequencing and functional annotation data
into fine-mapping, and Dr. Neal Tambe for his stimulating discussions and selfless
support. The MEC group feels like a big family, and I am sure the wonderful mentorship
and friendship that I have found here will last for a long time.
Last but not least, I would like to thank my family and friends for being so
supportive, especially my parents Peihua Li and Shengquan Han, and my husband Dr.
Dezhi Kang. Six years have gone fast, and their love has always been with me in this
rewarding journey, sharing joyful moments during good times, giving me strength during
tough times, and celebrating my growth along the way. Completing my PhD is just a
beginning, and I am excited about the new adventures to come.
v
Table of Contents
Dedication .......................................................................................................................... ii
Acknowledgements .......................................................................................................... iii
Chapter 1 Introduction .................................................................................................... 1
1.1 Overview of prostate cancer genetics ................................................................................. 1
1.1.1 Heritability .................................................................................................................. 2
1.1.2 Linkage analysis .......................................................................................................... 3
1.1.3 Admixture mapping ..................................................................................................... 5
1.1.4 Genome-wide association studies (GWAS) ................................................................ 7
1.1.5 Fine-mapping .............................................................................................................. 9
1.2 Post-GWAS studies of prostate cancer ............................................................................ 11
Chapter 2 Pleiotropy analysis identifies a novel prostate cancer variant at 6p21:
The PAGE, PRACTICAL, and BPC3 consortia .......................................................... 14
2.1 Abstract ............................................................................................................................ 16
2.2 Introduction ...................................................................................................................... 17
2.3 Results .............................................................................................................................. 18
2.4 Discussion ........................................................................................................................ 21
2.5 Materials and Methods ..................................................................................................... 25
vi
Chapter 3 Genome-wide association testing of putative functional exonic variants
and high-density imputed variants with prostate cancer risk in multiethnic
populations ...................................................................................................................... 36
3.1 Introduction ...................................................................................................................... 36
3.2 Genome-wide testing of putative functional exonic variants in relationship with
prostate cancer risk in a multiethnic population ......................................................................... 36
3.2.1 Materials and Methods .............................................................................................. 37
3.2.2 Results ....................................................................................................................... 41
3.2.3 Discussion ................................................................................................................. 48
3.3 A meta-analysis of 87,040 individuals identifies 23 new susceptibility loci for prostate
cancer .......................................................................................................................................... 57
Chapter 4 Generalizability of established prostate cancer risk variants in men of
African ancestry .............................................................................................................. 61
4.1 Abstract ............................................................................................................................ 65
4.2 Introduction ...................................................................................................................... 66
4.3 Materials and Methods ..................................................................................................... 67
4.4 Results and Discussion ..................................................................................................... 71
4.5 Acknowledgements .......................................................................................................... 76
Chapter 5 Prostate cancer susceptibility in men of African ancestry at 8q24 .......... 84
5.1 Abstract ............................................................................................................................ 88
5.2 Results and Discussion ..................................................................................................... 89
5.3 Funding ............................................................................................................................. 94
5.4 Notes ................................................................................................................................. 96
vii
Chapter 6 Integration of multiethnic fine-mapping and genomic annotation to
prioritize candidate functional SNPs at prostate cancer susceptibility regions ...... 101
6.1 Abstract .......................................................................................................................... 106
6.2 Introduction .................................................................................................................... 107
6.3 Results ............................................................................................................................ 109
6.4 Discussion ...................................................................................................................... 117
6.5 Materials and Methods ................................................................................................... 123
6.6 Acknowledgements ........................................................................................................ 129
Chapter 7 Statistical methodologies for multiethnic fine-mapping ......................... 144
7.1 Introduction .................................................................................................................... 144
7.2 Methods .......................................................................................................................... 145
7.2.1 Quasi-simulation ..................................................................................................... 145
7.2.2 Genotype simulation ............................................................................................... 147
7.2.3 Phenotype simulation and association testing ......................................................... 148
7.2.4 Meta-analytical approaches ..................................................................................... 149
7.2.5 Performance evaluation ........................................................................................... 153
7.2.6 Multiethnic study design ......................................................................................... 154
7.3 Results ............................................................................................................................ 156
7.3.1 Key factors that influence fine-mapping power ...................................................... 156
7.3.2 Comparison of different methods for combining summary statistics ..................... 164
7.3.3 Comparison of different study designs using multiethnic populations ................... 181
7.4 Discussion ...................................................................................................................... 187
7.5 Appendix ........................................................................................................................ 191
7.5.1 NCP calculation for a causal SNP ........................................................................... 191
7.5.2 NCP calculation for non-causal SNPs ..................................................................... 193
viii
7.5.3 MVN distribution of test statistics .......................................................................... 195
7.5.4 Meta-analytic statistical inferences ......................................................................... 195
Chapter 8 Summary and future directions ................................................................ 201
8.1 Introduction .................................................................................................................... 201
8.2 Further extension of fine-mapping ................................................................................. 202
8.3 Genetic profiling for personalized medicine .................................................................. 203
Bibliography .................................................................................................................. 207
1
Chapter 1 Introduction
1.1 Overview of prostate cancer genetics
Prostate cancer is the most common non-skin cancer among men in the United States,
with one of every seven men developing prostate cancer during his lifetime. It is also the
second leading cause of cancer death (10%) among U.S. males, exceeded only by lung
cancer (28%) (American Cancer Society, 2014). To date, the most established risk factors
for prostate cancer are age, race/ethnicity and family history. The incidence of prostate
cancer rises rapidly with increasing age, from 1 in 298 for men aged 49 years or younger
to 1 in 9 for men aged 70 years and older (American Cancer Society, 2014). The age-
adjusted incidence and mortality rate is highest among African Americans, is of
intermediate levels among European Americans, and is lowest among Asians (Brawley,
2012). A family history of a father or brother with prostate cancer at least doubles the risk
of developing this disease, and the risk is dramatically higher for men with several
affected relatives, particularly if diagnosed at young ages (Hemminki, 2012). The varying
incidence across racial/ethnic populations and the family aggregation of prostate cancer
suggest an important role of inherited genetic variation. From twin studies, it is estimated
that 42%-58% of the disease risk variance may be attributed to genetic factors
(Hjelmborg et al., 2014; Lichtenstein et al., 2000; Page et al., 1997). Therefore,
understanding the genetic basis of prostate cancer becomes a key question and may
provide insight into the biological mechanisms of this disease. In this section, we
introduce various approaches that have been utilized to investigate the genetic
predisposition of prostate cancer, including heritability estimation, linkage analysis,
2
admixture mapping, genome-wide association study, and fine-mapping.
1.1.1 Heritability
Heritability quantifies the contribution of genetics to any trait of interest by measuring
the proportion of phenotypic variation that can be attributed to genetic differences. To
estimate heritability, variance component approaches can be used to partition genetic and
environmental effects that contribute to the phenotypic variance (Gordon, Byth and
Balaam, 1972). Broad-sense heritability reflects the total genetic contributions, whereas
narrow-sense heritability is estimated based on additive genetic effects only (Visscher,
Hill and Wray, 2008). Traditionally, family-based studies are often used to estimate
narrow-sense heritability by measuring the phenotypic resemblance between individuals
with a known genetic relationship, such as parents and offspring, full and half siblings,
and monozygotic and dizygotic twin pairs. Among these designs, twin study is most
frequently used due to the natural advantage in splitting the total variance into genetic
and environmental components (Neale, Cardon and Division, 1992). From twin studies,
the heritability of prostate cancer is estimated to be 42%-58% (Hjelmborg et al., 2014;
Lichtenstein et al., 2000; Page et al., 1997). However, the heritability estimate in related
individuals is subject to upward bias in the presence of shared environmental factors,
non-additive genetic effects (e.g., dominance and epistasis) and/or gene-environment
interactions; it is also subject to downward bias in the presence of assortative mating
(Neale et al., 1992).
Benefited from widely conducted genome-wide association studies (GWAS; refer
to section 1.1.4), narrow-sense heritability can be estimated in unrelated individuals to
avoid the bias discussed above. One approach is to estimate the genetic relationship
3
matrix using genotyped variants and then fit a linear mixed model with the observed
quantitative trait or disease liability (Yang et al., 2010). In another approach, the genetic
relationship is estimated from local ancestry variation in an admixed population followed
by variance component analysis (Zaitlen et al., 2014). In applying these two methods to
the African American Prostate Cancer (AAPC) consortium of 4,207 cases and 4,008
controls, the estimated heritability is 18% and 33%, respectively (Zaitlen et al., 2014),
which indicates that 1) not all of the causal variants are genotyped or properly tagged on
the chip and that 2) the heritability might be overestimated in twin studies (i.e., upward
bias).
To assess the contribution of previously reported genetic variants to a certain
disease, there are multiple measures including the heritability explained from a
quantitative genetics perspective and the proportion of the familial relative risk explained
from an epidemiological point of view (Witte, Visscher and Wray, 2014). Unlike
Mendelian diseases that are caused by single-gene variation, polygenic diseases including
prostate cancer are influenced by many genetic loci with each locus contributing a
relatively small fraction to the total risk variance. For prostate cancer, 100 genetic risk-
associated loci have been discovered to date and in aggregate they can explain 33% of the
familial risk in European-ancestry population (Al Olama et al., 2014). This proportion is
expected to increase when more genetic risk loci become uncovered in large collaborative
efforts with adequate statistical power.
1.1.2 Linkage analysis
Linkage analysis is one of the earliest approaches applied for localizing disease
susceptibility genes in the human genome. The basic idea evolved from an exception of
4
Mendel’s law of independent assortment when two genetic loci are in close proximity on
the same chromosome. During meiosis, recombination may occur between two copies
(i.e., maternal and paternal) of a chromosome at the crossover locations, however, a lack
of recombination may lead to allelic co-segregation. For any two loci on the same
chromosome, the closer they are the less chance that recombination will happen between
them, and thus the alleles of the two loci are more likely to be co-inherited in the next
generation, which is termed as “linkage”. Conversely, one can infer the genetic distance
between two loci by looking at their co-inheritance. Based on this principle, linkage
analysis is mainly conducted in high-risk families with multiple generations of affected
and unaffected individuals that are genotyped on an extensive panel of genetic markers
spanning the entire genome. Suppose there is an underlying genetic locus that increases
the disease risk, under a genetic inheritance (e.g., dominant, recessive, low-penetrant) or
allele-sharing (e.g., identity-by-descent) model, it can be identified and coarsely localized
to a region by searching for “linked” genetic markers. A widely used measure of the
linkage between any genetic marker and the underlying susceptibility gene is the LOD
score (Morton, 1955), with methods extended to more complex genetic models or
pedigrees discussed by Elston et al. (Elston, Satagopan and Sun, 2012). Once a region of
interest is identified, follow-up studies are required to validate if there truly is a
susceptibility gene in that region and to further localize the gene by testing additional
markers at finer scales within the region.
For prostate cancer, the genome-wide linkage analysis has not been very
successful in finding susceptibility genes, except for a few examples. Motivated by the
evidence of linkage at 17q21-22 (Cropp et al., 2011; Lange et al., 2003), fine-mapping
5
and targeted sequencing performed in this particular region has led to the identification of
recurrent mutations in the HOXB13 gene (Akbari et al., 2012; Breyer et al., 2012; Lin et
al., 2013) that substantially increase the familial risk especially of early-onset prostate
cancer (Witte et al., 2013). Although many genomic regions have been implicated with
prostate cancer risk in linkage scans, conflicting evidence from different families or
studies is noted for most of the reported regions (Eeles et al., 2014). Since linkage studies
tend to identify rare (<1%) and highly penetrant (i.e. with large effect size) variation
responsible for the disease, the lack of consistent findings suggests that prostate cancer is
a polygenic disease with each genetic susceptibility locus, which could be common or
rare, conferring a small to moderate risk that requires a more powerful design for
detection. This hypothesis has been further confirmed by subsequent GWAS studies
(refer to section 1.1.4). Moreover, the fact that affected individuals with sporadic prostate
cancer that are not clinically differentiable from those with inherited cancer within
families as well as the heterogeneity of causal loci between families may partially explain
the lack of robust findings from linkage studies. Apparently, it is wise to shift to
population-based studies that are more powerful in detecting common yet low-penetrant
variants, in an attempt to better understand the genetic basis of prostate cancer.
1.1.3 Admixture mapping
Admixture mapping, also known as mapping by admixture linkage disequilibrium, is a
population-based approach to localize genetic variation that underlies the disease of
interest. The basic assumption is that if the disease risk is substantially different between
two ancestral populations, then the genetic susceptibility loci will be enriched in the
population with higher incidence. In an admixed population, the genome of each
6
individual consists of chromosomal segments from both parental populations after
generations of recombination. As discussed in section 1.1.2, genetic loci that are in close
proximity or away from recombination hotspots tend to co-inherit across generations and
thus form linkage disequilibrium (LD) blocks in the genome. As a consequence, two or
more alleles in LD would be more often observed in certain combinations than random
distribution. Large LD blocks from the ancestors have been split as recombination events
accumulated across generations, therefore populations with a longer history (i.e., more
generations) like Africans tend to have smaller LD blocks than other ancestral
populations. In contrast, an admixture between ethnically diverse populations may cause
temporary long-range LDs in the recent generations, which would largely reduce the
number of tags needed to capture all variation across the genome. Taking advantage of
this, admixture mapping became an attractive and economic approach in the last decade
to detect disease susceptibility loci by examining the local ancestry differences between
cases and controls (Hoggart et al., 2004; Montana and Pritchard, 2004; Patterson et al.,
2004). This approach is most effective when applied to recent admixture of two
historically separated and genetically isolated populations such as African Americans, an
admixture of European and African populations (Smith et al., 2004; Tian et al., 2006). It
has also been extended to admixed populations of three or more ancestral groups such as
Latinos, which inherit a mixture of European, Native American and African ancestry
(Mao et al., 2007; Price et al., 2007). Compared to linkage analysis (refer to section 1.1.2)
in which recombination happens over a few generations within pedigrees, admixture
mapping uses populations for which admixture has occurred over several hundred years
and generally has better resolution in localizing variants that are involved in disease
7
susceptibility (Smith and O'Brien, 2005). However, it limits the search to genetic
variation with substantial frequency differences between ancestral populations.
For prostate cancer, admixture mapping has been applied in African Americans
because men of African ancestry are at higher risk of prostate cancer than men of
European ancestry. On a genetic basis, regions enriched in African ancestry might be
responsible for the risk differences between groups and therefore are likely to be involved
in the disease pathogenesis. The high risk region for prostate cancer, 8q24, was initially
identified through admixture mapping in African Americans (Freedman et al., 2006), and
further replicated in an independent study in which 5q35 and 7q31 were also identified
(Bock et al., 2009). These findings highlighted a few regions for future investigation.
1.1.4 Genome-wide association studies (GWAS)
Genome-wide association studies (GWAS) are population-based studies designed to
identify common (>1%) and low-penetrant genetic variants associated with the trait of
interest. In contrast to linkage analysis (refer to section 1.1.2), which is best suited to
study Mendelian traits (i.e., single-gene disorders), GWAS is a powerful tool to discover
common risk variants for polygenic traits. The variants examined in GWAS are typically
single nucleotide polymorphisms (SNPs) genotyped on commercial or customized arrays.
As one type of genetic variation, SNP is a one-letter DNA change that commonly occurs
in the population and accounts for most of the differences in the DNA sequences between
any two individuals. SNP usually has two alleles from A, T, C, or G. The alleles for SNPs
in close proximity tend to be inherited together, which forms LD (refer to section 1.1.3)
in the population. Taking advantage of the existing LDs, one can roughly screen the
entire genome for association signals without genotyping every single SNP; this process
8
is usually called a ‘GWAS scan’. Due to the large amount of SNPs tested in one scan, a
more stringent significance threshold than p<0.05 is needed to control the overall false
positive rate. Traditionally, a genome-wide significance level of p<5×10
-8
has been
applied in GWAS studies, which corresponds to the Bonferroni correction for ~1 million
independent markers being tested. For dichotomous traits such as disease status, the
association of any SNP with the disease risk is essentially reflected by its allele frequency
(AF) differences between cases and controls. Given that the AF may vary across ancestral
populations, the association is subject to confounding by the natural AF differences
between cases and controls in the sample population due to their diverse ancestral
backgrounds. Therefore, to reduce spurious associations, global ancestry estimated from
principal component analysis (Price et al., 2006) needs to be adjusted for. In addition,
promising signals revealed in the original GWAS need to be further validated in
replication studies.
GWAS studies have been fruitful in finding genetic susceptibility loci for prostate
cancer, with 100 loci identified to date which can explain ~1/3 of the familial risk in
European populations (Al Olama et al., 2014; Eeles et al., 2014). The largest effect sizes
[per-allele odds ratio (OR)] were observed for SNPs at 8q24 (OR>2.0), while the vast
majority of the known risk variants only have modest effects (OR<1.5), indicating the
polygenic nature of prostate cancer for which a Mendelian inheritance model may not be
suitable. Interestingly, some of the established risk regions for prostate cancer, such as
8q24 and 5p13 (TERT), have also been found to contain risk variants for other cancers,
suggesting shared biological mechanisms among different types of cancer. This
observation motivated the search for additional prostate cancer risk loci with among
9
SNPs previously associated with other cancers, which will be fully discussed in Chapter
2. Despite the considerable success of GWAS, one of the limitations in this strategy is
that it requires very large sample size to detect rare risk variants (<1%) or common
variants with very small effects. To examine rare variation, targeted or genome-wide
sequencing in a large number of cases and controls needs to be performed, although a
whole-exome sequencing study in ~2,000 cases and ~2,000 controls of African ancestry
did not reveal any novel signals (Rand et al., 2016). To overcome the “small effect”
issue, large-scale collaborative efforts have been made by either pooling the original data
or combining the summary statistics from multiple studies, such as the AAPC (Haiman et
al., 2011b), PRACTICAL (Eeles et al., 2009; Eeles et al., 2013), and GAME-
ON/ELLIPSE (Al Olama et al., 2014) consortia. Although the existence of LD largely
reduces the number of SNPs needed to be genotyped and tested in GWAS, in turn it
becomes a limitation that the reported SNPs are likely to be merely proxies of the
underlying functional alleles. Given enough power (>80%), the limited replications of
previously reported risk associations in other racial/ethnic populations (Cheng et al.,
2012; Haiman et al., 2011a; Han et al., 2014) further suggest that the biologically relevant
variants (referred to as ‘causal’ variants) still remain to be identified. Therefore, fine-
mapping of the known risk loci in search of the underlying causal variation is of great
importance to understand the genetic predisposition to prostate cancer.
1.1.5 Fine-mapping
Benefiting from recent advances in sequencing, genotyping, and imputation of the high-
density genetic variants in targeted regions, fine-mapping has emerged as a powerful
approach to refine the index signals reported in previous GWAS, in an attempt to identify
10
the underlying causal variants and nearby secondary signals. Assuming the causal variant
is located within a certain distance (e.g., 500 kb) from the index SNP, one can examine
all variants in the region that are correlated with the index SNP to some degree in the
population in which the original GWAS discovery was made. Variants that might better
capture the signal than the index SNP (referred to as ‘better markers’) can be defined as
those with larger effect size and stronger statistical significance in association with the
trait of interest. Sometimes there could be hundreds of SNPs in the same LD block
spanning over a large region with similar strength of associations, which makes it
extremely challenging to localize the causal variant or prioritize a subset of causal
candidates. It is believed that fine-mapping in African populations, which have more
sparse LD patterns than other populations (refer to section 1.1.3), can narrow the
underlying causal variant to a relatively small region with a reduced set of candidates.
Likewise, the fine-mapping resolution may also improve by leveraging the different LD
structures in ethnically diverse populations. The rationale and methods for multiethnic
fine-mapping are discussed thoroughly in Chapter 7. To conduct fine-mapping in
admixture populations where extended LD blocks exist, adjusting for local ancestry may
help localize the causal variants (Zhang and Stram, 2014). In addition to statistical
mapping, functional annotation can be integrated to further prioritize candidate variants
for future biological testing (Han et al., 2015; Kichaev et al., 2014). Other than refining
the index signals discovered in previous GWAS, fine-mapping of known regions may
also reveal novel secondary signals. A secondary signal can be defined by variants that
are minimally correlated with the index SNP and remain significantly associated with the
11
trait of interest after conditioning on the index signal. Follow-up studies are required to
validate the secondary signals uncovered through fine-mapping.
For prostate cancer, previous fine-mapping efforts include region-specific studies
of 5p15/TERT (Kote-Jarai et al., 2013), 8p21/NKX3.1 (Akamatsu et al., 2010),
10q11/MSMB (Lou et al., 2009), 11q13 (Chung et al., 2012; Chung et al., 2011a; Zheng
et al., 2009), 17q12/HNF1B (Berndt et al., 2011), and 19q13/KLK3 (Kote-Jarai et al.,
2011a; Parikh et al., 2011) in European or Asian ancestry populations, and
comprehensive examinations of the known risk regions in men of African ancestry
(Haiman et al., 2011a; Han et al., 2014). These statistical findings, together with
functional characterization, highlighted a few SNPs that have strong implications in the
biological mechanisms of prostate cancer, such as rs10993994 at 10q11.2/MSMB (Lou et
al., 2009) and rs339331 at 6q22.1/RFX6 (Huang et al., 2014; Spisak et al., 2015). The
8q24 region harbors multiple independent signals in a complex genetic architecture (Al
Olama et al., 2009; Haiman et al., 2011a; Haiman et al., 2007), including an African-
specific risk variant recently revealed through fine-mapping (Han et al., 2016b) as
discussed in Chapter 5. In a large multiethnic sample, we also integrated statistical fine-
mapping, genomic annotation, and expression quantitative trait loci (eQTL) analysis to
characterize most of the risk regions reported to date. The results and comparison with a
concurrent fine-mapping study in European sample are discussed in Chapter 6.
1.2 Post-GWAS studies of prostate cancer
The following chapters focus on the genetic studies of prostate cancer that we have
conducted in the post-GWAS era. Motivated by the genetic loci associated with multiple
cancers (e.g., 8q24 and 5p13/TERT), we performed an exploratory pleiotropy analysis on
12
cancer SNPs reported in previous GWAS and identified a novel pleiotropic risk variant
(rs6457327) with prostate cancer in the HLA region (Chapter 2). To search for putative
functional variants located in exonic regions, we conducted an exome-wide association
study in multiethnic populations and found a non-synonymous SNP (rs2274911) in the
gene GPRC6A, which is likely to be the plausible variant for the risk association at this
locus (Chapter 3). In 2013, the GAME-ON/ELLIPSE consortium was assembled to
combine a large number of GWAS studies with each study imputed to the 1000 Genomes
Project. The primary goal is to detect common, low-penetrant risk variants that were
missed in previous GWAS due to limited sample size and/or low-density genotyping
platform. We performed genome-wide scans for most of the participating studies and
then conducted meta-analyses to combine all the studies (Chapter 3). This large
collaborative effort identified 23 novel loci, which brings the number of prostate cancer
risk loci to 100, and in total they can explain ~1/3 of the familial risk in European
populations. Since the vast majority of the risk variants were identified in GWAS of
European or Asian ancestry, we evaluated the generalizability of the known risk loci in
men of African ancestry and optimized the set of alleles for risk modeling in this
population (Chapter 4). Approximately 50% of the index SNPs did not replicate (at
p<0.05) in men of African ancestry despite sufficient power (>80%), suggesting that they
are not adequately correlated with the underlying causal variants in this population and
therefore fine-mapping is needed. Using the same sample, we conducted fine-mapping at
8q24 and revealed a novel risk allele (rs111906932, A, 3% frequency) that is only found
in African populations (Chapter 5). In a large multiethnic sample from the GAME-
ON/ELLIPSE consortium, we performed fine-mapping of 67 known risk regions and
13
identified better markers in 30 regions as well as novel secondary signals in two regions
(Chapter 6). Utilizing samples from multiple racial/ethnic populations presents both
advantages and challenges. In Chapter 7, we developed a novel method that is tailored for
combining summary statistics from multiethnic studies. Through simulation studies, we
compared the novel method with existing methods by evaluating their performances in
signal discovery and fine-mapping.
14
Chapter 2 Pleiotropy analysis identifies a novel prostate cancer
variant at 6p21: The PAGE, PRACTICAL, and BPC3
consortia
This manuscript is in preparation for submission.
Ying Han
1
, Daniele Campa
2
, Logan Dumitrescu
3,4
, Shelly-Ann Love
5
, Steven Buyske
6
,
Lucia A. Hindorff
7
, Konstantinos K. Tsilidis
8,9
, William S. Bush
4,10
, Anne M. Butler
5
,
Federico Canzian
11
, Iona Cheng
12
, Lynne R. Wilkens
13
, Jay H. Fowke
14
, Dana C.
Crawford
3,4
, Jose Luis Ambite
15
, Nora Franceschini
5
, Loic Le Marchand
13
, S. Lani Park
1
,
Christopher A. Haiman
1
, The PRACTICAL consortium
16
, The BPC3 consortium
16
,
Fredrick R. Schumacher
1
1
Department of Preventive Medicine, Keck School of Medicine, University of Southern
California, Los Angeles, California, USA
2
Division of Cancer Epidemiology, German Cancer Research Center (DKFZ),
Heidelberg, 69120, Germany
3
Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville,
Tennessee, USA
4
Center for Human Genetics Research, Vanderbilt University, Nashville, Tennessee,
USA
5
Department of Epidemiology, University of North Carolina, Chapel Hill, North
Carolina, USA
15
6
Department of Statistics and Biostatistics, Rutgers University, Piscataway, New Jersey,
USA
7
Division of Genomic Medicine, National Human Genome Research Institute, National
Institutes of Health, Bethesda, Maryland, USA
8
Department of Hygiene and Epidemiology, University of Ioannina Medical School,
Ioannina, Greece
9
Cancer Epidemiology Unit, University of Oxford, Oxford, UK
10
Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee,
USA
11
Genomic Epidemiology Group, German Cancer Research Center (DKFZ), Heidelberg,
69120, Germany
12
Cancer Prevention Institute of California, Fremont, California, USA
13
Epidemiology Program, University of Hawaii Cancer Center, University of Hawaii,
Honolulu, Hawaii, USA
14
Vanderbilt Epidemiology Center, Vanderbilt University, Nashville, Tennessee, USA
15
Information Sciences Institute, University of Southern California, Marina del Rey,
California, USA
16
A full list of members is provided in the Supplementary Note
Corresponding author:
Fredrick R. Schumacher
Harlyne Norris Research Tower
1450 Biggy Street, Room 2504
Los Angeles, CA 90033
Telephone: (323) 442-7882
Fax: (323) 442-7925
E-mail: fschumac@usc.edu
16
2.1 Abstract
Background: Genome-wide association studies have identified hundreds of
susceptibility loci for various cancers including prostate cancer. Several of the risk loci
have demonstrated pleiotropic effects with more than one type of cancer, suggesting
shared biological pathways in carcinogenesis. In the present study, we systematically
investigated the association between prostate cancer and genetic variants previously
associated with other cancers.
Methods: We examined prostate cancer against 196 genetic risk variants previously
associated with at least one of 18 cancers or cancer-related traits. A multiethnic sample of
28,135 prostate cancer cases and 37,218 controls was assembled from three consortia:
Population Architecture using Genetics and Epidemiology (PAGE), Prostate Cancer
Association Group to Investigate Cancer Associated Alterations in the Genome
(PRACTICAL) and Breast and Prostate Cancer Cohort Consortium (BPC3). Study- and
ethnic-specific results from logistic regression were combined through fixed-effect meta-
analysis for each variant, applying a Bonferroni-corrected significance threshold of
3.6×10
-4
.
Results: Three variants, rs6010620 at 20q13 (RTEL), rs2853676 at 5p15 (TERT) and
rs6457327 at 6p21 (HLA-C), demonstrated statistically significant associations with
prostate cancer risk after correcting for multiple comparisons. Originally reported as a
follicular lymphoma risk allele, the C allele of rs6457327 was associated with decreased
prostate cancer risk at the genome-wide significant level (per-allele OR=0.93, 95% CI,
0.91-0.96; p=4.5×10
-8
).
17
Conclusions: This systematic pleiotropy analysis identified a novel susceptibility locus
in the HLA region for prostate cancer in a large multiethnic population, suggesting shared
etiologic mechanisms between prostate cancer and follicular lymphoma.
2.2 Introduction
For men, prostate cancer is the most commonly diagnosed cancer excluding non-
melanoma skin cancer. To date, approximately 30% of the genetic heritability of prostate
cancer in men of European descent has been explained by known susceptibility loci
(Eeles et al., 2013), therefore, many more genetic risk variants remain to be discovered.
The principle of pleiotropy provides the opportunity to increase the likelihood of
identifying a prostate cancer risk variant. Pleiotropy occurs when a single genetic locus
influences the risk of multiple phenotypes. For cancer, pleiotropic associations (Sakoda,
Jorgenson and Witte, 2013) suggest that various cancers may share common molecular
mechanisms. For example, the 8q24 region contains multiple susceptibility loci for
prostate, breast, colorectal, bladder, ovarian, and other cancers (Al Olama et al., 2009;
Cheng et al., 2013; Crowther-Swanepoel et al., 2010; Easton et al., 2007; Goode et al.,
2010; Haiman et al., 2007; Rothman et al., 2010; Schumacher et al., 2007; Shete et al.,
2009; Yeager et al., 2009; Yeager et al., 2007). This region, merely considered as a “gene
desert” in the past, has now been revealed to contain important regulatory enhancers for
the oncogene c-Myc (Ahmadiyeh et al., 2010; Jia et al., 2009; Sotelo et al., 2010).
Similarly, genetic variants at 5p15, which contains the telomerase reverse transcriptase
(TERT) gene, have been associated with many cancers including prostate cancer (Haiman
et al., 2011c; Kote-Jarai et al., 2013; Landi et al., 2009; Peters et al., 2012; Petersen et al.,
2010; Rafnar et al., 2009; Shete et al., 2009; Turnbull et al., 2010), implying the possible
18
role of telomere length control in carcinogenesis (Willeit et al., 2010). Despite these
striking examples, potential pleiotropic effects of cancer risk variants identified from
genome-wide association studies (GWAS) have yet to be systematically explored for
prostate cancer.
In the present study, we selected 196 cancer risk variants previously identified
from GWAS and fine-mapping studies to investigate their associations with prostate
cancer. These 196 genetic markers also included variants previously associated with a
strong risk factor for cancer, such as nicotine dependence. The analyses used a
multiethnic sample from 30 studies combined through the PAGE, PRACTICAL and
BPC3 consortia. Our results may provide insight into etiologic mechanisms shared
between prostate cancer and other cancers.
2.3 Results
We investigated 196 single-nucleotide polymorphisms (SNPs) in a multiethnic
population of 28,135 prostate cancer cases and 37,218 controls from the PAGE,
PRACTICAL and BPC3 consortia (Table 1). Each consortium consisted of multiple
studies with differing levels of diversity in five racial/ethnic groups (Supplementary
Table S1). In total, the vast majority of participants was of European descent (83.5%),
with the remainder comprising 8.0% African, 3.9% Latino, 3.9% Asian and 0.7% Native
Hawaiian. The main characteristics for each study/consortium are presented in
Supplementary Table S2. The 196 SNPs examined in this study encompassed established
risk variants for 18 different cancer types and one cancer related trait (i.e., nicotine
dependence for lung cancer) from previous GWAS (Fig. 1; Supplementary Table S3). Of
the 196 risk variants, 59 were either originally identified for prostate cancer (n=53) or
19
located at 8q24 (n=6), a known high-risk region for prostate cancer (Al Olama et al.,
2009; Freedman et al., 2006; Haiman et al., 2007; Liu et al., 2012); the remaining 137
SNPs were considered as non-prostate cancer risk variants.
2.3.1 Generalizability of established prostate cancer risk variants across
racial/ethnic groups
In the multiethnic meta-analysis, all the 59 prostate cancer SNPs were associated with
increased prostate cancer risk (per-allele odds ratio (OR)>1.0), with 88% (n=52) reaching
nominal statistical significance (p<0.05; Supplementary Table S4). Of the 52 SNPs that
were available across all the five racial/ethnic populations, 24 (46%) SNPs had consistent
directions of effects and four SNPs at 8q24 (rs13254738, rs4242382, rs10090154,
rs7837688) were nominally statistically significant (p<0.05) in all populations
(Supplementary Table S4), providing further support that 8q24 generalizes across
racial/ethnic groups as a prostate cancer risk locus.
2.3.2 Pleiotropic effects of non-prostate cancer risk variants
Of the 137 risk variants previously associated with other cancers, 74 (54%) were
associated with prostate cancer at p<0.05 in the overall meta-analysis (Supplementary
Table S5). The three variants that remained significant after Bonferroni correction
(p<3.6×10
-4
) were rs6010620 (OR=1.10, p=6.9×10
-11
), rs2853676 (OR=0.92, p=3.0×10
-
9
), and rs6457327 (OR=0.93, p=4.5×10
-8
; Fig. 2, Table 2), all of which also reached
genome-wide significance level (p<5.0×10
-8
).
The most significant SNP rs6010620 (intronic to RTEL gene), originally reported
as a glioma risk variant, was correlated (r
2
>0.5 in EUR, 1000 Genomes Project) with a
20
known prostate cancer SNP rs6062509 (Eeles et al., 2013). In a conditional analysis using
the largest available data (the PRACTICAL consortium), rs6010620 was no longer
significant (p=0.32) when adjusted for rs6062509. The second most significant SNP
rs2853676 (intronic to TERT gene), previously reported for glioma, has also been
confirmed to be associated with prostate cancer in a fine-mapping study (Kote-Jarai et al.,
2013). Notably, the third SNP rs6457327 remained significant (p=xx) when conditioning
on a nearby prostate cancer risk variant rs130067 (Kote-Jarai et al., 2011b) (44kb away;
r
2
=0.11 in EUR, 1000 Genomes Project), suggesting that the effect of rs6457327 was
independent of rs130067.
Here we identified rs6457327 at 6p21 (the HLA region) as a novel pleiotropic risk
variant for prostate cancer. Previously known as a follicular lymphoma risk allele (C), it
was significantly associated with reduced prostate cancer risk in European populations
(OR=0.93, p=2.5×10
-8
) and in the multiethnic meta-analysis (OR=0.93, p=4.5×10
-8
,
p
het
=0.93; Table 2). When stratified by race/ethnicity and by consortium, we observed
consistent protective effects in men of European (OR=0.93, 95% CI, 0.90-0.95), African
(OR=0.92, 95% CI, 0.84-1.02) and Latino ancestry (OR=0.97, 95% CI, 0.86-1.11; Fig. 3,
Table 2). This allele is common across populations, with the frequency ranging from 0.60
in men of European ancestry to 0.71 in men of African ancestry. In summary, this
follicular lymphoma risk variant (rs6457327) demonstrated pleiotropic effects on prostate
cancer risk in multiethnic populations, and interestingly, it also represents a novel genetic
risk locus for prostate cancer.
21
2.3.3 Cumulative effect of non-prostate cancer risk variants
We further evaluated the cumulative effect of non-prostate cancer risk variants through
genetic risk score. Due to missing data across studies and linkage disequilibrium (LD)
among SNPs, 20 unlinked variants were included in this analysis (see Methods), which
didn’t contain the three pleiotropic variants. The multi-SNP risk score was associated
with increased prostate cancer risk at p<0.05 in men of European ancestry (per allele
OR=1.01, p=0.0046) and in the multiethnic meta-analysis (per allele OR=1.01, p=0.021).
2.4 Discussion
In a large multiethnic sample of 28,135 prostate cancer cases and 37,218 controls from
the PAGE, PRACTICAL and BPC3 consortia, we examined 196 cancer-related SNPs
that were identified from previous studies. To date, this is the first systematic
investigation of pleiotropy between prostate cancer and established genetic risk variants
for other cancers. At p<0.05, we replicated the associations with 52 of the 59 prostate
cancer SNPs. At a genome-wide significance level (p<5.0×10
-8
), we identified three
pleiotropic variants for prostate cancer. Of these, rs6010620 (RTEL) and rs2853676
(TERT) have been confirmed by two separate investigations in the PRACTICAL
consortium (Eeles et al., 2013; Kote-Jarai et al., 2013).
Here we report rs6457327 at 6p21, a follicular lymphoma risk variant (Conde et
al., 2010; Skibola et al., 2009), as a novel susceptibility marker for prostate cancer. It is
located 5kb downstream of the 3’UTR of the C6orf15 gene (encoding protein STG) and
telomeric to the human leukocyte antigen (HLA)-C locus on chromosome 6. Although
rs6457327 per se could be involved in miRNA recognition or in the regulation of a
nearby gene, we cannot rule out the possibility that it is tagging some other underlying
22
biologically functional variants. To the best of current knowledge, the closest and most
correlated (r
2
=0.59) putatively functional variant is rs1265054, a non-synonymous SNP
located in an exonic splicing enhancer motif in C6orf15 (STG) gene. Studies on the
molecular biology of the STG protein are needed to evaluate its potential role in
carcinogenesis. In addition, rs6457327 is also in LD with HLA-C polymorphisms (Nair et
al., 2006) that determine the structure and function of class-I major histocompatibility
complex (MHC) receptors. MHC class I molecules are expressed in nearly all cells, and
present antigenic peptides on cell surface to attract killer T-cells initiating the immune
response (Neefjes et al., 2011), which indicates that rs6457327 may provide a link
between inflammation and cancer pathogenesis. In a whole-exome sequencing study
(Fitzgerald et al., 2013), two rare germline missense variants in a MHC class II
associated gene (BTNL2) were found to be associated with elevated prostate cancer risk,
which further suggests the role of autoimmune and inflammatory conditions in prostate
cancer etiology.
In the previous GWAS studies, the C allele of rs6457327 was associated with
increased risk of follicular lymphoma (OR=1.69; p=4.7×10
-11
) in participants of
European descent (Conde et al., 2010; Skibola et al., 2009). Interestingly, this allele was
protective against prostate cancer at a genome-wide significance level of p<5.0×10
-8
in
men of European ancestry (OR=0.93, p=2.5×10
-8
) and in the multiethnic meta-analysis
(OR=0.93, p=4.5×10
-8
). In other words, the same allele (rs6457327, C) demonstrated
opposite effects on two cancers types – it appears to increase the risk of follicular
lymphoma and decrease the risk of prostate cancer. The “opposite-effect” phenomenon
has also been observed for a few other HLA alleles in autoimmune diseases (Sirota et al.,
23
2009). Intuitively, several hypotheses fit the observation for rs6457327: 1) the same allele
is tagging two distinct causal variants that have opposite roles in cancer etiology; 2) it is
tagging a single underlying causal variant that has different effects in the prostate gland
and lymph nodes; and 3) it influences a biologically relevant component with multiple
functions during tumorigenesis. The evidence that the MHC molecules may be involved
in either the promotion or inhibition of tumor growth (Leek et al., 1996) provides support
for the last hypothesis. Further studies are required to fully dissect this genetic association
at HLA-C locus with prostate cancer and follicular lymphoma.
We observed statistically significant associations with the three pleiotropic SNPs
and 52 established prostate cancer SNPs in the multiethnic meta-analysis. When stratified
by race/ethnicity, the vast majority of these SNPs were significant in men of European
ancestry but not in all the populations. One possibility is that the allele frequency of the
SNP being examined and/or the LD structure between the tested SNP and the underlying
causal variant varies across populations. Comprehensive fine-mapping of these loci in a
large multiethnic sample is underway, which may help to better understand and localize
the true signal. More likely, some ethnic groups are underpowered due to limited sample
sizes; for example, in this study we have only 462 Native Hawaiian participants. Larger
sample sizes for non-European populations are required to further evaluate the ethnic-
specific associations with these risk variants.
The existence of genetic pleiotropy between prostate cancer and other cancers
implies shared genetic mechanisms and provides insight for clinical screening. As
suggested in this study, positive screening results for certain prostate cancer genetic
markers may have implications for the risk of glioma and lymphoma. In the risk profile
24
for cancers, variants at pleiotropic loci including 8q24, RTEL, TERT and HLA-C should
be prioritized in genetic testing. The emerging field of high-throughput targeted
sequencing will generate sufficient data to directly identify the underlying biologically
relevant variants in these regions. Moreover, investigating etiologic mechanisms that are
shared among different cancer types will greatly expand our understanding of cancer
biology as well as propose both opportunities and challenges for drug development.
These clinical implications highlight the importance of identifying and characterizing
pleiotropic loci for cancers and other human complex diseases.
Strengths of this study include the large sample size from the collaboration of
well-designed prostate cancer studies, the racial/ethnic diversity of the participants, and
the systematic exploration of SNPs that had shown strong associations with any cancer in
previous studies. As described above, limitations of this study include the incomplete
database of cancer-associated SNPs by the start of this investigation (January 2010), the
reduced power for non-European populations, as well as the available genotype data from
only a subset of the studies to check the independence between the newly identified
variant (rs6457327) and a previously reported variant (rs130067) nearby. However,
utilizing public data from the 1000 Genomes Project have further confirmed that they are
minimally correlated.
In summary, we discovered that a follicular lymphoma risk allele at 6p21 (HLA-
C) was protective against prostate cancer with a combined allelic p-value of 4.5×10
-8
. It is
the first identified genetic variant that has pleiotropic effects on follicular lymphoma and
prostate cancer. Moreover, it also represents a novel prostate cancer susceptibility locus.
As numerous cancer SNPs have been identified since the beginning of this study, future
25
investigations using a more comprehensive profile of cancer-related genetic variants may
reveal additional pleiotropic loci for prostate cancer.
2.5 Materials and Methods
2.5.1 Study Populations
Three consortia contributed data to this meta-analysis study: the Population Architecture
using Genetics and Epidemiology (PAGE) (https://www.pagestudy.org/) (Matise et al.,
2011), Prostate Cancer Association Group to Investigate Cancer Associated Alterations
in the Genome (PRACTICAL) (http://ccge.medschl.cam.ac.uk/consortia/practical/)
(Kote-Jarai et al., 2008) and Breast and Prostate Cancer Cohort Consortium (BPC3)
(http://epi.grants.cancer.gov/BPC3/) (Hunter et al., 2005). This collaboration comprised
28,135 prostate cancer cases and 37,218 controls from 30 studies (Supplementary Table
S1). Details regarding these participating studies are described in the Supplementary
Note. Demographic, genetic and epidemiologic information obtained by each study are
summarized in Supplementary Table S2. The majority of studies utilized incident prostate
cancer cases and controls with no history of cancer. The five BPC3 studies were based on
advanced prostate cancer cases and controls without a diagnosis of prostate cancer.
Institutional review board approval was obtained for all studies.
2.5.2 SNP Selection and Genotyping
A total of 196 SNPs previously associated with 18 cancer and cancer-related traits (i.e.,
nicotine dependence for lung cancer) were selected from the National Human Genome
Research Institute GWAS catalog (http://www.genome.gov/26525384) (Hindorff et al.,
2009) as well as review of the cancer GWAS and fine-mapping literature (Matise et al.,
26
2011) as of January 2010 (Supplementary Table S3). In the ARIC study (part of the
PAGE consortium), proxies for 4 SNPs were used in genotyping and 122 SNPs were
imputed with a median quality score of 0.99 (ranging from 0.61 to 1.00). Details of
genotyping, imputation, and quality control for each study/consortium are provided in the
Supplementary Note.
2.5.3 Statistical Analysis
Ethnic-specific association with prostate cancer risk was evaluated using an additive
model for each candidate SNP, where the risk allele was coded as 0, 1 or 2. The risk
alleles were aligned for increased risk of the originally reported trait to aid in the
interpretation of prostate cancer associations. For example, a known breast cancer marker
for allele A was coded as the risk allele with prostate cancer regardless of allele
frequency. Ethnic- and study-specific unconditional logistic regression models minimally
adjusted for age and global ancestry were used to assess associations (see Supplementary
Note). Summary estimates for overall effects were evaluated using a fixed-effect inverse-
variance-weighted meta-analysis. Heterogeneity was evaluated using Cochran’s Q test
and I
2
statistics (Higgins et al., 2003). To account for multiple testing of 137 non-prostate
cancer SNPs, we employed a Bonferroni-corrected threshold of α=3.6×10
-4
(where
3.6×10
-4
=0.05/137) to assign statistical significance for novel risk variants. For the 59
prostate cancer SNPs, we applied an uncorrected threshold of α=0.05 to assess the
generalizability across racial/ethnic groups.
In addition to evaluating single variant associations, we assessed the cumulative
effect of non-prostate cancer risk variants through a genetic risk score analysis. SNPs
were selected based on the following criteria: (i) never reported as prostate cancer risk
27
variants, (ii) not in 8q24 region, which is highly associated with prostate cancer (Al
Olama et al., 2009; Freedman et al., 2006; Haiman et al., 2007; Liu et al., 2012), (iii)
available from all the participating studies and (iv) LD-independent, with LD threshold of
r2<0.2 in EUR population from the 1000 Genome Project (Feb 2012 release). The final
analysis included 20 SNPs (Supplementary Table S3). The multi-SNP genetic risk score
was defined as the unweighted sum of risk alleles carried by each individual. The average
per-allele effect was approximated by summarizing the coefficients and standard errors
from consortium-level meta-analytic results for each individual SNP included in the risk
score calculation (Dastani et al., 2012). We also meta-analyzed the risk score association
results from each consortium under a fixed-effect model by ethnicity and overall.
28
Table 1. Sample size by race/ethnicity in each consortium
Race/Ethnicity Prostate
Cancer
Consortium Total
PRACTICAL PAGE BPC3
European
Case 19,622 1,971 2,107 23,700
Control 19,715 7,367 3,773 30,855
Total 39,337 9,338 5,880 54,555
African
Case 623 1,406 - 2,029
Control 569 2,600 - 3,169
Total 1,192 4,006 - 5,198
Latino
Case - 1,160 - 1,160
Control - 1,402 - 1,402
Total - 2,562 - 2,562
Japanese
Case - 1,082 - 1,082
Control - 1,494 - 1,494
Total - 2,576 - 2,576
Native
Hawaiian
Case - 164 - 164
Control - 298 - 298
Total - 462 - 462
Overall
Case 20,245 5,783 2,107 28,135
Control 20,284 13,161 3,773 37,218
Total 40,529 18,944 5,880 65,353
29
Table 2. Summary results of the pleiotropic variants associated with prostate cancer
SNP
a
Region
b
Race/Ethnicity RAF
c
OR(95% CI)
d
P
e
rs6010620 20q13, 62309839 European 0.78 1.11(1.08-1.15) 1.7×10
-10
G/A RTEL1 African 0.94 1.13(0.95-1.36) 0.17
Glioma Latino 0.74 1.04(0.91-1.20) 0.57
Japanese 0.32 1.08(0.96-1.23) 0.19
Native Hawaiian 0.67 0.91(0.66-1.25) 0.55
Overall
f
- 1.10(1.07-1.14) 6.9×10
-11
p
het
g
=0.96
rs2853676 5p15, 1288547 European 0.26 0.92(0.89-0.95) 3.0×10
-7
A/G TERT African 0.23 0.87(0.78-0.97) 0.015
Glioma Latino 0.22 0.89(0.77-1.04) 0.14
Japanese 0.15 0.92(0.78-1.08) 0.31
Native Hawaiian 0.14 0.76(0.50-1.16) 0.2
Overall
f
- 0.92(0.89-0.94) 3.0×10
-9
p
het
g
=0.72
rs6457327 6p21, 31074030 European 0.61 0.93(0.90-0.95) 2.5×10
-8
C/A STG African 0.71 0.92(0.84-1.02) 0.11
Non-Hodgkin's Latino 0.61 0.97(0.86-1.11) 0.68
lymphoma Japanese 0.57 1.03(0.92-1.15) 0.63
Native Hawaiian 0.55 1.03(0.78-1.37) 0.83
Overall
f
- 0.93(0.91-0.96) 4.5×10
-8
p
het
g
=0.93
a
Single-nucleotide polymorphism (SNP) ID, risk/reference allele. Risk allele is the allele
associated with increased risk of the primary cancer in previous studies.
b
Chromosomal
locus, base-pair position in GRCh37/hg19, nearest gene, the primary cancer for which the
SNP was initially reported.
c
Risk allele frequency (RAF).
d
Per-allele odds ratio (OR) and
95% confidence interval (CI) of the risk allele in association with prostate cancer.
e
P-value
from 1-degree of freedom Wald test of trend.
f
Meta-analysis of all populations using an
inverse-variance-weighted fixed-effect model.
g
P-value from Cochran's Q test of
heterogeneity.
30
Figure 1. Pie chart of the genetic risk variants previously associated with 18 cancers
or cancer-related traits examined in this study. Nicotine dependence is a lung cancer-
related trait. The size of each slice is proportional to the percentage of risk variants
originally reported for the given cancer. Details regarding each variant are provided in
Supplementary Table S3.
Breast
Ovarian
Nicotine dependence
Colorectal
Testicular
Bladder
Leukemia
Esophageal
Glioma
Lung
Melanoma
Neuroblastoma
Prostate
Nasopharyngeal
Pancreatic
Thyroid
Basal cell
Non−Hodgkin's lymphoma
31
Figure 2. Manhattan plot of the multiethnic meta-analysis association between non-
prostate cancer risk variants and prostate cancer. Each symbol represents a genetic
variant, and is color/shape-coded according to the cancer (or related trait) for which the
variant was primarily identified. The variants are ordered on the horizontal axis according
to their genomic positions. The vertical axis represents the base-10 logarithm of the p-
value in association with prostate cancer risk. The blue dotted line is the Bonferroni-
corrected significance threshold; the red dotted line is the genome-wide significance
threshold. Details regarding each association are provided in Supplementary Table S5.
0 2 4 6 8 10
Chromosome
−log
10
(p)
1 2 4 6 7 8 10 11 13 15 17 20
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
Originally reported cancer (or risk factor) association
Breast
Ovarian
Nicotine dependence
Colorectal
Testicular
Bladder
Leukemia
Esophageal
Glioma
Lung
Melanoma
Neuroblastoma
Nasopharyngeal
Pancreatic
Thyroid
Basal cell
Non−Hodgkin's lymphoma
rs6010620
rs2853676
rs6457327
p=5×10
-8
p=3.6×10
-4
32
Figure 3. Forest plot of the association between rs6457327 at HLA-C locus and
prostate cancer risk. The effects of the C allele (risk allele for follicular lymphoma) on
prostate cancer are plotted. Round dots represent the per-allele odds ratios (OR) in the
ethnic- and consortium-specific associations; diamonds represent the summary estimates
in a fixed-effect meta-analysis. Horizontal bars represent the 95% confidence intervals
(CI). Dotted vertical line represents OR=1.
0.60 0.80 1.00 1.20 1.40
Odds Ratio
Overall − Summary
Native Hawaiian − PAGE
Latino − PAGE
Asian − PAGE
African − Summary
African − PAGE
African − PRACTICAL
European − Summary
European − BPC3
European − PAGE
European − PRACTICAL
●
●
●
●
●
●
●
●
33
References
Ahmadiyeh, N., Pomerantz, M. M., Grisanzio, C., et al. (2010). 8q24 prostate, breast, and
colon cancer risk loci show tissue-specific long-range interaction with MYC. Proc Natl
Acad Sci U S A 107, 9742-9746.
Al Olama, A. A., Kote-Jarai, Z., Giles, G. G., et al. (2009). Multiple loci on 8q24
associated with prostate cancer susceptibility. Nat Genet 41, 1058-1060.
Cheng, I., Kocarnik, J. M., Dumitrescu, L., et al. (2013). Pleiotropic effects of genetic
risk variants for other cancers on colorectal cancer risk: PAGE, GECCO and CCFR
consortia. Gut.
Conde, L., Halperin, E., Akers, N. K., et al. (2010). Genome-wide association study of
follicular lymphoma identifies a risk locus at 6p21.32. Nat Genet 42, 661-664.
Crowther-Swanepoel, D., Broderick, P., Di Bernardo, M. C., et al. (2010). Common
variants at 2q37.3, 8q24.21, 15q21.3 and 16q24.1 influence chronic lymphocytic
leukemia risk. Nat Genet 42, 132-136.
Dastani, Z., Hivert, M. F., Timpson, N., et al. (2012). Novel loci for adiponectin levels
and their influence on type 2 diabetes and metabolic traits: a multi-ethnic meta-analysis
of 45,891 individuals. PLoS Genet 8, e1002607.
Easton, D. F., Pooley, K. A., Dunning, A. M., et al. (2007). Genome-wide association
study identifies novel breast cancer susceptibility loci. Nature 447, 1087-1093.
Eeles, R. A., Olama, A. A., Benlloch, S., et al. (2013). Identification of 23 new prostate
cancer susceptibility loci using the iCOGS custom genotyping array. Nat Genet 45, 385-
391, 391e381-382.
Fitzgerald, L. M., Kumar, A., Boyle, E. A., et al. (2013). Germline missense variants in
the BTNL2 gene are associated with prostate cancer susceptibility. Cancer Epidemiol
Biomarkers Prev 22, 1520-1528.
Freedman, M. L., Haiman, C. A., Patterson, N., et al. (2006). Admixture mapping
identifies 8q24 as a prostate cancer risk locus in African-American men. Proc Natl Acad
Sci U S A 103, 14068-14073.
Goode, E. L., Chenevix-Trench, G., Song, H., et al. (2010). A genome-wide association
study identifies susceptibility loci for ovarian cancer at 2q31 and 8q24. Nat Genet 42,
874-879.
Haiman, C. A., Chen, G. K., Vachon, C. M., et al. (2011). A common variant at the
TERT-CLPTM1L locus is associated with estrogen receptor-negative breast cancer. Nat
Genet 43, 1210-1214.
Haiman, C. A., Patterson, N., Freedman, M. L., et al. (2007). Multiple regions within
8q24 independently affect risk for prostate cancer. Nat Genet 39, 638-644.
Higgins, J. P., Thompson, S. G., Deeks, J. J., and Altman, D. G. (2003). Measuring
inconsistency in meta-analyses. BMJ 327, 557-560.
34
Hindorff, L. A., Sethupathy, P., Junkins, H. A., et al. (2009). Potential etiologic and
functional implications of genome-wide association loci for human diseases and traits.
Proc Natl Acad Sci U S A 106, 9362-9367.
Hunter, D. J., Riboli, E., Haiman, C. A., et al. (2005). A candidate gene approach to
searching for low-penetrance breast and prostate cancer genes. Nat Rev Cancer 5, 977-
985.
Jia, L., Landan, G., Pomerantz, M., et al. (2009). Functional enhancers at the gene-poor
8q24 cancer-linked locus. PLoS Genet 5, e1000597.
Kote-Jarai, Z., Easton, D. F., Stanford, J. L., et al. (2008). Multiple novel prostate cancer
predisposition loci confirmed by an international study: the PRACTICAL Consortium.
Cancer Epidemiol Biomarkers Prev 17, 2052-2061.
Kote-Jarai, Z., Olama, A. A., Giles, G. G., et al. (2011). Seven prostate cancer
susceptibility loci identified by a multi-stage genome-wide association study. Nat Genet
43, 785-791.
Kote-Jarai, Z., Saunders, E. J., Leongamornlert, D. A., et al. (2013). Fine-mapping
identifies multiple prostate cancer risk loci at 5p15, one of which associates with TERT
expression. Hum Mol Genet 22, 2520-2528.
Landi, M. T., Chatterjee, N., Yu, K., et al. (2009). A genome-wide association study of
lung cancer identifies a region of chromosome 5p15 associated with risk for
adenocarcinoma. Am J Hum Genet 85, 679-691.
Leek, R. D., Lewis, C. E., Whitehouse, R., Greenall, M., Clarke, J., and Harris, A. L.
(1996). Association of macrophage infiltration with angiogenesis and prognosis in
invasive breast carcinoma. Cancer Res 56, 4625-4629.
Liu, M., Wang, J., Xu, Y., Wei, D., Shi, X., and Yang, Z. (2012). Risk loci on
chromosome 8q24 are associated with prostate cancer in northern Chinese men. J Urol
187, 315-321.
Matise, T. C., Ambite, J. L., Buyske, S., et al. (2011). The Next PAGE in understanding
complex traits: design for the analysis of Population Architecture Using Genetics and
Epidemiology (PAGE) Study. Am J Epidemiol 174, 849-859.
Nair, R. P., Stuart, P. E., Nistor, I., et al. (2006). Sequence and haplotype analysis
supports HLA-C as the psoriasis susceptibility 1 gene. Am J Hum Genet 78, 827-851.
Neefjes, J., Jongsma, M. L., Paul, P., and Bakke, O. (2011). Towards a systems
understanding of MHC class I and MHC class II antigen presentation. Nat Rev Immunol
11, 823-836.
Peters, U., Hutter, C. M., Hsu, L., et al. (2012). Meta-analysis of new genome-wide
association studies of colorectal cancer risk. Hum Genet 131, 217-234.
Petersen, G. M., Amundadottir, L., Fuchs, C. S., et al. (2010). A genome-wide
association study identifies pancreatic cancer susceptibility loci on chromosomes
13q22.1, 1q32.1 and 5p15.33. Nat Genet 42, 224-228.
35
Rafnar, T., Sulem, P., Stacey, S. N., et al. (2009). Sequence variants at the TERT-
CLPTM1L locus associate with many cancer types. Nat Genet 41, 221-227.
Rothman, N., Garcia-Closas, M., Chatterjee, N., et al. (2010). A multi-stage genome-
wide association study of bladder cancer identifies multiple susceptibility loci. Nat Genet
42, 978-984.
Sakoda, L. C., Jorgenson, E., and Witte, J. S. (2013). Turning of COGS moves forward
findings for hormonally mediated cancers. Nat Genet 45, 345-348.
Schumacher, F. R., Feigelson, H. S., Cox, D. G., et al. (2007). A common 8q24 variant in
prostate and breast cancer from a large nested case-control study. Cancer Res 67, 2951-
2956.
Shete, S., Hosking, F. J., Robertson, L. B., et al. (2009). Genome-wide association study
identifies five susceptibility loci for glioma. Nat Genet 41, 899-904.
Sirota, M., Schaub, M. A., Batzoglou, S., Robinson, W. H., and Butte, A. J. (2009).
Autoimmune disease classification by inverse association with SNP alleles. PLoS Genet
5, e1000792.
Skibola, C. F., Bracci, P. M., Halperin, E., et al. (2009). Genetic variants at 6p21.33 are
associated with susceptibility to follicular lymphoma. Nat Genet 41, 873-875.
Sotelo, J., Esposito, D., Duhagon, M. A., et al. (2010). Long-range enhancers on 8q24
regulate c-Myc. Proc Natl Acad Sci U S A 107, 3001-3005.
Turnbull, C., Rapley, E. A., Seal, S., et al. (2010). Variants near DMRT1, TERT and
ATF7IP are associated with testicular germ cell cancer. Nat Genet 42, 604-607.
Willeit, P., Willeit, J., Mayr, A., et al. (2010). Telomere length and risk of incident cancer
and cancer mortality. JAMA 304, 69-75.
Yeager, M., Chatterjee, N., Ciampa, J., et al. (2009). Identification of a new prostate
cancer susceptibility locus on chromosome 8q24. Nat Genet 41, 1055-1057.
Yeager, M., Orr, N., Hayes, R. B., et al. (2007). Genome-wide association study of
prostate cancer identifies a second risk locus at 8q24. Nat Genet 39, 645-649.
36
Chapter 3 Genome-wide association testing of putative
functional exonic variants and high-density imputed variants
with prostate cancer risk in multiethnic populations
3.1 Introduction
While genome-wide association studies (GWAS) have been remarkably successful in
identifying common genetic variants associated with prostate cancer risk, the genetic
basis underlying susceptibility has yet to be completely revealed. First, the effect sizes of
the GWAS-identified risk alleles have been modest (relative risk, RR of 1.1–1.4) and in
most cases, even in sum, they can explain only a fraction of familial risk or disease
heritability. Second, GWAS have relied almost exclusively on Illumina and Affymetrix
SNP arrays, with SNP content selected primarily from HapMap to capture a large fraction
of common variation in coding and non-coding regions in populations of European
ancestry. The vast majority of alleles that have never been selected on the genotyping
arrays, in particular those with frequencies ≤ 1%, have not been tested. Thus, to date, a
large fraction of genetic variation has yet to be explored with respect to disease etiology.
Given that, in search of additional risk variants for prostate cancer, we performed
genome-wide scans of exonic variants (refer to section 3.2) and high-density imputed
variants (refer to section 3.3).
3.2 Genome-wide testing of putative functional exonic variants
in relationship with prostate cancer risk in a multiethnic population
To date, a lack of technology to survey the genome and accurately enumerate and test the
37
variants in large numbers of samples has limited the exploration of less common and rare
alleles. In 2011, the Illumina Infinium HumanExome array (or ‘‘exome chip’’) has been
developed in collaboration with investigators who combined whole-exome sequencing
conducted in >12,000 individuals of primarily European ancestry as well as in small
numbers of other racial/ethnic minorities including African Americans, Hispanics, and
Asians; the content on the array includes >200,000 putative functional exonic variants
and is aimed to provide comprehensive testing on all non-synonymous variants above
0.1% frequency in Europeans. In the present study, we have utilized this array to test the
hypothesis that there are less common and rare functional variants in the coding regions
of genes that convey risk for prostate cancer of greater magnitude than the common
variants revealed through GWAS. We tested both single markers as well as gene
summaries of the burden of rare alleles in multiethnic studies of incident prostate cancer
in the Multiethnic Cohort study (MEC: 4,675 cases and 8,021 controls). In addition we
conducted exploratory analyses of rare variants in relationship with several cancer-related
traits ascertained at baseline in the entire MEC sample (n=15,837, including 3,141 breast
cancer cases). This paper was published on PLoS Genetics (Haiman et al., 2013).
In this study, I contributed to the quality control of genotyping data, allele
frequency summary by ethnicity, principal component analysis for the entire sample,
statistical analysis for prostate cancer and many cancer-related traits, as well as data
visualization and results interpretation. Detailed methods, results and discussion are
provided below.
38
3.2.1 Materials and Methods
3.2.1.1 Study Population
The MEC consists of more than 215,000 men and women in California and Hawaii aged
45–75 at recruitment, and comprises mainly five self-reported racial/ethnic populations:
African Americans, Japanese, Latinos, Native Hawaiians, and European Americans
(Kolonel et al., 2000). Between 1993 and 1996, adults enrolled in the study by
completing a 26-page mailed questionnaire asking detailed information about
demographic factors, personal behaviors, and prior medical conditions. Potential
participants were identified through driver’s license files from Departments of Motor
Vehicles, voter registration lists, and Health Care Financing Administration data files.
Incident breast and prostate cancer, as well as stage and hormone receptor status was
identified by linkage of the cohort to the Surveillance, Epidemiology, and End Results
cancer registries covering Hawaii and California. Between 1995 and 2006, blood
specimens were collected prospectively from ~67,000 participants for genetic and
biomarker analyses. Currently, the breast cancer case-control study nested in the MEC
includes 3,141 women diagnosed with invasive breast cancer and 3,721 frequency-
matched controls without breast cancer, matched by race/ethnicity and age (in 5-year age
categories). The case-control study of prostate cancer includes 4,675 men diagnosed with
incident prostate cancer and 4,300 male controls without prostate cancer. The
Institutional Review Boards at the University of Southern California and University of
Hawaii approved of the study protocol.
39
3.2.1.2 Genotyping and Quality Control
Genotyping of the Illumina Human Exome BeadChip (n = 247,895 SNPs) was conducted
at the USC Genomics Core Laboratory. Of the 15,837 samples described above
genotyping was successful with call rates ≥98% for 15,573 samples; of these we removed
17 samples for which reported sex conflicted with assessment of X chromosome
heterozygosity, and 651 samples based on relatedness. Relatedness was determined using
the IBD calculation in PLINK (Purcell et al., 2007), and we removed one of each
estimated MZ twin, sibling, parent-offspring, half sibling, or first cousin pairs. In the
analysis, we also removed SNPs with <98% call rates (n = 2,531). To assess genotyping
reproducibility we included 338 replicate samples which passed genotyping QC; among
these samples the concordance rate of heterozygote calls, number concordant/(number
concordant+number discordant), was 99.6% or greater for all replicate samples (average
99.99%). The final analysis dataset included 245,339 SNPs genotyped on 2,984 breast
cancer cases and 3,568 controls, and 4,376 prostate cancer cases and 3,977 controls.
3.2.1.3 Statistical Analysis
We relied on documentation files obtained from the University of Michigan posted on
ftp://share.sph.umich.edu/exomeChip/IlluminaDesigns/ for the assessment of SNP type
(i.e. nonsynonymous (NS), splice site (SP), stop gain or loss, etc), and the amino acid
affected. The array also includes SNPs that do not code for protein changes including
synonymous SNPs, and other intergenic SNPs including ancestry informative markers,
and GWAS identified risk SNPs for a number of diseases and outcomes. All SNPs were
analyzed and their results shown in Tables S1-S9. However, our primary analysis
40
focused on the 191,032 putative functional variants in the following categories (NS, SP
and stop gain or loss) that passed quality control procedures discussed above. We
estimated principal components in the entire sample using EIGENSTRAT (Price et al.,
2006) based on 2,887 autosomal ancestry informative markers on the array. We adjusted
for the top 10 principal components in all analyses.
Association testing of single markers. For all analyses except those of the X and Y
chromosomes, all controls (men and women combined) were utilized in the analysis of
each cancer in order to increase statistical power. Only controls of the same sex were
used to analyze X or Y chromosome variants. Analyses were performed overall and
within each racial/ethnic group. For each genotyped SNP, odds ratios (OR) and 95%
confidence intervals (95% CI) were estimated using unconditional logistic regression of
case/control status adjusting for age at diagnosis (cases) or blood draw (controls), and
reported race/ethnicity in the overall analyses, and the first 10 eigenvectors in both
overall and ethnic-specific analyses. For each SNP, we tested for allele dosage effects
through a 1 d.f. score chi-square trend test. When exposures are rare but with very strong
effects the score test can be more powerful than the usual Wald test for reasons described
in Hauck and Donner (Hauck and Donner, 1977). However, we found the score test to be
overly liberal when both the exposures are rare and the number of cases in a given
analysis is small compared to the number of controls (as in analysis of advanced prostate
cancer). Therefore we followed up any apparently globally significant findings found
with the score test by rerunning that analysis using the exact logistic regression procedure
implemented in SAS (Cary NC); when using the exact test we dropped the eigenvectors
and age from the analysis and only used reported race/ethnicity as an adjustment variable.
41
The presentation of results is focused on putative functional exonic SNPs (i.e. NS, SP and
stop gain or loss); the most significant results for all SNPs (including non-functional
SNPs) are provided in Tables S3, S4.
Evaluation of the known risk loci for prostate cancer. We also examined whether
known risk alleles (generally intergenic or intronic) from GWAS studies of prostate
cancer may be reflecting an underlying signal from a nearby protein-altering variant. In
these analyses for each GWAS SNP (n=83) we initially interrogated nearby SNPs known
to be or likely to be in LD with the index signal. Because LD data is not yet available for
the majority of the SNPs on the HumanExome array, we expanded the associations
considered to be all those within a 100 kb region on either side of the index signal, since
LD between common SNPs can sometimes extend this far. In this region we highlighted
in the results section and discussion, SNPs with modest signals of association (p<0.05) as
well as more strongly significant SNPs. Here the common SNPs are likely to be in high
LD with the (generally common) GWAS variants, and the rare SNPs could be producing
synthetic associations. We then relaxed this 200 kb region to 1 mb (500 kb on either side
of the index signal) in order to expand our examination of possible synthetic associations
between rare SNPs and the index GWAS findings, since LD with rare SNPs can extend
considerably further than with common SNPs.
3.2.2 Results
The analysis included 217,601 putative functional variants (of 247,870 total markers
listed on the array), predicted to alter the protein coding sequence, and which passed
quality control procedures (see Methods). Of the 15,837 samples, 14,905 were included
in the analysis (3,315 European Americans, 3,854 African Americans, 3,106 Latinos,
42
3,843 Japanese Americans and 787 Native Hawaiians; see Methods for exclusion
criteria). A few mitochondrial SNPs were included on the array (n = 165 SNPs passing
quality control) but are not discussed here (no associations with them were seen in the top
ranked 1,000 associations for either breast or prostate cancer). The number of prostate
cancer cases and controls are shown in Table 1. In this multiethnic sample, 191,032
(88%) putative functional variants were found to be polymorphic in at least one
population, with 26,569 (12%) being monomorphic in all five populations (Figure 1).
The percentage of monomorphic SNPs ranged from 34.1% in African Americans, 39.6%
in European Americans and 43.3% in Latinos to 66.8% in Native Hawaiians and 74.2%
in Japanese Americans (Figure S1). Of the polymorphic SNPs, 178,776 (93.4%) were
nonsynonymous (NS) variants, 8,308 (4.4%) splice site (SP) variants, and 3,948 (2.1%)
nonsense variants which either lead to a gain or loss of a stop codon. Of the polymorphic
SNPs, 34,834 (18.2%) were polymorphic in all four of the largest populations (excluding
Native Hawaiians), with 81,713 SNPs (42.7%) being polymorphic in African Americans,
Latinos and European Americans (Figure 2). African Americans had the largest number
of unique polymorphic SNPs (21,908, 11.4%), followed by European Americans (16,653,
8.7%), Japanese Americans (6,776, 3.5%) and Latinos (5,134, 2.7%).
In the pooled sample, 190,662 putative functional (NS, SP, or stop) SNPs had a
minor allele frequency (MAF) <1% (56,759<0.01%; 85,897 between 0.01% and 0.1%,
and 48,006 between 0.1% and 1%) (Figure 1, Figure S1). The minor allele frequency
distributions were similar across three of the five populations with African Americans,
European Americans and Latinos having roughly the same number of SNPs with
frequencies greater than 0 and less than 1% (100–110 thousand); However there were
43
only 37,979 SNPs with a frequency above zero and less than 1% in Japanese Americans
and 52,985 in Native Hawaiians. The number of SNPs with a frequency >1% ranged
from approximately 18–35 thousand between sampled populations.
Inspection of the distribution of the chi-square (score) tests from models for
overall prostate cancer showed evidence of over-dispersion of test statistics (genomic
control lambda estimated to be approximately 1.20); however, when very rare SNPs were
removed (MAF<0.1% overall) then the test statistics appeared to be sampled from an
overall central chi-square distribution (genomic control lambda = 1.05). In the gene
burden analyses, the distribution of observed score tests showed mild evidence of over-
dispersion (lambda = 1.06). When the single SNP analysis was restricted to advanced
prostate cancer, where there were many more controls than cases included in each model,
then the behavior of the score test for the single SNP associations was problematic for
rare SNPs. For such SNPs we followed up any apparently globally significant
associations with exact logistic regression analysis, in order to reduce what appeared to
be a proliferation of false positive signals.
3.2.2.1 Prostate cancer single SNP associations
For overall prostate cancer (4,376 cases and 7,545 controls) none of the single SNP
associations with prostate cancer met the Bonferroni adjustment for multiple comparison
testing (nominal p<3.9x10
-7
). The top two associations found for prostate cancer were for
rare NS variants in F13A1 (rs140712764, Val170Ile, OR=28.0, p=9.1x10
-7
) and ANXA4
(rs146778617, Val315Phe, OR=4.52, p=6.0x10
-6
), Table 2, see also Table S3. Gene
F13A1 is a coagulant factor gene not obviously related to prostate cancer etiology.
44
ANXA4 encodes a protein that has been discussed as a possible marker for gastric cancer
(Lin, Huang and Juan, 2012). Of note, the third most significant association was for a
common NS variant in GPRC6A (rs2274911, Pro91Ser, OR=0.88, P=1.3x10
-5
). This
gene is nearby to RFX6, which harbors an intronic variant (rs339331) that has been
reported in a GWAS of prostate cancer in Japanese men (Takata et al., 2010). The SNP
rs2274911 is common in all populations (MAFs of 24-43%) (Table 2) and the protective
effect of the minor allele was generally consistent in each group (OR=0.78 to 0.95, over
the five groups). This NS variant is correlated with the known intronic variant (rs339331,
which is included on the Illumina HumanExome array) in all populations (r
2
between
0.74 and 0.98); in conditional analyses neither of these two SNPs remained significant
after the other was forced into the model (p>0.2); thus these two variants are probably
capturing the same signal, with the NS SNP in GPRC6A a potentially plausible
susceptibility variant. The top 10 ranked associations (Table 2) were all NS variants and
4 were common with a MAF>10% in all ethnic groups.
When restricted to advanced cases (n=499), many associations with very rare
SNPs were nominally significant using the score test (69 total for SNPs with less than 10
minor alleles observed) but the p-values failed to stand up to further investigation using
exact logistic regression (with p-values all <3x10
-5
). In order to reduce discussion of a
large number of likely false positive tests we considered in subtype (advanced/non-
advanced) analyses only SNPs with at least 10 minor alleles seen over all cases and
controls used in the analysis. Of the remaining SNPs we found that four NS SNPs with at
least 10 minor alleles present were nominally significant using the score test criteria
(Table 2, Table S4). These included NS variants in KLHL30 (exm280349, Arg108His,
45
OR=13.9, p=1.7x10
-9
), PPP1R15A (rs45533432, Arg65Gly, OR=4.67, p=1.2x10
-8
),
MUC12 (rs143984295, Ala101Thr, OR=14.4, p=1.5x10
-8
) and RP1 (rs114797722,
Ala1326Pro, OR=13.4, p=2x10
-8
). These SNPs were all quite rare in the four largest
populations (0.1% - 1%). P-values from exact logistic regression for these SNPs were
again less significant with p-values between 1.4x10
-6
and 4.6x10
-4
).
For non-advanced disease (n=3,666 cases), the strongest associations were with
the same SNPs as overall prostate cancer (rs140712764 in F13A1, rs146778617 in
ANXA4, rs2274911 in GPRC6A) and also with rs61746620 in ZKSCAN2 (Ala574Val,
OR=13.4, p=1.3x10
-5
), although none of these were significant at our Bonferroni criteria.
No SNPs were significantly associated with overall prostate cancer in ethnic
specific analysis (Table S3).
3.2.2.2 Analysis at known risk loci for prostate cancer
GWAS Loci: Tables S7 gives results for SNP associations for genes located at known
prostate cancer susceptibility regions revealed through GWAS (e.g. regions harboring
globally significant associations) as of the time of this report (n= 89) (Eeles et al., 2012).
For each region, we list the genes having one or more genotyped coding variants that lie
within 500 kb of the known GWAS SNP and summarize associations (smallest p-value)
with coding variants in those genes and with the burden of coding variants (all SNPs and
rare SNPs). The most significant GWAS-related association, as described above, was
with rs2274911 (Pro91Ser) in GPRC6A. The next most significant finding was with
rs16836525 (Val125Met) in PMVK at 1q21 (p=3.0x10
-4
). This SNP was only common in
African Americans (20% frequency; ≤1 % in the other populations). An additional eight
46
nearby genes had SNPs with corrected p-values between 0.001 and 0.05: ITGA6, VGLL3,
TECPR1, TPCN2, FAM83F, PBXIP1, FARP2 and TTLL12 (Table S7). Seven SNPs were
correlated with a GWAS index SNP at r
2
≥0.3 in the 200kb window and significant at
p<0.05 (SLC2A4RG, PDLIM5, RNMTL1, KLK3, MLPH, RTEL1 as well as GPRC6A).
Given the modest effects noted with the initial GWAS signals as well as observed with
these correlated coding SNPs (OR per allele of ~1.1; Table S7), and the lack of strong
signals noted for the index signals across populations (Chen et al., 2010) conditional
analyses will be needed in much larger samples of the GWAS population (mainly
European ancestry) to determine whether these coding SNPs are the biologically
functional alleles underlying the GWAS signal. (Our ability to perform informative
conditional analysis here is further hampered by the fact that only a minority of the index
GWAS hits are included on the Illumina array).
Extended Associations: Because of the interest in the possibility that rare coding
variants with large effect sizes (OR > 1.5 or higher) may underlie GWAS signals and
since LD with rare SNPs can extend much further than with common SNPs, we report in
Table S7 the strongest associations for all coding variants in each gene within 500kb of
each GWAS index signal. The strongest single SNP associations with prostate cancer
(from 100 to 500kb) were in SNED1 (412 kb from rs3771570 on chromosome 2,
p=3.5x10
-4
) and PASK (317kb from the same index SNP on chromosome 2, p=4.8x10
-4
).
No other associations in this distance range had p<0.001 for overall prostate cancer.
High-Risk Genes: We also examined genes implicated in family-based studies of
prostate cancer (Tables S7) as they are strong candidates. We analyzed 5 genes and did
not observe an over-representation of SNP associations at p<0.05 (observed/tested:
47
BRCA2, 2/83; ELAC2, 1/9, HOXB13, 0/2; MSR1, 1/22; RNASEL, 2/21). However, we did
observe suggestive evidence of associations with burden testing of rare (MAF<0.01)
SNPs in ELAC2 (OR=1.67, p=0.03) and in RNASEL (OR=1.26, p=0.02). The most
significant associations included a very rare NS variant in ELAC2 that was mainly
observed in African Americans (rs149544601, Ile356Val, MAF=6.6x10
-5
; OR=14.0,
p=0.0014), and a nonsense SNP (rs74315364, Glu265Ter) and NS variant (rs151296858,
Gly59Ser) in RNASEL (both with OR=2.51, p=0.012) that were observed in the same
individuals. We did not observe significant associations with any of the reported risk
variants in these genes (Ala541Thr, Ser217Leu in ELAC2; Ser41Thr, Asp174Tyr,
Pro275Ala, Arg293Ter in MSR1, or Arg462Gln, Glu541Asp in RNASEL, Table S7; the
recently reported HOXB13 variant, Gly84Glu (Ewing et al., 2012), was not included on
the array).
3.2.2.3 Other phenotypes and traits
We also examined additional cancer-related traits: body mass index (BMI), alcohol intake,
as well as circulating PSA levels (Table S8). A number of NS variants have already
been strongly associated with many of these traits, such as rs671 (Glu504Lys) in ALDH2
with alcohol intake (Takeuchi et al., 2011), rs17632542 (Ile179Thr) in KLK3 and
circulating PSA levels (Gudmundsson et al., 2010; Parikh et al., 2011) and rs198977
(Arg250Trp) in KLK2 and the ratio of free to total PSA (Klein et al., 2010). For each trait,
the 10 most associated variants on the array (including non-functional SNPs, i.e. GWAS
SNPs) are provided in Table S9. We also observed a number of suggestive associations
at p<3.9x10
-7
with rare coding variants in some genes that are biologically plausible for
each trait. Three variants were strongly associated with blood PSA levels (chr19: Hg19
48
position: 4552446, Thr326Met, SEMA6B, 0.1% MAF in African Americans and
monomorphic in all of the other populations, beta=3.8, p=3.8x10
-9
; rs17632542,
Ile136Thr, beta=-0.4588, p=1.0x10
-8
MAF .06 in European Americans; rs148595483,
Asn322Lys, CCDC78, 0-0.1% MAF across populations, beta=-2.9, p=2.4x10
-8
). We also
found a number of significant associations with very rare NS variants that were observed
in 2-7 individuals and BMI (rs146199292, Asn31Lys, OSBPL11, beta=19.9 p=1.2x10
-10
;
rs149954327, Leu458Val, STON1-GTF2A1L, beta=15.2 p=1.5x10
-9
; rs146922831,
Lys608Asn, LRGUK, beta=9.2, p=3.0x10
-8
). The variants were very rare in African
Americans with frequencies <0.09% and monomorphic in all of the other populations
except for rs146199292 in Latinos (0.02%). Variations in these genes have been reported
in association with conditions related with BMI, including cardiovascular risk factors,
type 2 diabetes and polycystic ovarian syndrome (Bouchard et al., 2009; Chen et al.,
2011; Laramie et al., 2008). The carriers of these rare alleles were clustered at the
extreme high end of the BMI distribution. All these potentially novel associations will
need further follow-up.
3.2.3 Discussion
This study presents an initial investigation of the role of coding variation in the genetics
of prostate cancer. Our initial analysis fails to find strong evidence for the hypothesis that
relatively rare coding variation is highly determinative of prostate cancer risk either
overall or by subtype. Our sample sizes in each racial/ethnic group were each relatively
small (roughly 1,000 cases and 2,000 controls in the largest groups) however these
sample sizes are large enough to detect risk alleles with moderate to large effects (odds
ratios of 3-13) appearing in quite low frequency (0.1-1%) and to examine whether such
49
coding variation underlie (by so-called synthetic association (Dickson et al., 2010)) many
GWAS associations. While caution is advised in interpreting our results, especially for
other than European racial/ethnic groups (since the array utilized was predominantly
based upon sequence information for Europeans and is not expected to cover other
groups equally well), it appears that future studies to understand the relationship between
rare coding variation and prostate cancer risk will likely require the very large sample
sizes needed to target much less penetrant alleles.
For prostate cancer (all cases) the third strongest association result was for a
common NS coding variant (rs2274911) in GPRC6A that is in very high LD with the
known intronic GWAS variant rs339331. In our data the NS variant was slightly more
associated (Table S3) with prostate cancer risk (p=1.3x10
-5
) than was rs339331
(p=2.1x10
-5
). The coding SNP is arguably a more likely causal variant than the intronic
SNP since expression of GRPC6A is substantially increased in prostate cancer cell lines,
and mice deficient in GRPC6A show retarded prostate cancer progression (Pi and Quarles,
2012). In addition, GRPC6A deficiency in mice also attenuates the rapid signaling
responses to testosterone, an androgen that is critical for initiation and progression of
prostate cancer (Pi, Parrill and Quarles, 2010).
Other suggestive findings for prostate cancer include SNPs in a variety of genes
such as F13A1 expression of which has been associated with bone metastasis in prostate
cancer (Morrissey et al., 2008), ANXA4 which is up-regulated in gastric and other cancers
(Lin et al., 2012 ), NSD1 where cryptic translocations may be involved in AML
occurrence (Hollink et al., 2011) and MUC12, expression of which has been reported to
be a prognostic marker in colon cancer (Matsuyama et al., 2010).
50
We evaluated also associations in regions surrounding known (GWAS) risk
alleles as a partial fine-mapping exercise; we specifically focused upon (1) coding alleles
reported to be in high LD (in Europeans using 1000 Genomes data) with the index
marker, and (2) other (generally less common) coding alleles within 500kb of the GWAS
alleles, that might show associations that could underlie (by synthetic association
(Dickson et al., 2010)) GWAS associations. A number of GWAS risk alleles are in
reasonable LD (r
2
>0.3) with coding SNPs on the array and several of the latter show
nominal associations (p<0.05) with prostate cancer risk. The observation is made most
notably for GPRC6A but also for MLPH (GWAS index = rs7584330, chromosome 2,
p=0.003), PDLIM5 (rs12500426, chromosome 4, p=0.019), RNMTL1 (rs684232,
chromosome 17, p=0.024), KLK3 (rs2735839, chromosome 19, p=0.0046), and RTEL1
(rs6062509, chromosome 20, p=0.001). Previous reports (Kote-Jarai et al., 2011a; Parikh
et al., 2011) have highlighted the NS SNP rs17632542 in KLK3 as highly associated with
PSA level and a highly significant risk variant in fine-mapping of the locus near
rs2735839 (Kote-Jarai et al., 2011a); while no report for prostate cancer exists for coding
SNPs in RTEL1, another NS SNP, rs3208008, in RTEL1 has been found to be associated
with glioma risk (Egan et al., 2011).
Other coding SNPs that could include causal variants producing synthetic
associations (associations of rare with common SNPs of high penetrance) include SNPs
in SNED1 and PASK for prostate cancer. These do not have high r
2
with the GWAS
variants as they are mostly rare (and are >100kb away from the index signal) but their
nominally strong associations (p-values < 1x10
-3
) might possibly be indicative of signals
51
extending for many thousands of base pairs, although it will take much larger studies to
verify or refute this.
We found little evidence that the NS, SP, or nonsense variants captured by the
HumanExome SNP array that fall within known or suspected high risk genes for prostate
cancer are meaningfully associated with the disease. The Illumina array does not directly
interrogate the rare, high-risk mutations, such as frameshift mutations in BRCA1 or
BRCA2 (e.g. c.68_69delAG) (King, Marks and Mandell, 2003), as very few indels are
included on this array (just 136 were examined here). The inability to address frameshift
mutations either within known risk genes or more widely is a limitation of this report.
Other limitations include the focus on Europeans in the development of the array (as
seems to be particularly reflected in the relatively small fraction of SNPs found to be
polymorphic in Japanese Americans), and the loss of some targeted SNPs in the
manufacturing process and in our QC procedures. In addition, this technology (unlike
exome sequencing) cannot address the role of either private variation or of variants too
rare to have been reliably identified during the discovery phase of the development of the
array.
Genotyping cases and controls from our prospective cohort allowed us an
opportunity to examine other cancer-related phenotypes and traits for which data and
specimens had been collected prior to prostate cancer diagnosis. While two of these
endpoints (BMI, alcohol) were based on self-report, we were able to strongly replicate a
number of known associations such as rs671 in ALDH2 with alcohol intake which is
proof of principle that the exome array has the potential to reveal biologically relevant
coding variants. Apparently novel findings for PSA, BMI, and alcohol consumption will
52
need to be replicated in large-scale exome association analyses; hopefully making the
results from these preliminary analyses in a multiethnic population broadly available will
contribute to novel discoveries and further understanding the genetic basis of these traits.
Realistically our study only begins the assessment of whether a range of effects
for "moderately rare" coding variants is possible: the detectable ORs in this study range
from approximately 3 to 13 for alleles with frequency 1% to 0.1%, respectively. While
these are large ORs the above argument indicates that such effect sizes are not
unreasonable if rarer protein coding variation plays a similar role in the heritability of risk
as does common variation genome-wide. Our failure to find such ORs for the rarer alleles
may be providing evidence against coding variation having a predominant role in prostate
cancer heritability and risk (outside of high risk families).
In summary, the analyses and methods described here do not support NS variants
on the current exome chip as conveying moderate to high risk for prostate cancer. While
some suggestive findings are noted it is likely that very large sample sizes of the order
that can be only developed through collaborative efforts such as those now engaged in the
NCI GAME-ON post-GWAS meta-analysis of common variants, will be required in
order to further the understanding of the role of rare NS and other coding variation in
disease genetics. Exome sequencing of high-risk families will continue to be important to
reveal biologically relevant coding variants for these cancers, both for insertion/deletion
variants that were not covered by the current array, and to capture rarer variation
(including private variants) that cannot be captured except by sequencing.
53
Figure 1. Minor allele frequency for all variants successfully genotyped using the
Illumina Human Exome array.
0 (0,0.0001] (0.0001,0.001] (0.001,0.01] (0.01,0.05] (0.05,0.1] (0.1,1]
Minor Allele Frequency
Number of SNPs
0 50000 100000 150000 200000
54
Figure 2. Number of polymorphic putative functional variants by racial/ethnic group.
55
Table 1. The Descriptive Characteristics of the Multiethnic Case-control Studies of Prostate
Cancer.
Prostate Cancer n (Cases/Controls)
Age (years; mean(sd))
(Cases/Controls)
n (%)
(Advanced/Non-advanced)
All Groups 4376/7545 70(7.2)/68(8.6) 499(11)/3666(84)
European Americans 879/1682 69(7.7)/68(8.9) 100(11)/749(85)
African Americans 1117/2146 70(7.3)/69(8.4) 116(10)/932(83)
Latinos 1190/1302 69(6.6)/67(7.8) 145(12)/986(83)
Japanese Americans 1022/2012 72(7.4)/69(8.6) 114(11)/863(84)
Native Hawaiians 168/403 69(6.7)/64(8.6) 24(14)/136(81)
56
Table 2. The Most Significant Associations of Single Coding Variants with Prostate Cancer Risk
All Cases (n=4,376) vs Controls (n=7,545)
SNP ID
a
Chr Position
b
rs# A1/A2
c
Type
Gene OR
d
P
AA
MAF
e
NH
MAF
JA
MAF
LA
MAF
EA
MAF
exm514211 6 6266854 rs140712764 T/C Val170Ile F13A1 28.007 9.1E-07 0.0002
33
0 0 0 0
exm199465 2 70052624 rs146778617 T/G Val315Phe ANXA4 4.523 6.0E-06 0.0025
63
0 0 0 0
exm574153 6 117130704 rs2274911 G/A Pro91Ser GPRC6A 0.875 1.3E-05 0.2379 0.2717 0.4332 0.2657 0.2542
exm68152 1 70896038 rs145785987 C/T Cys229Arg CTH 9.011 3.1E-05 0.0006
99
0 0 0 0
exm971959 11 134128968 NA G/A SER186ASN ACAD8 >999.9
99
3.2E-05 0 0 0 0 0
exm1105738 14 61180657 rs3742636 T/G His605Pro SIX4 1.125 4.0E-05 0.4256 0.2742 0.4688 0.2479 0.2912
exm1474666 19 43414890 rs116433230 A/C Gly183Val PSG6 0.223 4.3E-05 0.0123
5
0 0 0.0015
36
0
exm506442 5 176637576 rs28932178 C/T Ser457Pro NSD1 0.871 6.1E-05 0.1051 0.4243 0.5249 0.217 0.1507
exm1507288 19 55527081 rs2304167 C/T Ala249Thr GP6 0.879 7.0E-05 0.4653 0.2903 0.2239 0.1889 0.1819
exm1478994 19 45296806 rs3208856 T/C His359Tyr CBLC 0.687 7.3E-05 0.0461
3
0.0086
85
0.0004
97
0.0188
2
0.0466
7
Advanced Cases (n=499) vs Controls (n=7,545)
SNP ID
a
Chr Position
b
rs# A1/A2
c
Type
Gene OR
d
P
AA
MAF
e
NH
MAF
JA
MAF
LA
MAF
EA
MAF
exm280349 2 239049718 NA A/G Arg108His KLHL30 13.991 1.7E-09 0 0 0 0 0.0020
81
exm1488544 19 49376683 rs45533432 G/A Arg65Gly PPP1R15A 4.677 1.2E-08 0.0020
97
0.0024
81
0 0.0038
4
0.0095
12
exm643590 7 100634145 rs143984295 A/G Ala101Thr MUC12 14.425 1.5E-08 0 0 0 0.0003
84
0.0014
86
exm701486 8 55540418 rs114797722 C/G Ala1326Pro RP1 13.409 2.0E-08 0.0016
31
0 0 0 0
exm782688 9 130224593 rs150292099 G/A VAL157ILE LRSAM1 10.488 3.5E-07 0.0020
97
0 0 0 0
exm2275251 17 58235051 rs185658468 T/A SP CA4 7.137 5.1E-07 0 0.0024
81
0.0039
76
0 0
exm1321007 17 39520119 rs150620728 T/C Arg395His KRT33B 7.485 5.8E-07 0.0028
16
0 0 0.0011
52
0
exm942022 11 75439894 rs141331999 G/A Asn237Ser MOGAT2 8.489 9.4E-07 0.0009
32
0 0 0.0003
84
0.0017
84
exm594160 6 167343185 rs35716361 T/G Ser221Tyr RNASET2 8.129 1.1E-06 0.0030
3
0 0 0 0
exm1607984 22 38483189 NA G/A Ser401Pro BAIAP2L2 6.084 2.4E-06 0.0018
65
0 0.0012
44
0.0019
23
0.0002
973
Non-Advanced cases (n=3,666) vs Controls (n=7,545)
SNP ID
a
Chr Position
b
rs# A1/A2
c
Type
Gene OR
d
P
AA
MAF
e
NH
MAF
JA
MAF
LA
MAF
EA
MAF
exm514211 6 6266854 rs140712764 T/C Val170Ile F13A1 28.366 8.3E-07 0.0002
33
0 0 0 0
exm1228070 16 25255366 rs61746620 A/G Ala574Val ZKSCAN2 13.396 1.3E-05 0.0002
331
0 0.0002
485
0 0
exm199465 2 70052624 rs146778617 T/G Val315Phe ANXA4 4.275 3.4E-05 0.0025
63
0 0 0 0
exm574153 6 117130704 rs2274911 G/A Pro91Ser GPRC6A 0.876 4.1E-05 0.2379 0.2717 0.4332 0.2657 0.2542
exm1311040 17 32688826 rs138527286 C/T Ile56Val CCL1 2.343 4.1E-05 0.0107
2
0 0 0 0
exm68152 1 70896038 rs145785987 C/T Cys229Arg CTH 8.761 6.3E-05 0.0006
99
0 0 0 0
exm1105738 14 61180657 rs3742636 T/G His605Pro SIX4 1.127 8.6E-05 0.4256 0.2742 0.4688 0.2479 0.2912
exm392074 4 22390167 rs9002 T/C Val1043Met GPR125 1.139 1.2E-04 0.2823 0.134 0.1886 0.2342 0.202
exm1419304 19 8564474 rs4239541 T/G Pro73Gln PRAM1 1.133 1.5E-04 0.6842 0.1563 0.1499 0.3263 0.2701
exm854616 10 106025864 rs116993524 T/C Thr163Ile GSTO1 3.756 1.6E-04 0.0009
32
0 0 0 0.0023
78
AA, African Americans; NH, Native Hawaiians; JA, Japanese Americans; LA, Latinos; EA, European Americans; SP, splice-site variant.
a
SNP ID from db135.
b
Position based on GRCh37.
c
A1 is minor allele based on the entire multiethnic sample and the tested allele, A2 is the
reference allele.
d
Odds ratio per allele based on the pooled analysis adjusted for age and the first 10 principle components.
e
MAF is minor allele
frequency in controls.
57
3.3 A meta-analysis of 87,040 individuals identifies 23 new
susceptibility loci for prostate cancer
Genome-wide association studies (GWAS) have identified 76 variants associated with
prostate cancer risk predominantly in populations of European ancestry. To identify
additional susceptibility loci, we conducted a meta-analysis of >10 million SNPs in
43,303 prostate cancer cases and 43,737 controls from studies in populations of
European, African, Japanese and Latino ancestry. In each study, imputation was
performed using a cosmopolitan reference panel from the 1000 Genomes Project (March
2012). Across the various studies, 5.8–16.8 million genotyped and imputed SNPs, as well
as insertion-deletion variants with a frequency of ≥1%, were examined in association
with prostate cancer risk. Twenty-three new susceptibility loci were identified at
association P < 5 × 10
−8
; 15 variants were identified among men of European ancestry, 7
were identified in multi-ancestry analyses and 1 was associated with early-onset prostate
cancer. These 23 variants, in combination with known prostate cancer risk variants,
explain 33% of the familial risk for this disease in European-ancestry populations. These
findings provide new regions for investigation into the pathogenesis of prostate cancer
and demonstrate the usefulness of combining ancestrally diverse populations to discover
risk loci for disease. More details can be found in the paper published on Nature Genetics
(Al Olama et al., 2014).
In this study, I conducted statistical analyses for the African Ancestry Prostate
Cancer GWAS Consortium (AAPC), GWAS in the Japanese (JAPC) and Latino (LAPC)
populations from the Multiethnic Cohort. The outcomes of interest include overall
prostate cancer risk (cases vs. controls) and aggressive prostate cancer risk (aggressive
58
cases vs. non-aggressive cases). For each study/consortium, I performed quality control
on the imputed data (quality score ≥0.3 and minor allele frequency ≥1%) followed by a
genome-wide scan on the high-density variants, adjusted for age, global ancestry (the first
10 principal components) and study in the model as they are potential confounders. I
checked the concordance between the genotyped and imputed data by comparing the
allele frequencies and effect sizes of overlapping alleles. I evaluated the genomic
inflation using quantile-quantile (Q-Q) plot, and visualized the genome-wide association
results using Manhattan plot. Furthermore, I contributed to the identification of novel risk
loci through summarizing meta-analysis results, estimating the correlation with known
risk loci, and conducting conditional analysis in AAPC, JAPC and LAPC.
59
References
Al Olama, A. A., Kote-Jarai, Z., Berndt, S. I., et al. (2014). A meta-analysis of 87,040
individuals identifies 23 new susceptibility loci for prostate cancer. Nat Genet.
Bouchard, L., Faucher, G., Tchernof, A., et al. (2009). Association of OSBPL11 gene
polymorphisms with cardiovascular disease risk factors in obesity. Obesity (Silver
Spring) 17, 1466-1472.
Chen, F., Stram, D. O., Loïc Le Marchand, et al. (2010). Caution in generalizing known
genetic risk markers for breast cancer across all ethnic/racial populations. Eur J Hum
Genet 19, 243–245.
Chen, Z. J., Zhao, H., He, L., et al. (2011). Genome-wide association study identifies
susceptibility loci for polycystic ovary syndrome on chromosome 2p16.3, 2p21 and
9q33.3. Nat Genet 43, 55-59.
Dickson, S. P., Wang, K., Krantz, I., Hakonarson, H., and Goldstein, D. B. (2010). Rare
variants create synthetic genome-wide associations. PLoS Biol 8, e1000294.
Eeles, R. A., Olama, A. A. A., Benlloch, S., et al. (2012). Identification of 23 novel
prostate cancer susceptibility loci using a custom array (the iCOGS) in an international
consortium, PRACTICAL. In review for Nature Genetics.
Egan, K. M., Thompson, R. C., Nabors, L. B., et al. (2011). Cancer susceptibility variants
and the risk of adult glioma in a US case-control study. J Neurooncol 104, 535-542.
Ewing, C. M., Ray, A. M., Lange, E. M., et al. (2012). Germline mutations in HOXB13
and prostate-cancer risk. N Engl J Med 366, 141-149.
Gudmundsson, J., Besenbacher, S., Sulem, P., et al. (2010). Genetic correction of PSA
values using sequence variants associated with PSA levels. Sci Transl Med 2, 62ra92.
Haiman, C. A., Han, Y., Feng, Y., et al. (2013). Genome-wide testing of putative
functional exonic variants in relationship with breast and prostate cancer risk in a
multiethnic population. PLoS Genet 9, e1003419.
Hauck, W., and Donner, A. (1977). Wald's Test as Applied to Hypotheses in Logit
Analysis. JASA 72, 851-853.
Hollink, I. H., van den Heuvel-Eibrink, M. M., Arentsen-Peters, S. T., et al. (2011).
NUP98/NSD1 characterizes a novel poor prognostic group in acute myeloid leukemia
with a distinct HOX gene expression pattern. Blood 118, 3645-3656.
King, M. C., Marks, J. H., and Mandell, J. B. (2003). Breast and ovarian cancer risks due
to inherited mutations in BRCA1 and BRCA2. Science 302, 643-646.
Klein, R. J., Hallden, C., Cronin, A. M., et al. (2010). Blood biomarker levels to aid
discovery of cancer-related single-nucleotide polymorphisms: kallikreins and prostate
cancer. Cancer Prev Res (Phila) 3, 611-619.
Kolonel, L. N., Henderson, B. E., Hankin, J. H., et al. (2000). A multiethnic cohort in
Hawaii and Los Angeles: baseline characteristics. Am J Epidemiol 151, 346-357.
60
Kote-Jarai, Z., Amin Al Olama, A., Leongamornlert, D., et al. (2011). Identification of a
novel prostate cancer susceptibility variant in the KLK3 gene transcript. Hum Genet 129,
687-694.
Laramie, J. M., Wilk, J. B., Williamson, S. L., et al. (2008). Polymorphisms near EXOC4
and LRGUK on chromosome 7q32 are associated with Type 2 Diabetes and fasting
glucose; the NHLBI Family Heart Study. BMC Med Genet 9, 46.
Lin, L. L., Huang, H. C., and Juan, H. F. (2012). Revealing the molecular mechanism of
gastric cancer marker annexin A4 in cancer cell proliferation using exon arrays. PLoS
One 7, e44615.
Matsuyama, T., Ishikawa, T., Mogushi, K., et al. (2010). MUC12 mRNA expression is an
independent marker of prognosis in stage II and stage III colorectal cancer. Int J Cancer
127, 2292-2299.
Morrissey, C., True, L. D., Roudier, M. P., et al. (2008). Differential expression of
angiogenesis associated genes in prostate cancer bone, liver and lymph node metastases.
Clin Exp Metastasis 25, 377-388.
Parikh, H., Wang, Z., Pettigrew, K. A., et al. (2011). Fine mapping the KLK3 locus on
chromosome 19q13.33 associated with prostate cancer susceptibility and PSA levels.
Hum Genet 129, 675-685.
Pi, M., Parrill, A. L., and Quarles, L. D. (2010). GPRC6A mediates the non-genomic
effects of steroids. J Biol Chem 285, 39953-39964.
Pi, M., and Quarles, L. D. (2012). GPRC6A regulates prostate cancer progression.
Prostate 72, 399-409.
Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., and Reich,
D. (2006). Principal components analysis corrects for stratification in genome-wide
association studies. Nat Genet 38, 904-909.
Purcell, S., Neale, B., Todd-Brown, K., et al. (2007). PLINK: a tool set for whole-
genome association and population-based linkage analyses. Am J Hum Genet 81, 559-
575.
Takata, R., Akamatsu, S., Kubo, M., et al. (2010). Genome-wide association study
identifies five new susceptibility loci for prostate cancer in the Japanese population. Nat
Genet 42, 751-754.
Takeuchi, F., Isono, M., Nabika, T., et al. (2011). Confirmation of ALDH2 as a Major
locus of drinking behavior and of its variants regulating multiple metabolic phenotypes in
a Japanese population. Circ J 75, 911-918.
61
Chapter 4 Generalizability of established prostate cancer risk
variants in men of African ancestry
This paper was published on International Journal of Cancer (Han et al., 2014).
Ying Han
1
, Lisa B. Signorello
2,3
, Sara S. Strom
4
,
Rick A. Kittles
5
, Benjamin A. Rybicki
6
,
Janet L. Stanford
7
,
Phyllis J. Goodman
8
, Sonja I. Berndt
9
, John Carpten
10
, Graham
Casey
1,11
, Lisa Chu
12
, David V. Conti
1
, Kristin A. Rand
1
, W. Ryan Diver
13
, Anselm JM
Hennis
14,15,16,17
, Esther M. John
12,18
, Adam S. Kibel
19
, Eric A. Klein
20
, Suzanne Kolb
7
,
Loic Le Marchand
21
, M. Cristina Leske
14
, Adam B. Murphy
22
, Christine Neslund-Dudas
6
,
Jong Y. Park
23
, Curtis Pettaway
24
, Timothy R. Rebbeck
25
, Susan M. Gapstur
13
, S. Lilly
Zheng
26
, Suh-Yuh Wu
14
, John S. Witte
27
,
Jianfeng Xu
26
, William Isaacs
28
, Sue A.
Ingles
1
, Ann Hsing
12,18
,
The PRACTICAL Consortium
29
, The ELLIPSE GAME-ON
Consortium
30
, Douglas F. Easton
31
, Rosalind A. Eeles
32,33
, Fredrick R. Schumacher
1,11
,
Stephen Chanock
9
, Barbara Nemesure
14
, William J. Blot
34,35,36
,
Daniel O. Stram
1
,
Brian E.
Henderson
1,11
, Christopher A. Haiman
1,11
1
Department of Preventive Medicine, Keck School of Medicine, University of Southern
California, Los Angeles, CA, USA
2
Department of Epidemiology, Harvard School of Public Health, Boston, MA, USA
3
Dana-Farber/Harvard Cancer Center, Boston, MA, USA
4
Department of Epidemiology, Division of Cancer Prevention and Population Sciences,
The University of Texas MD Anderson Cancer Center, Houston, TX, USA
5
Department of Medicine, University of Illinois at Chicago, Chicago, IL, USA
62
6
Department of Public Health Sciences, Henry Ford Hospital, Detroit, MI, USA
7
Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle,
WA, USA
8
SWOG Statistical Center, Seattle, WA, USA
9
Division of Cancer Epidemiology and Genetics, National Cancer Institute, National
Institutes of Health, Bethesda, MD, USA
10
The Translational Genomics Research Institute, Phoenix, AZ, USA
11
Norris Comprehensive Cancer Center, University of Southern California, Los Angeles,
CA, USA
12
Cancer Prevention Institute of California, Fremont, CA, USA
13
Epidemiology Research Program, American Cancer Society, Atlanta, GA, USA
14
Department of Preventive Medicine, Stony Brook University, Stony Brook, NY, USA
15
Chronic Disease Research Centre, University of the West Indies, Bridgetown, Barbados
16
Faculty of Medical Sciences, University of the West Indies, Bridgetown, Barbados
17
Ministry of Health, Bridgetown, Barbados
18
Division of Epidemiology, Department of Health Research & Policy, and Stanford
Cancer Institute, Stanford University School of Medicine, Stanford, CA, USA
19
Division of Urologic Surgery, Brigham and Women’s Hospital, Dana-Farber Cancer
Institute, Boston, MA, USA
20
Glickman Urologic and Kidney Institute, Cleveland Clinic, Cleveland, OH, USA
21
Epidemiology Program, University of Hawaii Cancer Center, Honolulu, HI, USA
22
Department of Urology, Northwestern University, Chicago, IL, USA
23
Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, USA
63
24
Department of Urology, The University of Texas M.D. Anderson Cancer Center,
Houston, TX, USA
25
University of Pennsylvania School of Medicine and the Abramson Cancer Center,
Philadelphia, PA, USA
26
Center for Cancer Genomics, Wake Forest University School of Medicine, Winston-
Salem, NC, USA
27
Institute for Human Genetics, Departments of Epidemiology and Biostatistics and
Urology, University of California, San Francisco, San Francisco, CA, USA
28
James Buchanan Brady Urological Institute, Johns Hopkins Hospital and Medical
Institutions, Baltimore, MD, USA
29
A full list of members is provided in (Eeles et al., 2013)
30
http://epi.grants.cancer.gov/gameon/
31
Centre for Cancer Genetic Epidemiology, Department of Oncology, University of
Cambridge, Cambridge, UK
32
The Institute of Cancer Research, London and Sutton, UK
33
Royal Marsden National Health Service Foundation Trust, London and Sutton, UK
34
Division of Epidemiology, Department of Medicine, Vanderbilt Epidemiology Center,
Vanderbilt University School of Medicine, Nashville, TN, USA
35
The Vanderbilt-Ingram Cancer Center, Vanderbilt University School of Medicine,
Nashville, TN, USA
36
International Epidemiology Institute, Rockville, MD, USA
64
Novelty & Impact Statements:
In the largest study to date, we found the vast majority of established prostate cancer risk
variants also contribute to prostate cancer risk in men of African ancestry. A subset of
variants were also found to be informative for prostate cancer risk modeling in this
population. These findings motivate further genomic characterization to understand the
contribution of these loci to risk in this population which may be influenced by linkage
disequilibrium patterns.
Corresponding Author:
Christopher A. Haiman
Harlyne Norris Research Tower
1450 Biggy Street, Room 1504
Los Angeles, CA 90033
Telephone: (323) 442-7755
Fax: (323) 442-7749
E-mail: haiman@usc.edu
Conflict of Interest
1,2
1
We declare that Rosalind A. Eeles has received educational grants from Tepnel (now GenProbe), Vista,
2
We declare that Adam S. Kibel is on advisory board of Dendron and Sanofi-Aventis.
65
4.1 Abstract
Genome-wide association studies have identified more than eighty risk variants for
prostate cancer, mainly in European or Asian populations. The generalizability of these
variants in other racial/ethnic populations needs to be understood before the loci can be
utilized widely in risk modeling. In this study, we examined 82 previously reported risk
variants in 4,853 prostate cancer cases and 4,678 controls of African ancestry. We
performed association testing for each variant using logistic regression adjusted for age,
study, and global ancestry. Of the 82 known risk variants, 68 (83%) had effects that were
directionally consistent in their association with prostate cancer risk and 30 (37%) were
significantly associated with risk at p<0.05, with the most statistically significant variants
being rs116041037 (p=3.7×10
-26
) and rs6983561 (p=1.1×10
-16
) at 8q24, as well as
rs7210100 (p=5.4×10
-8
) at 17q21. By exploring each locus in search of better markers,
the number of variants that captured risk in men of African ancestry (p<0.05) increased
from 30 (37%) to 44 (54%). An aggregate score comprised of these 44 markers was
strongly associated with prostate cancer risk (per-allele odds ratio (OR)=1.12, p=7.3×10
-
98
). In summary, the consistent directions of effects for the vast majority of variants in
men of African ancestry indicate common functional alleles that are shared across
populations. Further exploration of these susceptibility loci is needed to identify the
underlying biologically relevant variants to improve prostate cancer risk modeling in
populations of African ancestry.
66
4.2 Introduction
Prostate cancer is the most common non-skin cancer and the second leading cause of
cancer death for men in the United States. In African Americans, the incidence rate is 1.6
times that in European Americans and the mortality rate is 2.5 times greater(Brawley,
2012). Reasons for the greater disease burden, which has also been suggested in
Africa(Jemal et al., 2011), are not known. However, studies have revealed that some risk
variants are more common in men of African ancestry than in other racial/ethnic
populations(Haiman et al., 2011a; Haiman et al., 2011b), suggesting a genetic basis for
the greater disease burden. Genome-wide association studies (GWAS) and large-scale
collaborative replication efforts have identified more than eighty prostate cancer risk
variants, mainly in populations of European or Asian ancestry(Akamatsu et al., 2012; Al
Olama et al., 2009; Al Olama et al., 2012; Eeles et al., 2009; Eeles et al., 2013;
Gudmundsson et al., 2009; Gudmundsson et al., 2008; Haiman et al., 2011b; Haiman et
al., 2007; Jia et al., 2009; Kote-Jarai et al., 2011b; Lindstrom et al., 2011; Lindstrom et
al., 2012; Schumacher et al., 2011; Sun et al., 2009; Xu et al., 2012). Direct testing of
these variants in other populations will be required to characterize the risk conveyed by
these loci globally. In an earlier study in African Americans (3,425 cases and 3,290
controls) we examined 49 risk variants in 28 regions, and found that 71% of the variants
(n=35) had effects that were directionally consistent with the initial reports(Haiman et al.,
2011a). Similar results were also noted in a study by Chang et al.(Chang et al., 2011),
which included many of these same participating studies. Although preliminary, these
findings suggest that the majority of risk alleles found to date in other populations are
also common in African Americans. In the present study, we continue to investigate the
67
question of risk locus generalizability in African-ancestry populations, through testing a
more comprehensive set of risk variants (n=82) in a larger sample of prostate cancer
cases (n=4,853) and controls (n=4,678) of African ancestry, which includes the subjects
from our previous investigation. For each variant, we compared the magnitude of
association and risk allele frequency between African and the initial GWAS population.
We also modeled prostate cancer risk based on a cumulative score of associated alleles.
4.3 Materials and Methods
4.3.1 Study Populations
We assembled a consortium of prostate cancer studies that included men of African
ancestry and conducted a GWAS to search for additional risk loci that may be more
common in men of African descent. Initial findings from the GWAS have been reported
in Haiman et al.(Haiman et al., 2011a; Haiman et al., 2011b). The current study of
prostate cancer in men of African ancestry includes 5,096 cases and 4,972 controls (see
Supplementary Note), the vast majority of which are African Americans (95%). This
sample includes 11 studies that were part of our original investigation (cases/controls:
Multiethnic Cohort, 1,094/1,096; The Southern Community Cohort Study, 212/419; The
Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial, 286/269; The Cancer
Prevention Study II Nutrition Cohort, 76/152; Prostate Cancer Case-Control Studies at
MD Anderson, 543/474; Identifying Prostate Cancer Genes, 368/172; The Los Angeles
Study of Aggressive Prostate Cancer, 296/303; Prostate Cancer Genetics Study, 75/85;
Case-Control Study of Prostate Cancer among African Americans in Washington, DC,
292/359; King County (Washington) Prostate Cancer Study, 145/81; and The Gene-
Environment Interaction in Prostate Cancer Study, 234/92), as well as three additional
68
studies (cases/controls: The North Carolina Prostate Cancer Study, 216/249; Selenium
and Vitamin E Cancer Prevention Trial, 223/224; and Prostate Cancer in a Black
Population, 238/231) and additional samples from two of the original studies
(cases/controls: Multiethnic Cohort, 747/662; and The Southern Community Cohort
Study, 51/104). Institutional review board approval was obtained for all participating
studies.
4.3.2 Genotyping and Quality Control
Genotyping of the 10,068 samples (7,123 included in our previous report plus 2,945
additional samples) was conducted using the Illumina Infinium Human1M-Duo bead
array. Samples (n=537) were removed based on the following exclusion criteria: (i)
unknown replicates across studies, (ii) call rates <95%, (iii) >10% mean heterozygosity
on the X chromosome and/or <10% mean intensity on the Y chromosome, (iv) ancestry
outliers (>4 standard deviations from the mean of eigenvector 1 or 2 as calculated using
EIGENSTRAT(Price et al., 2006)), and (v) samples that were related (1 from each group:
monozygotic twin, parent-offspring, full- and half-sibling pairs as estimated in PLINK
(http://pngu.mgh.harvard.edu/purcell/plink/)). To assess genotyping reproducibility, we
included 215 replicate samples; the concordance rate was ≥99.9% for all pairs. The final
analysis included 4,853 cases and 4,678 controls (Supplementary Table S1). Of the 82
single-nucleotide polymorphisms (SNP) under investigation, 69 were genotyped and all
had call rates >99%.
69
4.3.3 SNP Imputation
In order to test the established prostate cancer risk variants that were not directly
genotyped and to further explore the known risk loci, we performed imputation using the
software IMPUTE2(Howie, Donnelly and Marchini, 2009). Phased haplotype data from a
multiethnic reference panel of 1,092 individuals in 1000 Genomes Project (March 2012
release) were used to infer linkage disequilibrium (LD) patterns in order to impute
missing markers. All 13 of the remaining variants were well imputed as indicated by the
“info score”, an imputation quality metric(Howie et al., 2009). For these SNPs, the mean
info score generated by IMPUTE2 was 0.97 with a range of 0.88 to 1.
4.3.4 Statistical Analysis
We tested 82 prostate cancer risk variants in 54 regions (with some regions having more
than one variant associated with risk) that were identified in previous GWAS. All 82
variants were weakly correlated with each other (r
2
<0.2 in EUR/AFR in 1000 Genomes
Project). For some regions, such as 8q24, not all previously reported SNPs were
presented because of high correlations with SNPs that were presented. For each SNP,
per-allele odds ratio (OR) and 95% confidence interval (CI) was estimated using
unconditional logistic regression adjusted for age (i.e. at diagnosis for cases and at blood
draw or reference date for controls), study, and the first 10 principal components of
global ancestry. We tested for allele dosage effects through a 1-degree of freedom Wald
test. In the results and discussion section, “directionally consistent” refers to the direction
of the association (OR) and not statistical significance (p-value).
We examined each locus in search of better markers of risk in this population
using genotyped and well-imputed (info score≥0.8) common variants (minor allele
70
frequency≥1%). For each known risk variant (referred to as index SNP), we examined all
highly correlated variants (r
2
≥0.8) in the racial/ethnic population in which the original
discovery was made. A marker that can better capture risk in men of African ancestry
(referred to as AA marker) was defined as p<0.05, more statistically significant than the
index SNP and with a larger effect size (OR).
For each index SNP and AA marker, we also examined the association with
prostate cancer risk by disease severity and performed a case-only analysis (aggressive
versus non-aggressive disease). Aggressive disease was defined as metastatic disease,
PSA>100 (ng/mL), Gleason Score ≥8 and/or prostate cancer as a cause of death (n=1,238
cases).
We modeled the cumulative genetic risk of prostate cancer using index SNPs
from previous GWAS (n=82). More specifically, we summed the number of risk alleles
for each individual as a genetic risk score, which is appropriate for unlinked variants with
independent effects of approximately the same magnitude for each allele, and estimated
the odds ratio per allele and by quintile for this aggregate score. For individuals missing
genotypes for a given SNP, we assigned the average number of risk alleles for that SNP
to replace the missing value. The vast majority of subjects (96.4%) had no missing
genotype for any SNP, with only two subjects missing ≥5% of the SNPs. We compared
the results to a model of risk-associated variants in men of African ancestry (OR>1.0 and
p<0.05), with index SNPs substituted by the AA markers when available (n=44). We
stratified the analysis by age (>65 versus ≤65 years) and tested for the interaction
between genetic risk score and age groups. We also examined the risk score by disease
severity.
71
4.4 Results and Discussion
Of the 82 known risk alleles we examined, 68 (83%) were associated with increased
prostate cancer risk (OR>1.0) and 30 (37%) reached nominal statistical significance
(p<0.05) in men of African ancestry (Supplementary Table S2). The number of variants
with consistent directions of effects (68 out of 82) is statistically significantly more than
expected (p=5.7×10
-10
, one-tailed binomial test). The vast majority of the 82 variants had
consistent effects across the 14 study groups, with only 7 variants (9%) exhibiting
statistically significant heterogeneity (P
het
<0.05; Supplementary Table S2), which is
slightly more than expected given the number of comparisons. In contrast to Europeans,
one allele at 8q24 (rs12543663, C) was significantly associated with reduced prostate
cancer risk (OR=0.86, 95% CI, 0.79-0.94, p=8.6×10
-4
), which is consistent with our
previous observation in a subset of these samples(Haiman et al., 2011a). Shown in Figure
1 is a comparison of the effect estimates from previous GWAS and those in men of
African ancestry. The marked directional consistency of effects for many variants
suggests that they are generalized markers of prostate cancer risk and that a common
functional variant is shared across populations at most susceptibility loci.
Among all tested variants, the most associated markers were rs116041037
(OR=2.23, p=3.7×10
-26
) and rs6983561 (OR=1.29, p=1.1×10
-16
) at 8q24, as well as
rs7210100 (OR=1.40, p=5.4×10
-8
) at 17q21 (Supplementary Table S2), which are located
in the regions that we have shown to be the most significantly associated with prostate
cancer risk in our GWAS in men of African ancestry(Haiman et al., 2011b). Of these, the
risk alleles of rs116041037(A)(Haiman et al., 2007) and rs7210100(A)(Haiman et al.,
72
2011b) have only been found in populations of African ancestry with a frequency of 2-
3% and 4-5%, respectively.
Our sample size provided ≥80% power to detect the reported effect size (i.e. the
OR from the largest replication stage in previous GWAS) for 50 (61%) of the 82 variants
at a significance level of α=0.05 (Supplementary Fig. S1, Supplementary Table S2).
However, even with ≥80% power, 25 variants (50%) were not replicated at p<0.05,
which suggests that these variants might not be adequately correlated with the underlying
biologically relevant variant in populations of African descent as demonstrated in our
initial study(Haiman et al., 2011a).
To address this hypothesis, we examined each locus in search of markers that
might better capture risk in men of African ancestry (referred to as AA markers; see
Methods). An AA marker revealed at 12q13 is demonstrated in Figure 2 as an example.
The index SNP (rs902774), originally identified in a European GWAS(Schumacher et al.,
2011), was not significantly associated with risk in men of African ancestry (OR=0.94,
p=0.28) given 89% statistical power. The most significant variant (rs55958994;
OR=1.17, p=2.5×10
-4
) in this region is located 27kb downstream from the index SNP.
These two markers are well-correlated in Europeans (r
2
=0.82) but are uncorrelated in
Africans (r
2
=0.01), which suggests that rs55958994 is a better proxy of the underlying
biologically relevant variant in men of African ancestry. Among all loci, an AA marker
was identified for 21 index SNPs (Supplementary Table S3). Taking these AA markers
into account, 44 (54%) of the 82 known risk signals reached nominal statistical
significance (p<0.05).
73
The frequencies of the index risk alleles were slightly greater, on average, in men
of African ancestry than in GWAS populations of European or Asian ancestry
(Supplementary Table S4); the two African-specific variants were not considered in the
comparisons (rs116041037 and rs7210100). The mean risk allele frequency (RAF) was
4% greater, with 42 (53%) of the 80 index risk alleles being more common in men of
African ancestry than in the initial GWAS population. For the nominally significant risk-
associated index variants in men of African ancestry (OR>1.0 and p<0.05; n=28), which
are likely to be more strongly correlated with the functional alleles in this population, the
mean RAF difference was 8%. Similar differences were also observed when evaluating
the median RAFs and when incorporating AA markers (Supplementary Table S4).
Although based on only a subset of markers that were significantly associated with risk at
p<0.05 (n=28 index SNPs or n=42 index plus AA markers), there is a suggestion that the
risk alleles for prostate cancer at these loci may be more common in men of African
ancestry than in the initial GWAS population. This assertion will need to be reassessed
and formally tested once the underlying biologically functional alleles are discovered.
When considering the index SNPs and AA markers, only 5 (6%) of the 82 known
risk signals were more associated with aggressive disease based on the case-only analysis
(4 expected at α=0.05), with the top three variants being rs7141529 (P
het
=0.0087) at
14q24, rs339331 (P
het
=0.0093) at 6q22, and rs721048 (P
het
=0.018) at 2p15
(Supplementary Table S5). Of these, rs339331 is linked to a non-synonymous variant in
the gene GPRC6A (Haiman et al., 2013); it has also been suggested to function by
regulating RFX6 expression through modulating HOXB13 chromatin binding(Huang et
al., 2014).
74
We further examined the cumulative effect of the risk signals through a composite
risk score (see Methods). Using the index SNPs (n=82), the risk per allele was 1.06 (95%
CI, 1.05-1.07, p=6.7×10
-53
) and individuals in the top quintile of the risk score
distribution were at 2.7-fold greater risk compared to those in the lowest quintile (Table
1). As expected, the risk modeling was improved when incorporating AA markers and
restricting to variants that were significantly associated with risk in men of African
ancestry (n=44; Supplementary Table S6). The risk per allele was 1.12 (95% CI, 1.11-
1.13, p=7.3×10
-98
), with the risk comparing the top versus the lowest quintile being 3.7
(Table 1). In aggregate, these variants can stratify men more effectively than the strongest
known risk factor, a first-degree family history of prostate cancer, which has a relative
risk of ~2.0(Ahn et al., 2008). Associations of similar strength were observed for
aggressive and non-aggressive prostate cancer (P
het
=0.05; Table 1). When stratifying by
age, risk for younger men (age≤65) in the top quintile was 4.4-times those in the lowest
quintile, while the odds ratio for men older than 65 years of age was 2.9 (P
interaction
=0.02;
Supplementary Table S6). While these variants are informative for stratifying prostate
cancer risk in men of African ancestry, their combined effects in each stratum are modest
and they have limited ability to differentiate aggressive versus non-aggressive disease.
Thus, their potential for predictive clinical utility remains limited. Together with
identifying and directly testing the biologically functional alleles at these known loci,
which is likely to improve population risk stratification, efforts are still needed to reveal
variants that are of particular importance and potentially unique to men of African
ancestry, such as those at 8q24(Haiman et al., 2007) and 17q21(Haiman et al., 2011b).
Larger-scale replication testing of variants from this GWAS in men of African ancestry is
75
underway as part of the NCI GAME-ON Consortium
(http://epi.grants.cancer.gov/gameon/), in an attempt to further discover loci that may
help us to better understand the higher risk of prostate cancer in this population as well as
to develop genetic risk prediction profiles that may be more suitable and tailored for this
population.
Compared to our previous study (case/control: 3,425/3,290) (Haiman et al.,
2011a), the current study has greater power because of increased sample size
(case/control: 4,853/4,678). Of the 38 SNPs that we have re-examined in this study, 25
(66%) were more significantly associated with prostate cancer risk than in our previous
study due to the greater sample size. Furthermore, we included 44 additional risk variants
that have been discovered since our last publication and imputed to 1000 Genomes
Project (versus HapMap in our previous study), which allowed for a more comprehensive
assessment of common variation at each locus.
To date, this is the largest study of prostate cancer in men of African ancestry to
examine the generalizability of risk with the established variants. These findings suggest
that the vast majority of currently known variants also contribute to prostate cancer risk
in men of African ancestry. Although power was <80% for 32 (39%) of the 82 variants
tested, the direction of effect sizes for these variants in men of African ancestry was
generally consistent with the previous reports. In exploring each locus, the number of
variants that were significantly associated with risk increased from 30 (37%) to 44 (54%).
Further fine-mapping of these susceptibility loci in larger multiethnic samples will be
required to reveal the underlying biologically relevant variants as well as the best set of
genetic markers for prostate cancer risk stratification in men of African ancestry.
76
4.5 Acknowledgements
The MEC and the genotyping in this study were supported by NIH grants CA63464,
CA54281, CA1326792, CA148085 and HG004726. Genotyping of the PLCO samples
was funded by the Intramural Research Program of the Division of Cancer Epidemiology
and Genetics, NCI, NIH. LAAPC was funded by grant 99-00524V-10258 from the
Cancer Research Fund, under Interagency Agreement #97-12013 (University of
California contract #98-00924V) with the Department of Health Services Cancer
Research Program. Cancer incidence data for the MEC and LAAPC studies have been
collected by the Los Angeles Cancer Surveillance Program of the University of Southern
California with Federal funds from the NCI, NIH, Department of Health and Human
Services, under Contract No. N01-PC-35139, and the California Department of Health
Services as part of the statewide cancer reporting program mandated by California Health
and Safety Code Section 103885, and grant number 1U58DP000807-3 from the Centers
for Disease Control and Prevention. KCPCS was supported by NIH grants CA056678,
CA082664 and CA092579, with additional support from the Fred Hutchinson Cancer
Research Center. MDA was support by grants, CA68578, ES007784, DAMD W81XWH-
07-1-0645, and CA140388. GECAP was supported by NIH grant ES011126. CaP Genes
was supported by CA88164 and CA127298. IPCG was support by DOD grant
W81XWH-07-1-0122. DCPC was supported by NIH grant S06GM08016 and DOD
grants DAMD W81XWH-07-1-0203, DAMD W81XWH-06-1-0066 and DOD
W81XWH-10-1-0532. SCCS is funded by NIH grant CA092447. SCCS sample
preparation was conducted at the Epidemiology Biospecimen Core Lab that is supported
in part by the Vanderbilt-Ingram Cancer Center (CA68485). Data on SCCS cancer cases
77
used in this publication were provided by the Alabama Statewide Cancer Registry;
Kentucky Cancer Registry; Tennessee Department of Health, Office of Cancer
Surveillance; Florida Cancer Data System; North Carolina Central Cancer Registry,
North Carolina Division of Public Health; Georgia Comprehensive Cancer Registry;
Louisiana Tumor Registry; Mississippi Cancer Registry; South Carolina Central Cancer
Registry; Virginia Department of Health, Virginia Cancer Registry; Arkansas Department
of Health, Cancer Registry. The Arkansas Central Cancer Registry is fully funded by a
grant from National Program of Cancer Registries, Centers for Disease Control and
Prevention (CDC). Data on SCCS cancer cases from Mississippi were collected by the
Mississippi Cancer Registry which participates in the National Program of Cancer
Registries (NPCR) of the Centers for Disease Control and Prevention (CDC). The
contents of this publication are solely the responsibility of the authors and do not
necessarily represent the official views of the CDC or the Mississippi Cancer Registry.
The authors thank Drs. Christine Berg and Philip Prorok, Division of Cancer Prevention,
NCI, the screening center investigators and staff of the PLCO Cancer Screening Trial,
Mr. Thomas Riley and staff at Information Management Services, Inc., and Ms. Barbara
O’Brien and staff at Westat, Inc. for their contributions to the PLCO Cancer Screening
Trial. We also acknowledge the technical support of Marta Gielzak and Guifang Yan.
CPS-II is supported by the American Cancer Society. This work was also supported by
European Commission's Seventh Framework Programme grant agreement No. 223175
(HEALTH-F2-2009-223175), Cancer Research UK Grants C5047/A7357,
C1287/A10118, C5047/A3354, C5047/A10692, C16913/A6135, and The National
78
Institute of Health (NIH) Cancer Post-Cancer GWAS initiative grant: No. 1 U19 CA
148537-01 (the GAME-ON initiative).
79
Table 1. A genetic risk score for prostate cancer in men of African ancestry.
Index markers
from GWAS (n=82)
Risk-associated markers
in men of African ancestry (n=44)
Prostate cancer Overall Overall Aggressive
a
Non-aggressive
Per allele
b
N (cases/controls) 4,853/4,678 4,853/4,678 1,238/4,678 3,615/4,678
OR(95% CI)
c
1.06(1.05-1.07) 1.12(1.11-1.13) 1.14(1.12-1.16) 1.12(1.10-1.13)
P-value 6.7×10
-53
7.3×10
-98
7.5×10
-50
2.6×10
-77
Quintiles of risk alleles
d
Q1 N (cases/controls) 568/934 488/936 124/936 364/936
OR(95% CI)
c
1.0(ref.) 1.0(ref.) 1.0(ref.) 1.0(ref.)
P-value - - - -
Q2 N (cases/controls) 725/937 658/935 161/935 497/935
OR(95% CI)
c
1.26(1.09-1.46) 1.39(1.19-1.61) 1.34(1.04-1.74) 1.41(1.19-1.67)
P-value 1.9×10
-3
2.5×10
-5
2.5×10
-2
5.4×10
-5
Q3 N (cases/controls) 993/935 893/936 220/936 673/936
OR(95% CI)
c
1.74(1.51-2.01) 1.91(1.65-2.22) 1.85(1.44-2.36) 1.93(1.64-2.27)
P-value 1.0×10
-14
6.0×10
-18
1.2×10
-6
2.3×10
-15
Q4 N (cases/controls) 1,061/936 1,105/935 281/935 824/935
OR(95% CI)
c
1.85(1.61-2.13) 2.39(2.07-2.77) 2.51(1.98-3.20) 2.37(2.02-2.78)
P-value 8.7×10
-18
4.9×10
-32
6.6×10
-14
3.1×10
-26
Q5 N (cases/controls) 1,506/936 1,709/936 452/936 1,257/936
OR(95% CI)
c
2.67(2.33-3.06) 3.69(3.20-4.26) 4.07(3.23-5.13) 3.57(3.06-4.18)
P-value 1.6×10
-44
1.9×10
-72
2.0×10
-32
1.3×10
-57
a
Metastatic disease, PSA>100 (ng/mL), Gleason Score ≥8 and/or prostate cancer as a cause of death.
b
Among controls, mean and range for the 82 index alleles is 77 (53-97); for the 44 risk-associated alleles the
mean and range is 40 (26-56).
c
Odds ratio and 95% confidence interval adjusted for age, study, and global ancestry (the 1
st
10
eigenvectors).
80
Figure 1. Effect Size Comparison of Known Risk Variants in Previous GWAS and
in Men of African Ancestry. The odds ratio (OR) and 95% confidence interval (CI) for
82 known risk variants in previous GWAS and in men of African ancestry (AA). For
SNPs reported in multi-stage GWAS, the OR and 95% CI from the largest replication
stage was used for comparison. The red dots represent ORs in this study; the blue
diamonds represent ORs in previous GWAS. The horizontal bars represent the
corresponding 95% CIs. For each tested allele, frequency and statistical power (%) in AA
are provided in the parentheses. The SNPs are sorted based on the ORs in AA. Detailed
information for each SNP is provided in Supplementary Table S2.
81
Figure 2. A Regional Association Plot of the Prostate Cancer Risk Locus at
Chromosome 12q13. The -log
10
p-values are from the association with prostate cancer
risk in men of African ancestry (AA). Squares are genotyped SNPs and circles are
imputed SNPs. The index SNP (rs902774), originally identified in a European GWAS, is
designated by a purple square. The r
2
shown is estimated in Europeans from 1000
Genomes Project (1000G EUR) in relation to rs902774. Grey symbols are SNPs not in
1000G EUR (r
2
cannot be estimated). The top red circle represents a better marker of risk
in AA (rs55958994) at this locus. The plot was generated using LocusZoom
(http://csg.sph.umich.edu/locuszoom/).
82
References
Ahn, J., Moslehi, R., Weinstein, S. J., Snyder, K., Virtamo, J., and Albanes, D. (2008).
Family history of prostate cancer and prostate cancer risk in the Alpha-Tocopherol, Beta-
Carotene Cancer Prevention (ATBC) Study. Int J Cancer 123, 1154-1159.
Akamatsu, S., Takata, R., Haiman, C. A., et al. (2012). Common variants at 11q12,
10q26 and 3p11.2 are associated with prostate cancer susceptibility in Japanese. Nat
Genet 44, 426-429, S421.
Al Olama, A. A., Kote-Jarai, Z., Giles, G. G., et al. (2009). Multiple loci on 8q24
associated with prostate cancer susceptibility. Nat Genet 41, 1058-1060.
Al Olama, A. A., Kote-Jarai, Z., Schumacher, F. R., et al. (2012). A meta-analysis of
genome-wide association studies to identify prostate cancer susceptibility loci associated
with aggressive and non-aggressive disease. Hum Mol Genet.
Brawley, O. W. (2012). Prostate cancer epidemiology in the United States. World J Urol
30, 195-200.
Chang, B. L., Spangler, E., Gallagher, S., et al. (2011). Validation of genome-wide
prostate cancer associations in men of African descent. Cancer Epidemiol Biomarkers
Prev 20, 23-32.
Eeles, R. A., Kote-Jarai, Z., Al Olama, A. A., et al. (2009). Identification of seven new
prostate cancer susceptibility loci through a genome-wide association study. Nat Genet
41, 1116-1121.
Eeles, R. A., Olama, A. A., Benlloch, S., et al. (2013). Identification of 23 new prostate
cancer susceptibility loci using the iCOGS custom genotyping array. Nat Genet 45, 385-
391, 391e381-382.
Gudmundsson, J., Sulem, P., Gudbjartsson, D. F., et al. (2009). Genome-wide association
and replication studies identify four variants associated with prostate cancer
susceptibility. Nat Genet 41, 1122-1126.
Gudmundsson, J., Sulem, P., Rafnar, T., et al. (2008). Common sequence variants on
2p15 and Xp11.22 confer susceptibility to prostate cancer. Nat Genet 40, 281-283.
Haiman, C. A., Chen, G. K., Blot, W. J., et al. (2011a). Characterizing genetic risk at
known prostate cancer susceptibility loci in African Americans. PLoS Genet 7, e1001387.
Haiman, C. A., Chen, G. K., Blot, W. J., et al. (2011b). Genome-wide association study
of prostate cancer in men of African ancestry identifies a susceptibility locus at 17q21.
Nat Genet 43, 570-573.
Haiman, C. A., Han, Y., Feng, Y., et al. (2013). Genome-wide testing of putative
functional exonic variants in relationship with breast and prostate cancer risk in a
multiethnic population. PLoS Genet 9, e1003419.
Haiman, C. A., Patterson, N., Freedman, M. L., et al. (2007). Multiple regions within
8q24 independently affect risk for prostate cancer. Nat Genet 39, 638-644.
83
Howie, B. N., Donnelly, P., and Marchini, J. (2009). A flexible and accurate genotype
imputation method for the next generation of genome-wide association studies. PLoS
Genet 5, e1000529.
Huang, Q., Whitington, T., Gao, P., et al. (2014). A prostate cancer susceptibility allele at
6q22 increases RFX6 expression by modulating HOXB13 chromatin binding. Nat Genet
46, 126-135.
Jemal, A., Bray, F., Center, M. M., Ferlay, J., Ward, E., and Forman, D. (2011). Global
cancer statistics. CA Cancer J Clin 61, 69-90.
Jia, L., Landan, G., Pomerantz, M., et al. (2009). Functional enhancers at the gene-poor
8q24 cancer-linked locus. PLoS Genet 5, e1000597.
Kote-Jarai, Z., Olama, A. A., Giles, G. G., et al. (2011). Seven prostate cancer
susceptibility loci identified by a multi-stage genome-wide association study. Nat Genet
43, 785-791.
Lindstrom, S., Schumacher, F., Siddiq, A., et al. (2011). Characterizing associations and
SNP-environment interactions for GWAS-identified prostate cancer risk markers--results
from BPC3. PLoS One 6, e17142.
Lindstrom, S., Schumacher, F. R., Campa, D., et al. (2012). Replication of five prostate
cancer loci identified in an Asian population--results from the NCI Breast and Prostate
Cancer Cohort Consortium (BPC3). Cancer Epidemiol Biomarkers Prev 21, 212-216.
Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., and Reich,
D. (2006). Principal components analysis corrects for stratification in genome-wide
association studies. Nat Genet 38, 904-909.
Schumacher, F. R., Berndt, S. I., Siddiq, A., et al. (2011). Genome-wide association
study identifies new prostate cancer susceptibility loci. Hum Mol Genet 20, 3867-3875.
Sun, J., Zheng, S. L., Wiklund, F., et al. (2009). Sequence variants at 22q13 are
associated with prostate cancer risk. Cancer Res 69, 10-15.
Xu, J., Mo, Z., Ye, D., et al. (2012). Genome-wide association study in Chinese men
identifies two new prostate cancer risk loci at 9q31.2 and 19q13.4. Nat Genet 44, 1231-
1235.
84
Chapter 5 Prostate cancer susceptibility in men of African
ancestry at 8q24
This paper was published on Journal of the National Cancer Institute (Han et al., 2016b).
Ying Han
1,*
, Kristin A. Rand
1,*
, Dennis J. Hazelett
1
, Sue A. Ingles
1,2
, Rick A. Kittles
3
,
Sara S. Strom
4
, Benjamin A. Rybicki
5
, Barbara Nemesure
6
, William B. Isaacs
7
, Janet L.
Stanford
8,9
, Wei Zheng
10
, Fredrick R. Schumacher
1, 2
, Sonja I. Berndt
11
, Zhaoming
Wang
11,12
, Jianfeng Xu
13
, Nadin Rohland
14,15
, David Reich
14-16
, Arti Tandon
14,15
, Bogdan
Pasaniuc
17,18
, Alex Allen
14,15
, Dominique Quinque
14,15
, Swapan Mallick
14-16
, Dimple
Notani
19
, Michael G. Rosenfeld
19
, Ranveer Singh Jayani
19
, Suzanne Kolb
8
, Susan M.
Gapstur
20
, Victoria L. Stevens
20
, Curtis A. Pettaway
21
, Edward D. Yeboah
22,23
, Yao
Tettey
22,23
, Richard B. Biritwum
22,23
, Andrew A. Adjei
22,23
, Evelyn Tay
22,23
, Ann
Truelove
24
, Shelley Niwa
24
, Anand P.Chokkalingam
25
, Esther M. John
26,27
, Adam B.
Murphy
28
, Lisa B. Signorello
29
, John Carpten
30
, M. Cristina Leske
6
, Suh-Yuh Wu
6
,
Anslem J.M. Hennis
6,31
, Christine Neslund-Dudas
32
, Ann W. Hsing
26,27
, Lisa Chu
26,27
,
Phyllis J. Goodman
33
, Eric A. Klein
34
, S. Lilly Zheng
35
, John S. Witte
36,37
, Graham
Casey
1, 2
, Alex Lubwama
38
, Loreall C. Pooler
1
, Xin Sheng
1
, Gerhard A. Coetzee
1, 2, 39
,
Michael B. Cook
11
, Stephen J. Chanock
11
, Daniel O. Stram
1, 2
, Stephen Watya
38,40
,
William J. Blot
10,29
, David V. Conti
1,2
, Brian E. Henderson
1,2,†
, Christopher A. Haiman
1,2
1. Department of Preventive Medicine, Keck School of Medicine, University of
Southern California, Los Angeles, CA 90089, US
2. Norris Comprehensive Cancer Center, University of Southern California, Los
Angeles, CA 90033, USA
3. University of Arizona College of Medicine and University of Arizona Cancer
Center, Tucson, AZ 85711, USA
4. Department of Epidemiology, University of Texas M.D. Anderson Cancer
Center, Houston, TX 77230, US
5. Department of Public Health Sciences, Henry Ford Hospital, Detroit, MI 48202,
USA
85
6. Department of Preventive Medicine, Stony Brook University, Stony Brook, NY
11794, USA
7. James Buchanan Brady Urological Institute, Johns Hopkins Hospital and Medical
Institution, Baltimore, MD 21287, USA
8. Division of Public Health Sciences, Fred Hutchinson Cancer Research Center,
Seattle, WA 98109, USA
9. Department of Epidemiology, School of Public Health, University of
Washington, Seattle, WA 98195, USA
10. Division of Epidemiology, Department of Medicine, Vanderbilt Epidemiology
Center, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
11. Division of Cancer Epidemiology and Genetics, National Cancer Institute,
National Institute of Health, Bethesda, MD 20892, USA
12. Cancer Genomics Research Laboratory, NCI-DCEG, SAIC-Frederick Inc.,
Frederick, MD 21702, USA
13. Program for Personalized Cancer Care and Department of Surgery, NorthShore
University HealthSystem, Evanston, IL 60201, USA
14. Department of Genetics, Harvard Medical School, Harvard University, Boston,
MA 02115, USA
15. Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
16. Howard Hughes Medical Institute, Harvard Medical School, Boston, MA 20815,
USA
17. Department of Pathology and Laboratory Medicine, David Geffen School of
Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA
18. Department of Human Genetics, David Geffen School of Medicine, University of
California, Los Angeles, Los Angeles, CA 90095, USA
19. Howard Hughes Medical Institute, Department of Medicine, School of Medicine,
University of California, San Diego, San Diego, CA 92093, USA
20. Epidemiology Research Program, American Cancer Society, Atlanta, GA 30303,
USA
21. Department of Urology, University of Texas M.D. Anderson Cancer Center,
Houston, TX 77030, USA
86
22. Korle Bu Teaching Hospital, Accra, Ghana
23. University of Ghana Medical School, Accra, Ghana
24. Westat, Rockville, MD 20850, USA
25. School of Public Health, University of California, Berkeley, Berkeley, CA
94720, USA
26. Cancer Prevention Institute of California, Fremont, CA 94538, USA
27. Stanford University School of Medicine and Stanford Cancer Institute, Palo Alto,
CA 94305, USA
28. Department of Urology, Northwestern University, Chicago, IL 60611, USA
29. Department of Epidemiology, Harvard School of Public Health, Boston, MA
02115, USA
30. The Translational Genomics Research Institute, Phoenix, AZ 85004, USA
31. Chronic Disease Research Centre and Faculty of Medical Sciences, University of
the West Indies, Bridgetown, Barbados
32. Department of Public Health Sciences, Henry Ford Hospital, Detroit, MI 48202,
USA
33. SWOG Statistical Center, Fred Hutchinson Cancer Research Center, Seattle, WA
98101, USA
34. Glickman Urological & Kidney Institute, Cleveland Clinic, Cleveland, OH
44195, USA
35. Center for Cancer Genomics, Wake Forest School of Medicine, Winston-Salem,
NC 27157, USA
36. Department of Epidemiology and Biostatistics, University of California, San
Francisco, San Francisco, CA 94158, USA
37. Institute for Human Genetics, University of California, San Francisco, San
Francisco, CA 94158, USA
38. School of Public Health, Makerere University College of Health Sciences,
Kampala, Uganda
39. Department of Urology, Keck School of Medicine, University of Southern
California, Los Angeles, CA 90033, USA
40. Uro Care, Kampala, Uganda
87
†In Memoriam.
*Authors contributed equally to this work.
Corresponding Author:
Christopher A. Haiman, ScD
Harlyne Norris Research Tower
1450 Biggy Street, Room 1504
Los Angeles, CA 90033
Telephone: (323) 442-7755
Fax: (323) 442-7749
E-mail: haiman@usc.edu
88
5.1 Abstract
The 8q24 region harbors multiple risk variants for distinct cancers, including >8 for
prostate cancer. In this study, we conducted fine-mapping of the 8q24 risk region (127.8-
128.8 Mb) in search of novel associations with common and rare variation in 4,853
prostate cancer cases and 4,678 controls of African ancestry. All statistical tests were
two-sided. We identified three independent associations at p<5.00×10
-8
, all of which
were replicated in studies from Ghana and Uganda (combined sample, 5,869 cases, 5,615
controls; rs114798100: risk allele frequency [RAF]= 0.04, per-allele odds ratio
[OR]=2.31, 95% confidence interval [CI]=2.04 to 2.61, p=2.38×10
-40
; rs72725879:
RAF=0.33, OR=1.37, 95% CI=1.30 to 1.45, p=3.04×10
-27
; and, rs111906932: RAF=0.03,
OR=1.79, 95% CI=1.53 to 2.08, p=1.39×10
-13
). Risk variants rs114798100 and
rs111906923 are only found in men of African ancestry, with rs111906923 representing a
novel association signal. The three variants are located within or near a number of
prostate cancer-associated long non-coding RNAs (lncRNAs), including PRNCR1,
PCAT1 and PCAT2. These findings highlight ancestry-specific risk variation and
implicate prostate-specific lncRNAs at the 8q24 prostate cancer susceptibility region.
89
5.2 Results and Discussion
Genetic variation at 8q24 is a major contributor to prostate cancer (PCa) susceptibility
globally (Amundadottir et al., 2006; Haiman et al., 2007; Takata et al., 2010; Yeager et
al., 2009). African ancestry has been found to be over-represented in this region in
African American men with PCa, which suggests that underlying risk variants may be
more common in men of African than European ancestry (Freedman et al., 2006). Rare,
ancestry-specific alleles have been revealed in African and European ancestry
populations, highlighting allelic heterogeneity in the overall contribution of this region to
PCa risk (Gudmundsson et al., 2012; Haiman et al., 2007). However, the biological
mechanism(s) underlying the PCa risk associations is not entirely clear, with studies
implicating both MYC and long non-coding RNAs (lncRNAs) in this region (Ahmadiyeh
et al., 2010; Chung et al., 2011b; Kim et al., 2014).
Given the importance of this region in men of African ancestry, we conducted a
comprehensive investigation of common and rare variation across the 8q24 region
(127.8-128.8 Mb) in 4,853 cases and 4,678 controls from the African Ancestry Prostate
Cancer GWAS Consortium (AAPC; Supplementary Table 1 and Supplementary Note)
(Haiman et al., 2011b). Genotyping was conducted using the Illumina Infinium 1M-Duo
with imputation to a cosmopolitan reference panel from the 1000 Genomes Project
(1KGP, March 2012; Supplementary Methods). For each SNP, per-allele odds ratios
(ORs) and 95% confidence intervals (CIs) were estimated using unconditional logistic
regression. We tested for allele dosage effects through a 1-degree of freedom Wald trend
test. Multivariable logistic regression was utilized to identify independent risk variants
90
across the 8q24 locus (127.8-128.8 Mb) by conditioning on the most statistically
significant SNPs in a stepwise fashion. All statistical tests were two-sided.
We identified 199 variants at 8q24 associated with PCa risk (p<5.00×10
-8
), all
located between 127.894-128.233 Mb [spanning a region previously described as ‘region
2’ (Haiman et al., 2007)] (Figure 1; Supplementary Table 2). Associations with 10 of
the 14 known risk variants at 8q24 were replicated at p<0.05 (Supplementary Table 3).
Through forward selection (Supplementary Table 3 and Supplementary Figure 1), we
identified three independent association signals at p<5.00×10
-8
(Table 1). The most
statistically significant association was with a low frequency SNP (RAF=0.04),
rs114798100 (conditional OR=2.07, 95% CI=1.80 to 2.38, p=2.98×10
-24
, imputation info
score=0.93), which is only found in populations of African ancestry (Table 1) and is
correlated with a known African-specific risk variant rs116041037 (r
2
=0.63, AFR 1KGP)
(Haiman et al., 2007). A second nearby signal captured by rs72725879 (conditional
OR=1.27, 95% CI=1.19 to 1.36, p=2.77×10
-13
, imputation info score=0.95; Table 1), is
more common in populations of African (RAF=0.33) than European (RAF=0.19, EUR
1KGP) ancestry and is most common in Asian populations (RAF=0.66, ASN 1KGP);
variant rs72725879 is the strongest risk signal across the 8q24 region in Japanese men
(Chung et al., 2011b). A third and novel signal was defined by variant rs111906932
(conditional OR=1.75, 95% CI=1.47 to 2.07, p=1.52×10
-10
, imputation info score=0.88;
Table 1). Like rs114798100, the signal captured by rs111906932 is uncommon and only
found in African ancestry populations (RAF=0.03). The correlation for genotyped and
imputed variants for these three variants was >0.90 (Supplementary Table 4).
Subsequent conditional analyses revealed four additional variants with suggestive
91
independent associations (conditional p<10
-4
; Table 1). Of previously reported risk
variants at 8q24, only rs6983267 (p=0.0091, imputation info score=0.93) and rs7000448
(p=0.0091, imputation info score=1.00; Supplementary Table 3) remained nominally
statistically significant after conditioning on the seven markers described above. None of
these markers was statistically significantly associated with disease aggressiveness (data
not shown).
The associations with rs114798100 (OR=1.93, 95% CI=1.23 to 3.03, p=4.30×10
-
3
), rs72725879 (OR=1.30, 95% CI=1.04 to 1.63, p=2.02×10
-2
) and rs111906932
(OR=2.04, 95% CI=1.34 to 3.11, p=9.36×10
-4
) were replicated in the Ghana Prostate
Study (GPS; 474 cases, 458 controls) (Cook et al., 2014) and in a study from Uganda
(UGPCS; 542 cases, 479 controls; rs114798100, OR=2.54, 95% CI=1.75 to 3.69,
p=9.84×10
-7
; rs72725879, OR=1.37, 95% CI=1.14 to 1.65, p=9.00×10
-4
; and
rs111906932, OR=2.51, 95% CI=1.22 to 5.15, p=1.24×10
-2
) (Supplementary Table 5).
The associations with these variants were highly statistically significant in a meta-
analysis of AAPC, GPS and UGPCS (5,869 cases, 5,615 controls: rs114798100,
OR=2.31, 95% CI=2.04 to 2.61, p=2.38×10
-40
; rs72725879, OR=1.37, 95% CI=1.30 to
1.45, p=3.04×10
-27
; and, rs111906932, OR=1.79, 95% CI=1.53 to 2.08, p=1.39×10
-13
).
An analysis of targeted sequencing data (~15× mean coverage) of the 8q24 risk
locus (127.8-128.8 Mb) was also conducted in 1,644 cases and 1,459 controls to
investigate rarer variation that may have been missed through imputation
(Supplementary Methods and Supplementary Figure 2). None of the 4,186 variants
identified in ‘region 2’, including 2,604 with a frequency<1%, could explain the
associations observed with the three risk variants in this region (data not shown).
92
In African American men, 8q24 was initially highlighted by an admixture signal
(identified in a subset of the AAPC samples) (Freedman et al., 2006). Here we find that
the three most statistically significant risk variants (rs114798100, rs72725879 and
rs111906932), being more prevalent in men of African than European ancestry, can
account for the rise in local African ancestry in the region in African American men with
PCa (OR per African chromosome at 8q24=1.16, 95% CI=1.07 to 1.26, p=3.76×10
-4
; OR
adjusted for the three risk variants=1.03, 95% CI=0.94 to 1.12, p=0.57; Supplementary
Table 6; Table 1).
These findings provide further evidence of rare ancestry-specific variants in
region 2 of 8q24 that have substantial effects on risk (ORs per allele, 1.8-2.9)
(Gudmundsson et al., 2012). Effect size heterogeneity is also a hallmark of risk variants
at 8q24, as exemplified by rs72725879 with an OR of 1.75 (95% CI=1.57 to 1.95) in
Japanese men (Chung et al., 2011b) and 1.38 (95% CI=1.29 to 1.47) in men of African
Ancestry Such heterogeneity exists even after sequencing in these populations, and
implies an impact of genetic background, or differences in linkage disequilibrium
structure between these markers and one or more functional variants in the region. The
number, location and frequency of risk alleles at 8q24 also vary between populations. For
example, rs6983561 in region 2 (1KGP RAF=0.49 AFR, 0.21 ASN, and 0.03 EUR) is no
longer associated with risk at p<10
-3
when adjusting for rs114798100 and rs72725879
(Supplementary Table 3); however, rs6983561 remains an independent signal in
Japanese and European men (Al Olama et al., 2009; Cheng et al., 2012). Likewise,
rs1016343, which is common in all populations (RAF>0.15), is the strongest signal in
region 2 (aside from rs188140481) in European men but is not found as an independent
93
signal in Asian (Cheng et al., 2012) or African populations (p=0.21; Supplementary
Table 3) (Cheng et al., 2012; Xu et al., 2009). Together, these observations suggest a
complex relationship between the underlying functional alleles at 8q24.
The most statistically significantly associated risk variants in region 2 are located
near a number of PCa-associated lncRNAs, including PRNCR1, PCAT1 and PCAT2
(Figure 1). PRNCR1 has been shown to be overexpressed in aggressive PCa, and to
influence androgen receptor-mediated gene activation (Yang et al., 2013). PCAT1 has
been implicated in the regulation of double-strand break repair through the repression of
BRCA2 (Prensner et al., 2014; Prensner et al., 2011). Nearby lncRNAs have also been
implicated at the prostate/colorectal cancer 8q24 risk locus rs6983267 (in region 3) (Kim
et al., 2014). Based on epigenetic annotations in PCa cell lines (Supplementary
Methods), rs72725879, was found to lie within an H3K27Ac marked enhancer
overlapping a FOXA1 ChIP-seq peak while four SNPs correlated with rs111906932 were
found in putative enhancers within the PRNCR1 transcript (Figure 1; Supplementary
Table 7 and 8). These data therefore implicate lncRNAs and/or enhancers of unknown
target genes involved in PCa etiology at 8q24. To ascertain possible functions of
lncRNAs, their knock-down (siRNA) or overexpression in prostate cells can be followed
by phenotypic assays. To identify the targets of enhancers, experiments such as 1)
CRISPR-cas9-mediated genome editing to either knock out or replace alleles; 2)
chromatin interaction assays that identify physical proximity between the locations of the
risk variants and functional target regions; and 3) eQTL or ELMER (Yao et al., 2015)
associations will be required to assess whether and how the alleles highlighted in this
94
study are functional. Such functional follow-up may yield insight into the mechanism(s)
underlying the risk associations in this region.
There are several limitations to this study. Compared with other ongoing efforts in
European populations, the sample size in men of African ancestry still remains small, so
there may be other less common, low-risk alleles in the region that we did not have
power to detect. We performed targeted sequencing to investigate rarer alleles in the
region; however, large sections were missed because of repetitive sequence. Efforts that
combine studies across multiple racial/ethnic populations will be required to understand
the complex genetic architecture of this region on PCa risk.
With the identification of a second risk variant for PCa at 8q24 that is only found
in men of African ancestry, these findings strongly reinforce the importance of rarer
genetic variation in this region which may contribute, in part, to their greater risk of PCa.
5.3 Funding
The MEC and the genotyping in this study were supported by NIH grants CA63464,
CA54281, CA1326792, CA148085 and HG004726. Genotyping of the PLCO samples
was funded by the Intramural Research Program of the Division of Cancer Epidemiology
and Genetics, NCI, NIH. LAAPC was funded by grant 99-00524V-10258 from the
Cancer Research Fund, under Interagency Agreement #97-12013 (University of
California contract #98-00924V) with the Department of Health Services Cancer
Research Program. Cancer incidence data for the MEC and LAAPC studies have been
collected by the Los Angeles Cancer Surveillance Program of the University of Southern
California with Federal funds from the NCI, NIH, Department of Health and Human
Services, under Contract No. N01-PC-35139, and the California Department of Health
95
Services as part of the statewide cancer reporting program mandated by California Health
and Safety Code Section 103885, and grant number 1U58DP000807-3 from the Centers
for Disease Control and Prevention. KCPCS was supported by NIH grants CA056678,
CA082664 and CA092579, with additional support from the Fred Hutchinson Cancer
Research Center and the Intramural Program of the National Human Genome Research
Institute. MDA was support by grants, CA68578, ES007784, DAMD W81XWH-07-1-
0645, and CA140388. CaP Genes was supported by CA88164 and CA127298. SELECT
is funded by Public Health Service cooperative Agreement grant CA37429 awarded by
the NCI, NIH. GECAP was supported by NIH grant ES011126. IPCG was support by
DOD grant W81XWH-07-1-0122. DCPC was supported by NIH grant S06GM08016 and
DOD grants DAMD W81XWH-07-1-0203, DAMD W81XWH-06-1-0066 and DOD
W81XWH-10-1-0532. SCCS is funded by NIH grant CA092447. SCCS sample
preparation was conducted at the Epidemiology Biospecimen Core Lab that is supported
in part by the Vanderbilt-Ingram Cancer Center (CA68485). Data on SCCS cancer cases
used in this publication were provided by the Alabama Statewide Cancer Registry;
Kentucky Cancer Registry; Tennessee Department of Health, Office of Cancer
Surveillance; Florida Cancer Data System; North Carolina Central Cancer Registry,
North Carolina Division of Public Health; Georgia Comprehensive Cancer Registry;
Louisiana Tumor Registry; Mississippi Cancer Registry; South Carolina Central Cancer
Registry; Virginia Department of Health, Virginia Cancer Registry; Arkansas Department
of Health, Cancer Registry. The Arkansas Central Cancer Registry is fully funded by a
grant from National Program of Cancer Registries, Centers for Disease Control and
Prevention (CDC). Data on SCCS cancer cases from Mississippi were collected by the
96
Mississippi Cancer Registry which participates in the National Program of Cancer
Registries (NPCR) of the Centers for Disease Control and Prevention (CDC). The
contents of this publication are solely the responsibility of the authors and do not
necessarily represent the official views of the CDC or the Mississippi Cancer Registry
CPS-II is supported by the American Cancer Society. KAR is supported in part by the
Margaret Kersten Ponty postdoctoral fellowship endowment, Achievement Rewards for
College Scientists (ARCS) Foundation, Los Angeles Founder Chapter. Sequencing in this
study was support by NCI grant CA165862.
5.4 Notes
The study funders had no role in the design of the study; the collection, analysis, and
interpretation of the data; the writing of the manuscript; and the decision to submit the
manuscript for publication.
We are forever indebted to Dr. Brian Henderson, who passed away before this
paper was published. Without his efforts in co-founding the MEC, this work would not
have been possible.
We thank all of the men who took part in these studies. We thank Drs. Christine
Berg and Philip Prorok, Division of Cancer Prevention, NCI, the screening center
investigators and staff of the PLCO Cancer Screening Trial, Mr. Thomas Riley and staff
at Information Management Services, Inc., and Ms. Barbara O’Brien and staff at Westat,
Inc. for their contributions to the PLCO Cancer Screening Trial. We also acknowledge
the technical support of Marta Gielzak and Guifang Yan.
97
Table 1. Prostate cancer risk variants at 8q24 in men of African ancestry.
SNP Position
*
Alleles, RAF
†
OR(95% CI)
‡,||
P
§,||
Conditional
OR(95% CI)
‡,¶
Conditional
P
§,¶
rs7816007
#
128012359 A/G, 0.80/0.74 1.21(1.12-1.30) 1.73×10
-6
1.20(1.11-1.30) 3.71×10
-6
rs114798100
#
128085434 G/A, 0.04/0 2.32(2.02-2.66) 1.61×10
-33
2.07(1.80-2.38) 2.98×10
-24
rs111906932
#
128086204 A/G, 0.03/0 1.72(1.45-2.03) 4.32×10
-10
1.75(1.47-2.07) 1.52×10
-10
rs72725879
#
128103969 T/C, 0.33/0.19 1.38(1.29-1.47) 1.07×10
-23
1.27(1.19-1.36) 2.77×10
-13
rs2445605 128161944 C/T, 0.90/0.96 1.30(1.18-1.44) 3.67×10
-7
1.24(1.11-1.37) 5.97×10
-5
rs7824868 128524414 T/C, 0.04/0.11 1.43(1.25-1.64) 2.62×10
-7
1.40(1.22-1.61) 2.57×10
-6
rs11784480
#
128762529 T/A, 0.77/0.48 1.18(1.09-1.28) 8.17×10
-5
1.19(1.10-1.30) 3.61×10
-5
*
Base-pair position in hg19 (GRCh37). SNP, single nucleotide polymorphism.
†
Risk allele/reference allele. RAF, risk allele frequency in controls of African/European (EUR 1KGP) ancestry populations.
‡
OR, odds ratio with reference allele as the reference category. CI, confidence interval.
§
P-value from two-sided Wald test with 1 degree of freedom.
||
Adjusted for age, study, global ancestry (the first 10 principal components), and local ancestry.
¶
Additionally adjusted for all variants in this table.
#
Imputed; imputation quality score range = 0.82-0.98.
98
Figure 1. Regional association plot of the 8q24 risk region in men of African ancestry. Single
nucleotide polymorphisms (SNPs) are plotted by position (x-axis) and -log10 p-value (y-axis).
The most associated SNP (purple diamond) is rs114798100 and the surrounding SNPs are colored
to indicate pairwise correlation in African ancestry populations (AFR panel in 1000 Genomes).
Below shows the overlap of the three most associated variants in ‘region 2’ as well as variants
correlated at r
2
≥0.7 with rs114798100 (green), rs72725879 (red) and rs111906932 (blue)
(Supplementary Table 7) and functional annotations from DNAseI, histone modification and
ChIP-seq experiments in LNCaP (Supplementary Table 8 and Supplementary Methods). All
statistical tests were two-sided.
Scale
chr8:
AROR
FoxA1
DNase I + androgen
H3K27ac
H3K27ac+DHT
H3K4me1
50 kb
hg19
128,050,000 128,100,000
PCAT1 PCAT2
PRNCR1
8q24.21
rs76229939
rs114798100
rs72725879
rs75986367
rs145656844(del)
rs7842223
rs111906932
rs77084358
rs112708750
rs76639131
rs74421890
rs76111959
Figure 1
99
References
Ahmadiyeh, N., Pomerantz, M. M., Grisanzio, C., et al. (2010). 8q24 prostate, breast, and
colon cancer risk loci show tissue-specific long-range interaction with MYC. Proc Natl
Acad Sci U S A 107, 9742-9746.
Al Olama, A. A., Kote-Jarai, Z., Giles, G. G., et al. (2009). Multiple loci on 8q24
associated with prostate cancer susceptibility. Nat Genet 41, 1058-1060.
Amundadottir, L. T., Sulem, P., Gudmundsson, J., et al. (2006). A common variant
associated with prostate cancer in European and African populations. Nat Genet 38, 652-
658.
Cheng, I., Chen, G. K., Nakagawa, H., et al. (2012). Evaluating genetic risk for prostate
cancer among Japanese and Latinos. Cancer Epidemiol Biomarkers Prev 21, 2048-2058.
Chung, S., Nakagawa, H., Uemura, M., et al. (2011). Association of a novel long non-
coding RNA in 8q24 with prostate cancer susceptibility. Cancer Sci 102, 245-252.
Cook, M. B., Wang, Z., Yeboah, E. D., et al. (2014). A genome-wide association study of
prostate cancer in West African men. Hum Genet 133, 509-521.
Freedman, M. L., Haiman, C. A., Patterson, N., et al. (2006). Admixture mapping
identifies 8q24 as a prostate cancer risk locus in African-American men. Proc Natl Acad
Sci U S A 103, 14068-14073.
Gudmundsson, J., Sulem, P., Gudbjartsson, D. F., et al. (2012). A study based on whole-
genome sequencing yields a rare variant at 8q24 associated with prostate cancer. Nat
Genet 44, 1326-1329.
Haiman, C. A., Chen, G. K., Blot, W. J., et al. (2011). Genome-wide association study of
prostate cancer in men of African ancestry identifies a susceptibility locus at 17q21. Nat
Genet 43, 570-573.
Haiman, C. A., Patterson, N., Freedman, M. L., et al. (2007). Multiple regions within
8q24 independently affect risk for prostate cancer. Nat Genet 39, 638-644.
Kim, T., Cui, R., Jeon, Y. J., et al. (2014). Long-range interaction and correlation
between MYC enhancer and oncogenic long noncoding RNA CARLo-5. Proc Natl Acad
Sci U S A 111, 4173-4178.
Prensner, J. R., Chen, W., Iyer, M. K., et al. (2014). PCAT-1, a long noncoding RNA,
regulates BRCA2 and controls homologous recombination in cancer. Cancer Res 74,
1651-1660.
Prensner, J. R., Iyer, M. K., Balbin, O. A., et al. (2011). Transcriptome sequencing across
a prostate cancer cohort identifies PCAT-1, an unannotated lincRNA implicated in
disease progression. Nat Biotechnol 29, 742-749.
Takata, R., Akamatsu, S., Kubo, M., et al. (2010). Genome-wide association study
identifies five new susceptibility loci for prostate cancer in the Japanese population. Nat
Genet 42, 751-754.
100
Xu, J., Kibel, A. S., Hu, J. J., et al. (2009). Prostate cancer risk associated loci in African
Americans. Cancer Epidemiol Biomarkers Prev 18, 2145-2149.
Yang, L., Lin, C., Jin, C., et al. (2013). lncRNA-dependent mechanisms of androgen-
receptor-regulated gene activation programs. Nature 500, 598-602.
Yao, L., Shen, H., Laird, P. W., Farnham, P. J., and Berman, B. P. (2015). Inferring
regulatory element landscapes and transcription factor networks from cancer
methylomes. Genome Biol 16, 105.
Yeager, M., Chatterjee, N., Ciampa, J., et al. (2009). Identification of a new prostate
cancer susceptibility locus on chromosome 8q24. Nat Genet 41, 1055-1057.
101
Chapter 6 Integration of multiethnic fine-mapping and
genomic annotation to prioritize candidate functional SNPs at
prostate cancer susceptibility regions
This paper was published on Human Molecular Genetics (Han et al., 2015).
Ying Han
1¶
, Dennis J. Hazelett
1¶
, Fredrik Wiklund
2
, Fredrick R. Schumacher
1, 3
, Daniel
O. Stram
1, 3
, Sonja I. Berndt
4
, Zhaoming Wang
4, 5
, Kristin A. Rand
1
, Robert N. Hoover
4
,
Mitchell J. Machiela
4
, Merideth Yeager
5
, Laurie Burdette
4, 5
, Charles C. Chung
4
, Amy
Hutchinson
4, 5
, Kai Yu
4
, Jianfeng Xu
6
, Ruth C. Travis
7
, Timothy J. Key
7
, Afshan Siddiq
8
,
Federico Canzian
9
, Atsushi Takahashi
10
, Michiaki Kubo
11
, Janet L. Stanford
12, 13
,
Suzanne Kolb
12
, Susan M. Gapstur
14
, W. Ryan Diver
14
, Victoria L. Stevens
14
, Sara S.
Strom
15
, Curtis A. Pettaway
16
, Ali Amin Al Olama
17
, Zsofia Kote-Jarai
18
, Rosalind A.
Eeles
18, 19
, Edward D. Yeboah
20, 21
, Yao Tettey
20, 21
, Richard B. Biritwum
20, 21
, Andrew A.
Adjei
20, 21
, Evelyn Tay
20, 21
, Ann Truelove
22
, Shelley Niwa
22
, Anand P. Chokkalingam
23
,
William B. Isaacs
24
, Constance Chen
25
, Sara Lindstrom
25
, Loic Le Marchand
26
, Edward
L. Giovannucci
27, 28
, Mark Pomerantz
29
, Henry Long
30
, Fugen Li
30
, Jing Ma
31
, Meir
Stampfer
27, 28
, Esther M. John
32, 33
, Sue A. Ingles
1, 3
, Rick A. Kittles
34
, Adam B.
Murphy
35
, William J. Blot
36, 37
, Lisa B. Signorello
38
, Wei Zheng
37
, Demetrius Albanes
4
,
Jarmo Virtamo
39
, Stephanie Weinstein
4
, Barbara Nemesure
40
, John Carpten
41
, M. Cristina
Leske
40
, Suh-Yuh Wu
40
, Anselm J. M. Hennis
40, 42
, Benjamin A. Rybicki
43
, Christine
Neslund-Dudas
43
, Ann W. Hsing
32, 33
, Lisa Chu
32, 33
, Phyllis J. Goodman
44
, Eric A.
Klein
45
, S. Lilly Zheng
46
, John S. Witte
47, 48
, Graham Casey
1, 3
, Elio Riboli
49
, Qiyuan Li
50
,
Matthew L. Freedman
29
, David J. Hunter
25
, Henrik Gronberg
2
, Michael B. Cook
4
,
Hidewaki Nakagawa
51
, Peter Kraft
25, 52
, Stephen J. Chanock
4
, Douglas F. Easton
17
, Brian
E. Henderson
1, 3
, Gerhard A. Coetzee
1, 3, 53
, David V. Conti
1, 3
, Christopher A. Haiman
1, 3*
1
Department of Preventive Medicine, Keck School of Medicine, University of Southern
California, Los Angeles, California, United States of America
102
2
Department of Medical Epidemiology and Biostatistics, Karolinska Institute,
Stockholm, Sweden
3
Norris Comprehensive Cancer Center, University of Southern California, Los Angeles,
California, United States of America
4
Division of Cancer Epidemiology and Genetics, National Cancer Institute, National
Institutes of Health, Bethesda, Maryland, United States of America
5
Cancer Genomics Research Laboratory, NCI-DCEG, SAIC-Frederick Inc., Frederick,
Maryland, United States of America
6
Program for Personalized Cancer Care and Department of Surgery, NorthShore
University HealthSystem, Evanston, Illinois, United States of America
7
Cancer Epidemiology Unit, Nuffield Department of Population Health, University of
Oxford, Oxford, United Kingdom
8
Department of Genomics of Common Disease, School of Public Health, Imperial
College London, London, United Kingdom
9
Genomic Epidemiology Group, German Cancer Research Center, Heidelberg, Germany
10
Laboratory for Statistical Analysis, RIKEN Center for Integrative Medical Sciences,
Yokohama, Japan
11
Laboratory for Genotyping Development, RIKEN Center for Integrative Medical
Sciences, Yokohama, Japan
12
Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle,
Washington, United States of America
13
Department of Epidemiology, School of Public Health, University of Washington,
Seattle, Washington, United States of America
14
Epidemiology Research Program, American Cancer Society, Atlanta, Georgia, United
States of America
15
Department of Epidemiology, University of Texas M.D. Anderson Cancer Center,
Houston, Texas, United States of America
16
Department of Urology, University of Texas M.D. Anderson Cancer Center, Houston,
Texas, United States of America
17
Centre for Cancer Genetic Epidemiology, Department of Public Health and Primary
Care, University of Cambridge, Cambridge, United Kingdom
103
18
The Institute of Cancer Research, London, United Kingdom
19
Royal Marsden National Health Services (NHS) Foundation Trust, London and Sutton,
United Kingdom
20
Korle Bu Teaching Hospital, Accra, Ghana
21
University of Ghana Medical School, Accra, Ghana
22
Westat, Rockville, Maryland, United States of America
23
School of Public Health, University of California, Berkeley, Berkeley, California,
United States of America
24
James Buchanan Brady Urological Institute, Johns Hopkins Hospital and Medical
Institution, Baltimore, Maryland, United States of America
25
Program in Genetic Epidemiology and Statistical Genetics, Department of
Epidemiology, Harvard School of Public Health, Boston, Massachusetts, United States of
America
26
Epidemiology Program, University of Hawaii Cancer Center, Honolulu, Hawaii,
United States of America
27
Department of Nutrition, Harvard School of Public Health, Boston, Massachusetts,
United States of America
28
Department of Epidemiology, Harvard School of Public Health, Boston,
Massachusetts, United States of America
29
Department of Medical Oncology, Dana-Farber Cancer Institute, Boston,
Massachusetts, United States of America
30
Dana-Farber Cancer Institute, Department of Medical Oncology, Center for Functional
Cancer Epigenetics, Boston, Massachusetts, United States of America
31
Channing Division of Network Medicine, Department of Medicine, Brigham and
Women's Hospital, Harvard Medical School, Boston, Massachusetts, United States of
America
32
Cancer Prevention Institute of California, Fremont, California, United States of
America
33
Division of Epidemiology, Department of Health Research and Policy, and Stanford
Cancer Institute, Stanford University School of Medicine, Stanford, California, United
States of America
104
34
University of Arizona College of Medicine and University of Arizona Cancer Center,
Tucson, Arizona, United States of America
35
Department of Urology, Northwestern University, Chicago, Illinois, United States of
America
36
International Epidemiology Institute, Rockville, Maryland, United States of America
37
Division of Epidemiology, Department of Medicine, Vanderbilt Epidemiology Center,
Vanderbilt University School of Medicine, Nashville, Tennessee, United States of
America
38
Department of Epidemiology, Harvard School of Public Health, Boston,
Massachusetts, United States of America
39
Department of Chronic Disease Prevention, National Institute for Health and Welfare,
Helsinki, Finland
40
Department of Preventive Medicine, Stony Brook University, Stony Brook, New York,
United States of America
41
The Translational Genomics Research Institute, Phoenix, Arizona, United States of
America
42
Chronic Disease Research Centre and Faculty of Medical Sciences, University of the
West Indies, Bridgetown, Barbados
43
Department of Public Health Sciences, Henry Ford Hospital, Detroit, Michigan, United
States of America
44
SWOG Statistical Center, Fred Hutchinson Cancer Research Center, Seattle,
Washington, United States of America
45
Department of Urology, Glickman Urological & Kidney Institute, Cleveland Clinic,
Cleveland, Ohio, United States of America
46
Center for Cancer Genomics, Wake Forest School of Medicine, Winston-Salem, North
Carolina, United States of America
47
Department of Epidemiology and Biostatistics, University of California, San Francisco,
San Francisco, California, United States of America
48
Institute for Human Genetics, University of California, San Francisco, San Francisco,
California, United States of America
105
49
Department of Epidemiology & Biostatistics, School of Public Health, Imperial
College, London, United Kingdom
50
Medical College, Xiamen University, Xiamen, China 361102
51
Laboratory for Genome Sequencing Analysis, RIKEN Center for Integrative Medical
Sciences, Tokyo, Japan
52
Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts,
United States of America
53
Department of Urology, Keck School of Medicine, University of Southern California,
Los Angeles, California, United States of America
¶
Authors contributed equally to this work.
*
Corresponding Author:
Christopher A. Haiman
Harlyne Norris Research Tower
1450 Biggy Street, Room 1504
Los Angeles, CA 90033
Telephone: (323) 442-7755
Fax: (323) 442-7749
E-mail: haiman@usc.edu
106
6.1 Abstract
Interpretation of biological mechanisms underlying genetic risk associations for prostate
cancer is complicated by the relatively large number of risk variants (n=100) and the
thousands of surrogate SNPs in linkage disequilibrium. Here we combined three distinct
approaches: multiethnic fine-mapping, putative functional annotation (based upon
epigenetic data and genome-encoded features), and expression quantitative trait loci
(eQTL) analyses, in an attempt to reduce this complexity. We examined 67 risk
regions using genotyping and imputation-based fine-mapping in populations of European
(cases/controls: 8,600/6,946), African (cases/controls: 5,327/5,136), Japanese
(cases/controls: 2,563/4,391) and Latino (cases/controls: 1,034/1,046) ancestry. Markers
at 55 regions passed a region-specific significance threshold (p-value cutoff range:
3.9×10
-4
-5.6×10
-3
) and in 30 regions we identified markers that were more significantly
associated with risk than the previously reported variants in the multiethnic sample.
Novel secondary signals (p<5.0×10
-6
) were also detected in two regions
(rs13062436/3q21 and rs17181170/3p12). Among 666 variants in the 55 regions with p-
values within one order of magnitude of the most-associated marker, 193 variants (29%)
in 48 regions overlapped with epigenetic or other putative functional marks. In 11 of the
55 regions, cis-eQTLs were detected with nearby genes. For 12 of the 55 regions (22%),
the most significant region-specific, prostate-cancer associated variant represented the
strongest candidate functional variant based on our annotations; the number of regions
increased to 20 (36%) and 27 (49%) when examining the 2 and 3 most significantly
associated variants in each region, respectively. These results have prioritized subsets of
candidate variants for downstream functional evaluation.
107
6.2 Introduction
Prostate cancer is the most common non-skin cancer and the second leading cause of
cancer death among men in the United States. The risk of prostate cancer varies across
racial/ethnic populations, with the incident rate in African Americans being 1.6 times that
in European Americans, and 2.6 times that in Asian Americans (Brawley, 2012).
Genome-wide association studies (GWAS) and large-scale collaborative replication
efforts have identified 100 prostate cancer risk variants (Akamatsu et al., 2012; Al Olama
et al., 2014; Al Olama et al., 2009; Al Olama et al., 2012; Eeles et al., 2009; Eeles et al.,
2013; Gudmundsson et al., 2009; Gudmundsson et al., 2008; Haiman et al., 2011b;
Haiman et al., 2007; Kote-Jarai et al., 2011b; Schumacher et al., 2011; Sun et al., 2009;
Xu et al., 2012) (referred to as index variants), mainly in populations of European or
Asian ancestry. Whether the associations with these risk variants generalize and define
the biologically relevant variation in other populations are important questions. In prior
studies, examining previously identified risk variants in men of African ancestry (Haiman
et al., 2011a; Han et al., 2014), we have noted directionally consistent associations at the
majority of risk loci (83%) which suggests that the underlying functional variant is
common and shared across populations. Fine-mapping in these regions in men of African
ancestry revealed markers that have greater statistical significance and larger effect sizes
[odds ratios (ORs)] for 27 (out of 82) index variants in this population (Han et al., 2014).
Due to the varying linkage disequilibrium (LD) patterns and allele frequencies observed
across racial/ethnic groups, studies in diverse populations, and most notably African-
American populations, have been suggested to increase power for fine-mapping by
reducing the number of proxies that are correlated with the underlying functional allele
108
(Diabetes Genetics Replication Meta-analysis Consortium et al., 2014; Franceschini et
al., 2012; Liu et al., 2014; Wu et al., 2013).
Since the vast majority of index variants (and their proxies) are located in regions
outside of protein-coding exons, identifying biologically functional candidate variants
and the genes they influence are substantial challenges in human genetics. It is now clear
that GWAS trait-associated variants are enriched amongst regulatory elements (Maurano
et al., 2012; Nicolae et al., 2010; Parker et al., 2013; Pickrell, 2014). Recently, we and
others have developed approaches to identify candidate functional variants by
intersecting genetic information with epigenetic marks that characterize regulatory
elements (Boyle et al., 2012; Coetzee et al., 2010; Coetzee et al., 2012; Grisanzio et al.,
2012; Ward and Kellis, 2012). Identifying the target gene of a regulatory element also
poses a challenge since regulatory elements can act over great distances. Expression
quantitative trait loci (eQTL) analysis has emerged as a powerful method to nominate
candidate genes (Cookson et al., 2009; Li et al., 2013; Westra and Franke, 2014). Such
approaches have led to the identification of putative functional variants and candidate
genes for a number of prostate cancer risk regions, including 8q24 (Ahmadiyeh et al.,
2010; Jia et al., 2009; Li et al., 2014), 10q11/MSMB (Pomerantz et al., 2010), 6q22/RFX6
(Huang et al., 2014) and 8p21/NKX3.1 (Akamatsu et al., 2010).
In the present study, we combined multiethnic fine-mapping results with detailed
tissue-specific functional annotation and eQTL data for prostate cancer. Specifically, we
conducted genotyping and imputation-based fine-mapping of 67 regions (see Materials
and Methods) in a large multiethnic sample comprised of 17,524 prostate cancer cases
and 17,519 controls from populations of European (8,600 cases and 6,946 controls),
109
African (5,327 cases and 5,136 controls), Japanese (2,563 cases and 4,391 controls) and
Latino (1,034 cases and 1,046 controls) ancestry to further refine the complexity of
prostate cancer-associated variants as well as elucidate novel risk variants (i.e. secondary
signals) for this malignancy. We used epigenetic and gene expression information to
functionally annotate the most-associated variants in an attempt to identify a subset of
variants in each region to be prioritized for functional testing.
6.3 Results
6.3.1 Statistical Fine-mapping
The 67 regions contained 69 index risk variants; 3p11-p12 and 4q22 each harbored 2
index SNPs (see below). In the analysis of 17,524 prostate cancer cases and 17,519
controls (Supplementary Table S1-S3; Supplementary File S1), a high degree of
directional consistency of the per-allele odds ratios (ORs) was noted with the index
signals in these populations, consistent with what we previously observed in many of
these same samples/populations (Cheng et al., 2012; Haiman et al., 2011a; Han et al.,
2014). Of the 69 risk alleles, 68 were available (frequency≥0.01) in populations of
European ancestry and all 68 alleles (100%) were positively associated with risk, with 50
(74%) nominally statistically significant (p<0.05); whereas these proportions (positive
OR versus nominally significant) were 90% (62/69) and 33% (23/69) in the African, 84%
(54/64) and 41% (26/64) in the Japanese and 81% (55/68) and 25% (17/68) in the Latino
ancestry populations, respectively. We observed significant effect heterogeneity across
populations for six index SNPs (p
het
<9.1×10
-4
and I
2
>80.0%; see Materials and
Methods), two of which (rs2660753/3p12 and rs9600079/13q22) were directionally
inconsistent, while the other four (rs12653946/5p15, rs1512268/8p21, rs7501939/17q12
110
and rs1859962/17q24) were directionally consistent but had large differences in
estimated effect sizes across populations (Supplementary Table S4).
Using a region-specific threshold of statistical significance (see Materials and
Methods), we found 55 of 67 (82%) regions contained signals that were significantly
associated with prostate cancer risk (Supplementary Table S4). Among these 55
regions, the index SNP remained the most significantly associated marker at 10 regions,
while a correlated variant was marginally more significantly associated (r
2
≥0.2 and <1
order of magnitude change in the p-value compared with the index SNP) at 15 regions
(Supplementary Table S4). The effect sizes (ORs) of the index SNP and the most-
associated correlated variant in these 15 regions were similar in magnitude in both the
multiethnic sample and the racial/ethnic population in which the discovery GWAS was
conducted (referred to as the discovery GWAS population), with no statistically
significant heterogeneity noted.
In 30 regions, combined data from multiple populations revealed variants that
were more significantly associated with risk than the index variant, which we defined as a
>1 order of magnitude change in the p-value (Table 1; Supplementary Fig. S1). A
complete list of these variants can be found in Supplementary Table S4. The most
significantly associated markers at three regions (rs13017478/2p21, rs58235267/2p15
and rs76925190/3q26) were weakly correlated with the index variant in each of these
respective regions (r
2
range, 0.15-0.18), but were still able to capture the index signals by
conditional analysis (see Materials and Methods; Supplementary Table S5). In these
30 regions, the ORs of the most-associated markers demonstrated marked directional
consistency for 27 (90%) regions, compared with only 18 of the 32 (56%) index variants
111
(Table 1). However, each of the 55 regions had a set of risk-associated SNPs from the
meta-analysis with similar effect sizes and corresponding p-values. While this set of
markers are statistically indistinguishable they define a relatively small subset that most
likely contain the underlying functional variant in each region.
Interestingly, two index SNPs located 357 kb apart and previously reported as
independent signals (rs2660753/3p12 (Kote-Jarai et al., 2008) and rs2055109/3p11
(Akamatsu et al., 2012)), could be explained by the most-associated marker in the region
after fine-mapping (rs76668454). Both index SNPs are modestly correlated with
rs76668454 in the discovery GWAS populations (r
2
≥0.30). This scenario was also
observed at 4q22 with the two index SNPs (rs12500426 and rs17021918), located 48 kb
apart, captured by marker rs60063444 (Supplementary Table S6).
When evaluating the most-associated variants instead of the index SNPs,
associations at three regions (rs76668454/3p12, rs7327286/13q22 and rs6501436/17q24)
were no longer significantly heterogeneous across populations. However, three other
regions (rs4975758/5p15, rs1160267/8p21 and rs11263763/17q12) remained
significantly heterogeneous, likely due to the larger estimated effect sizes observed in the
Japanese population (Table 1).
An example of a region illustrating the improvement in the association signal
through multiethnic fine-mapping is shown in Fig. 1. At 13q22, the index SNP
rs9600079, originally identified in this Japanese sample (Takata et al., 2010) (Table 1),
was not significantly associated with prostate cancer risk in the other racial/ethnic
populations (European: OR=1.02, p=0.41; African: OR=0.97, p=0.22; Latino: OR=1.03,
p=0.65). In testing all common variants that are correlated with rs9600079 in Asians
112
(r
2
≥0.2), the most-associated variant in the multiethnic meta-analysis was rs7327286
(Overall: p=6.1×10
-10
), which is located 15 kb upstream from the index SNP (Fig. 1).
This variant is highly correlated with the index SNP in Asians (r
2
=0.83), but is minimally
correlated in Europeans (r
2
=0.19) and Africans (r
2
=0.01). It was more statistically
significant and had a larger effect than the index SNP in each population and overall,
statistical evidence for heterogeneity no longer remained (p
het
=0.19 versus p
het of
index
=7.8×10
-5
).
At 17q24 (Fig. 2), the index SNP rs1859962 was originally reported in a
European GWAS (Eeles et al., 2009). The association with the index SNP was
significantly heterogeneous across racial/ethnic populations (p
het
=9.8×10
-6
; Table 1),
with the largest effect and the most significant association observed in Europeans
(OR=1.20, p=1.1×10
-13
). When examining all correlated (r
2
≥0.2) variants in Europeans,
rs6501436, a SNP located 10 kb downstream from the index SNP, was the most
associated marker in the multiethnic analysis (p=1.5×10
-14
). This marker is strongly
correlated with the index SNP in both European (r
2
=0.96) and Asian ancestry populations
(r
2
=0.94), but is minimally correlated (r
2
=0.08) in Africans. Moreover, in men of African
ancestry, this SNP was more significantly associated with risk than the index SNP
(OR=1.13, p=4.7×10
-4
versus OR=1.00, p=0.91). The effect heterogeneity of rs6501436
was no longer significant across populations (p
het
=0.005 vs p
het of index
=9.8×10
-6
).
Investigating associations in multiple populations also aided in deciphering
potentially ethnic-specific risk variants. As an example, at 10q26, the index variant
rs2252004, initially identified in this Japanese sample where the signal is the strongest
(OR=1.21, p=2.0×10
-5
), is common in all populations (RAF range, 0.49-0.90) and is only
113
weakly associated with risk in Europeans (OR=1.08, p=0.04; Table 1). In examining all
variants correlated with rs2252004 (r
2
>0.2, ASN 1KGP), the most-associated marker,
rs77929344, was only found in Japanese (RAF=0.87; OR=1.31, p=9.3×10
-7
). Markers
correlated with rs2252004 were only modestly associated with prostate cancer risk in the
other populations (p-values>0.003; region-specific threshold for significance p=0.001;
Supplementary Table S4), suggesting that this may be a Japanese-specific risk signal.
We also identified evidence of potential secondary signals in two regions through
conditional analyses (at p<5.0×10
-6
; see Materials and Methods; Supplementary Table
S7). At 3q21, rs13062436, located 179 kb from the index SNP (rs10934853), was
significantly associated with prostate cancer risk when conditioning on the index signal
(OR=1.14, p=5.0×10
-8
;
Supplementary Fig. S2). Similarly at 3p12, rs17181170 was
significantly associated with risk in conditional analyses (OR=1.10, p=5.9×10
-8
;
Supplementary Fig. S3). As expected, both of these novel risk variants are uncorrelated
with the index SNPs or the most-associated markers for the index signals in each
population (r
2
≤0.06).
6.3.2 Functional Annotation
Multi-ethnic fine-mapping in each region defined sets of alleles based on statistical
significance, with many having similar effect sizes (Supplementary Table S4). To
further prioritize which of the most associated variants have putative functionality, we
mapped them relative to epigenetic marks and transcription factor binding data from
publically available sources (see Materials and Methods). Here we limited the
annotation to the 55 regions that were significantly associated with prostate cancer risk in
the multiethnic analysis (as described above) and the 666 variants in these regions that
114
had p-values that were within 1 order of magnitude of the most-associated marker
(referred to as ‘top-order’ variants). Since this deterministic approach relies heavily on p-
value rankings, we compared this approach with the ranking distribution obtained by re-
sampling the effects of all candidate SNPs in each region for each population from a
multivariate distribution (see Materials and Methods). At 49 (out of 55, 89%)
regions, the set of top-order variants contained the top-ranked SNP when resampling
(Supplementary Table S8). Moreover, 84% on average (100% median) of the top-order
SNPs were within the 95% joint posterior probabilities from resampling. For 28 regions,
the entire set of top-order SNPs was included within the 95% joint posterior probabilities.
Of the 666 variants, 193 (29%) overlapped with functional marks and could be
assigned into one of four predicted functional categories: missense variants, enhancers,
promoter or promoter-proximal enhancers, and untranslated regions of coding exons
(UTR) (Supplementary Tables S9 and S10). Of the 193 SNPs, 2 were non-synonymous
substitutions (rs2292884/H347R in MLPH and rs699664/R325Q in GGCX), 152 (79%)
were present in enhancers, 29 (15%) were in promoters, and 10 (5%) were located within
untranslated regions. The most statistically associated variants in 12 of the 55 (22%)
regions represented the best functional candidates in the region, whereas this number
increased to 20 (36%) and 27 (49%) when examining the 2 or 3 most significantly
associated markers in each region, respectively (Supplementary Table S9). To
determine whether the top-order variants are enriched for enhancer annotations, we
examined 721,371 SNPs (MAF>1%) and insertion/deletion variants in the 1KGP
database within windows of 1 Mb (centered on the index SNP) at the 55 regions and
found that 102,735 (14%) overlapped with the features that we used to identify and
115
annotate enhancers. Compared with this figure, our 666 top-order variants contained 152
enhancer overlaps (23%), constituting a 1.6-fold enrichment over background (p=1.1×10
-
9
).
We also performed a cis-eQTL analysis with the top-order variants (n=666) in
145 prostate tumor samples to further assess whether the risk signals defined by the top
SNPs may also be associated with an eQTL. The cis-eQTL associations in this sample
based on the index variants or correlated markers (r
2
>0.5) in the CEU 1KGP population
have been published previously (Li et al., 2014). In 11 regions, we found suggestive
evidence of cis-eQTL associations (p<9.1×10
-4
; see Materials and Methods) between
top-order SNPs and the expression of one or more nearby genes (Supplementary Table
S9) with the most significant associations observed at 5p15 and IRX4 (p=2.4×10
-15
) (Xu
et al., 2014), 6q21 and SESN1 (p=6.3×10
-8
), 6q25 and RGS17 (p=2.5×10
-6
), 11q22 and
MMP7 (p=2.3×10
-6
), and at 17p13 with VPS53 (p=1.3×10
-6
) and FAM57A (p=6.4×10
-6
).
Five cis-eQTL associations (SLC6A19, C10orf32, CTBP2, MMP7 and FAM57A) were
not identified previously when focusing on the index variant or those correlated with the
index SNP in the European ancestry population (Li et al., 2014). To determine whether
the top-order variants were enriched for cis-eQTL associations, we examined all common
variants (MAF≥1%) in the 55 regions where fine-mapping was conducted. Of 334,357
variants in these regions, 4,822 (1.4%) were significant cis-eQTLs. In comparison, 87 out
of the 666 top-order variants in the same regions were significant cis-eQTLs (13%),
constituting a 9-fold enrichment over background (p=5.0×10
-55
).
Four of the regions were unique in that the statistical association, functional
annotation and eQTL evidence converged on a small number of variants (Fig. 3). These
116
regions are discussed below; detailed annotation information for all regions is provided in
the Supplementary Fig. S4 and Supplementary File S1.
At 10q24, the most promising functional candidate is variant rs12773833 (7
annotations, p=1.8×10
-6
), one of the most associated variants in the region. It overlaps
with ChIPseq peaks for FOXA1, the Androgen Receptor in tumor and normal prostate
tissue samples, DNaseI, H3K27Ac and H3K4me1 and is an eQTL (p=6.5×10
-4
) with the
AS3MT gene ~200 kb telomeric (Fig. 3A; Supplementary Table S9). A second
candidate, rs7094325 (p=1.7×10
-6
), is an eQTL for AS3MT (p=5.1×10
-4
) and C10orf32
(p=7.7×10
-4
; Supplementary Table S9), overlaps a CTCF peak in LNCaP (Fig. 3A) and
disrupts a potential CTCF response element (Supplementary Table S10).
At 10q26, the top-order SNPs include rs11598549 (p=2.3×10
-8
), which is situated
in a DNaseI site (Fig. 3B) and is an eQTL with CtBP2 (p=7.9×10
-4
; Supplementary
Table S9). Other top-order SNPs, rs4962419, rs7077275, rs12769019 and rs12769682
(p<2x10
-7
), also overlap the DNaseI hypersensitive and transcription-factor bound region
of this enhancer, with the low-risk A allele of rs12769019 predicted to disrupt an
Androgen Receptor response element (Fig. 3B; Supplementary Table S10). A recent
report confirmed that the Androgen Receptor binds to this region and that the risk allele
of rs12769019 mediates increased androgen responsiveness of the enhancer in a reporter
assay (Takayama et al., 2014).
At 12q13, the most significantly associated SNP rs55958994 (p=2.8×10
-13
;
Supplementary Table S9) overlaps 6 annotations (Fig. 3C) and is situated within a
DNaseI hypersensitive site in an H3K27Ac and H3K4me1 marked active promoter-
proximal enhancer of KRT8. In a previous fine-mapping study in this African sample we
117
reported rs55958994 as the most significant association in the region (Han et al., 2014).
The multiethnic results presented here with a larger sample and multiple populations
further support rs55958994 as the most-associated marker and best functional candidate
in the region.
At 17p13, the index SNP rs684232 (five annotations, p=5.3×10
-5
) and the most-
associated SNP rs2474694 (four annotations, p=3.3×10
-5
) are in different DNaseI
hypersensitive sites within the same H3K27Ac and H3K4me3 marked promoter of the
VPS53 gene (Fig. 3D). Both SNPs are also eQTLs with VPS53 (rs684232 p=4.6×10
-6
;
rs2474694 p=4.6×10
-4
) and FAM57A (rs684232 p=2.3×10
-5
; rs2474694 p=2.3×10
-5
;
Supplementary Table S9). Of the two, rs684232 is predicted to disrupt a GATA6
binding site, whereas rs2474694 is located within the 5' UTR of VPS53.
6.4 Discussion
In this multiethnic fine-mapping study of prostate cancer risk loci, 55 regions (out of 67
examined) passed region-specific significance thresholds, and in 25 regions, the index
SNP and correlated variants were most effective in denoting the risk association. In 30
regions, we identified markers that are more statistically associated with risk than the
index SNP in the multiethnic sample. In applying information on functional annotation
from prostate cancer cell lines we were able to further define variants in many regions
that may be considered high-priority functional candidates for experimental testing.
GWAS have revealed 100 genetic risk variants for prostate cancer, more than any
other common cancer. Although most of the associations were originally identified in
populations of European ancestry, the vast majority of these risk variants are also
associated with risk in other populations, pointing to common shared functional variants
118
(Cheng et al., 2012; Haiman et al., 2011a; Han et al., 2014). Some loci do not replicate
despite adequate statistical power. Given the directional consistency noted across
populations, this observation suggests that the index variant is not the biologically
relevant allele or is not a valid proxy for the true causal variant in all populations. Fine-
mapping using trans-ethnic populations has been shown both in theory (Morris, 2011;
Ong et al., 2012; Teo et al., 2010; Zaitlen et al., 2010) and in practice (Diabetes Genetics
Replication Meta-analysis Consortium et al., 2014; Franceschini et al., 2012; Liu et al.,
2014; Wu et al., 2013) for a number of traits to have better performance than using a
single population of relatively homogeneous ancestry. Based on the expectation that most
functional alleles are common and shared, the success of this approach requires having
good coverage on the genetic variation across populations. Here we used GWAS
genotyping plus imputation based on a high-density reference panel from 1KGP, which is
currently the most comprehensive and efficient approach for testing common alleles at
risk regions.
The primary goal of our study was to reduce the complexity in each region
defined by an index signal and to identify a subset of variants that most likely contain
functional variation affecting prostate cancer risk in all populations. However, this does
not rule out the possibility of additional risk variants in these regions as demonstrated at
3q21 and 3p12, or the possibility of allelic heterogeneity in a region in these populations..
Additionally, multiple variants may jointly contribute to the disease risk in a non-additive
fashion such as through haplotypes or gene-gene interactions, which requires further
investigation.
119
In selecting regions for fine-mapping, we applied a conservative region-specific
significance threshold to reduce the potential for identifying and reporting false-positive
associations. Several methods for prioritizing the best candidate markers in a region have
been proposed and utilized in trans-ethnic fine-mapping studies (Diabetes Genetics
Replication Meta-analysis Consortium et al., 2014; Franceschini et al., 2012; Haiman et
al., 2011a; Liu et al., 2014; Morris, 2011; Wu et al., 2013). In general, these methods
attempt to identify a single variant or so-called “best marker” in a region amongst a group
of variants mutually correlated. For example, MANTRA (Morris, 2011) leverages
heterogeneity across multiple ethnic groups to improve power and provide a credible set
of variants. When applied to our data, the credible set often contained hundreds of SNPs
with a ranking similar to the p-value ranking (data not shown). Instead, we opted to
leverage the differential LD structure and investigate SNPs with consistent estimates
across populations ranked by p-value, in order to provide a narrower list of top-order
variants as promising candidates for functional annotation. To account for the uncertainty
from estimation on the resulting ranking of p-values, we used an empirical resampling
approach to estimate the probability that each SNP is the most-associated marker within
each region. For the vast majority (89%) of the regions that we examined, the top-ranked
SNP from resampling is included within our defined set of top-order variants. Although
requiring validation, this suggests that the underlying functional allele, which is expected
to have the largest effect size and smallest p-value, has a good probability of being
included within the subset of alleles that we have defined in each region (shown in
Supplementary Table S4).
120
Compared with past fine-mapping studies for prostate cancer, our study is
empowered by improved imputation coverage, increased sample size and different LD
structure from multiethnic populations. Previous fine-mapping efforts limited to single
populations include region-specific studies of 5p15 (TERT) (Kote-Jarai et al., 2013), 8p21
(NKX3.1) (Akamatsu et al., 2010), 10q11 (MSMB) (Lou et al., 2009), 11q13 (Chung et
al., 2012; Chung et al., 2011a; Zheng et al., 2009), 17q12 (HNF1B) (Berndt et al., 2011;
Sun et al., 2008) and 19q13 (KLK3) (Kote-Jarai et al., 2011a; Parikh et al., 2011) in
European or Asian ancestry populations as well as an initial characterization of the
known regions in part of this African ancestry sample (Haiman et al., 2011a; Han et al.,
2014). In a concurrent fine-mapping study, which is the largest fine-mapping study to
date in men of European ancestry, Al Olama et al. (Human Molecular Genetics, in press)
examined 64 regions in 25,779 prostate cancer cases and 26,218 controls, with the
majority of samples genotyped with the iCOGS array (Eeles et al., 2013). Leveraging
individual-level data, they used stepwise regression to suggest more significant markers
at 47 regions. In total, they indicate 1,623 variants as candidates via statistical fine-
mapping (21.6 SNPs per region on average; median of 13). Of these variants, they
identify 403 with additional functional annotation with an average of 5.4 per region
(median=3). In comparison, our multiethnic study suggests a total of 666 SNPs in 55
regions (average of 12.1 SNPs per region and a median of 6). With additional functional
annotation our study identifies an average of 3.5 markers (median=2). Comparing final
conclusions for the 46 overlapping regions across the two studies, there are 29 (63.0%)
regions with at least one overlapping functional candidate marker. Moreover, 11 of the
46 regions (23.9%) have a single overlapping candidate – a strong candidate marker for
121
future more detailed functional investigation. For these overlapping regions, Al Olama et
al. reported novel secondary signals (p<1.0×10
-5
) with 13 variants in these regions. In our
study, nine of these variants were nominally associated with prostate cancer risk
(p<0.05), with two markers correlated (r
2
range, 0.47-0.93, EUR 1KGP) with the
secondary signals we identified at 3q21 and 3p12. We believe that these contrasting
findings highlight the value of using both homogeneous as well as multiethnic samples in
fine-mapping, with the latter focused on prioritizing risk variants that generalize across
populations and thus, are most likely to be or tag the biologically functional variant.
In order to assess the variants for biological relevance we determined their
locations relative to chromatin biofeatures and other primary sequence features in cell
types that reflect the epithelial cell types of origin for this cancer. These include PrEC
and RWPE1, which are immortalized, non-transforming prostate epithelial cell lines, and
LNCaP, a prostate cancer cell line that was originally isolated from lymph node
metastases. Previously, we examined GWAS risk variants and proxies within biofeatures
in these cell types and identified 727 putative functional SNPs in LD (r
2
>0.5) with the
index SNP in 77 risk regions (Hazelett et al., 2014). A large proportion (667, or 88%) of
these SNPs fell within 217 distinct enhancer regions. While this subset of variants likely
captures many true biologically functional variants, associations with prostate cancer risk
were not directly assessed. Here we show that through multiethnic association mapping
we are able to eliminate many of the proxies and reduce the number of candidate
functional variants to a much smaller subset within one or two enhancers in many
regions. In particular, we were able to narrow down our list of candidates from 666 top-
order SNPs to 193 within chromatin features of biological interest in the cell type of
122
origin, of which 152 (79%) are located in 82 distinct enhancer regions, with an average of
1.5 enhancers per region over 55 regions. Contrast this with the previous study where we
found 217 putative risk enhancers over 77 regions, equivalent to 2.8 enhancers per risk
region. We also evaluated eQTLs to further highlight potential functional alleles and
target genes that may be altered by inherited variation. Through combining different
sources of evidence we have attempted to reduce the number of variants and hypotheses
for researchers to test in follow-up studies designed to reveal the biological mechanisms
underlying genetic risk.
We showed that relative to random selections of SNPs from 1KGP, our top-order
set was highly enriched for enhancer and promoter regulatory elements, consistent with
previous findings that the majority of GWAS signals for many cancers overlap enhancer
regions (Hazelett et al., 2014; Rhie et al., 2014). Moreover, our finding of enrichment
strongly suggests that there is a real bias toward perturbation of regulatory sequences in
risk for prostate cancer. It also suggests, because of the types of disruptions, that risk is
mediated biologically at the level of point-disruptions of enhancer-promoter interactions
mediated by transcription factor binding to the chromosome at the location of the SNP.
More studies are needed to assess the gene regulation for each risk enhancer, identify
potential gene targets, and determine which targets mediate phenotypes consistent with
increased risk for cancer using cell-based assays. Combining new technologies, including
CRISPR-cas9 RNA-mediated genome editing for knockout and allele replacement
experiments and 4C chromatin interaction assays to identify physical interactions
between risk enhancers and their target regions, are expected to yield insight into the
mechanism of GWAS identified risk at many of the regions.
123
In summary, we have characterized 55 prostate cancer risk
regions through statistical multiethnic fine-mapping, functional annotation and cis-eQTL
analyses and have revealed variants in ~50% of these regions that may be functional and
in turn should be prioritized in future experimental testing to understand biological
mechanisms at prostate cancer risk regions.
6.5 Materials and Methods
6.5.1 Study Populations and Genotyping
We combined data from studies with existing high-density SNP genotyping in prostate
cancer GWAS in the following populations: European ancestry [8,600 cases and 6,946
controls from the Cancer of the Prostate in Sweden (CAPS) (Duggan et al., 2007), Breast
and Prostate Cancer Cohort Consortium (BPC3) (Schumacher et al., 2011) and
PEGASUS]; African ancestry [5,327 cases and 5,136 controls from the African Ancestry
Prostate Cancer GWAS Consortium (AAPC) (Haiman et al., 2011b) and the Ghana
Prostate Study (Cook et al., 2014)]; Japanese ancestry [2,563 cases and 4,391 controls
from GWAS in Japanese in the Multiethnic Cohort (MEC) (Cheng et al., 2012), and in
Biobank Japan (Akamatsu et al., 2012; Takata et al., 2010)]; and Latino ancestry [a
GWAS of 1,034 cases and 1,046 controls from the MEC (Cheng et al., 2012)]. Details of
each study are provided in the Supplementary File S1 and Supplementary Table S1.
Genotyping of each study was performed using Illumina or Affymetrix GWAS arrays and
quality control procedures of each GWAS have been described previously and are
provided in Supplementary Table S2. Imputation was performed in each study using a
cosmopolitan reference panel from the 1000 Genomes Project (1KGP; March, 2012).
Across each region, genotyped SNPs, imputed SNPs, and insertion/deletion variants ≥1%
124
frequency were examined for association with prostate cancer risk. SNPs with an
imputation r
2
[‘info score’ (Howie et al., 2009)] < 0.3 were not tested for association. The
vast majority of SNPs reported in this paper (581/666: 87%) had an r
2
≥0.8 in all studies.
This study was approved by the Institutional Review Board at the University of Southern
California.
6.5.2 Statistical Analysis
Here we focus on 69 of the 100 known risk variants (referred to as index SNPs) in 67
regions; exclusions include 23 regions/variants where multiethnic fine-mapping in this
sample has already been reported (Al Olama et al., 2014), 8q24 (Al Olama et al., 2009;
Amundadottir et al., 2006; Haiman et al., 2007; Takata et al., 2010; Yeager et al., 2009)
and 11q13 which harbor multiple independent risk variants (Chung et al., 2012; Chung et
al., 2011a; Zheng et al., 2009) and will be reported separately, and 19q13 (KLK3, a gene
that encodes PSA) which is a risk region for low-grade prostate cancer (Eeles et al., 2008;
Kote-Jarai et al., 2011a; Parikh et al., 2011). Regions that contain two index SNPs are
3p11-p12 (rs2055109/3p11 and rs2660753/3p12) and 4q22 (rs12500426 and
rs17021918). Within +/-500 kb of each index SNP, association testing with prostate
cancer risk was conducted within each study using unconditional logistic regression,
adjusted for global ancestry in an additive model. We summarized the ethnic-specific and
overall effect using a fixed-effect inverse-variance-weighted meta-analysis. For each
SNP, we report a per-allele odds ratio (OR), 95% confidence interval (CI), and a p-value
obtained from a 1-degree of freedom Wald test.
In our multiethnic fine-mapping, we focused initially on SNPs that are correlated
(r
2
≥0.2) with the index SNP in the racial/ethnic population in which the original
125
discovery was made, and are more statistically significant (>1 order of magnitude change
in p-value in the multiethnic sample). The r
2
threshold of 0.2 was lowered when
demonstrated through conditional analyses that a more weakly correlated variant could
account for the association signal defined by the index variant. To determine statistical
significance within each region, we applied a region-specific threshold to correct for
multiple independent tests conducted for SNPs correlated (r
2
≥0.2) with the index SNP in
the original GWAS population. For each region, an empirically determined threshold
accounting for the number of correlated SNPs was estimated as 0.05 divided by the
number of tags that can capture all of the common tested SNPs [minor allele frequency
(MAF)≥0.05] at r
2
>0.8 in AFR 1KGP. These region-specific significance cutoffs range
from 3.9×10
-4
to 5.6×10
-3
and are conservative given that African ancestry populations
often require more ‘tags’ and represent only approximately one-third of our study sample,
and thus, will reduce the number of false-positive signals. Of the 67 regions examined,
the significance of the index SNP or those correlated with the index SNP (at r
2
≥0.2)
surpassed region-specific thresholds for 55 regions (Supplementary Table S4). In these
regions we evaluated effect heterogeneity across racial/ethnic populations using
Cochran’s Q test and I
2
statistics (Higgins et al., 2003). Statistically significant
heterogeneous effects were defined as those with p
het
<9.1×10
-4
(0.05/55 regions) and
I
2
>80.0%. For 15 regions in which the most associated SNP, or a SNP with nominal p-
value<10
-5
, was weakly correlated with the index variant (r
2
<0.2), we performed
conditional analysis with both SNPs in the same model to determine if a secondary signal
could be identified. A secondary signal was defined as significantly associated with
prostate cancer risk in the overall meta-analysis at p<5.0×10
-6
, with no impact on the
126
effect or degree of statistical significance of the index SNP or most significantly
associated marker in the region (Haiman et al., 2011a).
For each statistically significant region, we defined a “top-set” of SNPs as those
with p-values within one order of magnitude change from the most-associated marker.
This definition was used for ease of implementation and interpretation. However, several
new statistical approaches for fine-mapping have been presented and rely on the
conversion of a marginal p-value to a posterior probability (Onengut-Gumuscu et al.,
2015), Bayes Factors, or scaled Bayes Factor (Wellcome Trust Case Control et al., 2012)
for re-ranking the SNPs and providing probabilistic interpretations for the final set of
SNPs selected (i.e. credible sets). The main advantage of these approaches is that they
incorporate the corresponding SNP variance, weighted relative to a prior variance, to re-
rank the SNPs. The re-ranking of the SNPs is dependent upon the specification of the
prior and Wakefield (Wakefield, 2009) discusses several options including a prior with
effect size-MAF dependence and a prior equivalent to the ranking via p-values.
Importantly, in Wakefield (Wakefield, 2007) it is noted that these priors implicitly
assume that the SNPs are independent, thus they may not be appropriate for fine-mapping
and may lead to false conclusions when relying on final probabilistic interpretations (i.e.
95% credible sets). While more sophisticated hierarchical modeling approaches may be
necessary, such as those in Conti and Witte (Conti and Witte, 2003), these lack the ease
of implementation. Moreover, the major strength of our overall approach is the
leveraging of differential linkage disequilibrium across multiple ethnic groups to narrow
the set of SNPs. In this situation, it is unclear how sensitive final SNP rankings are to
various pre-specified priors. To account for the potential uncertainty in the ranking of p-
127
values due to effect estimation within each ethnic group, we performed a resampling
approach similar in spirit to the methods described in Zaitlen et al (Zaitlen et al., 2010).
That is, for all candidate SNPs (r
2
≥0.2 with index in the original GWAS population), we
resampled the ethnic-specific effect estimates from a multivariate normal distribution
with a mean given by the estimated ethnic-specific marginal maximum likelihood
estimates (MLE) and structured covariance matrix with the diagonal elements equal to
the estimated variance for each marginal effect and the off-diagonal elements equal
to the approximated covariance for two marginal effects , where is the
estimated pairwise correlation between SNP
i
and SNP
j
from 1KGP for the corresponding
racial/ethnic population. For each region, resampled ethnic-specific effect estimates were
then meta-analyzed in a fixed-effect model, weighted by the estimated variance. This
approach provides posterior probabilities (under a prior with a point mass at the MLE) for
the joint ranking of all candidate SNPs within a region and avoids assumptions of
independence.
6.5.3 Functional Annotation
For each region we mapped the most associated SNPs (Supplementary Tables S9 and
S10) to putative functional domains using bedtools software
(bedtools.readthedocs.org/en/latest/#) and in-house python scripts to assemble a matrix of
positive overlaps. We used a number of publicly available prostate epithelia and PrCa
ENCODE datasets of chromatin features to identify putative enhancer/regulatory regions
in each risk region (Hazelett et al., 2014; Thurman et al., 2012). These datasets included
LNCaP and RWPEI DnaseI HS sites (GSE32970) ENCODE; PrEC DNaseI HS sites
ˆ
β
ˆ
σ
β
i
2
ρ
ij
ˆ
σ
β
i
ˆ
σ
β
j
ρ
ij
128
(GSE29692) ENCODE; LNCaP CTCF ChIP-seq peaks (GSE33213) ENCODE; LNCaP
H3K27ac and TCF7L2 (GSE51621) (Hazelett et al., 2014), H3K4me3 and H3K4me1
histone modification ChIP-seq peaks (GSE27823) (Wang et al., 2011); FoxA1 ChIP-seq
peaks (GSE28264) (Tan et al., 2012); Androgen Receptor (AR) ChIP-seq peaks (Andreu-
Vieyra et al., 2011) and AR binding sites (GSE28219) (Sharma et al., 2013); NKX3-1
ChIP-seq peaks (GSE28264) (Tan et al., 2012). We also included AR ChIP-seq data on 7
normal and 13 tumor prostate tissue (Pomerantz et al., submitted for publication).
We subsequently classified the putative functionality of each SNP according to
the mapped features. These fall into four categories, promoter, enhancer, coding
disruptions and untranslated exonic regions. For coding exons we used dbSNP (Sherry et
al., 2001) to assess the nature of disruptions in protein coding sequence. For 5' or 3' UTR
SNPs we reported overlap with miRcode highcons predicted target sequences (Jeggari,
Marks and Larsson, 2012). To assess potential disruptions of transcription factor response
elements, we performed motif analysis as previously described (Hazelett et al., 2014),
reporting motifs with a >85% match and 70% difference between the reference and effect
alleles in the position frequency matrix describing the motif. To calculate enrichment of
SNPs in enhancers we used the hypergeometric distribution as implemented in base
packages of R. The hypergeometric distribution measures the probability of k successes
in n draws without replacement from a finite population given that the entire makeup of
the population is known.
6.5.4 Cis-eQTL Analysis
For the most associated variants in each region we examined the associations with
expression of nearby genes in 145 prostate tumor samples from the TCGA database (Feb
129
2013). If a variant was not represented in the TCGA data, the genotypes were imputed
using IMPUTE2 (Howie et al., 2009). A cis-eQTL analysis was performed for these
variants and any transcript within a 1 Mb interval (500 kb on either side). Gene
expression values were adjusted for somatic copy number and CpG methylation as
previously described (Li et al., 2013). Each risk variant was corrected for the number of
transcripts in the interval. Significant associations were defined as a nominal p-value of
<9.1×10
-4
, which is a Bonferroni correction for the number of regions examined (n=55).
6.6 Acknowledgements
This work was supportted by the National Institutes of Health (NIH), National Cancer
Institute (NCI) GAME-ON U19 initiative for prostate cancer (ELLIPSE, CA148537). The
BPC3 was supported by NIH/NCI cooperative agreements U01-CA98233 to D.J.H., U01-
CA98710 to S.M.G., U01-CA98216 to E.R., and U01-CA98758 to B.E.H., and
Intramural Research Program of NIH/National Cancer Institute, Division of Cancer
Epidemiology and Genetics). ATBC, PEGASUS/PLCO and the Ghana Prostate Study
were supported in part by the Intramural Research Program of the NIH and NCI.
Additionally, this research was supported by U.S. Public Health Service contracts N01-
CN-45165, N01-RC-45035, N01-RC-37004 and HHSN261201000006C from the NCI,
Department of Health and Human Services. CAPS was supported by the Cancer Risk
Prediction Center (CRisP; www.crispcenter.org), a Linneus Centre (Contract ID
70867902) financed by the Swedish Research Council, Swedish Research Council (grant
no K2010-70X-20430-04-3), the Swedish Cancer Foundation (grant no 09-0677), the
Hedlund Foundation, the Söderberg Foundation, the Enqvist Foundation, ALF funds from
the Stockholm County Council. Stiftelsen Johanna Hagstrand och Sigfrid Linnér’s Minne,
130
Karlsson’s Fund for urological and surgical research. The AAPC studies were supported
as follows: The MEC and the genotyping in AAPC were supported by NIH grants
CA63464, CA54281, CA1326792, CA148085 and HG004726. Functional annotation
was supported by grant CA136924. Genotyping of the PLCO samples was funded by the
Intramural Research Program of the Division of Cancer Epidemiology and Genetics, NCI,
NIH. LAAPC was funded by grant 99-00524V-10258 from the Cancer Research Fund,
under Interagency Agreement #97-12013 (University of California contract #98-00924V)
with the Department of Health Services Cancer Research Program. Cancer incidence data
for the MEC and LAAPC studies have been collected by the Los Angeles Cancer
Surveillance Program of the University of Southern California with Federal funds from
the NCI, NIH, Department of Health and Human Services, under Contract No. N01-PC-
35139, and the California Department of Health Services as part of the state-wide cancer
reporting program mandated by California Health and Safety Code Section 103885, and
grant number 1U58DP000807-3 from the Centers for Disease Control and Prevention.
KCPCS was supported by NIH grants CA056678, CA082664 and CA092579, with
additional support from the Fred Hutchinson Cancer Research Center and the Intramural
Program of the National Human Genome Research Institute. MDA was support by grants,
CA68578, ES007784, DAMD W81XWH-07-1-0645, and CA140388. GECAP was
supported by NIH grant ES011126. CaP Genes was supported by CA88164 and
CA127298. IPCG was support by DOD grant W81XWH-07-1-0122. DCPC was
supported by NIH grant S06GM08016 and DOD grants DAMD W81XWH-07-1-0203,
DAMD W81XWH-06-1-0066 and DOD W81XWH-10-1-0532. CPS-II is supported by
the American Cancer Society. SELECT is funded by Public Health Service grants
131
CA37429 and 5UM1CA182883 from the National Cancer Institute. SCCS is funded by
NIH grant CA092447. SCCS sample preparation was conducted at the Epidemiology
Biospecimen Core Lab that is supported in part by the Vanderbilt Ingram Cancer Center
(CA68485). Data on SCCS cancer cases used in this publication were provided by the
Alabama Statewide Cancer Registry; Kentucky Cancer Registry; Tennessee Department
of Health, Office of Cancer Surveillance; Florida Cancer Data System; North Carolina
Central Cancer Registry, North Carolina Division of Public Health; Georgia
Comprehensive Cancer Registry; Louisiana Tumor Registry; Mississippi Cancer
Registry; South Carolina Central Cancer Registry; Virginia Department of Health,
Virginia Cancer Registry; Arkansas Department of Health, Cancer Registry. The
Arkansas Central Cancer Registry is fully funded by a grant from National Program of
Cancer Registries, Centers for Disease Control and Prevention (CDC). Data on SCCS
cancer cases from Mississippi were collected by the Mississippi Cancer Registry which
participates in the National Program of Cancer Registries (NPCR) of the Centers for
Disease Control and Prevention (CDC). The contents of this publication are solely the
responsibility of the authors and do not necessarily represent the official views of the
CDC or the Mississippi Cancer Registry. JAPC and LAPC were supported by NIH
grants CA63464, CA54281 and CA098758, the NIH/National Human Genome Research
Institute by U01 HG004726-01 and the Genome Coordinating Center was supported by
U01 HG004446. BBJ was supported by the Ministry of Education, Culture, Sports,
Sciences and Technology of the Japanese government, and this study was supported in
part by the Princess Takamatsu Cancer Research Fund and Takeda Science Foundation
Award (to H. Nakagawa). G.A. Coetzee was supported by grant R01 CA136924 and M.L.
132
Freedman was supported by grant R01 GM107427. We should like to thank all of the
men who took part in these studies.
Conflict of Interest Statement
None declared.
133
Figure 1. A Regional Association Plot of the Prostate Cancer Risk Region at
Chromosome 13q22.1. The -log
10
p-values are from the multiethnic meta-analysis. The
index SNP (rs9600079), originally discovered in this Japanese sample (Takata et al.,
2010), is designated by a purple circle. The r
2
shown is estimated in Asians from 1000
Genomes Project (ASN 1KGP) in relation to rs9600079. Gray circles are SNPs not in
ASN 1KGP (r
2
cannot be estimated). The top red circle represents the marker
(rs7327286) that is most strongly associated with risk in this region. The plot was
generated using LocusZoom (http://csg.sph.umich.edu/locuszoom/).
Multiethnic
0
2
4
6
8
10
!log
10
(p−value)
0
20
40
60
80
100
Recombination rate (cM/Mb)
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
! !
! !
!
! ! !
!
!
!
!
!
! !
!
! !
!
!
!
!
! !
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
! ! ! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
! !
! !
! !
! !
!
!
! ! ! !
!
! !
!
!
!
! ! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! ! ! !
!
!
!
!
!
!
!
! !
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
! !
!
!
! !
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! ! !
!
!
!
!
! !
!
!
!
!
! !
!
!
!
! !
!
! ! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! ! !
!
!
!
!
!
! !
!
! !
!
!
!
!
! !
!
!
!
!
!
!
! ! !
!
! ! ! !
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
! !
! !
!
! !
!
! !
! !
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
! !
! !
!
!
!
! !
!
!
!
!
!
! !
!
!
! ! ! ! !
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
! ! !
!
! !
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
! !
!
!
! !
! ! ! !
!
!
!
!
!
!
!
! !
! !
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
! ! !
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
! !
!
! !
!
!
!
!
!
! !
!
! ! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
! ! ! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
! !
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
! !
!
!
!
! !
!
!
!
!
! ! !
!
!
!
!
! !
! !
!
!
! !
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
! !
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
! !
!
! !
! ! !
! ! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
! !
!
!
! !
!
!
! ! ! !
!
! !
!
! ! !
!
! ! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
! !
!
! ! ! ! !
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
! ! ! ! !
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! ! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
! ! !
!
!
! ! !
! !
!
!
!
!
!
!
!
!
!
!
!
! ! ! ! !
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
! ! !
!
!
!
!
!
!
! ! !
!
!
!
! !
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
! ! ! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
! ! ! ! !
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! ! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! ! ! !
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
! ! !
! !
!
! !
!
!
! ! ! !
!
!
!
!
!
!
!
!
! !
!
!
!
! !
! !
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
! ! ! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! ! ! ! ! ! !
!
! !
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! ! !
!
! !
!
! !
! !
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! ! !
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! ! ! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
! ! ! !
!
! ! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
! ! !
!
!
! ! ! !
!
!
!
! ! !
! !
! !
!
! !
!
!
!
!
!
! !
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
! !
!
!
!
! ! ! !
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! ! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! ! ! !
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! ! !
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
! ! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! ! !
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! ! !
! ! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
! !
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! ! ! ! ! ! !
! !
!
!
! !
!
!
!
! !
!
!
!
!
!
! ! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
! !
! !
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
! !
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
! !
!
! !
!
!
! ! ! !
! ! !
! !
!
!
!
! !
!
!
!
!
!
! ! !
!
!
!
!
!
! ! !
!
!
!
! ! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! ! !
!
!
!
!
! !
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! ! ! ! !
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
! ! !
!
!
!
! ! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! ! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
! !
!
!
!
! !
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! ! !
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
! ! ! ! !
!
!
!
!
!
! !
!
!
!
! ! ! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! ! !
!
!
! !
!
! ! ! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! ! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
! ! ! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
! !
!
!
!
!
! ! ! !
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
! !
! !
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! ! ! !
!
!
! !
! ! !
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! ! ! !
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
! !
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
! !
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! ! !
! !
!
!
!
!
!
! !
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
! !
!
!
! ! !
! !
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
! !
! !
!
!
!
! !
! !
!
!
! !
!
!
! !
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! ! !
!
! !
! !
!
!
! ! ! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
! ! !
!
!
!
! ! !
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! ! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
! !
!
!
!
!
! !
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
! !
!
! !
!
! ! !
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
! ! ! !
!
!
!
! ! ! !
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
rs9600079
0.2
0.4
0.6
0.8
r
2
MZT1
BORA
DIS3
PIBF1 KLF5 LINC00392
73.4 73.6 73.8 74 74.2
Position on chr13 (Mb)
Plotted SNPs
rs9600079
rs7327286
13q22.1
134
Figure 2. A Regional Association Plot of the Prostate Cancer Risk Region at
Chromosome 17q24.3. The -log
10
p-values are from multiethnic meta-analysis. The
index SNP (rs1859962), originally discovered in a European GWAS (Eeles et al., 2009),
is designated by a purple circle. The r
2
shown is estimated in Europeans from 1000
Genomes Project (EUR 1KGP) in relation to rs1859962. Gray circles are SNPs not in
EUR 1KGP (r
2
cannot be estimated). The top red circle represents the marker
(rs6501436) that is most strongly associated with risk in this region. The plot was
generated using LocusZoom (http://csg.sph.umich.edu/locuszoom/).
Multiethnic
0
5
10
15
!log
10
(p−value)
0
20
40
60
80
100
Recombination rate (cM/Mb)
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
! ! !
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
! !
!
!
!
! ! ! !
!
!
! ! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
! ! !
!
!
!
!
!
! !
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
! !
!
!
!
! !
! ! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
! !
!
! !
!
! !
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
! !
!
!
!
!
!
! ! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! ! !
!
!
!
!
!
!
!
!
! ! !
!
! ! ! !
!
!
! !
!
! ! ! ! !
! ! ! ! !
! !
! !
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
! !
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
! ! !
!
!
!
!
! !
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
! !
! !
! !
!
!
!
! ! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
! ! !
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
! ! !
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
! ! !
!
! !
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
! !
!
!
!
!
!
! !
!
! ! !
! !
!
!
! !
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! ! ! !
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! ! !
!
!
! ! ! ! ! ! !
! !
! ! ! !
!
!
!
!
!
!
!
!
! ! !
!
!
! !
!
! !
!
! !
! !
! !
!
!
! ! !
!
!
!
!
!
!
! ! !
! !
!
!
! ! !
!
!
!
!
!
! !
!
!
! !
!
!
! !
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
! ! ! !
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! ! ! !
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
! !
!
! !
!
! !
!
!
!
!
! !
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
! !
! !
!
!
! !
!
!
!
!
! ! ! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
! ! !
!
!
!
!
! !
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
! !
!
!
!
! !
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
! !
!
!
!
! !
!
!
! ! ! !
!
! ! !
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
! ! ! ! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! ! ! !
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
! !
!
!
!
! !
!
!
! !
!
!
! !
!
!
!
!
! !
!
!
!
! ! !
!
!
!
! ! ! !
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
! !
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
! !
!
!
!
!
! ! ! !
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! ! ! ! !
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
! !
! ! !
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
! ! !
!
!
! ! ! ! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
! !
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
! !
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
! !
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! ! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! ! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
! ! ! ! !
!
! !
! !
! ! ! !
!
!
!
!
!
! !
! !
!
!
! ! !
!
!
!
!
!
!
!
!
! !
!
! ! ! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! ! !
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! ! !
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
! !
!
! !
! !
! !
!
!
!
!
! !
!
!
!
!
!
!
! !
! !
! !
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
! ! !
!
! !
! !
!
! !
!
!
!
!
! !
!
!
! ! !
!
!
! !
!
!
!
!
!
! !
!
!
! ! ! ! ! ! ! !
! ! ! !
!
! ! !
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! ! !
! !
!
! !
!
!
!
!
!
!
!
!
!
!
! ! !
!
! !
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
! !
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
! !
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! ! ! ! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
! ! !
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
! !
!
!
!
!
!
! ! !
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
! ! !
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
! !
! !
!
!
!
!
! ! ! !
!
! ! ! !
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! ! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
! !
!
!
!
!
! ! !
!
!
!
! ! ! !
!
! !
!
!
!
! !
!
!
!
!
! ! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! ! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
! ! ! !
! !
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
! ! !
!
! !
! ! ! ! !
!
!
!
! !
!
!
!
! ! ! !
!
!
!
!
!
!
!
! !
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
! !
!
! ! !
!
!
!
! ! ! !
!
! !
! ! ! !
!
!
!
! ! !
!
!
!
!
! !
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
! ! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
! ! ! !
!
! !
!
!
!
!
!
!
! !
!
!
! ! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
! ! !
!
!
!
!
!
! !
!
! ! ! !
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
! !
!
! !
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
! !
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
! !
!
! !
! !
!
!
!
!
!
!
!
!
!
!
! ! ! !
!
!
! ! ! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
! ! !
! !
!
! !
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
! ! !
!
!
!
!
! ! ! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
! !
!
!
!
!
!
!
! ! ! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
! ! ! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! ! !
!
!
!
! !
!
!
!
!
!
! !
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
! ! ! !
!
!
!
! !
!
!
!
!
!
! ! ! !
! !
!
!
! !
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
! ! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
! !
!
!
!
!
! ! ! !
!
!
!
!
!
! !
!
!
!
!
! !
!
! ! !
!
! ! !
!
! !
!
!
!
!
! ! ! !
!
!
! !
!
!
!
! !
!
!
!
!
!
! !
!
!
!
! ! ! !
!
! !
!
!
!
! ! ! ! !
! ! ! ! ! !
!
! ! !
!
! !
!
!
! !
!
!
!
!
!
!
!
! ! !
!
! !
!
!
! !
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! ! !
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
! !
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! ! ! !
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
! !
!
! ! !
!
!
! ! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
! !
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! ! !
!
!
!
!
!
!
! !
!
!
!
!
! ! !
!
! !
!
!
! !
!
!
! ! !
!
!
!
!
!
!
! !
!
! ! !
!
! !
! ! !
!
!
! ! !
!
!
!
! ! !
!
!
! !
!
!
!
! !
!
!
! ! !
!
!
!
!
!
!
!
! !
!
!
! ! !
! ! ! !
! !
! ! ! ! !
! !
!
!
!
! !
!
!
!
! ! !
! !
!
! !
! ! !
!
!
!
!
!
!
! ! !
! !
!
!
! ! !
!
!
!
!
!
!
! !
! ! ! !
!
! ! !
! !
!
!
!
!
! ! ! !
! !
!
! ! ! ! ! ! ! ! ! !
!
! ! !
! !
!
!
! ! !
!
! !
!
! !
!
!
!
!
! ! !
!
!
! !
! ! !
!
! !
!
!
!
!
!
!
!
! !
!
! ! ! ! ! !
!
!
! ! !
! !
!
! !
! ! !
! ! !
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
! !
!
!
!
!
!
! !
! !
! ! !
! !
!
! !
!
!
!
! !
! !
!
!
!
!
!
! ! !
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
! !
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
! ! !
!
! !
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! ! !
! !
! ! !
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
! ! ! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
! !
!
!
!
! !
!
!
!
!
!
! !
!
! ! !
!
!
!
!
!
! !
!
!
!
! ! !
!
! ! !
!
!
!
!
!
! ! !
!
!
! ! ! !
! ! ! !
!
!
!
! ! !
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
! !
!
!
!
!
! !
! ! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! ! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
! !
!
! !
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
! !
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! ! !
!
!
!
! ! !
!
!
! !
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
! !
!
! !
! ! !
!
!
!
! ! !
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
! !
!
! ! ! ! !
!
!
!
!
! !
!
!
!
! !
!
!
! ! !
! !
!
!
!
!
!
!
!
! !
! !
!
!
! !
! ! !
!
! ! !
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
!
!
! ! !
!
!
!
!
!
!
!
!
!
!
!
!
! ! !
!
! ! !
!
! !
!
!
!
! ! ! ! ! !
!
! ! !
!
!
!
!
!
!
!
!
!
! !
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! ! ! ! !
!
!
! !
!
!
!
!
!
! !
!
!
! !
!
!
!
!
! ! ! ! ! ! ! ! !
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
! ! !
!
!
! !
!
!
! !
! ! !
!
! ! ! ! !
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
rs1859962
0.2
0.4
0.6
0.8
r
2
CASC17
68.8 69 69.2 69.4 69.6
Position on chr17 (Mb)
Plotted SNPs
17q24.3
rs6501436
rs1859962
135
Figure 3. Genome Browser Views of Four Candidate Risk Regions. SNP locations are
shown relative to epigenetic features in LNCaP (and other) cell lines. Intervals for
significant peak regions are denoted with black rectangles for each dataset indicated at
left. SNPs are colored by order of p-value magnitude relative to the top SNP (lilac) in the
region: 1st order red, 2nd order green, 3rd order blue, all others black. Red dotted lines
guide the exact position of the best functional candidates relative to epigenetic features.
Gray lines guide the position of other SNPs. For the most notable SNPs in enhancer or
promoter regions an alignment of the surrounding DNA sequence to a response element
match is shown, with gray boxes indicating position of the SNP. (A) chromosome
10q24.32 region. (B) chromosome 10q2.13 region. (C) chromosome 12q13.13 region.
(D) chromosome 17p13.3 region.
136
Table 1. The index variants and most significantly associated markers in 30 known prostate cancer susceptibility regions.
Index variant Region, ethnicity
a
Multiethnic European African Japanese Latino
Most-associated
marker
r
2
with index in
EUR/AFR/ASN
b
Alleles
c
OR
d
P
e
P
het
f
RAF
g
OR
d
P
e
RAF
g
OR
d
P
e
RAF
g
OR
d
P
e
RAF
g
OR
d
P
e
rs11902236 2p25.1, EUR T/C 1.01 0.72 0.055 0.27 1.06 0.036 0.62 0.98 0.56 0.11 0.92 0.13 0.24 0.94 0.4
rs7575106 0.20/0.08/----- G/A 1.11 0.0011 0.033 0.07 1.16 0.0013 0.09 1.11 0.038 0 - - 0.06 0.80 0.099
rs13385191 2p24.1, ASN G/A 1.08 0.0003 0.043 0.24 1.03 0.24 0.06 1.00 0.99 0.56 1.16 6.1×10
-5
0.28 1.14 0.053
rs9306894 0.45/0.27/0.82 G/A 1.10 5.8×10
-8
0.21 0.37 1.08 0.0011 0.12 1.05 0.28 0.57 1.17 1.1×10
-5
0.32 1.12 0.1
rs1465618 2p21, EUR T/C 1.10 2.3×10
-6
0.013 0.22 1.08 0.0097 0.11 1.02 0.62 0.68 1.22 3.6×10
-7
0.41 1.04 0.6
rs13017478 0.17/0.02/0.73 C/T 1.10 7.4×10
-8
0.013 0.70 1.09 0.00091 0.80 1.07 0.081 0.71 1.23 4.4×10
-7
0.79 0.96 0.65
rs721048 2p15, EUR A/G 1.07 0.0091 0.21 0.18 1.05 0.13 0.04 1.12 0.15 0.05 1.01 0.94 0.17 1.24 0.0088
rs58235267 0.15/0.03/0.00 G/C 1.13 3.9×10
-12
0.73 0.51 1.12 1.7×10
-5
0.44 1.16 1.5×10
-6
0.73 1.11 0.013 0.54 1.09 0.2
rs10187424 2p11.2, EUR T/C 1.07 3.6×10
-5
0.24 0.58 1.04 0.12 0.36 1.07 0.016 0.63 1.10 0.0079 0.66 1.18 0.015
rs1561198 0.82/0.40/0.99 C/T 1.09 3.6×10
-7
0.69 0.54 1.08 0.0015 0.32 1.07 0.02 0.63 1.10 0.011 0.63 1.17 0.02
rs2660753
h
3p12.1, EUR T/C 1.07 0.00057 5.1×10
-7
0.10 1.14 0.00047 0.50 0.95 0.061 0.27 1.20 4.7×10
-6
0.21 1.23 0.0071
rs76668454 0.37/0.02/0.28 C/T 1.29 4.6×10
-19
0.19 0.07 1.27 1.2×10
-7
0.05 1.15 0.041 0.16 1.34 1.3×10
-9
0.13 1.46 8.5×10
-5
rs2055109
h
3p11.2, ASN C/T 1.08 0.00024 0.024 0.23 1.05 0.088 0.12 1.06 0.21 0.11 1.27 2.5×10
-5
0.17 1.07 0.43
rs76668454 0.06/0.00/0.30 C/T 1.29 4.6×10
-19
0.19 0.07 1.27 1.2×10
-7
0.05 1.15 0.041 0.16 1.34 1.3×10
-9
0.13 1.46 8.5×10
-5
rs7611694 3q13.2, EUR A/C 1.05 0.0028 0.068 0.58 1.08 0.0017 0.65 0.99 0.63 0.20 1.11 0.024 0.67 1.06 0.37
rs12629813 0.80/0.04/0.71 C/T 1.08 1.3×10
-5
0.97 0.57 1.08 0.0029 0.57 1.07 0.027 0.23 1.09 0.066 0.64 1.10 0.15
rs10934853 3q21.3, EUR A/C 1.06 0.00039 0.037 0.28 1.12 1.1×10
-5
0.71 1.04 0.24 0.50 1.01 0.7 0.40 0.97 0.67
rs4857837 0.90/0.18/0.11 A/G 1.13 2.1×10
-11
0.15 0.28 1.10 0.00053 0.30 1.16 1.2×10
-6
0.15 1.21 0.00015 0.29 1.04 0.56
rs10936632 3q26.2, EUR A/C 1.08 2.7×10
-5
0.22 0.51 1.11 1.0×10
-5
0.26 1.05 0.16 0.34 1.04 0.46 0.37 1.00 0.95
rs76925190 0.18/0.04/0.10 A/C 1.22 1.4×10
-13
0.15 0.81 1.28 1.8×10
-12
0.96 1.06 0.47 0.79 1.15 0.015 0.79 1.19 0.048
rs1894292 4q13.3, EUR G/A 1.05 0.002 0.3 0.52 1.07 0.006 0.68 1.03 0.42 0.66 1.08 0.032 0.64 0.96 0.55
rs4694176 0.53/0.04/0.55 C/A 1.07 3.9×10
-5
0.7 0.58 1.08 0.0017 0.79 1.10 0.0089 0.65 1.04 0.27 0.50 1.04 0.54
rs12500426
i
4q22.3, EUR A/C 1.03 0.052 0.12 0.46 1.05 0.043 0.40 0.98 0.5 0.44 1.04 0.3 0.55 1.14 0.043
rs60063444 0.65/0.13/0.83 T/C 1.08 5.9×10
-5
0.79 0.40 1.07 0.0042 0.11 1.10 0.048 0.39 1.05 0.17 0.51 1.12 0.075
rs17021918
i
4q22.3, EUR C/T 1.05 0.003 0.95 0.65 1.05 0.038 0.78 1.06 0.093 0.63 1.04 0.34 0.72 1.08 0.28
rs60063444 0.30/0.00/0.42 T/C 1.08 5.9×10
-5
0.79 0.40 1.07 0.0042 0.11 1.10 0.048 0.39 1.05 0.17 0.51 1.12 0.075
rs2242652 5p15.33, EUR G/A 1.12 5.9×10
-6
0.43 0.80 1.13 0.001 0.85 1.07 0.12 0.81 1.21 0.0035 0.85 1.15 0.22
rs7726159 0.43/0.59/0.27 C/A 1.11 1.7×10
-7
0.92 0.67 1.12 0.00013 0.81 1.11 0.0053 0.65 1.10 0.025 0.71 1.06 0.49
rs12653946 5p15.33, EUR T/C 1.13 6.8×10
-14
4.6×10
-6
0.42 1.10 8.2×10
-5
0.42 1.07 0.018 0.46 1.33 2.8×10
-15
0.49 1.02 0.76
rs4975758 0.82/0.49/0.96 G/C 1.15 2.2×10
-16
2.7×10
-5
0.47 1.11 5.0×10
-5
0.33 1.11 0.0012 0.46 1.34 1.3×10
-15
0.50 1.04 0.55
rs1933488 6q25.2, EUR A/G 1.03 0.082 0.28 0.58 1.06 0.02 0.56 0.99 0.81 0.19 1.05 0.26 0.58 0.97 0.67
rs13215045 0.70/0.32/0.32 C/T 1.06 0.0003 0.71 0.69 1.09 0.0016 0.56 1.05 0.081 0.38 1.03 0.37 0.72 1.06 0.45
rs6465657 7q21.3, EUR C/T 1.06 0.0045 0.15 0.46 1.09 0.00051 0.89 1.01 0.92 0.90 1.04 0.48 0.69 0.94 0.36
rs138101303 0.22/0.03/0.11 A/G 1.11 7.3×10
-5
0.12 0.71 1.17 8.2×10
-6
0.82 1.06 0.2 0.86 1.01 0.87 0.80 1.00 0.99
137
Index variant Region, ethnicity
a
Multiethnic European African Japanese Latino
Most-associated
marker
r
2
with index in
EUR/AFR/ASN
b
Alleles
c
OR
d
P
e
P
het
f
RAF
g
OR
d
P
e
RAF
g
OR
d
P
e
RAF
g
OR
d
P
e
RAF
g
OR
d
P
e
rs1512268 8p21.2, EUR T/C 1.17 2.5×10
-21
0.00012 0.43 1.10 6.8×10
-5
0.65 1.15 2.3×10
-6
0.38 1.34 1.3×10
-15
0.47 1.20 0.0037
rs1160267 0.99/0.64/0.99 G/A 1.18 4.5×10
-23
0.00017 0.43 1.10 4.3×10
-5
0.72 1.19 8.2×10
-8
0.38 1.34 1.2×10
-15
0.47 1.20 0.0041
rs817826 9q31.2, ASN C/T 1.04 0.11 0.36 0.14 1.02 0.58 0.30 1.07 0.027 0.06 0.97 0.64 0.11 0.94 0.51
rs1746824 0.02/0.38/0.25 C/T 1.06 0.00095 0.032 0.28 1.01 0.6 0.48 1.10 0.0012 0.29 1.03 0.44 0.31 1.22 0.0039
rs2252004 10q26.12, ASN C/A 1.08 0.00022 0.021 0.90 1.08 0.047 0.49 1.03 0.39 0.79 1.21 2.0×10
-5
0.70 1.05 0.48
rs77929344 -----/0.00/0.48 T/C 1.31 9.3×10
-7
1 0 - - <0.01 - - 0.87 1.31 9.3×10
-7
0 - -
rs1938781 11q12.1, ASN G/A 1.08 1.2×10
-5
0.023 0.21 1.03 0.4 0.33 1.07 0.02 0.31 1.19 9.6×10
-6
0.24 1.13 0.091
rs12223473 0.02/0.16/0.50 G/T 1.18 4.6×10
-7
0.66 <0.01 - - 0.08 1.22 0.00012 0.19 1.15 0.0022 0.07 1.20 0.16
rs10875943 12q13.12, EUR C/T 1.06 0.0004 0.33 0.30 1.08 0.0049 0.63 1.02 0.45 0.81 1.13 0.0093 0.32 1.06 0.39
rs11168963 0.92/0.58/0.71 G/A 1.08 2.4×10
-5
0.91 0.29 1.07 0.0086 0.57 1.07 0.021 0.76 1.11 0.021 0.31 1.07 0.36
rs902774 12q13.13, EUR A/G 1.12 1.7×10
-5
0.0032 0.16 1.20 3.9×10
-8
0.08 0.97 0.55 0.06 1.21 0.19 0.14 1.00 0.97
rs55958994 0.82/0.01/----- T/C 1.21 2.8×10
-13
0.34 0.13 1.25 2.1×10
-10
0.14 1.17 0.00013 0 - - 0.07 1.09 0.5
rs1270884 12q24.21, EUR A/G 1.08 2.3×10
-5
0.21 0.48 1.11 2.9×10
-5
0.20 1.03 0.47 0.20 1.03 0.49 0.38 1.14 0.046
rs10774740 0.41/0.21/0.59 G/T 1.10 2.2×10
-8
0.12 0.61 1.14 1.1×10
-7
0.32 1.06 0.062 0.31 1.04 0.27 0.53 1.14 0.039
rs9600079 13q22.1, ASN T/G 1.03 0.035 7.8×10
-5
0.44 1.02 0.41 0.53 0.97 0.22 0.39 1.19 1.1×10
-6
0.40 1.03 0.65
rs7327286 0.19/0.01/0.83 G/A 1.13 6.1×10
-10
0.19 0.18 1.09 0.0047 0.17 1.11 0.0071 0.40 1.20 3.3×10
-7
0.38 1.09 0.19
rs8008270 14q22.1, EUR C/T 1.11 3.0×10
-6
0.029 0.81 1.17 2.4×10
-7
0.72 1.04 0.17 <0.01 - - 0.89 1.06 0.56
rs62003517 0.98/0.01/1.00 C/G 1.15 2.0×10
-7
0.33 0.81 1.18 2.2×10
-7
0.95 1.05 0.44 <0.01 - - 0.90 1.15 0.18
rs11649743 17q12 A, EUR G/A 1.16 3.9×10
-12
0.2 0.80 1.15 3.2×10
-6
0.93 1.07 0.22 0.72 1.20 8.5×10
-6
0.83 1.30 0.0018
rs11658433 0.87/0.38/0.67 A/C 1.19 5.6×10
-14
0.55 0.79 1.16 1.9×10
-6
0.94 1.19 0.014 0.72 1.23 3.6×10
-6
0.82 1.30 0.0038
rs7501939 17q12 B, EUR C/T 1.16 2.5×10
-19
1.9×10
-6
0.61 1.20 1.5×10
-12
0.48 1.05 0.062 0.70 1.34 6.6×10
-14
0.68 1.04 0.52
rs11263763 0.71/0.18/0.91 A/G 1.20 3.9×10
-26
0.00031 0.54 1.22 2.9×10
-15
0.62 1.10 0.0028 0.70 1.35 4.6×10
-14
0.61 1.11 0.12
rs11650494 17q21.32, EUR A/G 1.10 0.00049 0.86 0.08 1.09 0.047 0.23 1.10 0.0035 <0.01 - - 0.06 1.02 0.86
rs111834333 0.87/0.18/----- T/C 1.15 6.0×10
-6
0.043 0.08 1.08 0.078 0.12 1.24 1.1×10
-6
0 - - 0.06 0.97 0.84
rs1859962 17q24.3, EUR G/T 1.10 7.0×10
-9
9.8×10
-6
0.49 1.20 1.1×10
-13
0.29 1.00 0.91 0.27 1.01 0.79 0.61 1.14 0.041
rs6501436 0.96/0.08/0.94 G/A 1.14 1.5×10
-14
0.005 0.50 1.20 1.0×10
-13
0.22 1.13 0.00047 0.28 1.01 0.74 0.61 1.15 0.039
rs8102476 19q13.2, EUR C/T 1.07 4.5×10
-5
0.017 0.54 1.11 1.3×10
-5
0.76 1.09 0.015 0.37 0.97 0.34 0.50 1.07 0.28
rs11083450 0.77/0.58/0.51 T/C 1.09 7.7×10
-7
0.2 0.49 1.13 1.0×10
-6
0.67 1.06 0.04 0.31 1.03 0.38 0.42 1.03 0.63
rs9623117 22q13.1, EUR C/T 1.07 0.00079 0.13 0.21 1.09 0.0023 0.77 1.04 0.24 0.03 0.92 0.41 0.17 1.21 0.019
rs6001833 0.38/0.26/0.18 A/G 1.09 2.9×10
-5
0.86 0.25 1.08 0.0027 0.77 1.09 0.012 <0.01 - - 0.18 1.14 0.11
a
The racial/ethnic population in which the index variant was initially identified.
b
Pairwise correlation with the index variant in European (EUR), African (AFR) and Asian
(ASN) populations from 1000 Genomes Project. Missing correlation as indicated by “-----” was due to at least one of the SNPs being monomorphic in the given population.
c
Risk allele/reference allele. Risk allele is the allele associated with increased risk in the multiethnic sample.
d
Per-allele odds ratio of the risk allele.
e
P-value from the 1 d.f.
Wald test of trend.
f
P-value from 3 d.f. chi-squared test of heterogeneity across four populations.
g
RAF, risk allele frequency in controls.
h,i
Two index SNPs in the same region.
138
References
Ahmadiyeh, N., Pomerantz, M. M., Grisanzio, C., et al. (2010). 8q24 prostate, breast, and
colon cancer risk loci show tissue-specific long-range interaction with MYC. Proc Natl
Acad Sci U S A 107, 9742-9746.
Akamatsu, S., Takata, R., Ashikawa, K., et al. (2010). A functional variant in NKX3.1
associated with prostate cancer susceptibility down-regulates NKX3.1 expression. Hum
Mol Genet 19, 4265-4272.
Akamatsu, S., Takata, R., Haiman, C. A., et al. (2012). Common variants at 11q12,
10q26 and 3p11.2 are associated with prostate cancer susceptibility in Japanese. Nat
Genet 44, 426-429, S421.
Al Olama, A. A., Kote-Jarai, Z., Berndt, S. I., et al. (2014). A meta-analysis of 87,040
individuals identifies 23 new susceptibility loci for prostate cancer. Nat Genet.
Al Olama, A. A., Kote-Jarai, Z., Giles, G. G., et al. (2009). Multiple loci on 8q24
associated with prostate cancer susceptibility. Nat Genet 41, 1058-1060.
Al Olama, A. A., Kote-Jarai, Z., Schumacher, F. R., et al. (2012). A meta-analysis of
genome-wide association studies to identify prostate cancer susceptibility loci associated
with aggressive and non-aggressive disease. Hum Mol Genet.
Amundadottir, L. T., Sulem, P., Gudmundsson, J., et al. (2006). A common variant
associated with prostate cancer in European and African populations. Nat Genet 38, 652-
658.
Andreu-Vieyra, C., Lai, J., Berman, B. P., et al. (2011). Dynamic nucleosome-depleted
regions at androgen receptor enhancers in the absence of ligand in prostate cancer cells.
Mol Cell Biol 31, 4648-4662.
Berndt, S. I., Sampson, J., Yeager, M., et al. (2011). Large-scale fine mapping of the
HNF1B locus and prostate cancer risk. Hum Mol Genet.
Boyle, A. P., Hong, E. L., Hariharan, M., et al. (2012). Annotation of functional variation
in personal genomes using RegulomeDB. Genome Res 22, 1790-1797.
Brawley, O. W. (2012). Prostate cancer epidemiology in the United States. World J Urol
30, 195-200.
Cheng, I., Chen, G. K., Nakagawa, H., et al. (2012). Evaluating genetic risk for prostate
cancer among Japanese and Latinos. Cancer Epidemiol Biomarkers Prev 21, 2048-2058.
Chung, C. C., Boland, J., Yeager, M., et al. (2012). Comprehensive resequence analysis
of a 123-kb region of chromosome 11q13 associated with prostate cancer. Prostate 72,
476-486.
Chung, C. C., Ciampa, J., Yeager, M., et al. (2011). Fine mapping of a region of
chromosome 11q13 reveals multiple independent loci associated with risk of prostate
cancer. Hum Mol Genet 20, 2869-2878.
Coetzee, G. A., Jia, L., Frenkel, B., et al. (2010). A systematic approach to understand
the functional consequences of non-protein coding risk regions. Cell Cycle 9, 256-259.
139
Coetzee, S. G., Rhie, S. K., Berman, B. P., Coetzee, G. A., and Noushmehr, H. (2012).
FunciSNP: an R/bioconductor tool integrating functional non-coding data sets with
genetic association studies to identify candidate regulatory SNPs. Nucleic Acids Res 40,
e139.
Conti, D. V., and Witte, J. S. (2003). Hierarchical modeling of linkage disequilibrium:
genetic structure and spatial relations. Am J Hum Genet 72, 351-363.
Cook, M. B., Wang, Z., Yeboah, E. D., et al. (2014). A genome-wide association study of
prostate cancer in West African men. Hum Genet 133, 509-521.
Cookson, W., Liang, L., Abecasis, G., Moffatt, M., and Lathrop, M. (2009). Mapping
complex disease traits with global gene expression. Nat Rev Genet 10, 184-194.
Diabetes Genetics Replication Meta-analysis Consortium, Asian Genetic Epidemiology
Network Type 2 Diabetes Consortium, South Asian Type 2 Diabetes Consortium, et al.
(2014). Genome-wide trans-ancestry meta-analysis provides insight into the genetic
architecture of type 2 diabetes susceptibility. Nat Genet 46, 234-244.
Duggan, D., Zheng, S. L., Knowlton, M., et al. (2007). Two genome-wide association
studies of aggressive prostate cancer implicate putative prostate tumor suppressor gene
DAB2IP. J Natl Cancer Inst 99, 1836-1844.
Eeles, R. A., Kote-Jarai, Z., Al Olama, A. A., et al. (2009). Identification of seven new
prostate cancer susceptibility loci through a genome-wide association study. Nat Genet
41, 1116-1121.
Eeles, R. A., Kote-Jarai, Z., Giles, G. G., et al. (2008). Multiple newly identified loci
associated with prostate cancer susceptibility. Nat Genet 40, 316-321.
Eeles, R. A., Olama, A. A., Benlloch, S., et al. (2013). Identification of 23 new prostate
cancer susceptibility loci using the iCOGS custom genotyping array. Nat Genet 45, 385-
391, 391e381-382.
Franceschini, N., van Rooij, F. J., Prins, B. P., et al. (2012). Discovery and fine mapping
of serum protein loci through transethnic meta-analysis. Am J Hum Genet 91, 744-753.
Grisanzio, C., Werner, L., Takeda, D., et al. (2012). Genetic and functional analyses
implicate the NUDT11, HNF1B, and SLC22A3 genes in prostate cancer pathogenesis.
Proc Natl Acad Sci U S A 109, 11252-11257.
Gudmundsson, J., Sulem, P., Gudbjartsson, D. F., et al. (2009). Genome-wide association
and replication studies identify four variants associated with prostate cancer
susceptibility. Nat Genet 41, 1122-1126.
Gudmundsson, J., Sulem, P., Rafnar, T., et al. (2008). Common sequence variants on
2p15 and Xp11.22 confer susceptibility to prostate cancer. Nat Genet 40, 281-283.
Haiman, C. A., Chen, G. K., Blot, W. J., et al. (2011a). Characterizing genetic risk at
known prostate cancer susceptibility loci in African Americans. PLoS Genet 7, e1001387.
Haiman, C. A., Chen, G. K., Blot, W. J., et al. (2011b). Genome-wide association study
of prostate cancer in men of African ancestry identifies a susceptibility locus at 17q21.
Nat Genet 43, 570-573.
140
Haiman, C. A., Patterson, N., Freedman, M. L., et al. (2007). Multiple regions within
8q24 independently affect risk for prostate cancer. Nat Genet 39, 638-644.
Han, Y., Signorello, L. B., Strom, S. S., et al. (2014). Generalizability of established
prostate cancer risk variants in men of African ancestry. Int J Cancer.
Hazelett, D. J., Rhie, S. K., Gaddis, M., et al. (2014). Comprehensive functional
annotation of 77 prostate cancer risk loci. PLoS Genet 10, e1004102.
Higgins, J. P., Thompson, S. G., Deeks, J. J., and Altman, D. G. (2003). Measuring
inconsistency in meta-analyses. BMJ 327, 557-560.
Howie, B. N., Donnelly, P., and Marchini, J. (2009). A flexible and accurate genotype
imputation method for the next generation of genome-wide association studies. PLoS
Genet 5, e1000529.
Huang, Q., Whitington, T., Gao, P., et al. (2014). A prostate cancer susceptibility allele at
6q22 increases RFX6 expression by modulating HOXB13 chromatin binding. Nat Genet
46, 126-135.
Jeggari, A., Marks, D. S., and Larsson, E. (2012). miRcode: a map of putative microRNA
target sites in the long non-coding transcriptome. Bioinformatics 28, 2062-2063.
Jia, L., Landan, G., Pomerantz, M., et al. (2009). Functional enhancers at the gene-poor
8q24 cancer-linked locus. PLoS Genet 5, e1000597.
Kote-Jarai, Z., Amin Al Olama, A., Leongamornlert, D., et al. (2011a). Identification of a
novel prostate cancer susceptibility variant in the KLK3 gene transcript. Hum Genet 129,
687-694.
Kote-Jarai, Z., Easton, D. F., Stanford, J. L., et al. (2008). Multiple novel prostate cancer
predisposition loci confirmed by an international study: the PRACTICAL Consortium.
Cancer Epidemiol Biomarkers Prev 17, 2052-2061.
Kote-Jarai, Z., Olama, A. A., Giles, G. G., et al. (2011b). Seven prostate cancer
susceptibility loci identified by a multi-stage genome-wide association study. Nat Genet
43, 785-791.
Kote-Jarai, Z., Saunders, E. J., Leongamornlert, D. A., et al. (2013). Fine-mapping
identifies multiple prostate cancer risk loci at 5p15, one of which associates with TERT
expression. Hum Mol Genet 22, 2520-2528.
Li, Q., Seo, J. H., Stranger, B., et al. (2013). Integrative eQTL-based analyses reveal the
biology of breast cancer risk loci. Cell 152, 633-641.
Li, Q., Stram, A., Chen, C., et al. (2014). Expression QTL-based analyses reveal
candidate causal genes and loci across five tumor types. Hum Mol Genet 23, 5294-5302.
Liu, C. T., Buchkovich, M. L., Winkler, T. W., et al. (2014). Multi-ethnic fine-mapping
of 14 central adiposity loci. Hum Mol Genet 23, 4738-4744.
Lou, H., Yeager, M., Li, H., et al. (2009). Fine mapping and functional analysis of a
common variant in MSMB on chromosome 10q11.2 associated with prostate cancer
susceptibility. Proc Natl Acad Sci U S A 106, 7933-7938.
141
Maurano, M. T., Humbert, R., Rynes, E., et al. (2012). Systematic localization of
common disease-associated variation in regulatory DNA. Science 337, 1190-1195.
Morris, A. P. (2011). Transethnic meta-analysis of genomewide association studies.
Genet Epidemiol 35, 809-822.
Nicolae, D. L., Gamazon, E., Zhang, W., Duan, S., Dolan, M. E., and Cox, N. J. (2010).
Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery
from GWAS. PLoS Genet 6, e1000888.
Onengut-Gumuscu, S., Chen, W. M., Burren, O., et al. (2015). Fine mapping of type 1
diabetes susceptibility loci and evidence for colocalization of causal variants with
lymphoid gene enhancers. Nat Genet 47, 381-386.
Ong, R. T., Wang, X., Liu, X., and Teo, Y. Y. (2012). Efficiency of trans-ethnic genome-
wide meta-analysis and fine-mapping. Eur J Hum Genet 20, 1300-1307.
Parikh, H., Wang, Z., Pettigrew, K. A., et al. (2011). Fine mapping the KLK3 locus on
chromosome 19q13.33 associated with prostate cancer susceptibility and PSA levels.
Hum Genet 129, 675-685.
Parker, S. C., Stitzel, M. L., Taylor, D. L., et al. (2013). Chromatin stretch enhancer
states drive cell-specific gene regulation and harbor human disease risk variants. Proc
Natl Acad Sci U S A 110, 17921-17926.
Pickrell, J. K. (2014). Joint analysis of functional genomic data and genome-wide
association studies of 18 human traits. Am J Hum Genet 94, 559-573.
Pomerantz, M. M., Shrestha, Y., Flavin, R. J., et al. (2010). Analysis of the 10q11 cancer
risk locus implicates MSMB and NCOA4 in human prostate tumorigenesis. PLoS Genet
6, e1001204.
Rhie, S. K., Hazelett, D. J., Coetzee, S. G., Yan, C., Noushmehr, H., and Coetzee, G. A.
(2014). Nucleosome positioning and histone modifications define relationships between
regulatory elements and nearby gene expression in breast epithelial cells. BMC Genomics
15, 331.
Schumacher, F. R., Berndt, S. I., Siddiq, A., et al. (2011). Genome-wide association
study identifies new prostate cancer susceptibility loci. Hum Mol Genet 20, 3867-3875.
Sharma, N. L., Massie, C. E., Ramos-Montoya, A., et al. (2013). The androgen receptor
induces a distinct transcriptional program in castration-resistant prostate cancer in man.
Cancer Cell 23, 35-47.
Sherry, S. T., Ward, M. H., Kholodov, M., et al. (2001). dbSNP: the NCBI database of
genetic variation. Nucleic Acids Res 29, 308-311.
Sun, J., Zheng, S. L., Wiklund, F., et al. (2009). Sequence variants at 22q13 are
associated with prostate cancer risk. Cancer Res 69, 10-15.
Sun, J., Zheng, S. L., Wiklund, F., et al. (2008). Evidence for two independent prostate
cancer risk-associated loci in the HNF1B gene at 17q12. Nat Genet 40, 1153-1155.
142
Takata, R., Akamatsu, S., Kubo, M., et al. (2010). Genome-wide association study
identifies five new susceptibility loci for prostate cancer in the Japanese population. Nat
Genet 42, 751-754.
Takayama, K. I., Suzuki, T., Fujimura, T., et al. (2014). CtBP2 Modulates the Androgen
Receptor to Promote Prostate Cancer Progression. Cancer Res.
Tan, P. Y., Chang, C. W., Chng, K. R., Wansa, K. D., Sung, W. K., and Cheung, E.
(2012). Integration of regulatory networks by NKX3-1 promotes androgen-dependent
prostate cancer survival. Mol Cell Biol 32, 399-414.
Teo, Y. Y., Ong, R. T., Sim, X., Tai, E. S., and Chia, K. S. (2010). Identifying candidate
causal variants via trans-population fine-mapping. Genet Epidemiol 34, 653-664.
Thurman, R. E., Rynes, E., Humbert, R., et al. (2012). The accessible chromatin
landscape of the human genome. Nature 489, 75-82.
Wakefield, J. (2007). A Bayesian measure of the probability of false discovery in genetic
epidemiology studies. Am J Hum Genet 81, 208-227.
Wakefield, J. (2009). Bayes factors for genome-wide association studies: comparison
with P-values. Genet Epidemiol 33, 79-86.
Wang, D., Garcia-Bassets, I., Benner, C., et al. (2011). Reprogramming transcription by
distinct classes of enhancers functionally defined by eRNA. Nature 474, 390-394.
Ward, L. D., and Kellis, M. (2012). HaploReg: a resource for exploring chromatin states,
conservation, and regulatory motif alterations within sets of genetically linked variants.
Nucleic Acids Res 40, D930-934.
Wellcome Trust Case Control, C., Maller, J. B., McVean, G., et al. (2012). Bayesian
refinement of association signals for 14 loci in 3 common diseases. Nat Genet 44, 1294-
1301.
Westra, H. J., and Franke, L. (2014). From genome to function by studying eQTLs.
Biochim Biophys Acta 1842, 1896-1902.
Wu, Y., Waite, L. L., Jackson, A. U., et al. (2013). Trans-ethnic fine-mapping of lipid
loci identifies population-specific signals and allelic heterogeneity that increases the trait
variance explained. PLoS Genet 9, e1003379.
Xu, J., Mo, Z., Ye, D., et al. (2012). Genome-wide association study in Chinese men
identifies two new prostate cancer risk loci at 9q31.2 and 19q13.4. Nat Genet 44, 1231-
1235.
Xu, X., Hussain, W. M., Vijai, J., et al. (2014). Variants at IRX4 as prostate cancer
expression quantitative trait loci. Eur J Hum Genet 22, 558-563.
Yeager, M., Chatterjee, N., Ciampa, J., et al. (2009). Identification of a new prostate
cancer susceptibility locus on chromosome 8q24. Nat Genet 41, 1055-1057.
Zaitlen, N., Pasaniuc, B., Gur, T., Ziv, E., and Halperin, E. (2010). Leveraging genetic
variability across populations for the identification of causal variants. Am J Hum Genet
86, 23-33.
143
Zheng, S. L., Stevens, V. L., Wiklund, F., et al. (2009). Two independent prostate cancer
risk-associated Loci at 11q13. Cancer Epidemiol Biomarkers Prev 18, 1815-1820.
144
Chapter 7 Statistical methodologies for multiethnic fine-
mapping
7.1 Introduction
In the post-GWAS era, with the help of large-scale sequencing and imputation, fine-
mapping has been widely used to localize and prioritize candidate functional variants at
known risk regions for future experimental testing. As discussed in Chapter 6, combining
subjects from multiple populations presents both advantages and disadvantages for fine-
mapping. On the basis of a shared underlying causal allele across ethnicities, multiethnic
fine-mapping can take advantage of the naturally varying linkage disequilibrium (LD)
patterns to prioritize risk variants that generalize across populations. Multiethnic fine-
mapping has been demonstrated to increase power for uncovering the functional allele
and to reduce the number of candidate surrogates in simulation studies (Ong et al., 2012;
Teo et al., 2010; Zaitlen et al., 2010), likely due to leveraging different LD structures in
ethnically diverse populations. On the other hand, the heterogeneity of LD, allele
frequency, and potentially the underlying true genetic effect also makes it challenging to
perform trans-ethnic analyses and to statistically rank the candidate causal variants
surrounding a risk locus. Another challenge that researchers often face is that only study-
level summary statistics, rather than individual-level genotype and phenotype data, are
shared from studies conducted in multiple racial/ethnic populations.
To date, several methods for combining summary statistics have been proposed
and applied in multiethnic fine-mapping studies of disease (Amin Al Olama et al., 2015;
Diabetes Genetics Replication Meta-analysis Consortium et al., 2014; Han et al., 2015)
145
and quantitative traits (Franceschini et al., 2012; Liu et al., 2014; Wu et al., 2013). The
existing statistical methods include the traditional meta-analysis using fixed- and random-
effects models (DerSimonian and Laird, 1986; Stram, 1996), the modified random-effects
meta-analysis (Han and Eskin, 2011), and MANTRA in a Bayesian framework (Morris,
2011). In general, SNPs in a given region are first filtered by a certain degree of
correlation with the previously reported index SNP in the population where it was
initially identified, and then ranked based on p-value or Bayes’ Factor from one of the
above methods. A common feature of these methods is that only summary statistics are
required for combining data, which adds great convenience to large-scale collaborations.
In this chapter, we present a novel meta-analysis model that is tailored for
combining summary statistics from studies in multiethnic populations. Through
simulation, we compare the performance of the novel and existing methods in the
contexts of signal discovery and fine-mapping using multiethnic data. In addition, we
evaluate the influence of several key factors on fine-mapping, including causal allele
frequency, effect size, LD, sample size and multiethnic study design.
7.2 Methods
7.2.1 Quasi-simulation
We performed simulation to understand the influential factors for fine-mapping in a
conceptual region. Initially, genotypes of a fixed number of SNPs (n=20) were simulated
based on an autoregressive (AR-1) model, representing a realistic scenario that the
pairwise correlation between any two SNPs exponentially decays with distance. We
assumed that only one of the SNPs has a causal effect on the phenotype while all the
146
others are surrogates of the causal SNP. A binary outcome for each individual was
simulated from a logistic model based on the causal SNP’s genotype and effect size (refer
to section 7.2.3). The marginal association between each SNP and the outcome was fitted
in a log additive model and then tested using two-sided 1.d.f Wald test. Through iterative
simulations (number of iterations=1,000), we measured the proportion of times that the
causal SNP is top-ranked based on p-value (referred to as fine-mapping “power”). A set
of parameters including the strength of LD (lag 1 autocorrelation coefficient [ρ], 0.3, 0.7,
0.9 and 0.99), risk allele frequency (RAF, 0.1, 0.2, 0.3, 0.4 and 0.5), sample size (N, 100,
500, 1000, 2000, 4000 and 8000), and effect size (odds ratio [OR], 1.1, 1.3, 1.5, 1.7 and
1.9) were varied in the simulation to understand how they influence fine-mapping power.
Intuitively, the number of surrogates in a region may also have an effect on the
fine-mapping power. We further performed quasi-simulation to understand this effect in
conjunction with the other parameters described above. Quasi-simulation means directly
sampling test statistics for all SNPs without simulating genotypes and outcomes. As
noted in previous papers (Han, Kang and Eskin, 2009; Zaitlen et al., 2010), the joint
distribution of Z statistics for all SNPs in a region is asymptotically multivariate normal
(MVN) specified by the mean vector and the variance-covariance matrix (see Appendix
0). The mean Z statistic for the causal SNP, denoted by its non-centrality parameter
(NCP), is based on OR, RAF and N (see Appendix 7.5.1) (Stram, 2013). The mean Z
statistic for any non-causal SNP is the product of its correlation coefficient (r) with the
causal SNP and the causal NCP (see Appendix 7.5.2). In terms of fine-mapping power,
quasi-simulation based on MVN generates similar results as that from genotype and
outcome simulation (data not shown), but with much less computational time. Here we
147
used a uniform correlation structure instead of AR-1 because the former does not depend
on the number of SNPs, which enables the independent evaluation of these parameters on
fine-mapping. The frequency of causal SNP being top-ranked was calculated from 1,000
iterations with varying number of surrogates (n, 1, 3, 7, 15, 63, 127 and 255), pairwise
correlation coefficient (r, 0.3, 0.5, 0.7, 0.9 and 0.99), and NCP of the causal SNP (causal
NCP, 10, 20, 30, 40 and 50).
7.2.2 Genotype simulation
To simulate genotypes for individuals from multiethnic populations, we assumed that
there is one causal SNP per region and it is common (minor allele frequency [MAF] >
5%) in all populations. In order to better represent the complexity in real data, we
allowed the frequency and effect size of the causal allele to be different across
populations. Since the inter-population LD variation naturally differs from region to
region, a large number of regions are needed to evaluate the overall performance in
multiethnic fine-mapping. In the following, we randomly selected 100 causal variants on
autosomes (i.e., Chromosome 1-22) from the 1000 Genomes Project (1KGP) data that
meet the criteria: 1) MAF > 5% in European, African, and Asian populations in 1KGP; 2)
at least 100kb away from the edge of the chromosome; 3) at least 200kb away from any
other causal variant; 4) zero missing rate in 1KGP data. A region was defined as +/-25kb
from each causal variant, including most of the moderately to highly correlated
surrogates. The 100 regions contain a mean of 711 SNPs per region (range, 437-1,711).
We simulated genotypes for 10 studies, each having 1,000 individuals. Of the 10
studies, four are from European populations, three from African and three from Asian.
Within each study, genotypes of SNPs in the same region were simulated simultaneously
148
to retain the RAFs and LD structure from the corresponding population in 1KGP.
Specifically, for a set of ! SNPs, we sampled the haploid dosages jointly for all SNPs,
!= !
!
,!
!
,… ,!
!
, from a multivariate normal (MVN) distribution
!~!"#(!,!)
where != !
!
,!
!
,… ,!
!
is the vector of RAFs and ! is the !×! correlation matrix in
which the !
!!
,!
!!
entry is the correlation coefficient between the !
!!
and !
!!
SNPs. Both
! and ! are estimated from the corresponding population in 1KGP. Then we converted
the haploid dosages to haploid genotypes, != !
!
,!
!
,… ,!
!
, by coding !
!
= 1 if
!
!
> 0 and !
!
= 0 otherwise. To get diploid genotypes, we added up the vectors of
haploid genotypes from two independent runs of sampling and recoding (personal
communication with Zhao Yang).
7.2.3 Phenotype simulation and association testing
We assumed the phenotype of interest is a binary variable such as disease status (i.e., case
or control). The binary outcome of each individual was sampled from a binomial
distribution with the probability of being a case (!) calculated from the following logistic
model:
log
!
1−!
=!+!"
where ! is the genotype of the causal SNP, ! is an arbitrary log odds ratio of the causal
SNP, and ! is an intercept largely reflecting case/control ratio. In our study, ! was set to
zero in order to simulate a balanced number of cases and controls. Based on the simulated
outcomes, we examined the marginal association for each SNP in the region by fitting a
univariate logistic regression model. The regression coefficient and its standard error
149
were saved as summary statistics for further meta-analytical approaches. The entire
procedure was repeated 1,000 times by study and by region.
7.2.4 Meta-analytical approaches
We tested the marginal association between each SNP and the outcome through
combining the summary statistics from the 10 studies. In general, for any given SNP in
study !, the effect estimate !
!
follows a normal distribution centering on the true effect
size !
!
, if the sample size is sufficiently large. Specifically,
!
!
=!
!
+!
!
, != 1,2,… ,!
where the error term !
!
~!(0,!
!
). !
!
can be estimated by the sample variance of !
!
,
which is equal to the squared standard error !"(!
!
)
!
. Based on the summary statistics
!
!
,!!(!
!
) instead of the original data, the ! studies can be combined by several meta-
analytical approaches. In this section, we present six (including two novel) methods with
brief statistics; detailed derivations can be found in Appendix 7.5.4.
7.2.4.1 Fixed-effect meta-analysis (FE)
FE assumes the underlying true effect is fixed and common across all studies (!
!
=
! ∀!). The maximum likelihood estimate (MLE) of the true effect size,
!=
!
!
!
!
!
!!!
1 !
!
!
!!!
is an inverse-variance-weighted effect size estimator. The likelihood ratio test (LRT)
statistic for testing whether the true effect ! is zero,
!
!
!"
=
!
!
!
!
!
!
!!!
1 !
!
!
!!!
follows a !
!
distribution with 1 degree of freedom (df).
150
7.2.4.2 Random-effects meta-analysis (RE)
RE assumes the true effect size in each study is sampled from a normal distribution with
mean ! and variance !
!
, allowing it to be different across studies. Similarly as in FE, the
MLE of ! is an inverse-variance-weighted estimator
!=
!
!
!
!
+!
! !
!!!
1 !
!
+!
! !
!!!
accounting for the additional between-study variance !
!
, which can be estimated by the
methods of moments (DerSimonian and Laird, 1986). The LRT statistic for testing
whether the mean effect ! is zero,
!
!
!"
=
!
!
!
!
+!
!
!
!
!!!
1 !
!
+!
! !
!!!
follows a !
!
distribution with 1 df.
7.2.4.3 Modified random-effects meta-analysis (RE2)
RE2 is a hybrid approach of FE and RE (Han and Eskin, 2011). Under the null
hypothesis, it adopts the assumptions of FE in which the true effect is zero with no
heterogeneity across studies (!
!
= 0 ∀!); under the alternative hypothesis, it adopts the
assumptions of RE in which the study-specific effects are sampled from a probability
distribution (!
!
~!(!,!
!
)). The MLE of ! and !
!
can be obtained by an iterative
approach (Hardy and Thompson, 1996).
!
!!!
=
!
!
!
!
+!
!
!
!
!!!
1 !
!
+!
!
!
!
!!!
!
!
(!!!)
=
!
!
−!
(!!!)
!
−!
!
!
!
+!
!
(!)
!
!
!!!
1 !
!
+!
!
(!)
!
!
!!!
151
Thus, the LRT statistic is
!
!
!"!
= log
!
!
!
!
+!
!
!
!!!
+
!
!
!
!
!
!
!!!
−
!
!
−!
!
!
!
+!
!
!
!!!
which tests for the mean effect and heterogeneity jointly. When the number of studies is
large, the LRT statistic asymptotically follows an equal mixture of 1 df !
!
distribution
and 2 df !
!
distribution. When the number of studies is small, the asymptotic p-value is
overly conservative; instead, Han and Eskin provided tabulated p-values with reasonable
accuracy up to 10
-8
.
We used the open-source software METASOFT (Han and Eskin, 2011), available
for download from http://genetics.cs.ucla.edu/meta, to conduct FE, RE and RE2.
7.2.4.4 Heterogeneous fixed-effect meta-analysis (HFE)
We developed HFE based on FE, as they share the same assumption of !
!
= 0 ∀! under
the null hypothesis. Under the alternative hypothesis, HFE still assumes fixed effect sizes
(i.e., not sampled from a probability distribution as in random-effects models) but allows
for effect heterogeneity across studies. The LRT statistic for testing whether any of the
study-specific effects is non-zero,
!
!
!"#
=
!
!
!
!
!
!
!!!
follows a !
!
distribution with ! df.
7.2.4.5 Group-heterogeneous fixed-effect meta-analysis (GHFE)
We further developed GHFE based on HFE by modifying the assumptions under the
alternative hypothesis. GHFE assumes a common effect size for studies from the same
ethnic group while allowing for effect heterogeneity among different groups. Suppose the
152
! studies are from a total of ! groups (1≤!≤!), the tessellation assigning studies to
groups is then given by
!
!"
=
1 if study k is in group c
0 otherwise
The LRT statistic for testing whether any of the group-specific effects is non-zero,
!
!
!"#$
=
!
!"
!
!
!
!
!
!!!
!
!
!"
!
!
!
!!!
!
!!!
follows a !
!
distribution with ! df. In the special case that != 1, GHFE is equivalent to
FE; when !=!, GHFE is identical to HFE.
We implemented HFE and GHFE in the programming language R.
7.2.4.6 Meta-analysis of transethnic association studies (MANTRA)
Unlike the above frequentist methods, MANTRA was built in a Bayesian framework
(Morris, 2011). Given the same tessellation, MANTRA can be thought of as a Bayesian
implementation of GHFE due to their similar assumptions, with the only difference being
that the group-specific effects are fixed in GHFE but have a prior Gaussian distribution in
MANTRA. Let !
!
and !
!
denote the null and alternative model, respectively. The
evidence in favor of the alternative model can be measured by the Bayes’ Factor,
Λ=
! !,!!
!
)
! !,!!
!
)
where ! !,!!) denotes the marginal likelihood of the observed effects in ! studies
(i.e., != (!
!
,!
!
,… ,!
!
) and != (!
!
,!
!
,… ,!
!
)) under model !. As shown later, this
marginal likelihood can be evaluated through the joint posterior density of the unknown
model parameters !, which include the number of studies !, the number of groups !, the
group-specific effects != (!
!
,!
!
,… ,!
!
), the mean and variance of the prior Gaussian
153
distribution for !. Under a specific model !, the posterior density of ! is proportional to
the likelihood of the observed effects given ! multiplied by the prior density of !,
! !!,!,!)∝ ! !,! !)!(!!
and it can be approximated by a Metropolis–Hastings MCMC algorithm. The marginal
likelihood in Bayes’ Factor estimation is then given by integrating the posterior density
of ! over the parameter space
! !,!!)∝ ! !,! !)!(!! !!
!
∝ ! !!,!,!)!!
!
An estimate of the Bayes’ Factor can be obtained from two independent runs of the
MCMC algorithm, once each under model !
!
and !
!
.
We used the MANTRA software (Morris, 2011) to determine the tessellation
based on allele frequency dissimilarity among studies and then to perform the meta-
analysis in a Bayesian framework.
7.2.5 Performance evaluation
We evaluated the above methods under five different scenarios of effect heterogeneity
using a model of β
k
= r
k
log(OR), where β
k
is the log odds ratio in study k, OR represents
the baseline odds ratio of the causal SNP, and the vector {r
k
}=(r
1
, r
2
, …, r
10
) controls the
level of heterogeneity (Table 1). Assuming a baseline OR of 1.3, the variance of study-
specific ORs is equal to 0 under no heterogeneity, 0.008 under moderate inter-study
heterogeneity, 0.033 under extreme inter-study heterogeneity, 0.012 under moderate
inter-population heterogeneity, and 0.024 under extreme inter-population heterogeneity.
154
Table 1. Five different scenarios of effect heterogeneity.
Effect heterogeneity (r)
Study (k)
1 2 3 4 5 6 7 8 9 10
None 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Moderate inter-study 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Extreme inter-study 1.0 0.9 0.8 0.7 0.6 -0.2 -0.3 -0.4 -0.5 -0.6
Moderate inter-population 1.0 1.0 1.0 1.0 0.6 0.6 0.6 0.2 0.2 0.2
Extreme inter-population 1.0 1.0 1.0 1.0 0.6 0.6 0.6 -0.2 -0.2 -0.2
Under each scenario, we performed outcome simulation followed by association
testing in each study (refer to section 7.2.3), and then combined the summary statistics
from the 10 studies using the methods described in section 7.2.4. For the frequentist
methods (i.e., meta-analysis with different models), we examined the false positive rate
and the statistical power for signal detection. For all methods including MANTRA, we
compared their performance in multiethnic fine-mapping by looking at 1) the fine-
mapping power, which is defined as the proportion of times that the causal variant is top-
ranked among 100 iterations; 2) the mean rank of the causal variant; and 3) the mean
percentile rank of the causal variant. SNPs in the same region were ranked by decreasing
statistical significance, namely, increasing p-value in meta-analysis or decreasing Bayes’
Factor in MANTRA. Each of the above metrics was averaged over 100 regions.
7.2.6 Multiethnic study design
After comparing different statistical methods in multiethnic fine-mapping, we further
evaluated different study designs through quasi-simulation. Specifically, a fixed number
of individuals (N=12,000) from one single population or multiple ethnically diverse
populations were considered in the study design (Table 2). An equal number of cases and
controls were assumed in each population.
155
Table 2. Study designs of using multiple racial/ethnic populations in fine-mapping.
Study Design
Sample Size
ASN EUR AFR
ASN 12,000 0 0
EUR 12,000 0 0
AFR 12,000 0 0
ASN + EUR 6,000 6,000 0
ASN + AFR 6,000 0 6,000
EUR + AFR 0 6,000 6,000
ASN + EUR + AFR 4,000 4,000 4,000
ASN = Asian; EUR = European; AFR = African.
To compare the fine-mapping performance of different study designs, we
performed quasi-simulation for the 100 causal SNPs described in section 7.2.2. For each
causal SNP, we assumed a common OR of 1.2 across populations, adopted the RAFs
from the corresponding populations in 1KGP, and considered all non-causal SNPs less
than 100kb away with LD estimated in 1KGP. For one single population, the NCP of the
causal SNP depends on OR, RAF and N; the NCPs of non-causal SNPs depend on LD
and the causal NCP (refer to section 7.2.1). The sampling of Z statistics from MVN was
repeated 1,000 times in each population. For study designs composed of more than one
population, data were combined within each iteration through a weighted sum of Z
statistics (Evangelou and Ioannidis, 2013)
!
!"#$
=
!
!
!
! !
!
!
!
!
where the weight !
!
is the square root of the sample size of the !
th
population. This
approach is approximately equivalent to fixed-effect meta-analysis based on regression
coefficients and standard errors (de Bakker et al., 2008). SNPs in the same region were
156
ranked by decreasing absolute value of Z statistic, which is essentially the same as being
ranked by increasing p-value in meta-analysis.
Same as before, fine-mapping power is defined as the proportion of times that the
causal variant was top-ranked over 1,000 iterations. Across 100 regions, we calculated
the mean power for each study design as well as the fraction of times that it achieves the
maximal power among all designs. In addition, we evaluated the average rank and
percentile rank of the causal variant to make a comprehensive comparison between the
study designs.
7.3 Results
7.3.1 Key factors that influence fine-mapping power
Through simulation, we evaluated how the effect size (OR) and risk allele frequency
(RAF) of the causal SNP, the sample size (N), the strength of LD, and the number of
surrogates in the same region would influence the frequency of causal SNP being top-
ranked (referred to as “fine-mapping power” throughout the text).
First, we fixed the number of SNPs (n=20) in a region and varied the other
parameters (Figure 1). Interestingly, given a certain LD structure, OR, RAF and N
appear to function jointly through NCP, which fully specifies the statistical power for
detecting the causal SNP and has a positive influence on causal SNP prioritization. The
fine-mapping power continuously grows with increasing NCP and reaches a plateau once
the NCP is above a certain threshold. This threshold depends on the strength of LD.
Compared to high LD, the growth curve reaches plateau faster when SNPs are in weak
LD. Under the same NCP, weaker LD results in greater fine-mapping power, which
157
further suggests that the strength of LD has a negative influence on causal SNP
prioritization.
Figure 1. Influential factors on fine-mapping power.
The vertical axis is the fine-mapping power, calculated as the proportion of times in 1,000 iterations that
the causal SNP is top-ranked among all SNPs (n=20) in the region. The ranking is based on statistical
significance (i.e., p-value) in association with the outcome. The horizontal axis is the non-centrality
parameter (NCP) of the causal SNP calculated based on risk allele frequency (RAF), effect size (OR) and
sample size (N). Closed circles are from combinations of varying RAF (0.1, 0.2, 0.3, 0.4 and 0.5) and OR
●
●
●
● ● ●
0 50 100 150
0.0 0.2 0.4 0.6 0.8 1.0
causal NCP
Fine−mapping power
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
● ● ● ● ● ● ● ● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ● ● ●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ● ●
●
● ● ● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
LD (ρ)
0.3
0.7
0.9
0.99
158
(1.1, 1.3, 1.5, 1.7 and 1.9) with N being fixed (N=2000). Open circles are from varying N (100, 500, 1000,
2000, 4000 and 8000) with RAF and OR both being fixed (RAF=0.4 and OR=1.5). The curves are drawn
through the open circles. Different strengths of LD (lag 1 autocorrelation coefficient in the AR-1 model [ρ],
0.3, 0.7, 0.9, 0.99) are color-coded.
Next, we added a level of complexity by varying the number of surrogates (i.e.,
non-causal SNPs) in the region. As noted above, OR, RAF and N influence fine-mapping
power through the causal NCP. Inspired by this observation, we performed quasi-
simulation (refer to section 7.2.1) with NCP, strength of LD, and number of surrogates as
varying parameters. It is not surprising to see that when NCP and LD are fixed, the
number of surrogates has a negative influence on causal SNP prioritization (Figure 2).
Interestingly, this influence can be attenuated by increasing NCP and/or reducing strength
of LD. As shown in Figure 2A, the number of surrogates has less effect on fine-mapping
power in low LD regions than in high LD regions. As NCP increases, the influence of the
number of surrogates dramatically decays in low LD (r=0.3) regions, but does not change
much in high LD (r=0.99) regions (Figure 2A-E). Furthermore, if the statistical power
for signal detection is sufficiently large (e.g., causal NCP=50) and the LD is sufficiently
weak (e.g., r=0.3), the causal SNP would be top-ranked almost 100% of the time
regardless of the number of surrogates in the same region (Figure 2E).
159
●
●
●
●
●
●
●
●
0 50 100 150 200 250
0.0 0.2 0.4 0.6 0.8 1.0
causal NCP = 10
Number of surrogates
Fine−mapping power
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
LD (r)
0.3
0.5
0.7
0.9
0.99
A
160
●
●
●
●
●
●
●
●
0 50 100 150 200 250
0.0 0.2 0.4 0.6 0.8 1.0
causal NCP = 20
Number of surrogates
Fine−mapping power
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
LD (r)
0.3
0.5
0.7
0.9
0.99
B
161
● ●● ●
●
●
●
●
0 50 100 150 200 250
0.0 0.2 0.4 0.6 0.8 1.0
causal NCP = 30
Number of surrogates
Fine−mapping power
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
LD (r)
0.3
0.5
0.7
0.9
0.99
C
162
● ●● ● ● ●
●
●
0 50 100 150 200 250
0.0 0.2 0.4 0.6 0.8 1.0
causal NCP = 40
Number of surrogates
Fine−mapping power
● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
LD (r)
0.3
0.5
0.7
0.9
0.99
D
163
Figure 2. Interaction between key influential factors on fine-mapping power.
In each panel, the vertical axis is fine-mapping power, calculated as the proportion of times in 1,000
iterations that the causal SNP is top-ranked based on p-value; the horizontal axis is the number of
surrogates (n, 1, 3, 7, 15, 63, 127 and 255) in the region. Different strengths of LD (pairwise correlation
coefficient [r], 0.3, 0.5, 0.7, 0.9 and 0.99) are color-coded. Across panels, the non-centrality parameters
(NCPs) of the causal SNP are (A) 10, (B) 20, (C) 30, (D) 40 and (E) 50, respectively.
● ●● ● ● ●
● ●
0 50 100 150 200 250
0.0 0.2 0.4 0.6 0.8 1.0
causal NCP = 50
Number of surrogates
Fine−mapping power
● ●● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
LD (r)
0.3
0.5
0.7
0.9
0.99
E
164
In summary, assuming that the causal SNP is available in the data, fine-mapping
power improves with increasing statistical power, weakening LD, and decreasing number
of non-causal SNPs in the same region. These key factors do not function independently;
in fact, they interact with each other and therefore, a comprehensive consideration of
these factors in the study design phase is necessary in order to optimize the fine-mapping
power.
The simulation analysis was conducted under the assumptions of 1) one causal
SNP per region, 2) an AR-1 or uniform LD structure, and 3) a balanced number of cases
and controls in the sample. This framework can be easily extended to unbalanced designs
with flexible case/control ratio, as it can be accounted for in the NCP calculation. A
simplified correlation structure facilitates the quantification of LD strength and thus is
helpful for understanding its influence on fine-mapping; however, more realistic LD
structures are recommended for the evaluation of different fine-mapping methods or
study designs.
7.3.2 Comparison of different methods for combining summary statistics
7.3.2.1 False-Positive Rate
The false-positive rate is the probability of falsely identifying a signal that in fact does
not exist. It is estimated as the likelihood of being statistically significant at α=0.05 under
the null hypothesis. Specifically, for a conceptual SNP with MAF=0.3, we constructed a
series of number of studies (n, 3, 5, 10, 20 and 50) with an equal sample size of 1,000 and
simulated the outcome under the assumption of no associations (i.e., OR=1 in all studies).
We estimated the summary statistics in each study and conducted meta-analysis to obtain
165
the p-value. After 100,000 iterations, we estimated the false-positive rate as the
proportion of times that the p-value is less than 0.05.
Figure 3 shows that the false-positive rate is accurate regardless of the number of
studies for all the meta-analysis methods, except the random-effects model (RE). RE has
a false-positive rate of noticeably lower than 0.05, suggesting that it is more conservative
than the other methods. This is because RE also needs to estimate the inter-study
heterogeneity (τ
2
), whose true value is exactly zero in the simulation setting of a common
OR equal to 1. The conservative nature of RE comes from the variance in estimating τ
2
,
and the conservativeness is reduced with increasing number of studies as illustrated in
Figure 3. Theoretically, with infinite number of studies, the estimate of τ
2
should
converge to zero and thus RE would have a false-positive rate of approximately 0.05.
166
Figure 3. False-positive rate of different meta-analysis methods.
False-positive rate is estimated at threshold α=0.05 in varying number of studies (n, 3, 5, 10, 20 and 50)
using 100,000 iterations. Under the null hypothesis of OR=1 in all studies, five different models of meta-
analysis are evaluated: fixed-effect (FE), random-effects (RE), modified random-effects (RE2),
heterogeneous fixed-effects (HFE), and group-heterogeneous fixed-effects (GHFE). For GHFE, it is
assumed that there are three populations with roughly equal number the studies. For example, if there are a
total of 10 studies (i.e., n=10), we assume the first four studies are from the first population, the next three
from the second population, and the last three from the third population.
3 5 10 20 50
FE
RE
RE2
HFE
GHFE
Number of studies
False−positive rate at α=0.05
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07
167
7.3.2.2 Statistical Power
The statistical power is the probability of successfully discovering a true signal. It is
estimated as the likelihood of being statistically significant at a genome-wide threshold
(α=10
-7
) under the alternative hypothesis. We used a similar simulation framework for a
conceptual SNP (MAF=0.3) in 10 studies with an equal sample size of 1,000. We
examined five different scenarios of effect heterogeneity (Table 1), and further varied the
baseline OR to expand the spectrum of effect size difference across studies. We repeated
this procedure for 10,000 times and estimated the statistical power as the proportion of
the repeats with a p-value of less than 10
-7
.
Under no effect heterogeneity, the most powerful method is fixed-effect (FE)
meta-analysis (Figure 4A), which is expected since under the alternative hypothesis FE
assumes the same effect size across studies while the other methods do not. Interestingly,
the modified random-effects (RE2) model has a comparable yet slightly lower power.
This is probably because the RE2 statistic is equal to the FE statistic plus the statistic
testing for heterogeneity, and it follows a mixture distribution with slightly higher
degrees of freedom than the FE statistic (Han and Eskin, 2011).
Next, we introduced effect heterogeneity on two levels: 1) inter-study, each one of
the 10 studies has a different effect; 2) inter-population, there are three population-
specific effects but studies from the same population share the same effect. On each level,
we varied the degree of heterogeneity: moderate heterogeneity means the effects have a
consistent direction but different values; extreme heterogeneity means there are effects
with opposite directions (Table 1). When effect heterogeneity exists either between
studies or between populations, the group-heterogeneous fixed-effects (GHFE) model
168
outperforms the other methods, and the advantage of GHFE becomes greater as the
degree of heterogeneity increases (Figure 4B-E). It is striking that when studies from the
same population have different effects (i.e., under inter-study effect heterogeneity),
GHFE demonstrates robustness and it still has greater power than the heterogeneous
fixed-effects (HFE) model (Figure 4B and 4C), whose assumptions best fit this scenario.
HFE never achieves the maximal power, probably due to the high degrees of freedom in
the χ
2
test. The power of FE reduces to nearly zero when the effect size is extremely
different across all studies (Figure 4C), because it tests for mean effect yet the positive
and negative effects cancel out under this scenario. It noted that RE has the minimal
power when effect heterogeneity exists (Figure 4B-E), which further reflects its
conservative nature in signal discovery.
In addition, we simulated a situation in which the true effects for the 10 studies
were drawn at random from a distribution with varying between-study heterogeneity
(Supplementary Fig. 1). Under low to moderate heterogeneity, FE is more powerful
than the other methods. The power of FE gradually diminishes as the level of
heterogeneity increases, and after a certain point it is exceeded by the power of RE2.
Under moderate to large heterogeneity, RE2 is superior to the other methods, primarily
because it tests for the mean effect and heterogeneity at the same time. Since the effect
heterogeneity pattern is completely random across studies, the tessellation in GHFE is
also random and no long reflects the effect similarity. Nonetheless, GHFE seems to be
robust against this violation; it has a steadily growing power as the between-study
heterogeneity increases.
169
1.1 1.2 1.3 1.4 1.5 1.6
0.0 0.2 0.4 0.6 0.8 1.0
No effect heterogeneity
Baseline OR
Power at α=10
−7
●
●
●
●
●
● ● ● ● ● ●
●
●
●
●
● ● ● ● ● ● ●
●
●
FE
RE
RE2
HFE
GHFE
A
170
1.1 1.2 1.3 1.4 1.5 1.6
0.0 0.2 0.4 0.6 0.8 1.0
Moderate inter−study effect heterogeneity
Baseline OR
Power at α=10
−7
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
FE
RE
RE2
HFE
GHFE
B
171
1.1 1.2 1.3 1.4 1.5 1.6
0.0 0.2 0.4 0.6 0.8 1.0
Extreme inter−study effect heterogeneity
Baseline OR
Power at α=10
−7
● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
FE
RE
RE2
HFE
GHFE
C
172
1.1 1.2 1.3 1.4 1.5 1.6
0.0 0.2 0.4 0.6 0.8 1.0
Moderate inter−population effect heterogeneity
Baseline OR
Power at α=10
−7
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
FE
RE
RE2
HFE
GHFE
D
173
Figure 4. Statistical power of different meta-analysis methods.
Statistical power is estimated at a genome-wide threshold α=10
-7
using 10,000 iterations of 10 studies, each
having 1,000 individuals with a balanced number of cases and controls. Under the alternative hypothesis of
OR≠1, five different models of meta-analysis are evaluated: fixed-effect (FE), random-effects (RE),
modified random-effects (RE2), heterogeneous fixed-effects (HFE), and group-heterogeneous fixed-effects
(GHFE). We simulate under various settings of the baseline OR (ranging from 1.1 to 1.6 with a step of
0.05) and effect heterogeneity (Table 1). Each panel shows a different scenario of effect heterogeneity: (A)
no heterogeneity; (B) moderate inter-study heterogeneity; (C) extreme inter-study heterogeneity; (D)
moderate inter-population heterogeneity; (E) extreme inter-population heterogeneity.
1.1 1.2 1.3 1.4 1.5 1.6
0.0 0.2 0.4 0.6 0.8 1.0
Extreme inter−population effect heterogeneity
Baseline OR
Power at α=10
−7
● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
● ● ●
●
●
FE
RE
RE2
HFE
GHFE
E
174
7.3.2.3 Fine-mapping performance
Besides the statistical power to detect a causal signal, we also assessed the prioritization
of causal variant in multiethnic fine-mapping, which was measured by the fine-mapping
power and the average rank of the causal SNP. To account for the region-specific
heterogeneity of LD and RAF, we averaged the performance metrics over 100 regions.
Figure 5A shows that the random set of 100 common causal variants is sampled across
all MAFs above 0.05, with inter-population differences ranging from 0.02 to 0.43. The
inter-population MAF difference has a bell-shaped distribution (Figure 5B), indicating
that there are not as many SNPs with little or extreme heterogeneity of MAF as those
with moderate heterogeneity. Assuming that the causal SNP has a baseline OR of 1.3 and
under five different scenarios of effect heterogeneity (Table 1), we simulated 10 studies
from three populations (refer to section 7.2.2 and 7.2.3) and evaluated the fine-mapping
performances of six meta-analytical approaches (refer to section 7.2.4 and 7.2.5).
175
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0 0.1 0.2 0.3 0.4 0.5
100 causal variants
Minor allele frequency
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
EUR AFR ASN
A
176
Figure 5. Minor allele frequency of the 100 causal variants.
(A) The vertical axis shows the minor allele frequency (MAF) of the 100 randomly selected causal variant
in European (EUR), African (AFR), and Asian (ASN) populations, respectively. The color-coded closed
circles represent population-specific MAFs. Each vertical segment connects the MAFs for a specific
variant. On the horizontal axis, the variants are sorted with increasing inter-population MAF difference. (B)
The histogram shows the distribution of the inter-population MAF difference among the 100 causal
variants, with the probability density represented by the blue dashed curve.
Inter−population MAF difference
Density
0.0 0.1 0.2 0.3 0.4
0 1 2 3 4 5
B
177
As one of the key performance metrics, fine-mapping power represents the
frequency of causal SNP being top-ranked. When the causal effect is the same across
studies, all of the methods have comparable fine-mapping power with HFE being the
least powerful; however, when the causal effect differs between studies or populations,
GHFE is superior to the other methods (Figure 6). Another performance metric, average
rank of the causal variant, suggests the minimum number of SNPs needed for functional
follow-up and thus the lower the better. On average, the causal SNP has a similar rank
using any method but HFE under no effect heterogeneity and is better prioritized by
GHFE than any other method when heterogeneity exists (Figure 7), which is consistent
with the findings based on fine-mapping power. Intuitively, the rank of the causal SNP
also depends on the total number of SNPs considered in the ranking, which potentially
overweighs the regions containing more SNPs when taking an average. To correct for this
effect, we further examined the percentile rank of the causal SNP through dividing the
original rank by the number of SNPs in the same region. A similar trend is observed,
suggesting that GHFE has the best overall performance in causal SNP prioritization
(Figure 8). As a side note, the method performances cannot be compared between
different scenarios of effect heterogeneity, because the overall statistical power (i.e.,
causal NCP), which has a crucial influence on fine-mapping as illustrated in Figure 1, is
not the same under all scenarios.
178
Figure 6. Mean fine-mapping power of six statistical methods by effect heterogeneity.
The frequency of causal variant being top-ranked among 100 iterations (i.e., fine-mapping power) is
averaged over 100 regions. We simulate 10 studies, each having 1,000 individuals, from three populations.
Assuming a baseline causal OR of 1.3, five different scenarios of effect heterogeneity are considered: none,
moderate inter-study, extreme inter-study, moderate inter-population, and extreme inter-population (see
Table 1). Summary statistics are combined using fixed-effect (FE), random-effects (RE), modified random-
effects (RE2), heterogeneous fixed-effects (HFE), group-heterogeneous fixed-effects (GHFE) meta-
analysis, and MANTRA. Note that the mean fine-mapping power cannot be compared between different
scenarios of effect heterogeneity, because the average causal OR is not the same under all scenarios.
Case1 Case2 Case3 Case4 Case5
FE
RE
RE2
HFE
GHFE
MANTRA
Mean fine−mapping power
0.0 0.2 0.4 0.6 0.8 1.0
None Moderate
inter-study
Extreme
inter-study
Moderate
inter-pop
Extreme
inter-pop
179
Figure 7. Mean rank of causal variant by statistical method and effect heterogeneity.
SNPs within the same region are ranked by decreasing statistical significance (i.e., increasing p-value in
meta-analysis or decreasing Bayes’ Factor in MANTRA). The mean rank of the causal variant among 100
iterations is averaged across 100 regions. Note that the mean rank of the causal variant cannot be compared
between different scenarios of effect heterogeneity, because the average causal OR is not the same under all
scenarios. Under extreme inter-study effect heterogeneity, for random-effects (RE) or modified random-
effects (RE2) meta-analysis, the mean rank of the causal SNP is beyond the maximum scale value of the
vertical axis and thus is not shown precisely in this bar plot.
Case1 Case2 Case3 Case4 Case5
FE
RE
RE2
HFE
GHFE
MANTRA
Mean rank of causal SNP
0 10 20 30 40 50 60
None Moderate
inter-study
Extreme
inter-study
Moderate
inter-pop
Extreme
inter-pop
180
Figure 8. Mean percentile rank of causal variant by statistical method and effect heterogeneity.
A percentile rank is defined as the original rank divided by the number of SNPs in the same region, to
corrected for the varying total number of SNPs across regions. The mean percentile rank of the causal
variant among 100 iterations is averaged across 100 regions. Note that the mean percentile rank of the
causal variant cannot be compared between different scenarios of effect heterogeneity, because the average
causal OR is not the same under all scenarios. Under extreme inter-study effect heterogeneity, for random-
effects (RE) or modified random-effects (RE2) meta-analysis, the mean percentile rank of the causal SNP
is beyond the maximum scale value of the vertical axis and thus is not shown precisely in this bar plot.
Case1 Case2 Case3 Case4 Case5
FE
RE
RE2
HFE
GHFE
MANTRA
Mean percentile rank of causal SNP (%)
0 2 4 6 8 10
None Moderate
inter-study
Extreme
inter-study
Moderate
inter-pop
Extreme
inter-pop
181
In summary, for signal discovery and fine-mapping based on summary statistics
from multiethnic samples, GHFE demonstrates a substantial advantage over the other
methods when causal effect heterogeneity exists. Under no heterogeneity, FE is
recommended as having the greatest power for signal discovery and a comparable
performance to GHFE in fine-mapping.
7.3.3 Comparison of different study designs using multiethnic populations
In the previous section, we compared different statistical methods in fine-mapping under
the study design of 4,000 European (EUR), 3,000 African (AFR), and 3,000 Asian (ASN)
individuals. In this section, we compare different combinations of these three ethnically
diverse populations with a fixed total sample size in the design phase. Specifically, a total
of 12,000 individuals drawn from multiple populations in comparison to that all from one
single population, which led to seven possible designs, were examined (Table 2). For any
study design involving multiple populations, the total sample size was evenly split into
each population. Different study designs were evaluated through quasi-simulation (refer
to section 7.2.1) under the assumptions of a balanced number of cases and controls in
each population and a common OR of 1.2 for the causal variant. The quasi-simulation
was performed in 100 regions, where the causal RAFs and the LD structures were
estimated using 1KGP data. To combine summary statistics from multiple populations,
an analogy of FE (refer to section 7.2.6) was used since it has been demonstrated to be
the optimal method under no effect heterogeneity in the previous section.
First, we estimated the fine-mapping power, which is defined as the frequency of
causal variant being top-ranked. When an average is taken across 100 regions to
eliminate inter-region differences, the mean power is highest in the study design of AFR
182
only (Figure 9). Intuitively, this is expected because AFR has the sparsest LD on average
compared to any other populations. Accordingly, the multiethnic designs containing AFR
also have an elevated power compared to those without AFR, but never reach the power
of AFR only. This is likely due to that 1) part of the AFR sample is traded for less
powerful sample from other populations, and 2) there is little gain in further breaking
down LD blocks by bringing in non-AFR sample because SNPs highly correlated in AFR
are also highly correlated in other populations in most cases. On the other hand, when
two populations have similar power and a certain degree of non-overlap in LD, we would
expect an improved power by combining these populations. For example, the single-
population designs, EUR and ASN, both have relatively low power probably due to high
LD; however, the multiethnic design, EUR + ASN, has a slightly higher power than
either population.
Figure 9. Mean fine-mapping power of seven different study designs.
Every study design is composed of 12,000 individuals with a balanced number of cases and controls in
each population. The multiethnic designs split individuals evenly over the populations. The fine-mapping
0.23
0.19
0.24
0.44
0.38
0.41
0.37
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
ASN
EUR
EUR+ASN
AFR
AFR+EUR
AFR+ASN
AFR+EUR+ASN
Mean fine-mapping power
183
power is defined as the frequency of causal variant being top-ranked among 1,000 iterations in quasi-
simulation. The mean power for each study design is obtained by taking an average over 100 regions.
Next, we compared these study designs from another perspective but still based
on fine-mapping power. Across 100 regions, we calculated the fraction of times that each
study design achieving the maximal power. AFR is the most powerful design 59% of the
time, which is substantially more frequent than any other design (Figure 10). The trend is
similar to what we observed when evaluating mean power (Figure 9), except for an
exaggerated advantage of AFR and disadvantage of multiethnic designs containing EUR
+ ASN. The fundamental difference from Figure 9 is that Figure 10 only reflects the
likelihood of being the most powerful design rather than the actual power. In other words,
before averaged over 100 regions, the power estimates of every design are substituted by
the binary codes indicating whether it achieves the maximal power for each region (i.e.,
yes=1, no=0). This information reduction inevitably introduces more variance in the
performance metric and further enlarges the gaps between AFR and the other designs.
For example, EUR + ASN has a power of more than 0.20 in 45% of the regions but it
rarely outperforms AFR, which may partially explain why it is undervalued in Figure 10.
In addition, among all designs, AFR has the largest variance of power estimates across
the 100 regions (data not shown), which further increases the chances of AFR popping up
as the optimal design.
184
Figure 10. Fraction of times that each study design achieves the maximal fine-mapping power.
The most powerful study design was first identified in each region. The frequency of each design achieving
the maximal power is then calculated based on 100 regions.
Furthermore, we examined the rank of the causal variant in each study design.
Unlike fine-mapping power, which indicates the likelihood of the causal variant being the
most significant SNP, the rank of the causal variant suggests the minimum number of
SNPs needed for functional follow-up and thus the lower the better. Based on this metric,
study designs involving multiple populations have a marked advantage over designs
involving one single population (Figure 11), which is likely due to the reduced number
of surrogates by leveraging diverse LD patterns across ethnicities. For example, in a
sample from the EUR population, the mean rank of the causal SNP is 13.4 on average,
which decreases to 7.3 if replacing half of the EUR sample with AFR sample as in the
study design of EUR + AFR.
0.07
0.06
0.01
0.59
0.08
0.17
0.02
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
ASN
EUR
EUR+ASN
AFR
AFR+EUR
AFR+ASN
AFR+EUR+ASN
Frequency of achieving max power
185
Figure 11. Mean and median rank of causal variant by study design.
SNPs within the same region are ranked by decreasing statistical significance (i.e., increasing p-value). For
each study design, the mean and median rank of the causal variant among 1,000 iterations is averaged over
100 regions by taking an arithmetic mean.
It is noted that the rank of the causal variant highly depends on the total number
of SNPs within the same region. After correcting for the number of SNPs, which largely
differs from region to region, we examined the percentile rank of the causal variant
(Figure 12). It demonstrates a similar trend but highlights the multiethnic design
involving three populations (AFR + EUR + ASN). In this design, the causal variant ranks
on the top 0.29% on average, which is statistically significantly lower than that in the
AFR design (0.47%; one-sided p=0.015) and is the lowest mean percentile rank among
all of the designs.
9.7
8.8
7.2
3.9
3.8 3.7
4.0
14.1
13.4
11.3
9.9
7.3
8.3
7.8
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
ASN
EUR
EUR+ASN
AFR
AFR+EUR
AFR+ASN
AFR+EUR+ASN
Average rank of causal SNP
Median
Mean
186
Figure 12. Mean and median percentile rank of causal variant by study design.
A percentile rank is defined as the original rank divided by the number of SNPs in the same region, to
correct for the varying total number of SNPs across regions. For each study design, the mean and median
percentile rank of the causal variant among 1,000 iterations is averaged over 100 regions by taking an
arithmetic mean.
In summary, the study design of AFR population alone maximizes the likelihood
of the causal variant being top-ranked, while combining multiple ethnically diverse
populations improves the rank of the causal variant and therefore reduces the set of SNPs
for follow-up functional experiments. For fine-mapping of a particular region, further
investigation on the sampling ratio from each population is required to optimize the study
design.
0.97
0.76
0.45
0.21
0.18 0.17
0.16
1.41
1.13
0.70
0.47
0.32
0.34
0.29
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
ASN
EUR
EUR+ASN
AFR
AFR+EUR
AFR+ASN
AFR+EUR+ASN
Average percen+le rank of causal SNP (%)
Median
Mean
187
7.4 Discussion
In this chapter, we made a comprehensive investigation on multiethnic fine-mapping
from three aspects: the key influential factors in the data, the statistical methods for
combining studies in the analysis phase, and the study designs involving one or multiple
racial/ethnic populations in the design phase.
It is generally considered that if a genetic variant is truly functional, it may be
associated with the disease with a similar magnitude of effect in multiple racial/ethnic
populations. In fact, the true effect size may vary for several reasons, including its
interaction with other genetic or environmental risk factors whose prevalence differs
across populations and the population-specific exposures that affect the disease pathways.
As a consequence, the assumptions in the traditional meta-analysis approaches are likely
to be violated when applied to multiethnic data. To address this issue, we developed a
novel method, group-heterogeneous fixed-effect (GHFE) meta-analysis, allowing the true
effects to vary across populations yet assuming fixed and homogeneous effects within
each population.
Under the scenarios of inter-study or inter-population heterogeneity, GHFE
outperforms the other meta-analytical approaches in both signal discovery and fine-
mapping. In the special case that all studies are from one single population, GHFE is
equivalent to fixed-effect (FE) meta-analysis; when each study is from a different
population, GHFE is identical to heterogeneous fixed-effect (HFE) meta-analysis.
Furthermore, GHFE is robust to random effects with varying between-study
heterogeneity; however, the recommended approaches in this case would be FE under
188
low-to-moderate heterogeneity and modified random-effects meta-analysis (RE2) under
moderate-to-high heterogeneity.
GHFE can be thought of as a frequentist implementation of the Bayesian method
MANTRA. More importantly, GHFE has a substantially improved computational
efficiency (~500× faster than MANTRA) and can be easily implemented in any
programming language. On the other hand, MANTRA utilizes a more flexible framework
of tessellation, Bayesian partition model, in which the a priori knowledge on the number
of ethnic clusters can be updated by the observed data to achieve the optimal posterior
probability. A critical assumption of MANTRA is that the effect heterogeneity pattern is
consistent with the allele frequency dissimilarity across studies, as the latter is used as the
distance metric for clustering. If the effects are less likely to vary between closely related
populations, MANTRA might have better performance than GHFE by clustering the
studies more properly, but further investigation is required to evaluate this speculation.
Besides the methods presented in this chapter, other methods for combining
summary statistics include Bayesian meta-analysis for quantitative traits (Verzilli et al.,
2008) and binary outcomes (De Iorio et al., 2011), and Bayesian methods specifically
designed for fine-mapping such as XPEB (Coram et al., 2015), PAINTOR (Kichaev et
al., 2014), and CAVIARBF (Chen et al., 2015). XPEB leverages multiethnic evidence to
re-prioritize SNPs in minority populations that have been under-represented in GWAS,
with a constraint on there being a single causal variant per risk locus. In contrast, both
PAINTOR and CAVIARBF allow for multiple causal variants per risk locus. Moreover,
PAINTOR incorporates functional annotation in SNP ranking and further extends to a
multi-population fine-mapping framework (Kichaev and Pasaniuc, 2015). A comparison
189
of the existing fine-mapping approaches can be found in these two literature reviews (Li
and Keating, 2014; Spain and Barrett, 2015).
Under no heterogeneity, the traditional FE approach is recommended. Therefore,
we adopted an analogy of this approach to evaluate the study designs involving one or
more populations for fine-mapping. The results are consistent with previous findings that
suggest an advantage of leveraging multiethnic data to prioritize the causal SNP (Ong et
al., 2012; Teo et al., 2010; Zaitlen et al., 2010). In particular, the simulation setting in our
study is comparable to that in Zaitlen et al., with several differences as follows. 1) They
included rare causal SNPs (MAF<0.05), but we did not since it would be more
appropriate to analyze the rare SNPs separately from the common SNPs via a different
association test. 2) They defined each region to contain 20 additional SNPs from each
side of the causal SNP based on HapMap, spanning ~30kb on average. Fixing the number
of SNPs (n=41) across regions may add convenience to comparing the average rank of
the causal SNP between different methods; however, for some regions, such a small
number of SNPs might be insufficient to represent the complex LD structures. Therefore,
we defined each region by fixing the length: +/- 100 kb from the causal SNP based on
1KGP, which is expected to include the vast majority of the surrogates that are in
moderate to high LD with the causal SNP. Although the varying number of surrogates
across regions can be accounted for by examining the percentile rank of the causal SNP,
its potential interaction with the causal NCP and/or LD may still have some remaining
influence on fine-mapping. 3) Given the same sample size, they assumed a common NCP
for all causal SNPs and in every population, while we only assumed a common OR and
allowed the causal NCPs to be different due to varying RAF across populations. 4) They
190
simulated 10,000 regions, but we only simulated 100 regions, which might result in less
stable estimates of fine-mapping performance. In general, our study better approximated
the realistic situation of multiethnic fine-mapping on common signals, while Zaitlen et al
included rare signals and also eliminated the influence of the number of surrogates by
holding it constant. These two complementary studies have consistent results, which
further confirms the advantage of multiethnic designs in fine-mapping.
The simulation framework for comparing different methods and study designs has
several limitations that may provide further extension of the research. First, we assumed
that the casual variation in each region is one single SNP, which actually could be a
haplotype, a structural variant, or multiple independent SNPs. As a side point, an
approximate conditional and joint analysis method, based on summary statistics, has been
developed to identify additional independent risk variants within the same region (Yang
et al., 2012). Second, we assumed that the causal SNP is common in every population; in
fact, the causal SNP could be rare in one or more populations. Unlike common alleles,
rare alleles within the same gene are usually tested in aggregate through a burden test (Li
and Leal, 2008), score test, or SKAT (Wu et al., 2011) to increase power. Mensah-Ablorh
et al. compared the performance of several rare variant meta-analysis approaches under
multiethnic designs with varying ethnic sampling fraction (Mensah-Ablorh et al., 2016).
Last but not least, we only simulated dichotomous traits, but we reason that the same
conclusions also hold for quantitative traits, because the type of outcome does not affect
the summary statistics provided for the meta-analytical approaches that have been
evaluated in this chapter.
191
7.5 Appendix
7.5.1 NCP calculation for a causal SNP
Under a log additive model relating individual disease status (case: != 1; control: !=
0) to the number of risk alleles (!∈ {0,1,2}) for a causal SNP, we have
!"#
!(!|!)
1−!(!|!)
=!+!" ,
where ! !|! =! != 1 | ! is the probability of being a case conditioning on the
causal SNP genotype, != log !" is the additive allele effect, and != log (!/(1−!))
is the offset effect in which ! represents the disease prevalence. Here we assume a rare
disease with != 10
!!
in the population.
Given a certain phenotype, the conditional probability of being a case is
!
!"#$
=! != 1!= ! =
!"# (!!!")
!!!"# (!!!")
. The binary outcome ! follows a Bernoulli
distribution: ! !=! != !)= !
!"#$
!
(1−!
!"#$
)
!!!
. Given an overall risk allele
frequency ! in the population, the number of risk alleles ! carried by any individual
follows a binomial distribution !(2,!): ! != ! =
!
!
!
!
(1−!)
!!!
. According to
Bayes’ theorem, the conditional distribution of ! given ! is
! != ! !=!)=
! !=! != !)!(!= !)
! !=! != !)!(!= !)
!
!!!
,
which can be used to calculate the risk allele frequency among cases (!
!
) and controls
(!
!
) separately. Note that we need to divide the total number of risk alleles by 2 because
human somatic cells are diploid, containing two complete sets of chromosomes.
192
!
!
=
! != 1 != 1)+2! != 2 != 1)
2
!
!
=
! != 1 != 0)+2! != 2 != 0)
2
Based on the central limit theorem, the test statistic ! for !
!
:!= 0 (or
equivalently !
!
=!
!
) versus !
!
:!≠ 0 (or equivalently !
!
≠!
!
),
!=
!
!
−!
!
!"#(!
!
−!
!
)
follows a standard normal distribution !(0,1) under !
!
. Let !
!
and !
!
denote the
number of cases and controls, respectively. It thus follows that
!=
!
!
!
!
+!
!
!
!
!
!
+!
!
.
If we have a balanced number of cases and controls, !
!
=!
!
=!/2 , then !=
!
!
+!
!
2. When ! is large, the total number of risk alleles 2!! also follows a
binomial distribution !(2!,!) with a variance of !"# 2!! = 2!"(1−!), which
further leads to !"# !
!
+!
!
= 2!(1−!)/!. Because !
!
and !
!
are independent,
!"# !
!
−!
!
= !"# !
!
+!
!
=
2!(1−!)
!
.
Subsequently, the test statistic can be formulated as
!=
!
!
−!
!
2!(1−!)
! .
Under !
!
, ! ~ ! !,1 , where ! is the non-centrality parameter (NCP) given by
!=
!
!
−!
!
2!(1−!)
! .
Given a certain disease prevalence !, the NCP only depends on the risk allele
frequency !, the effect size !, and the total sample size !.
193
7.5.2 NCP calculation for non-causal SNPs
Assuming that the outcome ! is a quantitative trait, we can fit a linear regression model,
!
!
=!+!!
!
+!
!
, != 1,2,… ,!.
The error terms !
!
are independent and identically distributed as Gaussian !(0,!
!
). The
ordinary least squares solution for this model is
!=
!"# !,!
!"# !
!"# ! =
!
!
!"#$(!)
.
Based on the above estimates, an alternative way of constructing the test statistic that
follows !(!,1) is given by
!=
!
!"#(!)
.
We start with a simple region of one causal SNP and one non-causal SNP. Let !
denote the number of risk alleles for the causal SNP and ! for the non-causal (!,!∈
{0,1,2}). To evaluate marginal associations, two separate models are fit:
!=!
!
+!
!
!+!
!
!=!
!
+!
!
!+!
!
We assume that there is a linear relationship between ! and ! as well as ! and !
(!=!
!
+!
!
!+!
!"
). We also assume that the linear relationship between ! and ! is
due to full mediation by !; in other words, the outcome is independent of the non-causal
SNP when conditioning on the causal SNP: !"# !,! !)= 0. In this case, the residual
variances in the two models above are the same: !
!
!
=!
!
!
=!
!
. According to the law
of total variance, the covariance between ! and ! can be decomposed as
194
!"# !,! =! !"# !,! !) +!"# ! ! ! ,! ! !
= 0+!"# !
!
+!
!
!,!
!
+!
!
!
=!
!
!
!
!"# !
=
!"# !,!
!"# !
!"# !,!
!"# !
!"# !
=!"# !,!
!"# !,!
!"# !
.
Therefore, the ratio of the test statistics for the non-causal and causal SNPs,
!
!
!
!
=
!
!
!"# !
!
!
!
!"# !
!
=
!"# !,! !"# ! !"#$ ! !
!
!
!"# !,! !"# ! !"#$ ! !
!
!
=
!"#(!,!)
!"# ! !"#(!)
is exactly the correlation coefficient (!) between these two SNPs. Consequently, the NCP
of the non-causal SNP is the product of the NCP of causal SNP and their correlation
coefficient: !
!
= !!
!
. The result also holds when modeling a binary outcome through
logistic regression (Hormozdiari et al., 2014).
195
7.5.3 MVN distribution of test statistics
By the multivariate central limit theorem, if N is large, the joint distribution of the
marginal association statistics for the two SNPs above asymptotically follows a
multivariate normal (MVN) distribution,
!
!
!
!
~!
!
!
!!
!
,
1 !
! 1
.
It can be generalized to a set of ! SNPs with the causal SNP denoted by an index !,
!~!"#(!
!
!
!
,!)
where != !
!
,!
!
,… ,!
!
is the vector of statistics for ! SNPs, !
!
is the NCP for the
causal SNP, !
!
= !
!!
,!
!!
,… ,!
!"
is the vector of correlation coefficients between the
causal SNP and every other SNP, and ! is the !×! covariance matrix in which the
!
!!
,!
!!
entry is !
!"
(i.e., the correlation coefficient between the !
!!
and !
!!
SNPs).
7.5.4 Meta-analytic statistical inferences
Here we provide detailed derivations for the meta-analysis statistics in section 7.2.4, with
an emphasis on the development of the two novel methods: HFE and GHFE.
For any given SNP in study !, the effect estimate !
!
follows a normal distribution
centering on the true effect size !
!
, if the sample size is sufficiently large. Specifically,
!
!
=!
!
+!
!
, != 1,2,… ,!
where the error term !
!
~!(0,!
!
). !
!
can be estimated by the sample variance of !
!
,
which is equal to the squared standard error !"(!
!
)
!
.
196
Fixed-effect meta-analysis (FE)
FE assumes !
!
= ! ∀!. Thus if follows that !
!
~! !,!
!
with the hypotheses below.
!
!
:!= 0
!
!
:!≠ 0
The MLE of ! is obtained by maximizing the likelihood ! over the parameter !.
!=
1
2!!
!
exp −
!
!
−!
!
2!
!
!
!!!
log!= log
1
2!!
!
−
!
!
−!
!
2!
!
!
!!!
!log!
!"
=
!
!
−!
!
!
!
!!!
=
!
!
!
!
!
!!!
−!
1
!
!
!
!!!
Setting the first derivative to zero, we can get
!=
!
!
!
!
!
!!!
1 !
!
!
!!!
.
The LRT statistic is constructed as follows:
!
!
=−2log
sup
!
!
!
sup
!
!
!
=−2 log!
!!!
− log!
!!!
=−2 −
!
!
!
2!
!
!
!!!
+
!
!
−!
!
2!
!
!
!!!
= 2!
!
!
!
!
!
!!!
−!
!
1
!
!
!
!!!
.
After plugging in !, the LRT statistic is given by
!
!
!"
=
!
!
!
!
!
!
!!!
1 !
!
!
!!!
which follows a !
!
distribution with 1 df.
197
Random-effects meta-analysis (RE)
RE assumes !
!
= !+!
!
, where !
!
~! 0,!
!
. Thus if follows that !
!
~! !,!
!
+!
!
with the hypotheses below.
!
!
:!= 0
!
!
:!≠ 0
Similarly as shown above, the MLE of ! is
!=
!
!
!
!
+!
! !
!!!
1 !
!
+!
! !
!!!
,
where !
!
is estimated by the methods of moments (DerSimonian and Laird, 1986), and
the LRT statistic is
!
!
!"
=
!
!
!
!
+!
!
!
!
!!!
1 !
!
+!
! !
!!!
,
which follows a !
!
distribution with 1 df.
Heterogeneous fixed-effect meta-analysis (HFE)
We developed HFE based on FE by modifying the alternative hypothesis.
!
!
:!
!
= 0 ∀!
!
!
:!
!
≠ 0 for as least one !
The likelihood and log likelihood are given by
!=
1
2!!
!
exp −
!
!
−!
!
!
2!
!
!
!!!
log!= log
1
2!!
!
−
!
!
−!
!
!
2!
!
!
!!!
.
The LRT statistic is constructed as follows,
198
!
!
=−2log
sup
!
!
!
sup
!
!
!
=−2 log!
!
!
!! ∀!
− log!
!
!
!!
!
=−2 −
!
!
!
2!
!
!
!!!
+0
=
!
!
!
!
!
!
!!!
.
There are ! additional parameters, (!
!
,!
!
,… ,!
!
), to be estimated under !
!
compared to
!
!
, hence
!
!
!"#
=
!
!
!
!
!
!
!!!
follows a !
!
distribution with ! df.
Group-heterogeneous fixed-effect meta-analysis (GHFE)
We developed GHFE based on HFE by further modifying the alternative hypothesis.
GHFE clusters ! studies into ! distinct groups (1≤!≤!), and assumes that studies
from the same group have the same effect size while allowing for heterogeneity between
groups. Let !
!
denote the effect size for group !. The null and alternative hypotheses are
!
!
:!
!
= 0 ∀!
!
!
:!
!
≠ 0 for as least one !
Let ! denote the tessellation assigning studies to groups,
!
!"
=
1 if study k is in group c
0 otherwise
.
The effect in study is then given by !
!
= !
!"
!
!
!
!!!
. Thus it follows that
!
!
~! !
!"
!
!
!
!!!
,!
!
. By setting the first derivative of the log likelihood to zero,
199
!=
1
2!!
!
exp −
!
!
− !
!"
!
!
!
!!!
!
2!
!
!
!!!
log!= log
1
2!!
!
−
!
!
− !
!"
!
!
!
!!!
!
2!
!
!
!!!
!log!
!!
!
= !
!"
!
!
−!
!
!
!
!
!!!
=
!
!"
!
!
!
!
!
!!!
−!
!
!
!!
!
!
!
!!!
= 0 ,
the MLE of !
!
is obtained as
!
!
=
!
!"
!
!
!
!
!
!!!
!
!"
!
!
!
!!!
.
The LRT statistic is constructed as follows,
!
!
=−2log
sup
!
!
!
sup
!
!
!
=−2 log!
!
!
!! ∀!
− log!
!
!
!!
!
=−2 −
!
!
!
2!
!
!
!!!
+
!
!
− !
!"
!
!
!
!!!
!
2!
!
!
!!!
=
!
!
!
!
!
!
!!!
!
!
− !
!"
!
!
!
!!!
!
!
!
!
!!!
= !
!"
!
!
!
!
!
!
!!!
!
!!!
!
!"
!
!
−!
!
!
!
!
!
!!!
!
!!!
= 2 !
!
!
!!!
!
!"
!
!
!
!
!
!!!
!
!
!
!
!!!
!
!"
1
!
!
!
!!!
.
After plugging in !
!
, we get the LRT statistic, which follows a !
!
distribution with ! df.
!
!
!"#$
=
!
!"
!
!
!
!
!
!!!
!
!
!"
!
!
!
!!!
!
!!!
200
Supplementary Materials
Supplementary Figure 1. The statistical power of different meta-analysis methods under the scenario
of random effects with varying between-study heterogeneity.
Statistical power is estimated at a genome-wide threshold α=10
-7
using 10,000 iterations of 10 studies, each
having 1,000 individuals with a balanced number of cases and controls. In each iteration, the study-specific
effects are randomly sampled from a normal distribution N(β
0
, (rβ
0
)
2
), where β
0
=log(OR) is the mean effect
and r controls the between-study heterogeneity. We set OR=1.2 and vary r from 0 to 1 with a step of 0.1.
Five different meta-analysis methods are evaluated: fixed-effect (FE), random-effects (RE), modified
random-effects (RE2), heterogeneous fixed-effects (HFE), and group-heterogeneous fixed-effects (GHFE).
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Between−study heterogeneity
Power at α=10
−7
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●
●
●
●
●
●
●
●
●
FE
RE
RE2
HFE
GHFE
201
Chapter 8 Summary and future directions
8.1 Introduction
The scope of this dissertation focuses on the identification and fine-mapping of genetic
susceptibility loci for prostate cancer. In Chapter 1, we reviewed the previous efforts and
recent advances in understanding genetic predisposition to prostate cancer, highlighting
the success of GWAS on investigating this polygenic disease. In Chapter 2 and 3, we
identified 20+ novel risk loci for prostate cancer, which was accomplished by combining
samples from multiple studies as well as extending variants in exploration through
exome-wide genotyping (Haiman et al., 2013) and genome-wide imputation (Al Olama et
al., 2014). In Chapter 4 - 6, we evaluated the known risk loci for prostate cancer through
generalizability analysis (Han et al., 2014) and fine-mapping (Han et al., 2016a). Using a
large multiethnic sample (described in Chapter 6), integration of statistical evidence and
functional annotation prioritized ~4 SNPs per region for future biological testing (Han et
al., 2015); coupled with an independent study in European population (Amin Al Olama et
al., 2015), it further reduced to ~2 SNPs per region
(Hazelett et al., 2016). In Chapter 7,
we switched gears and discussed theoretical aspects of fine-mapping. In particular, we
developed a robust and powerful method for combining summary statistics from multiple
racial/ethnic populations, which is best suited to discover and prioritize causal SNP when
effect heterogeneity exists.
In this chapter, I present some thoughts on future research and how the findings
may contribute to addressing healthcare problems.
202
8.2 Further extension of fine-mapping
In the GWAS context, fine-mapping is to explore all the available SNPs surrounding a
risk locus in search of the underlying causal variant, which is a crucial step in elucidating
the genetic basis of disease risk and revealing new targets for drug development.
Technically, establishing causal relationship between a genetic variant and a trait of
interest based on observational data is quite challenging (Vansteelandt and Lange, 2012)
and is beyond the scope of this discussion. That being said, statistical fine-mapping,
especially in combination with functional annotation, has prioritized a reasonably small
number of candidates in some regions for further validation in cell lines or animal
models. For instance, at the prostate cancer risk locus 6q22.1, functional study using
genome editing confirmed the potential causality of rs339331, which was among the top
set of SNPs in fine-mapping (Spisak et al., 2015). However, for most regions, further
validation is challenging primarily due to 1) the large number of candidates that are
statistically indistinguishable and/or 2) lack of overlaps with biofeatures.
The ultimate goal of fine-mapping is to figure out genetic factors that are involved
in disease pathogenesis. Hence, future research can expand on the genetic factors and
clinical outcomes considered in fine-mapping, as well as how they are connected.
Previous studies have been focusing on common SNPs so far; however, the
biologically plausible genetic factors could be rare variants, haplotypes, or structural
variants (e.g., CNVs, insertions and deletions). To evaluate other forms of genetic
variation, high-density genotyping, imputation or sequencing needs to be conducted at
targeted regions. To statistically distinguish true signal from noise, sufficient sample size
is required, and leveraging differential correlation structures across populations may be
203
helpful. In addition to disease susceptibility, fine-mapping efforts on progression,
recurrence, and survival utilizing both germline and somatic data will also shed light on
the genetic components that are potentially involved in multiple or specific stages of
disease pathogenesis.
From genetic variation to clinical outcomes, it is always challenging to infer the
underlying biological mechanism based on fine-mapping results. The middle-layer
information should include gene expression (regulated via promoter, enhancer,
transcription factor, epigenetics, etc), protein coding and modification, and level of
relevant biomarkers. Therefore, incorporating association results with intermediate
phenotypes as well as functional annotation data may help connect the dots and generate
hypotheses for in vitro or in vivo testing. As more data of genomic features becomes
available, such as ENCODE and the NIH Roadmap Epigenomics Project, it will facilitate
bridging the knowledge gaps and elucidating the genetic influence on the etiology of
disease. Further exploration on disease-associated variants corresponding to certain
functional groups (e.g., membrane proteins) or biological pathways may aid in the
discovery of novel druggable targets for disease prevention and treatment.
8.3 Genetic profiling for personalized medicine
Genetic profile, as an individual-specific “fingerprint”, plays a key role in personalized
medicine. Future progress in this area could be based on using the architecture of genetic
variation to predict heath-related traits, disease susceptibility and clinical outcomes. For
predictive modeling, the fundamental difference from exploratory analysis is that it
emphasizes more on prediction accuracy or discriminatory power than teasing out
plausible variants. In other words, any variant should be included in the model as long as
204
it is predictive of the outcome either by itself or in combination with other variables,
regardless of its role in causality. The outcome of interest could be quantitative or
categorical in nature; for simplicity, we are going to focus on binary outcomes (such as
prostate cancer) in the following discussion.
In the early attempts, we and other researchers selected the most-associated SNP
in each region after fine-mapping and then calculated the polygenic risk score for prostate
cancer in an additive way (Al Olama et al., 2014; Han et al., 2014). By fitting a model of
disease risk on this cumulative score, the regression coefficient can be interpreted as the
log odds ratio of developing disease per carrying an additional risk allele from any of the
selected SNPs. Similarly, polygenic models based on GWAS SNPs have been utilized for
risk prediction (Wang et al., 2014). The performance depends on the total heritability of
disease, the underlying effect-size distribution, and the significance threshold for
selecting SNPs (Chatterjee et al., 2013). In the future, large-scale collaborations will
make larger sample size and more SNPs available, which is likely to boost the discovery
of risk variants with small effect sizes. By including newly identified risk-associated
SNPs, the performance of risk prediction model is expected to improve. It is worth noting
that modeling efforts have been focused on additive genetic effects so far. In spite of the
great convenience of implementation and interpretation, the performance might be
compromised due to ignoring potential interactions and non-linear relationships between
genetic variants. In addition, important variants are likely to be omitted by selecting
SNPs based on marginal associations rather than classification performance. As more
SNP data becomes available in the wide application of genomic sequencing, how to
better use such information to improve risk prediction becomes an important question.
205
The high dimensionality of genomic data poses a large challenge for building
classification models, because when the number of variants is close to or larger than the
sample size, the parameter estimation algorithms would fail to converge, indicating
perfect but meaningless separation of cases and controls in the training data due to
overfitting.
The machine learning technique, originally developed to find hidden patterns in
massive data, provides an alternative approach in genetic risk modeling that can
potentially improve the prediction performance and reduce the sample size requirement.
It has been widely used in many areas of genomics, particularly successful for genomic
annotation (Libbrecht and Noble, 2015). For binary classification problems, each
individual is represented by a data point in the high-dimensional space, and the basic idea
is to find a hyperplane that best separate cases and controls. Supervised learning
algorithms such as penalized logistic regression, support vector machines (SVM), and
random forest are promising to achieve this goal. The core objective is to let the computer
learn from labeled data and make predictions without explicitly specified models. After
training on sufficient cases and controls along with their genetic data, the algorithm is
expected to build and optimize a classifier that predicts the risk for any new individual
given the genetic data. An ensemble classifier of multiple diverse algorithms usually has
improved performance. To overcome the high dimensionality issue as described above,
one can either pre-select variants in high-risk regions or across the whole genome with a
liberal threshold, or use dimensionality reduction techniques such as principal component
analysis (PCA) and multidimensional scaling. Some great features of machine learning
algorithms might be extremely helpful for genome-based prediction. For example, noise
206
variants would be automatically down-weighted or even eliminated from the model
through regularization. Moreover, non-linear relationships and gene-gene interactions
would be automatically considered in random forest or SVM with radial basis function
kernel. Most importantly, hyper-parameters in the model would be tuned via cross-
validation to avoid overfitting to the training data and thus to reduce the prediction error
when generalized to new data. Assessing variable importance in predictive models may,
in turn, shed light on further investigation of causal variants and biomarkers.
Over the past decades, growing amount of genetic, biomarker, and medical record
data from many populations of patients have been electronically accessible, which opens
up new opportunities for tailoring disease prognosis and treatment to specific individuals.
Empowering supervised learning models with rich genetic information, as discussed
above, is likely to make enormous progress in disease risk counseling and therapeutic
outcome prediction. In addition, integration of genomic profiles and clinical biomarkers
has great potential to improve patient stratification, which is crucial for targeted medical
treatment. To stratify patients into subgroups, unsupervised learning algorithms such as
k-means clustering, hierarchical clustering, and PCA can be used to reveal heterogeneity
patterns. Patients from non-trivial clusters may react differently to the same treatment,
which should be considered in the design and interpretation of clinical trial studies. As
genomic sequencing becomes more affordable, the ultimate goal of personalized
medicine based on genetic profiling to make improved disease prevention, early
diagnosis, risk stratification, and customized treatment accessible to every individual.
207
Bibliography
Ahmadiyeh, N., Pomerantz, M. M., Grisanzio, C., et al. (2010). 8q24 prostate, breast, and colon cancer risk
loci show tissue-specific long-range interaction with MYC. Proc Natl Acad Sci U S A 107, 9742-9746.
Ahn, J., Moslehi, R., Weinstein, S. J., Snyder, K., Virtamo, J., and Albanes, D. (2008). Family history of
prostate cancer and prostate cancer risk in the Alpha-Tocopherol, Beta-Carotene Cancer Prevention
(ATBC) Study. Int J Cancer 123, 1154-1159.
Akamatsu, S., Takata, R., Ashikawa, K., et al. (2010). A functional variant in NKX3.1 associated with
prostate cancer susceptibility down-regulates NKX3.1 expression. Hum Mol Genet 19, 4265-4272.
Akamatsu, S., Takata, R., Haiman, C. A., et al. (2012). Common variants at 11q12, 10q26 and 3p11.2 are
associated with prostate cancer susceptibility in Japanese. Nat Genet 44, 426-429, S421.
Akbari, M. R., Trachtenberg, J., Lee, J., et al. (2012). Association between germline HOXB13 G84E
mutation and risk of prostate cancer. J Natl Cancer Inst 104, 1260-1262.
Al Olama, A. A., Kote-Jarai, Z., Berndt, S. I., et al. (2014). A meta-analysis of 87,040 individuals identifies
23 new susceptibility loci for prostate cancer. Nat Genet.
Al Olama, A. A., Kote-Jarai, Z., Giles, G. G., et al. (2009). Multiple loci on 8q24 associated with prostate
cancer susceptibility. Nat Genet 41, 1058-1060.
Al Olama, A. A., Kote-Jarai, Z., Schumacher, F. R., et al. (2012). A meta-analysis of genome-wide
association studies to identify prostate cancer susceptibility loci associated with aggressive and non-
aggressive disease. Hum Mol Genet.
American Cancer Society (2014). Cancer Facts & Figures 2014. Atlanta: American Cancer Society.
Amin Al Olama, A., Dadaev, T., Hazelett, D. J., et al. (2015). Multiple novel prostate cancer susceptibility
signals identified by fine-mapping of known risk loci among Europeans. Hum Mol Genet 24, 5589-5602.
Amundadottir, L. T., Sulem, P., Gudmundsson, J., et al. (2006). A common variant associated with prostate
cancer in European and African populations. Nat Genet 38, 652-658.
Andreu-Vieyra, C., Lai, J., Berman, B. P., et al. (2011). Dynamic nucleosome-depleted regions at androgen
receptor enhancers in the absence of ligand in prostate cancer cells. Mol Cell Biol 31, 4648-4662.
Berndt, S. I., Sampson, J., Yeager, M., et al. (2011). Large-scale fine mapping of the HNF1B locus and
prostate cancer risk. Hum Mol Genet.
Bock, C. H., Schwartz, A. G., Ruterbusch, J. J., et al. (2009). Results from a prostate cancer admixture
mapping study in African-American men. Hum Genet 126, 637-642.
Bouchard, L., Faucher, G., Tchernof, A., et al. (2009). Association of OSBPL11 gene polymorphisms with
cardiovascular disease risk factors in obesity. Obesity (Silver Spring) 17, 1466-1472.
Boyle, A. P., Hong, E. L., Hariharan, M., et al. (2012). Annotation of functional variation in personal
genomes using RegulomeDB. Genome Res 22, 1790-1797.
Brawley, O. W. (2012). Prostate cancer epidemiology in the United States. World J Urol 30, 195-200.
208
Breyer, J. P., Avritt, T. G., McReynolds, K. M., Dupont, W. D., and Smith, J. R. (2012). Confirmation of
the HOXB13 G84E germline mutation in familial prostate cancer. Cancer Epidemiol Biomarkers Prev 21,
1348-1353.
Chang, B. L., Spangler, E., Gallagher, S., et al. (2011). Validation of genome-wide prostate cancer
associations in men of African descent. Cancer Epidemiol Biomarkers Prev 20, 23-32.
Chatterjee, N., Wheeler, B., Sampson, J., Hartge, P., Chanock, S. J., and Park, J. H. (2013). Projecting the
performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat Genet
45, 400-405, 405e401-403.
Chen, F., Stram, D. O., Loïc Le Marchand, et al. (2010). Caution in generalizing known genetic risk
markers for breast cancer across all ethnic/racial populations. Eur J Hum Genet 19, 243–245.
Chen, W., Larrabee, B. R., Ovsyannikova, I. G., et al. (2015). Fine Mapping Causal Variants with an
Approximate Bayesian Method Using Marginal Test Statistics. Genetics 200, 719-736.
Chen, Z. J., Zhao, H., He, L., et al. (2011). Genome-wide association study identifies susceptibility loci for
polycystic ovary syndrome on chromosome 2p16.3, 2p21 and 9q33.3. Nat Genet 43, 55-59.
Cheng, I., Chen, G. K., Nakagawa, H., et al. (2012). Evaluating genetic risk for prostate cancer among
Japanese and Latinos. Cancer Epidemiol Biomarkers Prev 21, 2048-2058.
Cheng, I., Kocarnik, J. M., Dumitrescu, L., et al. (2013). Pleiotropic effects of genetic risk variants for
other cancers on colorectal cancer risk: PAGE, GECCO and CCFR consortia. Gut.
Chung, C. C., Boland, J., Yeager, M., et al. (2012). Comprehensive resequence analysis of a 123-kb region
of chromosome 11q13 associated with prostate cancer. Prostate 72, 476-486.
Chung, C. C., Ciampa, J., Yeager, M., et al. (2011a). Fine mapping of a region of chromosome 11q13
reveals multiple independent loci associated with risk of prostate cancer. Hum Mol Genet 20, 2869-2878.
Chung, S., Nakagawa, H., Uemura, M., et al. (2011b). Association of a novel long non-coding RNA in
8q24 with prostate cancer susceptibility. Cancer Sci 102, 245-252.
Coetzee, G. A., Jia, L., Frenkel, B., et al. (2010). A systematic approach to understand the functional
consequences of non-protein coding risk regions. Cell Cycle 9, 256-259.
Coetzee, S. G., Rhie, S. K., Berman, B. P., Coetzee, G. A., and Noushmehr, H. (2012). FunciSNP: an
R/bioconductor tool integrating functional non-coding data sets with genetic association studies to identify
candidate regulatory SNPs. Nucleic Acids Res 40, e139.
Conde, L., Halperin, E., Akers, N. K., et al. (2010). Genome-wide association study of follicular lymphoma
identifies a risk locus at 6p21.32. Nat Genet 42, 661-664.
Conti, D. V., and Witte, J. S. (2003). Hierarchical modeling of linkage disequilibrium: genetic structure and
spatial relations. Am J Hum Genet 72, 351-363.
Cook, M. B., Wang, Z., Yeboah, E. D., et al. (2014). A genome-wide association study of prostate cancer
in West African men. Hum Genet 133, 509-521.
Cookson, W., Liang, L., Abecasis, G., Moffatt, M., and Lathrop, M. (2009). Mapping complex disease
traits with global gene expression. Nat Rev Genet 10, 184-194.
209
Coram, M. A., Candille, S. I., Duan, Q., et al. (2015). Leveraging Multi-ethnic Evidence for Mapping
Complex Traits in Minority Populations: An Empirical Bayes Approach. Am J Hum Genet 96, 740-752.
Cropp, C. D., Simpson, C. L., Wahlfors, T., et al. (2011). Genome-wide linkage scan for prostate cancer
susceptibility in Finland: evidence for a novel locus on 2q37.3 and confirmation of signal on 17q21-q22.
Int J Cancer 129, 2400-2407.
Crowther-Swanepoel, D., Broderick, P., Di Bernardo, M. C., et al. (2010). Common variants at 2q37.3,
8q24.21, 15q21.3 and 16q24.1 influence chronic lymphocytic leukemia risk. Nat Genet 42, 132-136.
Dastani, Z., Hivert, M. F., Timpson, N., et al. (2012). Novel loci for adiponectin levels and their influence
on type 2 diabetes and metabolic traits: a multi-ethnic meta-analysis of 45,891 individuals. PLoS Genet 8,
e1002607.
de Bakker, P. I., Ferreira, M. A., Jia, X., Neale, B. M., Raychaudhuri, S., and Voight, B. F. (2008).
Practical aspects of imputation-driven meta-analysis of genome-wide association studies. Hum Mol Genet
17, R122-128.
De Iorio, M., Newcombe, P. J., Tachmazidou, I., Verzilli, C. J., and Whittaker, J. C. (2011). Bayesian
semiparametric meta-analysis for genetic association studies. Genet Epidemiol 35, 333-340.
DerSimonian, R., and Laird, N. (1986). Meta-analysis in clinical trials. Control Clin Trials 7, 177-188.
Diabetes Genetics Replication Meta-analysis Consortium, Asian Genetic Epidemiology Network Type 2
Diabetes Consortium, South Asian Type 2 Diabetes Consortium, et al. (2014). Genome-wide trans-
ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nat
Genet 46, 234-244.
Dickson, S. P., Wang, K., Krantz, I., Hakonarson, H., and Goldstein, D. B. (2010). Rare variants create
synthetic genome-wide associations. PLoS Biol 8, e1000294.
Duggan, D., Zheng, S. L., Knowlton, M., et al. (2007). Two genome-wide association studies of aggressive
prostate cancer implicate putative prostate tumor suppressor gene DAB2IP. J Natl Cancer Inst 99, 1836-
1844.
Easton, D. F., Pooley, K. A., Dunning, A. M., et al. (2007). Genome-wide association study identifies
novel breast cancer susceptibility loci. Nature 447, 1087-1093.
Eeles, R., Goh, C., Castro, E., et al. (2014). The genetic epidemiology of prostate cancer and its clinical
implications. Nat Rev Urol 11, 18-31.
Eeles, R. A., Kote-Jarai, Z., Al Olama, A. A., et al. (2009). Identification of seven new prostate cancer
susceptibility loci through a genome-wide association study. Nat Genet 41, 1116-1121.
Eeles, R. A., Kote-Jarai, Z., Giles, G. G., et al. (2008). Multiple newly identified loci associated with
prostate cancer susceptibility. Nat Genet 40, 316-321.
Eeles, R. A., Olama, A. A., Benlloch, S., et al. (2013). Identification of 23 new prostate cancer
susceptibility loci using the iCOGS custom genotyping array. Nat Genet 45, 385-391, 391e381-382.
Eeles, R. A., Olama, A. A. A., Benlloch, S., et al. (2012). Identification of 23 novel prostate cancer
susceptibility loci using a custom array (the iCOGS) in an international consortium, PRACTICAL. In
review for Nature Genetics.
210
Egan, K. M., Thompson, R. C., Nabors, L. B., et al. (2011). Cancer susceptibility variants and the risk of
adult glioma in a US case-control study. J Neurooncol 104, 535-542.
Elston, R. C., Satagopan, J. M., and Sun, S. (2012). Statistical Human Genetics: Methods and Protocols:
Humana Press.
Evangelou, E., and Ioannidis, J. P. (2013). Meta-analysis methods for genome-wide association studies and
beyond. Nat Rev Genet 14, 379-389.
Ewing, C. M., Ray, A. M., Lange, E. M., et al. (2012). Germline mutations in HOXB13 and prostate-cancer
risk. N Engl J Med 366, 141-149.
Fitzgerald, L. M., Kumar, A., Boyle, E. A., et al. (2013). Germline missense variants in the BTNL2 gene
are associated with prostate cancer susceptibility. Cancer Epidemiol Biomarkers Prev 22, 1520-1528.
Franceschini, N., van Rooij, F. J., Prins, B. P., et al. (2012). Discovery and fine mapping of serum protein
loci through transethnic meta-analysis. Am J Hum Genet 91, 744-753.
Freedman, M. L., Haiman, C. A., Patterson, N., et al. (2006). Admixture mapping identifies 8q24 as a
prostate cancer risk locus in African-American men. Proc Natl Acad Sci U S A 103, 14068-14073.
Goode, E. L., Chenevix-Trench, G., Song, H., et al. (2010). A genome-wide association study identifies
susceptibility loci for ovarian cancer at 2q31 and 8q24. Nat Genet 42, 874-879.
Gordon, I. L., Byth, D. E., and Balaam, L. N. (1972). Variance of heritability ratios estimated from
phenotypic variance components. Biometrics 28, 401-415.
Grisanzio, C., Werner, L., Takeda, D., et al. (2012). Genetic and functional analyses implicate the
NUDT11, HNF1B, and SLC22A3 genes in prostate cancer pathogenesis. Proc Natl Acad Sci U S A 109,
11252-11257.
Gudmundsson, J., Besenbacher, S., Sulem, P., et al. (2010). Genetic correction of PSA values using
sequence variants associated with PSA levels. Sci Transl Med 2, 62ra92.
Gudmundsson, J., Sulem, P., Gudbjartsson, D. F., et al. (2009). Genome-wide association and replication
studies identify four variants associated with prostate cancer susceptibility. Nat Genet 41, 1122-1126.
Gudmundsson, J., Sulem, P., Gudbjartsson, D. F., et al. (2012). A study based on whole-genome
sequencing yields a rare variant at 8q24 associated with prostate cancer. Nat Genet 44, 1326-1329.
Gudmundsson, J., Sulem, P., Rafnar, T., et al. (2008). Common sequence variants on 2p15 and Xp11.22
confer susceptibility to prostate cancer. Nat Genet 40, 281-283.
Haiman, C. A., Chen, G. K., Blot, W. J., et al. (2011a). Characterizing genetic risk at known prostate
cancer susceptibility loci in African Americans. PLoS Genet 7, e1001387.
Haiman, C. A., Chen, G. K., Blot, W. J., et al. (2011b). Genome-wide association study of prostate cancer
in men of African ancestry identifies a susceptibility locus at 17q21. Nat Genet 43, 570-573.
Haiman, C. A., Chen, G. K., Vachon, C. M., et al. (2011c). A common variant at the TERT-CLPTM1L
locus is associated with estrogen receptor-negative breast cancer. Nat Genet 43, 1210-1214.
Haiman, C. A., Han, Y., Feng, Y., et al. (2013). Genome-wide testing of putative functional exonic variants
in relationship with breast and prostate cancer risk in a multiethnic population. PLoS Genet 9, e1003419.
211
Haiman, C. A., Patterson, N., Freedman, M. L., et al. (2007). Multiple regions within 8q24 independently
affect risk for prostate cancer. Nat Genet 39, 638-644.
Han, B., and Eskin, E. (2011). Random-effects model aimed at discovering associations in meta-analysis of
genome-wide association studies. Am J Hum Genet 88, 586-598.
Han, B., Kang, H. M., and Eskin, E. (2009). Rapid and accurate multiple testing correction and power
estimation for millions of correlated markers. PLoS Genet 5, e1000456.
Han, Y., Hazelett, D. J., Wiklund, F., et al. (2015). Integration of multiethnic fine-mapping and genomic
annotation to prioritize candidate functional SNPs at prostate cancer susceptibility regions. Hum Mol
Genet.
Han, Y., Rand, K. A., Hazelett, D. J., et al. (2016a). Prostate Cancer Susceptibility in Men of African
Ancestry at 8q24. J Natl Cancer Inst 108.
Han, Y., Rand, K. A., Hazelett, D. J., et al. (2016b). Prostate Cancer Susceptibility in Men of African
Ancestry at 8q24. J Natl Cancer Inst 108.
Han, Y., Signorello, L. B., Strom, S. S., et al. (2014). Generalizability of established prostate cancer risk
variants in men of African ancestry. Int J Cancer.
Hardy, R. J., and Thompson, S. G. (1996). A likelihood approach to meta-analysis with random effects.
Stat Med 15, 619-629.
Hauck, W., and Donner, A. (1977). Wald's Test as Applied to Hypotheses in Logit Analysis. JASA 72, 851-
853.
Hazelett, D. J., Conti, D. V., Han, Y., et al. (2016). Reducing GWAS Complexity. Cell Cycle 15, 22-24.
Hazelett, D. J., Rhie, S. K., Gaddis, M., et al. (2014). Comprehensive functional annotation of 77 prostate
cancer risk loci. PLoS Genet 10, e1004102.
Hemminki, K. (2012). Familial risk and familial survival in prostate cancer. World J Urol 30, 143-148.
Higgins, J. P., Thompson, S. G., Deeks, J. J., and Altman, D. G. (2003). Measuring inconsistency in meta-
analyses. BMJ 327, 557-560.
Hindorff, L. A., Sethupathy, P., Junkins, H. A., et al. (2009). Potential etiologic and functional implications
of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106, 9362-9367.
Hjelmborg, J. B., Scheike, T., Holst, K., et al. (2014). The heritability of prostate cancer in the nordic twin
study of cancer. Cancer Epidemiol Biomarkers Prev 23, 2303-2310.
Hoggart, C. J., Shriver, M. D., Kittles, R. A., Clayton, D. G., and McKeigue, P. M. (2004). Design and
analysis of admixture mapping studies. Am J Hum Genet 74, 965-978.
Hollink, I. H., van den Heuvel-Eibrink, M. M., Arentsen-Peters, S. T., et al. (2011). NUP98/NSD1
characterizes a novel poor prognostic group in acute myeloid leukemia with a distinct HOX gene
expression pattern. Blood 118, 3645-3656.
Hormozdiari, F., Kostem, E., Kang, E. Y., Pasaniuc, B., and Eskin, E. (2014). Identifying causal variants at
loci with multiple signals of association. Genetics 198, 497-508.
212
Howie, B. N., Donnelly, P., and Marchini, J. (2009). A flexible and accurate genotype imputation method
for the next generation of genome-wide association studies. PLoS Genet 5, e1000529.
Huang, Q., Whitington, T., Gao, P., et al. (2014). A prostate cancer susceptibility allele at 6q22 increases
RFX6 expression by modulating HOXB13 chromatin binding. Nat Genet 46, 126-135.
Hunter, D. J., Riboli, E., Haiman, C. A., et al. (2005). A candidate gene approach to searching for low-
penetrance breast and prostate cancer genes. Nat Rev Cancer 5, 977-985.
Jeggari, A., Marks, D. S., and Larsson, E. (2012). miRcode: a map of putative microRNA target sites in the
long non-coding transcriptome. Bioinformatics 28, 2062-2063.
Jemal, A., Bray, F., Center, M. M., Ferlay, J., Ward, E., and Forman, D. (2011). Global cancer statistics.
CA Cancer J Clin 61, 69-90.
Jia, L., Landan, G., Pomerantz, M., et al. (2009). Functional enhancers at the gene-poor 8q24 cancer-linked
locus. PLoS Genet 5, e1000597.
Kichaev, G., and Pasaniuc, B. (2015). Leveraging Functional-Annotation Data in Trans-ethnic Fine-
Mapping Studies. Am J Hum Genet 97, 260-271.
Kichaev, G., Yang, W. Y., Lindstrom, S., et al. (2014). Integrating functional data to prioritize causal
variants in statistical fine-mapping studies. PLoS Genet 10, e1004722.
Kim, T., Cui, R., Jeon, Y. J., et al. (2014). Long-range interaction and correlation between MYC enhancer
and oncogenic long noncoding RNA CARLo-5. Proc Natl Acad Sci U S A 111, 4173-4178.
King, M. C., Marks, J. H., and Mandell, J. B. (2003). Breast and ovarian cancer risks due to inherited
mutations in BRCA1 and BRCA2. Science 302, 643-646.
Klein, R. J., Hallden, C., Cronin, A. M., et al. (2010). Blood biomarker levels to aid discovery of cancer-
related single-nucleotide polymorphisms: kallikreins and prostate cancer. Cancer Prev Res (Phila) 3, 611-
619.
Kolonel, L. N., Henderson, B. E., Hankin, J. H., et al. (2000). A multiethnic cohort in Hawaii and Los
Angeles: baseline characteristics. Am J Epidemiol 151, 346-357.
Kote-Jarai, Z., Amin Al Olama, A., Leongamornlert, D., et al. (2011a). Identification of a novel prostate
cancer susceptibility variant in the KLK3 gene transcript. Hum Genet 129, 687-694.
Kote-Jarai, Z., Easton, D. F., Stanford, J. L., et al. (2008). Multiple novel prostate cancer predisposition
loci confirmed by an international study: the PRACTICAL Consortium. Cancer Epidemiol Biomarkers
Prev 17, 2052-2061.
Kote-Jarai, Z., Olama, A. A., Giles, G. G., et al. (2011b). Seven prostate cancer susceptibility loci
identified by a multi-stage genome-wide association study. Nat Genet 43, 785-791.
Kote-Jarai, Z., Saunders, E. J., Leongamornlert, D. A., et al. (2013). Fine-mapping identifies multiple
prostate cancer risk loci at 5p15, one of which associates with TERT expression. Hum Mol Genet 22, 2520-
2528.
Landi, M. T., Chatterjee, N., Yu, K., et al. (2009). A genome-wide association study of lung cancer
identifies a region of chromosome 5p15 associated with risk for adenocarcinoma. Am J Hum Genet 85,
679-691.
213
Lange, E. M., Gillanders, E. M., Davis, C. C., et al. (2003). Genome-wide scan for prostate cancer
susceptibility genes using families from the University of Michigan prostate cancer genetics project finds
evidence for linkage on chromosome 17 near BRCA1. Prostate 57, 326-334.
Laramie, J. M., Wilk, J. B., Williamson, S. L., et al. (2008). Polymorphisms near EXOC4 and LRGUK on
chromosome 7q32 are associated with Type 2 Diabetes and fasting glucose; the NHLBI Family Heart
Study. BMC Med Genet 9, 46.
Leek, R. D., Lewis, C. E., Whitehouse, R., Greenall, M., Clarke, J., and Harris, A. L. (1996). Association of
macrophage infiltration with angiogenesis and prognosis in invasive breast carcinoma. Cancer Res 56,
4625-4629.
Li, B., and Leal, S. M. (2008). Methods for detecting associations with rare variants for common diseases:
application to analysis of sequence data. Am J Hum Genet 83, 311-321.
Li, Q., Seo, J. H., Stranger, B., et al. (2013). Integrative eQTL-based analyses reveal the biology of breast
cancer risk loci. Cell 152, 633-641.
Li, Q., Stram, A., Chen, C., et al. (2014). Expression QTL-based analyses reveal candidate causal genes
and loci across five tumor types. Hum Mol Genet 23, 5294-5302.
Li, Y. R., and Keating, B. J. (2014). Trans-ethnic genome-wide association studies: advantages and
challenges of mapping in diverse populations. Genome Med 6, 91.
Libbrecht, M. W., and Noble, W. S. (2015). Machine learning applications in genetics and genomics. Nat
Rev Genet 16, 321-332.
Lichtenstein, P., Holm, N. V., Verkasalo, P. K., et al. (2000). Environmental and heritable factors in the
causation of cancer--analyses of cohorts of twins from Sweden, Denmark, and Finland. N Engl J Med 343,
78-85.
Lin, L. L., Huang, H. C., and Juan, H. F. (2012). Revealing the molecular mechanism of gastric cancer
marker annexin A4 in cancer cell proliferation using exon arrays. PLoS One 7, e44615.
Lin, X., Qu, L., Chen, Z., et al. (2013). A novel germline mutation in HOXB13 is associated with prostate
cancer risk in Chinese men. Prostate 73, 169-175.
Lindstrom, S., Schumacher, F., Siddiq, A., et al. (2011). Characterizing associations and SNP-environment
interactions for GWAS-identified prostate cancer risk markers--results from BPC3. PLoS One 6, e17142.
Lindstrom, S., Schumacher, F. R., Campa, D., et al. (2012). Replication of five prostate cancer loci
identified in an Asian population--results from the NCI Breast and Prostate Cancer Cohort Consortium
(BPC3). Cancer Epidemiol Biomarkers Prev 21, 212-216.
Liu, C. T., Buchkovich, M. L., Winkler, T. W., et al. (2014). Multi-ethnic fine-mapping of 14 central
adiposity loci. Hum Mol Genet 23, 4738-4744.
Liu, M., Wang, J., Xu, Y., Wei, D., Shi, X., and Yang, Z. (2012). Risk loci on chromosome 8q24 are
associated with prostate cancer in northern Chinese men. J Urol 187, 315-321.
Lou, H., Yeager, M., Li, H., et al. (2009). Fine mapping and functional analysis of a common variant in
MSMB on chromosome 10q11.2 associated with prostate cancer susceptibility. Proc Natl Acad Sci U S A
106, 7933-7938.
214
Mao, X., Bigham, A. W., Mei, R., et al. (2007). A genomewide admixture mapping panel for
Hispanic/Latino populations. Am J Hum Genet 80, 1171-1178.
Matise, T. C., Ambite, J. L., Buyske, S., et al. (2011). The Next PAGE in understanding complex traits:
design for the analysis of Population Architecture Using Genetics and Epidemiology (PAGE) Study. Am J
Epidemiol 174, 849-859.
Matsuyama, T., Ishikawa, T., Mogushi, K., et al. (2010). MUC12 mRNA expression is an independent
marker of prognosis in stage II and stage III colorectal cancer. Int J Cancer 127, 2292-2299.
Maurano, M. T., Humbert, R., Rynes, E., et al. (2012). Systematic localization of common disease-
associated variation in regulatory DNA. Science 337, 1190-1195.
Mensah-Ablorh, A., Lindstrom, S., Haiman, C. A., et al. (2016). Meta-Analysis of Rare Variant
Association Tests in Multiethnic Populations. Genet Epidemiol 40, 57-65.
Montana, G., and Pritchard, J. K. (2004). Statistical tests for admixture mapping with case-control and
cases-only data. Am J Hum Genet 75, 771-789.
Morris, A. P. (2011). Transethnic meta-analysis of genomewide association studies. Genet Epidemiol 35,
809-822.
Morrissey, C., True, L. D., Roudier, M. P., et al. (2008). Differential expression of angiogenesis associated
genes in prostate cancer bone, liver and lymph node metastases. Clin Exp Metastasis 25, 377-388.
Morton, N. E. (1955). Sequential tests for the detection of linkage. Am J Hum Genet 7, 277-318.
Nair, R. P., Stuart, P. E., Nistor, I., et al. (2006). Sequence and haplotype analysis supports HLA-C as the
psoriasis susceptibility 1 gene. Am J Hum Genet 78, 827-851.
Neale, M., Cardon, L., and Division, N. A. T. O. S. A. (1992). Methodology for Genetic Studies of Twins
and Families: Springer.
Neefjes, J., Jongsma, M. L., Paul, P., and Bakke, O. (2011). Towards a systems understanding of MHC
class I and MHC class II antigen presentation. Nat Rev Immunol 11, 823-836.
Nicolae, D. L., Gamazon, E., Zhang, W., Duan, S., Dolan, M. E., and Cox, N. J. (2010). Trait-associated
SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet 6,
e1000888.
Onengut-Gumuscu, S., Chen, W. M., Burren, O., et al. (2015). Fine mapping of type 1 diabetes
susceptibility loci and evidence for colocalization of causal variants with lymphoid gene enhancers. Nat
Genet 47, 381-386.
Ong, R. T., Wang, X., Liu, X., and Teo, Y. Y. (2012). Efficiency of trans-ethnic genome-wide meta-
analysis and fine-mapping. Eur J Hum Genet 20, 1300-1307.
Page, W. F., Braun, M. M., Partin, A. W., Caporaso, N., and Walsh, P. (1997). Heredity and prostate
cancer: a study of World War II veteran twins. Prostate 33, 240-245.
Parikh, H., Wang, Z., Pettigrew, K. A., et al. (2011). Fine mapping the KLK3 locus on chromosome
19q13.33 associated with prostate cancer susceptibility and PSA levels. Hum Genet 129, 675-685.
215
Parker, S. C., Stitzel, M. L., Taylor, D. L., et al. (2013). Chromatin stretch enhancer states drive cell-
specific gene regulation and harbor human disease risk variants. Proc Natl Acad Sci U S A 110, 17921-
17926.
Patterson, N., Hattangadi, N., Lane, B., et al. (2004). Methods for high-density admixture mapping of
disease genes. Am J Hum Genet 74, 979-1000.
Peters, U., Hutter, C. M., Hsu, L., et al. (2012). Meta-analysis of new genome-wide association studies of
colorectal cancer risk. Hum Genet 131, 217-234.
Petersen, G. M., Amundadottir, L., Fuchs, C. S., et al. (2010). A genome-wide association study identifies
pancreatic cancer susceptibility loci on chromosomes 13q22.1, 1q32.1 and 5p15.33. Nat Genet 42, 224-
228.
Pi, M., Parrill, A. L., and Quarles, L. D. (2010). GPRC6A mediates the non-genomic effects of steroids. J
Biol Chem 285, 39953-39964.
Pi, M., and Quarles, L. D. (2012). GPRC6A regulates prostate cancer progression. Prostate 72, 399-409.
Pickrell, J. K. (2014). Joint analysis of functional genomic data and genome-wide association studies of 18
human traits. Am J Hum Genet 94, 559-573.
Pomerantz, M. M., Shrestha, Y., Flavin, R. J., et al. (2010). Analysis of the 10q11 cancer risk locus
implicates MSMB and NCOA4 in human prostate tumorigenesis. PLoS Genet 6, e1001204.
Prensner, J. R., Chen, W., Iyer, M. K., et al. (2014). PCAT-1, a long noncoding RNA, regulates BRCA2
and controls homologous recombination in cancer. Cancer Res 74, 1651-1660.
Prensner, J. R., Iyer, M. K., Balbin, O. A., et al. (2011). Transcriptome sequencing across a prostate cancer
cohort identifies PCAT-1, an unannotated lincRNA implicated in disease progression. Nat Biotechnol 29,
742-749.
Price, A. L., Patterson, N., Yu, F., et al. (2007). A genomewide admixture map for Latino populations. Am
J Hum Genet 80, 1024-1036.
Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., and Reich, D. (2006).
Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38,
904-909.
Purcell, S., Neale, B., Todd-Brown, K., et al. (2007). PLINK: a tool set for whole-genome association and
population-based linkage analyses. Am J Hum Genet 81, 559-575.
Rafnar, T., Sulem, P., Stacey, S. N., et al. (2009). Sequence variants at the TERT-CLPTM1L locus
associate with many cancer types. Nat Genet 41, 221-227.
Rand, K. A., Rohland, N., Tandon, A., et al. (2016). Whole-exome sequencing of over 4100 men of
African ancestry and prostate cancer risk. Hum Mol Genet 25, 371-381.
Rhie, S. K., Hazelett, D. J., Coetzee, S. G., Yan, C., Noushmehr, H., and Coetzee, G. A. (2014).
Nucleosome positioning and histone modifications define relationships between regulatory elements and
nearby gene expression in breast epithelial cells. BMC Genomics 15, 331.
Rothman, N., Garcia-Closas, M., Chatterjee, N., et al. (2010). A multi-stage genome-wide association
study of bladder cancer identifies multiple susceptibility loci. Nat Genet 42, 978-984.
216
Sakoda, L. C., Jorgenson, E., and Witte, J. S. (2013). Turning of COGS moves forward findings for
hormonally mediated cancers. Nat Genet 45, 345-348.
Schumacher, F. R., Berndt, S. I., Siddiq, A., et al. (2011). Genome-wide association study identifies new
prostate cancer susceptibility loci. Hum Mol Genet 20, 3867-3875.
Schumacher, F. R., Feigelson, H. S., Cox, D. G., et al. (2007). A common 8q24 variant in prostate and
breast cancer from a large nested case-control study. Cancer Res 67, 2951-2956.
Sharma, N. L., Massie, C. E., Ramos-Montoya, A., et al. (2013). The androgen receptor induces a distinct
transcriptional program in castration-resistant prostate cancer in man. Cancer Cell 23, 35-47.
Sherry, S. T., Ward, M. H., Kholodov, M., et al. (2001). dbSNP: the NCBI database of genetic variation.
Nucleic Acids Res 29, 308-311.
Shete, S., Hosking, F. J., Robertson, L. B., et al. (2009). Genome-wide association study identifies five
susceptibility loci for glioma. Nat Genet 41, 899-904.
Sirota, M., Schaub, M. A., Batzoglou, S., Robinson, W. H., and Butte, A. J. (2009). Autoimmune disease
classification by inverse association with SNP alleles. PLoS Genet 5, e1000792.
Skibola, C. F., Bracci, P. M., Halperin, E., et al. (2009). Genetic variants at 6p21.33 are associated with
susceptibility to follicular lymphoma. Nat Genet 41, 873-875.
Smith, M. W., and O'Brien, S. J. (2005). Mapping by admixture linkage disequilibrium: advances,
limitations and guidelines. Nat Rev Genet 6, 623-632.
Smith, M. W., Patterson, N., Lautenberger, J. A., et al. (2004). A high-density admixture map for disease
gene discovery in african americans. Am J Hum Genet 74, 1001-1013.
Sotelo, J., Esposito, D., Duhagon, M. A., et al. (2010). Long-range enhancers on 8q24 regulate c-Myc.
Proc Natl Acad Sci U S A 107, 3001-3005.
Spain, S. L., and Barrett, J. C. (2015). Strategies for fine-mapping complex traits. Hum Mol Genet.
Spisak, S., Lawrenson, K., Fu, Y., et al. (2015). CAUSEL: an epigenome- and genome-editing pipeline for
establishing function of noncoding GWAS variants. Nat Med.
Stram, D. (2013). Design, Analysis, and Interpretation of Genome-Wide Association Scans: Springer New
York.
Stram, D. O. (1996). Meta-analysis of published data using a linear mixed-effects model. Biometrics 52,
536-544.
Sun, J., Zheng, S. L., Wiklund, F., et al. (2009). Sequence variants at 22q13 are associated with prostate
cancer risk. Cancer Res 69, 10-15.
Sun, J., Zheng, S. L., Wiklund, F., et al. (2008). Evidence for two independent prostate cancer risk-
associated loci in the HNF1B gene at 17q12. Nat Genet 40, 1153-1155.
Takata, R., Akamatsu, S., Kubo, M., et al. (2010). Genome-wide association study identifies five new
susceptibility loci for prostate cancer in the Japanese population. Nat Genet 42, 751-754.
Takayama, K. I., Suzuki, T., Fujimura, T., et al. (2014). CtBP2 Modulates the Androgen Receptor to
Promote Prostate Cancer Progression. Cancer Res.
217
Takeuchi, F., Isono, M., Nabika, T., et al. (2011). Confirmation of ALDH2 as a Major locus of drinking
behavior and of its variants regulating multiple metabolic phenotypes in a Japanese population. Circ J 75,
911-918.
Tan, P. Y., Chang, C. W., Chng, K. R., Wansa, K. D., Sung, W. K., and Cheung, E. (2012). Integration of
regulatory networks by NKX3-1 promotes androgen-dependent prostate cancer survival. Mol Cell Biol 32,
399-414.
Teo, Y. Y., Ong, R. T., Sim, X., Tai, E. S., and Chia, K. S. (2010). Identifying candidate causal variants via
trans-population fine-mapping. Genet Epidemiol 34, 653-664.
Thurman, R. E., Rynes, E., Humbert, R., et al. (2012). The accessible chromatin landscape of the human
genome. Nature 489, 75-82.
Tian, C., Hinds, D. A., Shigeta, R., Kittles, R., Ballinger, D. G., and Seldin, M. F. (2006). A genomewide
single-nucleotide-polymorphism panel with high ancestry information for African American admixture
mapping. Am J Hum Genet 79, 640-649.
Turnbull, C., Rapley, E. A., Seal, S., et al. (2010). Variants near DMRT1, TERT and ATF7IP are
associated with testicular germ cell cancer. Nat Genet 42, 604-607.
Vansteelandt, S., and Lange, C. (2012). Causation and causal inference for genetic effects. Hum Genet 131,
1665-1676.
Verzilli, C., Shah, T., Casas, J. P., et al. (2008). Bayesian meta-analysis of genetic association studies with
different sets of markers. Am J Hum Genet 82, 859-872.
Visscher, P. M., Hill, W. G., and Wray, N. R. (2008). Heritability in the genomics era--concepts and
misconceptions. Nat Rev Genet 9, 255-266.
Wakefield, J. (2007). A Bayesian measure of the probability of false discovery in genetic epidemiology
studies. Am J Hum Genet 81, 208-227.
Wakefield, J. (2009). Bayes factors for genome-wide association studies: comparison with P-values. Genet
Epidemiol 33, 79-86.
Wang, D., Garcia-Bassets, I., Benner, C., et al. (2011). Reprogramming transcription by distinct classes of
enhancers functionally defined by eRNA. Nature 474, 390-394.
Wang, X., Oldani, M. J., Zhao, X., Huang, X., and Qian, D. (2014). A review of cancer risk prediction
models with genetic variants. Cancer Inform 13, 19-28.
Ward, L. D., and Kellis, M. (2012). HaploReg: a resource for exploring chromatin states, conservation, and
regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res 40, D930-934.
Wellcome Trust Case Control, C., Maller, J. B., McVean, G., et al. (2012). Bayesian refinement of
association signals for 14 loci in 3 common diseases. Nat Genet 44, 1294-1301.
Westra, H. J., and Franke, L. (2014). From genome to function by studying eQTLs. Biochim Biophys Acta
1842, 1896-1902.
Willeit, P., Willeit, J., Mayr, A., et al. (2010). Telomere length and risk of incident cancer and cancer
mortality. JAMA 304, 69-75.
218
Witte, J. S., Mefford, J., Plummer, S. J., et al. (2013). HOXB13 mutation and prostate cancer: studies of
siblings and aggressive disease. Cancer Epidemiol Biomarkers Prev 22, 675-680.
Witte, J. S., Visscher, P. M., and Wray, N. R. (2014). The contribution of genetic variants to disease
depends on the ruler. Nat Rev Genet.
Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M., and Lin, X. (2011). Rare-variant association testing for
sequencing data with the sequence kernel association test. Am J Hum Genet 89, 82-93.
Wu, Y., Waite, L. L., Jackson, A. U., et al. (2013). Trans-ethnic fine-mapping of lipid loci identifies
population-specific signals and allelic heterogeneity that increases the trait variance explained. PLoS Genet
9, e1003379.
Xu, J., Kibel, A. S., Hu, J. J., et al. (2009). Prostate cancer risk associated loci in African Americans.
Cancer Epidemiol Biomarkers Prev 18, 2145-2149.
Xu, J., Mo, Z., Ye, D., et al. (2012). Genome-wide association study in Chinese men identifies two new
prostate cancer risk loci at 9q31.2 and 19q13.4. Nat Genet 44, 1231-1235.
Xu, X., Hussain, W. M., Vijai, J., et al. (2014). Variants at IRX4 as prostate cancer expression quantitative
trait loci. Eur J Hum Genet 22, 558-563.
Yang, J., Benyamin, B., McEvoy, B. P., et al. (2010). Common SNPs explain a large proportion of the
heritability for human height. Nat Genet 42, 565-569.
Yang, J., Ferreira, T., Morris, A. P., et al. (2012). Conditional and joint multiple-SNP analysis of GWAS
summary statistics identifies additional variants influencing complex traits. Nat Genet 44, 369-375, S361-
363.
Yang, L., Lin, C., Jin, C., et al. (2013). lncRNA-dependent mechanisms of androgen-receptor-regulated
gene activation programs. Nature 500, 598-602.
Yao, L., Shen, H., Laird, P. W., Farnham, P. J., and Berman, B. P. (2015). Inferring regulatory element
landscapes and transcription factor networks from cancer methylomes. Genome Biol 16, 105.
Yeager, M., Chatterjee, N., Ciampa, J., et al. (2009). Identification of a new prostate cancer susceptibility
locus on chromosome 8q24. Nat Genet 41, 1055-1057.
Yeager, M., Orr, N., Hayes, R. B., et al. (2007). Genome-wide association study of prostate cancer
identifies a second risk locus at 8q24. Nat Genet 39, 645-649.
Zaitlen, N., Pasaniuc, B., Gur, T., Ziv, E., and Halperin, E. (2010). Leveraging genetic variability across
populations for the identification of causal variants. Am J Hum Genet 86, 23-33.
Zaitlen, N., Pasaniuc, B., Sankararaman, S., et al. (2014). Leveraging population admixture to characterize
the heritability of complex traits. Nat Genet.
Zhang, J., and Stram, D. O. (2014). The role of local ancestry adjustment in association studies using
admixed populations. Genet Epidemiol 38, 502-515.
Zheng, S. L., Stevens, V. L., Wiklund, F., et al. (2009). Two independent prostate cancer risk-associated
Loci at 11q13. Cancer Epidemiol Biomarkers Prev 18, 1815-1820.
Abstract (if available)
Abstract
The scope of this dissertation focuses on the identification and fine-mapping of genetic susceptibility loci for prostate cancer. In Chapter 1, we reviewed the previous efforts and recent advances in understanding genetic predisposition to prostate cancer, highlighting the success of GWAS on investigating this polygenic disease. In Chapter 2 and 3, we identified 20+ novel risk loci for prostate cancer, which was accomplished by combining samples from multiple studies as well as extending variants in exploration through exome-wide genotyping and genome-wide imputation. In Chapter 4–6, we evaluated the known risk loci for prostate cancer through generalizability analysis and fine-mapping. Using a large multiethnic sample, integration of statistical evidence and functional annotation prioritized ~4 SNPs per region for future biological testing
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Examining the relationship between common genetic variation, type 2 diabetes and prostate cancer risk in the multiethnic cohort
PDF
Prostate cancer: genetic susceptibility and lifestyle risk factors
PDF
Post-GWAS methods in large scale studies of breast cancer in African Americans
PDF
Genetic studies of cancer in populations of African ancestry and Latinos
PDF
The multiethnic nature of chronic disease: studies in the multiethnic cohort
PDF
Understanding prostate cancer genetic susceptibility and chromatin regulation
PDF
Body size and the risk of prostate cancer in the multiethnic cohort
PDF
Extending genome-wide association study methods in African American data
PDF
Identifying genetic, environmental, and lifestyle determinants of ethnic variation in risk of pancreatic cancer
PDF
Screening and association testing of coding variation in steroid hormone coactivator and corepressor genes in relationship with breast cancer risk in multiple populations
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Association of comorbidity with prostate cancer tumor characteristics in African American men
PDF
Polygenes and estimated heritability of prostate cancer in an African American sample using genome-wide association study data
PDF
Diet quality and pancreatic cancer incidence in the multiethnic cohort
PDF
Using genetic ancestry to improve between-population transferability of a prostate cancer polygenic risk score
PDF
Co-chaperone influence on androgen receptor signaling and identification of androgen receptor genes in prostate cancer
PDF
Utility of polygenic risk score with biomarkers and lifestyle factors in the multiethnic cohort study
PDF
Methodology and application of modern genetic association tests in admixed populations
PDF
Genetic risk factors in multiple myeloma
PDF
Dietary and supplementary folate intake and prostate cancer risk
Asset Metadata
Creator
Han, Ying
(author)
Core Title
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
07/14/2016
Defense Date
02/04/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
genome-wide association study,multiethnic fine-mapping,OAI-PMH Harvest,prostate cancer,Statistical Genetics
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Haiman, Christopher (
committee chair
), Stram, Daniel (
committee chair
), Coetzee, Gerhard (
committee member
), Conti, David (
committee member
), Schumacher, Fredrick (
committee member
)
Creator Email
yinghan@usc.edu,yinghan1128@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-268529
Unique identifier
UC11281007
Identifier
etd-HanYing-4546.pdf (filename),usctheses-c40-268529 (legacy record id)
Legacy Identifier
etd-HanYing-4546.pdf
Dmrecord
268529
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Han, Ying
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
genome-wide association study
multiethnic fine-mapping
prostate cancer