Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Polygenic analyses of complex traits in complex populations
(USC Thesis Other)
Polygenic analyses of complex traits in complex populations
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
POLYGENIC ANALYSES OF COMPLEX TRAITS IN
COMPLEX POPULATIONS
by
Fang Chen
_________________________________________________________
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BIOSTATISTICS)
December 2012
Copyright 2012 Fang Chen
ii
Acknowledgements
I would like to thank my dissertation committee members, especially Drs. Daniel O.
Stram and Christopher A. Haiman for their guidance in the completion of this dissertation.
I would also like to thank members of our research group: Dr. Gary K. Chen, Alex Stram,
Grace Sheng and Peggy Wan for their inspiration and assistance in computational skills,
data collection and data cleaning. Finally I thank the men and women who participated in
the epidemiological studies described below and contribute to a better understanding of
the effects of polygenes in the etiology of complex traits.
iii
Table of Contents
Acknowledgements ii
List of Tables iv
List of Figures v
Abstract vi
Introduction 1
Chapter One: Replication study of breast cancer GWAS in the Multiethnic
Cohort (MEC) 18
Chapter Two: Testing for pleiotropic effects of known type 2 diabetes (T2D)
and obesity risk variants on breast cancer 31
Chapter Three: Fine-mapping of known breast cancer risk loci in African
Americans 42
Chapter Four: Methodological considerations in genome-wide assessment
of heritability 75
Conclusion 110
References 120
Appendix 135
iv
List of Tables
Table 1-1: The summary associations of validated breast cancer risk
variants in diverse populations 30
Table 2-1: Association of known T2D and obesity risk alleles with
breast cancer risk by race/ethnicity 38
Table 2-2: Association of known T2D and obesity risk alleles with
breast cancer risk in ethnicity-pooled analysis 40
Table 3-1: Associations with common variants at known breast cancer
risk regions in African Americans 70
Table 3-2: The association of the total risk score with breast cancer risk
in African Americans 72
Table 4-1: Coverage of the 1000 genomes project SNPs on chromosome
21 by different numbers of SNPs 105
Table 4-2: Power and type I error of the omnibus approach and variance
components approach with different number of markers 106
v
List of Figures
Figure 3-1: Risk allele frequencies in Europeans and African Americans 73
Figure 3-2: –Log P plots for common alleles at 8 breast cancer risk loci in
African Americans 74
Figure 4-1: Proportion of phenotypic variation explained by different
number of SNPs in an independent sample (k<0.025) 107
Figure 4-2: A comparison of phenotypic variation explained by different
number of SNPs in independent samples and close relatives 108
Figure 4-3: Simulation results of heritability explained by SNPs only
weakly correlated with causal SNPs 109
vi
Abstract
Genome-wide association scans (GWAS) have identified numerous common variants
associated with hundreds of complex diseases. In this dissertation, I investigated the
properties of the GWAS-identified common SNPs in multiple populations, and estimated
their aggregate effects on complex diseases. In the first Chapter, I assessed the
generalizability of a risk score derived from 12 SNPs known to be associated with breast
cancer risk in European or Asian populations in the Multiethnic Cohort (MEC). I
performed a case-control study with 2,224 cases and 2,827 controls nested in the MEC
and found that when viewed as a summary risk score, the total number of risk alleles
carried by women was significantly associated with breast cancer risk overall (OR per
allele: 1.09; 95% CI: 1.06-1.12; p=2.0×10
-10
) and in all populations except African
Americans, in which no significant association was observed (OR, 1.03; 95% CI, 0.98-
1.08). These results emphasized the need for large-scale association studies in multiple
racial/ethnic groups, especially in populations of African ancestry.
Since body mass index and type 2 diabetes (T2D) are established risk factors for (post-
menopausal) breast cancer, I tested for the pleiotropic effects of 31 common variants for
T2D and obesity in a case-control study of 1,915 breast cancer cases and 2,884 controls
nested within the MEC. However, following adjustment for multiple tests, we found no
significant association between any variant and breast cancer risk, as in shown in Chapter
Two.
vii
In Chapter Three I analyzed a large study of breast cancer in African American women
(3,016 cases and 2,745 controls), where I tested 19 known risk variants identified by
GWAS and replicated associations (P<0.05) with only 4 variants. Through fine-mapping,
markers in that better capture the association with breast cancer risk in African
Americans were identified in 4 regions (2q35, 5q11, 10q26 and 19p13). Statistically
significant associations were also identified with markers in 4 separate regions (8q24,
10q22, 11q13 and 16q12) that are independent of the index signals and may represent
putative novel risk variants. This detailed analysis of the known breast cancer risk loci
has validated and improved upon markers of risk that better characterize their association
with breast cancer in women of African ancestry.
In the last Chapter, I genotyped and analyzed 966,578 autosomal SNPs across the entire
genome using a variance components approach in a large sample of African ancestry
(N=14,419), and estimated an additive heritability of 44.7% (se: 3.7%) for this
phenotype in a sample of evidently unrelated individuals. Using simulation, I concluded
that the additive heritability estimate is not necessarily the heritability proportion directly
explained by the genotyped SNPs. I then explored the performance of the variance
components approach in an unrelated sample and found that the approach fails when a
large number of independent variables are included, indicating that some relatedness
between subjects is required for the method to perform well using large number of SNPs.
In two samples of close relatives defined by probability of identical-by-descent (IBD)
alleles sharing (Pr (IBD=1)>=0.3, n=1,415 and Pr (IBD=1)>=0.4, n=575), the additive
heritability estimate increased to 76.5% (se: 11.7%) and 75.1% (13.3%), respectively
viii
which is consistent with the view (Zuk et al PNAS 2011) that the additive component of
genetic variation for height may have been overestimated in earlier studies (80%) and the
proportion could also include variation due to epistatic effects.
This dissertation contributes to the polygenic analyses of complex traits in two aspects:
first, it emphasizes the necessity of using genetic markers that are specific to the
populations of interest for disease prediction in different populations. Second, analyses
performed in this dissertation add to the investigation of the “missing heritability”
problem. I concluded from this dissertation that the hypothesis that common variants
explain a large proportion of heritability remains unproven, and that studies of additional
genetic markers such as rare variants, and investigations of non-linear effects of genetic
markers including epistatic effects are needed in order to gain a better understanding of
the genetic characteristics of complex traits.
1
Introduction
1. The idea of linkage disequilibrium (LD) – based association testing.
The DNA sequence varies among individuals within a species, which underlies the
proportion phenotypic diversity that is due to genetics. Some of these variations can be
used as genetic markers to tag the functional signal of variations nearby, no matter
whether or not these markers themselves impose influences on the phenotype. Compared
to that of other species, the human genome has limited variability: it was reported that
the variation of DNA sequence between two copies of human genome is at most of the
order of 0.1%, which is one order of magnitude lower than the variations in Drosophila
populations (Li and Sadler, 1991). The low variability of human DNA sequence is
believed to be due to the relatively younger age of human population, and small effective
population size in the past (Alaya et al., 1994; Li and Sadler, 1991). The most common
type of genome variation, the single nucleotide variant (SNV), is the difference among
chromosomes in the base at a particular site in the DNA sequence. For example, some
chromosomes may have a “C” at a particular site while others have an “A”. Usually an
SNV has two different alleles (cases of three-allele SNVs are extremely rare) and a SNV
with frequencies of both alleles >=1% is called a single nucleotide polymorphism (SNP)
(Wang et al., 1998). By 2008, the Human Genome Project (Lander et al., 2001), the SNP
Consortium (Sachidanandam et al., 2001), and the International HapMap Project (Frazer
et al., 2007) have together documented about 10 million common SNPs in the human
genome. With the recent improvement of sequencing technologies, the 1000 Genomes
Project which is based on the next-generation sequencing (NGS) technology have greatly
2
added to the number of SNPs identified (The 1000 Genomes Project Consortium, 2010).
As of October, 2011, the online SNP database (dbSNP) that includes SNP submission
from the 1000 Genomes Project has listed 52 million SNPs in the human genome, which
is about 1 SNP per 60 base pairs (bp)
(http://www.ncbi.nlm.nih.gov/mailman/pipermail/dbsnp-
announce/2011q4/000108.html ) .
The mutation rate is very low (of the order of 10
-8
per site per generation) relative to the
number of generations since the most common ancestor of any of two human individuals
(of the order of 10
4
) (The International HapMap Consortium, 2003). As a result, the
variation at each site of the human genome is likely to be due to a single historical
mutation. In other words, mutations usually do not occur twice at the same nucleotide site,
and only in very rare occasions does a SNP have more than two alleles. For the same
reason, each new allele is associated with the surrounding alleles on the same
chromosome segment which happen to be present when the new allele arose. The set of
these associated alleles that are located on a single chromosome or a chromosome
segment is called a haplotype. New haplotypes are formed when mutation or
recombination events occur; otherwise, a haplotype is transmitted from one ancestor to
descendants over generations with an unaltered pattern. The association between different
loci on a haplotype is called linkage disequilibrium (LD). Alleles in LD with each other
tend to be inherited together, which greatly facilitates the association testing of the
genome, since knowing the pattern of some “tag” SNPs would allow us to infer the
association profile of others that are in LD with these tag SNPs. While searching of the
3
entire genome requires sequencing more than 3.4 billion base pairs that contain 20,000 ~
25,000 genes, the usage of tag SNPs reduces this number to a much lower order,
depending on the targeted MAF range and genome-wide coverage, and the population of
interest (The International HapMap Consortium, 2003). GWAS studies were designed
based on such rationale. Nowadays, genome-wide scans are performed on microarray
chips genotyping up to 5 million common SNPs targeting variants with MAFs down to
0.01.
Recombination is the strongest force that breaks linkage between genetic variants and
forms new haplotypes. The distribution of recombination rate is extremely heterogeneous
along chromosomes (McVean et al., 2004; Myers et al., 2005). Genetic variants were
transmitted together from parent to offspring over generations within regions known as
LD blocks where no historical recombination was evident. As was discussed above, LD
blocks are broken and separated by sites of recombination. With the accumulation of
recombination effects, the patterns of LD blocks change over time. Generally, ancient
populations which have existed for more generations have shorter and more scattered LD
blocks than relatively new populations, since the chance of recombination accumulates
over time. Examples can be found from the samples genotyped in the international
HapMap project. The first two phases of HapMap genotyped individuals from four
populations (Caucasians in Utah (CEU), Han Chinese (CHB), Japanese (JPK) and
Yorubans in Nigeria (YRI)); in phase III samples from 11 populations were included. In
general, populations of African ancestry, including the YRI population and the ASW
population in phase III have shorter LD blocks that other populations (The International
4
HapMap Consortium, 2003). Due to the variation in LD structures across populations,
efficiency may vary in capturing functional signals using the same genetic markers in
different populations.
Recent isolation is another factor that may alter the LD pattern of a population, since the
LD structure of isolated populations might undergo a different evolving process.
Although isolation alone does not necessarily increase the correlation between variants
and therefore do not guarantee higher capturing efficiency by genotyping tag SNPs,
isolated populations of small and constant sizes, as was shown by a few studies, are
believed to have higher LDs than expanded populations (Bosch et al., 2009; Laan and
Paabo, 1997). The so-called bottleneck effect, where only a limited number of founders
are present in an isolated population, would lead to a reduction in genetic variability, as
the newly arising variation in DNA sequence would be rare. It may take thousands of
generations to regain the variability level in the isolated population (Hanfling and Brandl,
1998; Nei et al., 1975), before which the isolated population would have higher LD level.
Larger isolated populations drift more slowly in genotype frequencies and LD patterns,
but may form subgroups with different LD structures. Recent mixing of populations also
introduces additional features to LD patterns.
As was mentioned before, changing of LD structures across populations limits the
generalizability of genetic markers, since the original GWAS studies are usually
performed in populations of European ancestry, although more non-European GWAS
studies are being performed these years (Ruiz-Narvaez et al., 2010; Stacey et al., 2010;
5
Zheng et al., 2009a) . Chapter One of the dissertation describes a replication study of
breast cancer GWAS in the Multiethnic Cohort (MEC), where a risk score comprised of
known GWAS hits was replicated in all populations except African Americans (Chen et
al., 2010).
The development of association studies has greatly facilitated the uncovering of common
variants with mild to moderate effect sizes by comparing the risk allele frequencies of the
variants in unrelated (only slightly related) subjects. Unrelated subjects are much easier
to collect than family data of comparable sample sizes, which is clearly an advantage of
association studies; however, association studies could be susceptible to complications in
data structure such as hidden population. For diseases that vary in incidence among
populations, spurious associations with markers that have different allele frequencies in
these populations could be identified by association studies, creating false positive hits.
For this reason, ethnicity is often considered to be a potential confounder in association
studies and is often controlled by including the race variable into the statistical model.
Obviously this method doesn’t apply to the case where population structure is
unknown/hidden. Methods used to control for hidden population structure includes the
genomic control method (Devlin and Roeder, 1999; Devlin et al., 2001), method using
either the ancestry informative markers (AIM) (Pritchard et al., 2000a; Pritchard et al.,
2000b) or random SNPs (if there are enough of them) in the principal components
analysis (PCA) (Price et al., 2006). The genomic control method corrects for over-
dispersion of test statistic due to hidden population structure by adjusting the test statistic
with an over-dispersion factor estimated from the data, which is often the ratio of the
6
mean (or median) of the χ
2
statistic computed from the data and that of the central χ
2
distribution (Devlin and Roeder, 1999; Devlin et al., 2001). In 2000, Pritchard et al
developed a method of using AIMs to estimate the ancestry of individuals and correct for
population stratification (Pritchard et al., 2000a; Pritchard et al., 2000b). Another method,
which is most commonly used today, is the principal components analysis (PCA) method
described by Price et al in 2006 (Price et al., 2006). The authors proposed to perform a
principal components analysis on the genotype of the sampled individuals to infer
continuous axes of genetic variation. By doing that the variation of phenotype and
genotypes of interest that is attributable to ancestry can be easily removed (Price et al.,
2006). This method was shown to be more effective than genomic control in correcting
for inflated type I error at highly differentiated SNPs (Price et al., 2006).
In addition to hidden population structure, admixture of ancestry between two or more
populations may also cause inflated type I error in association studies. Many modern
populations, including African Americans, Native Americans, Native Hawaiians, Latinos,
etc are recently mixed. The presence of admixture could bring both advantages and
disadvantages to association scans: on the one hand, admixture scan facilitates the search
of new genetic risk variants in admixed groups as fewer markers are required to extract
the ancestry information, and newly mixed populations may harbor novel genetic variants
that contribute to the risk (Price et al., 2007; Rosenberg et al., 2010; Seldin et al., 2011;
Smith et al., 2004; Tian et al., 2007; Tian et al., 2006). On the other hand, like hidden
structure, population structure in admixed populations could seriously confound
association testing and needs to be controlled (Aldrich et al., 2009; Zhang et al., 2008).
7
Ancestry estimation methods have been developed using software packages such as
STRUCTURE, ADMIXMAP and ANCESTRYMAP (Falush et al., 2003; Pritchard et al.,
2000b). For association studies performed in admixed populations, population structure is
often controlled using principal components (PCs) or AIMs. In some studies, local
ancestry is considered a potential confounder and is controlled as well (Chen et al., 2011;
Haiman et al., 2011a; Haiman et al., 2011b).
2. GWAS and their findings.
In the 1990s, population geneticists proposed a hypothesis that common variants
contribute to a significant portion of the risk for common diseases, known as the common
disease, common variants (CD-CV) hypothesis (Lander, 1996). At that time linkage
studies had been successful in identifying the genetic basis for some Mendelian diseases,
such as Alzheimer’s disease, Huntington’s disease, Cystic Fibrosis, etc, most of which
are believed to be caused by a single causal gene with high penetrance. Diseases that do
not follow a Mendelian pattern of inheritance, known as complex diseases, are believed
to be associated with multiple genes each with moderate effects. Risch and Merikangas
(1996) showed that, association studies are more powerful than linkage studies in
identifying genetic variants that have moderate (<1.5 in genetic relative risk, GRR)
effects (Risch and Merikangas, 1996). Although the small effect sizes may limit the
power that can be achieved in an association study, if the causal alleles are common as
was assumed in the CD-CV hypothesis, the sample size needed to identify these common
alleles are achievable even after adjusting for multiple comparisons. For example, the
8
number of unrelated subjects needed in an association study to detect risk alleles with
GRR=1.5 and MAF=0.10 is about 2000 allowing for 1,000,000 independent tests, while
linkage studies need more than 67,000 families to detect the same effect(Risch and
Merikangas, 1996).
In 2005, the first GWAS was carried out by Klein et al reporting association between
age-related macular degeneration (ARMD) and a polymorphism in the complement factor
H (CFH) gene (Klein et al., 2005). Since then many GWAS studies have been performed,
finding numerous risk-associated loci for various diseases and traits. Most GWAS studies
target at common variants with MAF greater than 0.05, which usually test for 500,000 to
1,000,000 SNPs across the whole genome. Some GWAS studies also incorporate SNPs
down to 0.01 in MAF(Cichon et al., 2009). As of September, 2012, GWAS studies have
reported more than 7,000 SNPs that are associated with over 200 complex diseases and
traits (http://www.genome.gov/gwastudies/), including cardiovascular disease, breast
cancer, prostate cancer, lung cancer, diabetes, obesity, lipids, nicotine, etc. As is true for
many other epidemiological investigations, most of GWAS studies are performed in
populations of European ancestry and results can be readily combined among different
research groups. Meanwhile, since the coverage of genetic markers is portable among
samples from the same origin, it becomes possible to select a standardized panel of SNPs
to be genotyped. A number of replication studies were then performed in different
populations to evaluate the generalizability of GWAS-identified SNPs over ethnicity.
These studies have reported different degrees of replication in different populations, and
the variants are often least replicable in populations of African origin. It is also found
9
that the degree of generalizability across populations is varies in different diseases. For
example, while a lot of studies have reported lack of replication on GWAS-identified
breast cancer risk variants, especially in subjects of African origin, risk alleles associated
with prostate cancer appear to be more replicable (Chen et al., 2011; Chen et al., 2010;
Haiman et al., 2011a; Ruiz-Narvaez et al., 2010; Stacey et al., 2010; Waters et al., 2009;
Zheng et al., 2009a).
There are a few reasons why different patterns of association are observed in different
populations. It is certainly possible that the functional variants may have different effect
in different populations; however, the underlying biological mechanisms for this
hypothesis have not been thoroughly investigated. Studies have been more focused on the
following aspects: first, both SNP frequency and disease prevalence vary across
populations. As a result, the power for detecting the effect of a particular SNP varies.
Moreover, the changing of the frequencies of both tagging SNPs that are genotyped and
tested, and the functional alleles modifies the tagging efficiency. Second, LD patterns are
different among ethnic groups. Therefore, the risk-associated tagging SNP in one
population may or may not be correlated with the functional allele in another population.
In populations of African origin, where LD blocks are shorter that those in other
populations, it is less likely that the risk-associated tagging SNPs can capture the signal
as efficiently as in other populations.
Given that the inheritance patterns of most complex diseases and traits in different
populations are not exactly the same, researchers began to consider the possibility of
10
conducting GWAS in diverse populations. GWAS in non-European populations face a
few challenges: 1) Properly designed studies with biological samples readily available are
fewer than in European populations ; 2) the SNP panel used in European populations may
not have similar coverage in other populations; 3) the need to control for admixture in
analysis of many, especially American populations. Recently a few GWAS studies have
been conducted in non-European populations including East Asian population, natives
from Pacific island of Kosrae, and African Americans (Barbalic et al., 2011; Lowe et al.,
2009; Ng et al., 2011; Yoon et al., 2011; Zheng et al., 2009b).
The blossoming of GWAS has occasioned the identification of numerous risk-associated
common variants that contribute to the genetic basis of various diseases, which further
confirmed the polygenic inheritance nature of complex diseases. Some of the complex
diseases studied by GWAS are highly heritable, including various types of cancers,
diabetes, height, etc. The CD-CV hypothesis has assumed that a large proportion of
common disease risk is attributable to common variants, which indicates that common
variants identified in GWAS scans that have so far been performed should explain a large
proportion of heritability. As the cost of GWAS reduces and the availability of sample
increases, and collaboration between research groups becomes more common, large-scale
GWAS with higher power and therefore capable of detecting variants with smaller effect
sizes and lower allele frequencies become possible. Researchers have begun to seek an
answer to the question of how much of the heritability can be explained by GWAS-
identified common variants. This is one of the focuses of our study and will be discussed
in detail in Chapter Four.
11
3. Polygenic analyses.
The genetic basis of complex traits which do not follow a Mendelian pattern of
inheritance is believed to be multiple genes with low penetrance functioning together.
Results from GWAS also indicated polygenic pattern of genetic causes. As was
mentioned before, GWAS have discovered more than 7,000 risk-associated common
variants. What are the aggregate effects of these variants? In other words, how much of
the heritability can be explained by the common variants discovered so far? Common
variants are expected to explain a large proportion of heritability according to the CD-CV
hypothesis, which is considered to be the primary assumption underlying GWAS.
Usually, we can divide the observed phenotypic variation (V
P
) into components that are
attributable to genotypic and non-genotypic (or environmental) risk factors (Falconer and
Mackay, 1996), i.e.
P GE
VV V =+
The genotypic component of variation can be further divided into additive effects (A) and
dominance deviation (D) for simple variants, and, the interaction component is called
interaction deviation or epistatic deviation (I) when multiple genetic loci are under
consideration, i.e.
P AD I E
VV V V V =+ + +
The quantity V
A
/V
P
, or narrow-sense heritability, describes the fraction of phenotypic
variation that is determined by the additive effect of genes transmitted from parents. The
12
quantity V
G
/V
P
, or heritability in the broad sense, describes the total genetic component
of the phenotypic variation, which could be due to dominant and epistatic effects of genes,
in addition to the additive effect (Falconer and Mackay, 1996). For the reasons mentioned
above, we are particularly interested in the narrow-sense heritability. We refer to the
narrow-sense heritability as “heritability” throughout this dissertation and denote it with
h
2
.
While it is established that epistasis does exist between genes and markers and modify
gene functions, in polygenic studies these effects are often ignored. Although evidence of
epistatic effects has been found in studies of deleterious genes and development of
disease (Azevedo et al., 2006; Chang and Noor, 2010; Cicila et al., 2009; Falconer and
Mackay, 1996; Long et al., 1995; Orr and Irving, 2005; Shrimpton and Robertson, 1988a,
b; Spickett and Thoday, 1966), in the case of polygenic analysis with smaller effects,
epistasis or dominance effects are often not detected (Greenberg and Crow, 1960; Laurie
et al., 2004). Crow (2010) has concluded in a review that epistatic effect of genetic
variants with small effects is likely to contribute to only a slight proportion of genetic
variance (Crow, 2010), and it is the additive action of multiple alleles that mostly
determines heritability of complex diseases. Therefore, in analysis of complex diseases
involving many genes with small effects, epistatic effect is often ignored.
However, Zuk et al (2012) indicated in their recent paper that epistatic interaction might
in fact play an important role in the genetics of complex diseases (Zuk et al., 2012). They
claim that narrow-sense heritability has been overestimated in earlier studies; specifically,
13
among close relatives. Since complex interactions are inherently much harder to detect
and examine, this would impose a constraint on how much heritability could be identified
using GWAS. This matter is discussed in detail using GWAS data of African ancestry in
Chapter Four.
There are several ways of testing for polygenic effects, and we will describe three of the
most commonly used methods below.
1) Omnibus test.
An intuitive method to test for polygenic method is the omnibus (global) F-test for multi-
variants. The method is widely applied in multivariate analysis.
The omnibus (global) analysis of the additive effects of all alleles at once can be used but
its power of accuracy decays as the number of markers approaches sample size even if
the non-centrality parameter is increasing with the number of variants used. Current
polygenic studies often involve the evaluation of several hundred thousand to millions of
SNPs simultaneously, which exceeds the sample sizes of most of these studies (thousands
to 10
4
). Polygenic analysis of markers genome wide, therefore, is hard to be
accomplished using traditional multiple regression.
Below is an example of polygenic analysis using the omnibus approach. By writing out
the joint distribution of multiple common variants and the distribution of relative risk and
using previously reported effect size of known breast cancer risk variants, Gail (2008)
calculated the discrimination power of seven GWAS-reported breast cancer SNPs. The
14
author also evaluated a hypothetical 14-SNP model that assumed the presence of an
additional seven risk-associated SNPs with the same properties as the known seven SNPs.
Although some improvement was observed with the 14-SNP model, neither model was
satisfactory in terms of breast cancer prediction. The author showed by calculation that it
could require as many as hundreds of common SNPs with modest effects to provide a
prediction power that can be used in clinical practice (Gail, 2008).
2) Risk score approach
The risk score approach is one of the alternative approaches have been developed for
situations where the omnibus test has very low or no power. A score statistic was defined
as the total number of risk alleles carried by an individual, weighted or unweighted by
their effect sizes. In the risk score approach it is assumed that all the risk alleles in the
model are independent, while in the omnibus test there are correlation structures among
tested variants. In 2010, Lango Allen et al used the risk score approach to estimate the
proportion of heritability of height explained by common variants. They performed a
two-stage meta-analysis from 46 and 15 GWAS and selected 180 SNPs that reached
genome-wide significance in the meta-analysis of two stages. Then they constructed a
risk score that was weighted by their effect sizes and tested the score in a validation set.
Phenotypic variance attributable to the SNPs in the score was estimated by the R
2
value
of the linear model. These 180 SNPs were estimated to have explained 10% of
phenotypic variance of human height (Lango Allen et al., 2010). Using the method
described by Park et al (2010), the authors estimated the number of undiscovered
15
common variants that had similar effect sizes with the known risk-associated loci (Park et
al., 2010). Together the common variants known and those not yet discovered explain
16% of the phenotypic variance, or 20% of the heritability (Lango Allen et al., 2010).
3) Variance components approach
The variance of a phenotypic value can be divided into genetic and non-genetic
components, as was described above. One way to estimate the genetic component of
variance is to estimate how much of phenotypic similarity among individuals is
attributable to their genetic similarity, or so-called kernel machine approach.
The variance components approach described byYang et al (2010) in their estimation the
genetic component of variance of human height is one example of the kernel machine
approach (Yang et al., 2010). The genetic component of variance was estimated by the
regression coefficient of the variance-covariance matrix of the phenotype vector on a
genetic similarity matrix that was estimated from ~300,000 common variants in 3,925
unrelated Australian adolescents (Yang et al., 2010). The authors reported that about 45%
of phenotypic variance can be explained by using all the SNPs simultaneously. In a
second paper they published in 2011, the authors further estimated the effects of
intergenic SNPs as compared to genic SNPs, and concluded that genic SNPs explain a
larger proportion of heritability than intergenic SNPs (32.8% vs. 12.6% in heritability
explained by genic vs. intergenic SNPs, where the genome-wide coverage was 49.4% vs.
50.6%). By performing the regression on multiple genetic similarity matrices estimated
by the SNPs in one chromosome, they reported that the proportion of heritability
16
explained is proportional to chromosome length, which indicates that the more common
SNPs included, the larger proportion of heritability can be explained by the model (Yang
et al., 2011b).
However, we think there are a few caveats in Yang’s paper. First, we are concerned about
the authors’ interpretation of the variance coefficient estimate. Is it an estimator of the
proportion of variance explained by the particular set of SNPs considered in the study, or
by distant genetic relatedness of the subjects that is estimated by the genetic similarity
matrix? It is possible that the proportion of heritability is actually explained by an
underlying genetic similarity that is captured by the set of ~300,000 SNPs that they
studied. Second, how does the variance component model perform when more and more
SNPs are added to the model? It is known that when the number of independent variable
is large, the omnibus test becomes unstable since there isn’t enough degrees of freedom
left to estimate the error. Does the variance component approach have the same property,
or does it still provide a stable estimate of the genetic variance component?
In order to answer these questions, we performed a study in African Americans to
estimate how much of the phenotypic variation in height can be explained by common
variants. In this dissertation, I will first review some polygenic analyses that I previously
conducted on breast cancer, and then address the issues emerging from current studies on
height heritability. I will then describe a study on height heritability estimation, and
related methodological considerations. I’ll close the dissertation with a discussion of
17
current opinions on future directions in uncovering additional genetic variations that
underlie phenotypic diversity, and how my work fits into the big picture.
18
Chapter One: Replication study of breast cancer GWAS in the Multiethnic Cohort
(MEC)
A brief introduction:
Breast cancer is a non-Mendelian, complex disease, the genetic etiology of which is
believed to be multiple genetic factors functioning together. So far GWAS of breast
cancer have identified 19 loci that are related to breast cancer risk (12 when this paper
was published), among which 18 were discovered in population of European origin, and
1 in Chinese.
We have discussed previously that the generalizability of GWAS-reported risk variants to
other populations is of great interest, since subsequent studies largely rely on whether or
not these markers can be used to capture the risk alleles in non-GWAS populations. In
this study we genotyped and tested 12 GWAS-reported breast cancer risk loci in the MEC.
The MEC is a prospective cohort that initiated between 1993 and 1996 which mainly
consists of 5 populations -- European Americans, African Americans, Native Hawaiians,
Japanese and Latinos. Through December 31, 2005, the breast cancer case-control study
nested in the cohort has assembled 2,224 breast cancer cases and 2,827 controls.
We created a breast cancer risk score by summing over the number of breast cancer risk
alleles carried by each individual, and tested the score in each population separately as
well as in the pooled sample for association with breast cancer risk. The risk score was
highly significantly associated with disease risk in the pooled analysis (P = 2.0×10
-10
),
and in all the populations (P < 4.0×10
-3
)
except in African Americans (P = 0.23). Since
populations of African origin generally have shorter LD blocks, we do expect a lower
19
tagging efficiency in these populations, which might be one explanation of the lack of
association in African Americans. We concluded in this study that caution is needed in
generalizing the GWAS-identified risk-associated polygene panel to other populations,
especially to populations of African origin.
This work was published in European Journal of Human Genetics in Feb., 2011 (PMID:
21102626).
20
Caution in generalizing known genetic risk markers for breast cancer across all
ethnic/racial populations.
Fang Chen
1
, Daniel O. Stram
1
, Loïc Le Marchand
2
, Kristine R. Monroe
1
, Laurence N.
Kolonel
2
, Brian E. Henderson
1
, Christopher A. Haiman
1
1
Department of Preventive Medicine, Keck School of Medicine, University of Southern
California/Norris Comprehensive Cancer Center, Los Angeles, California, 90089
2
Epidemiology Program, Cancer Research Center of Hawaii, University of Hawaii,
Honolulu, Hawaii, 96813
21
Abstract
Genome-wide association studies (GWAS) have identified common variants associated
with breast cancer risk among women of European and Asian ancestries. To assess the
generalizability across ethnic/racial populations of a risk score derived from genotyping
12 highly replicated breast cancer GWAS hits, we performed a case-control study (2,224
cases and 2,827 controls) nested in the Multiethnic Cohort (MEC) study, which was
initiated in 1993-1996 and consists of subjects mainly from European American, African
American, Native Hawaiian, Japanese and Latino populations. When viewed as a
summary risk score, the total number of risk alleles carried by women was significantly
associated with breast cancer risk overall (OR per allele: 1.09; 95% CI: 1.06-1.12;
p=2.0×10
-10
) and in all populations except African Americans, in which no significant
association was observed (OR, 1.03; 95% CI, 0.98-1.08). In aggregate, the number of risk
alleles is strongly associated with breast cancer risk in all populations studied except
African-Americans. These results emphasize the need for large-scale association studies
of multiple racial/ethnic groups for discovery and characterization of risk alleles relevant
to all populations in the U.S.
22
Introduction
Genome-wide association studies (GWAS) of breast cancer have substantiated a role for
common low-risk alleles in disease susceptibility (Ahmed et al., 2009; Easton et al., 2007;
Stacey et al., 2007; Stacey et al., 2008; Thomas et al., 2009; Zheng et al., 2009b). To date,
discovery and replication efforts have been limited primarily to populations of European
ancestry, which traditionally have been the focus of most genetic studies of cancer.
Prediction of individual risk on the basis of multiple risk markers is one potential utility
of this new genetic information, although it has been persuasively argued (Gail, 2008;
Kraft et al., 2009; Pharoah et al., 2002; Wacholder et al., 2010) that many more common
low penetrance genetic markers would need to be identified before they would have,
individually or as a group, any public health utility. Nevertheless, private companies have
already begun to market genetic tests for the currently known risk markers (such as
deCODE BreastCancer
TM
) for women of European ancestry before the causal alleles
underlying the marker associations have been identified. Whether these tests have any
relation to risk at all in non-White populations, which make up a third of the U.S.
population, is not known.
Associations with cancer risk alleles may not be consistent across populations for a
number of reasons (Ioannidis, 2009), including differences by race/ethnicity of the
linkage disequilibrium patterns relating a risk marker to the causal variant(s), and/or
context dependency of the association resulting from genetic and environmental
modifiers that vary in frequency across populations. Studies in multiple populations
23
(Waters et al., 2009; Xu et al., 2009; Zheng et al., 2009a) are needed to examine the
generalizability of these markers before their potential for public health utility can be
applied to populations of non-European ancestry. Here we report on the association of 12
validated risk alleles identified in breast cancer GWAS conducted primarily in
populations of European ancestry, among European American, African American, Native
Hawaiian, Japanese American, and Latino breast cancer cases and controls from the
Multiethnic Cohort (MEC) study (Kolonel et al., 2000a).
Materials and Methods
Study population: the Multiethnic Cohort
The MEC study is a prospective cohort study initiated between 1993 and 1996. The study
consists of 215,251 adult men and women living in Hawaii and California (mainly Los
Angeles County) mainly from the following populations: European American, African
American, Native Hawaiian, Japanese American and Latino. Drivers’ license files were
used as a primary source to identify the study subjects. Participants entered the cohort
study by completing and returning a self-administered questionnaire that asked
information about general demographic characteristics as well as known breast cancer
risk factors. Cases were identified through cohort linkage to population-based cancer
Surveillance, Epidemiology and End Results (SEER) registries in California and Hawaii
(Kolonel et al., 2000a). Through December 31, 2005, the breast cancer case-control study
nested in the MEC assembled for genetic studies included 2,224 cases and 2,827 controls
frequency-matched on race/ethnicity and age. For this study, we included additional
24
African American controls to allow for more precise risk estimation. The median ages of
cases and controls were 66 and 65 years, respectively, and ranged from 44 to 87. This
study was approved by the Institutional Review Boards at the University of Southern
California and at the University of Hawaii.
Laboratory Assays
We genotyped 12 SNPs from GWAS of breast cancer (Ahmed et al., 2009; Easton et al.,
2007; Stacey et al., 2007; Stacey et al., 2008; Thomas et al., 2009; Zheng et al., 2009b)(1-
6). We also tested one additional variant in FGFR2 that was revealed by fine-mapping,
which the MEC also participated in identifying (African Americans only) (Udler et al.,
2009). Genotyping was performed using the TaqMan allelic discrimination assay (Lee et
al., 1993). We substituted rs10483813 for rs999737 (14q24) as genotyping of the latter
failed; the two SNPs have perfect correlation in European Americans (r
2
=1). The overall
genotyping call rate ranged from 94.6% to 100.0% (average, 97.9%) for the 13 variants.
For blinded duplicates, the mismatch rate was < 2% for all 13 SNPs (average < 1%).
Hardy-Weinberg Equilibrium (HWE) testing was conducted for each variant in each
population using a 1-df chi-square test, and all 13 variants were consistent with HWE
using a criterion of p>0.01 in controls (Supplemental Table 1-1).
Statistical Analysis
In each of the five racial/ethnic groups constituting the MEC, we examined the
distribution and breast cancer risk associated with an unweighted summary score, taken
as the number of risk alleles for 12 variants (1p11, rs11249433; 2q35, rs13387042; 3p24,
25
rs4973768; 5p12, rs10941679; 5q11, rs889312; 6q25, rs2046210; 8q24, rs13281615;
10q26, rs2981582; 11p15, rs3817198; 14q24, rs10483813; 16q12, rs3803662; 17q23,
rs6504950) (Supplemental Table 1-2) in order to determine their combined contribution
to breast cancer risk. Analysis was conducted on 2,171 cases and 2,795 controls as
individuals missing greater than or equal to 4 SNP genotypes were excluded (53 (2.4%)
cases and 32 (1.1%) controls). Individuals missing genotypes were given the mean score
for that locus within each population. Odds ratios were estimated for this risk score,
which ranged from 4 to 18 risk alleles per individual, with a median of 11, over all 5
racial/ethnic groups. Odds ratios were adjusted for age (quartiles) and race. Since
ancestry may differ by case-control status, SNPs may be associated with risk simply
because they vary in frequency across racial/ethnic groups. Although we adjust for self-
reported ethnicity, several of the populations we consider here are known to be admixed
between two or more ancestral groups. We used principal components analysis (Price et
al., 2006) to control for hidden population stratification (including admixture) that could
otherwise cause confounding of unreported ethnicity (or ethnic mixture) with SNP effects.
Specifically we computed the first ten eigenvectors for principal components analysis
using a panel of >1,300 SNPs from previous studies not linked to the 12 markers of
interest here (Haiman et al., 2008). These were included as adjustment variables in all
models. All statistical analysis was performed in a SAS 9.1 package, SAS Institute Inc.,
Cary, North Carolina.
26
Results
Using the unweighted summary score, we observed a highly significant association with
breast cancer risk in an ethnic-pooled analysis (per allele: OR, 1.09; 95% CI, 1.06-1.12;
p=2.0×10
-10
) (Table 1-1), with women in the upper quintile having a 1.6-fold greater risk
than women in the bottom quintile (>12 alleles vs <9 alleles; 95% CI, 1.32-1.97;
p=3.0×10
-6
) (Supplemental Table 2-3) and women in the highest decile having a 2.0-fold
greater risk of breast cancer, compared to those in the bottom decile (>13 alleles vs <8
alleles; 95% CI, 1.52-2.73; p=2.3×10
-6
).
However, significant racial/ethnic heterogeneity was noted (p=0.030, 4 df test).
Specifically, the summary score variable was found to be positively and significantly
associated with breast cancer risk (p ≤3.9×10
-3
) with effects per allele of ~1.10 in all
populations, except in African Americans (per allele: OR=1.03, 95% CI=0.98-1.08,
p=0.23; Table 1-1). The apparent lack of an association between breast cancer risk either
with this aggregate allele count variable, and with many of the individual SNPs
(Supplemental Table 2-4), in African Americans suggests that few of these variants are
likely to be markers of risk in the African American population.
Given that several of the validated risk alleles have been more strongly associated with
ER positive disease (Garcia-Closas et al., 2008; Stacey et al., 2007; Stacey et al., 2008;
Thomas et al., 2009), we tested for heterogeneity of the risk score by ER status in each
population. As is expected, the score was more informative for ER positive disease than
ER negative disease (Supplemental Table 2-5). We observed the same pattern of
27
association in ER positive disease as in the overall pooled analysis, with the summary
risk score being significantly associated with disease risk in most populations, and no
significant association observed in African Americans (Supplemental Table 2-5).
We tested for dominant and recessive effects of single SNPs and observed no significant
evidence of a better model fit when the genotypes for each SNP were modeled in
combination with the summary risk score (Supplemental Table 2-6). Odds ratios
estimated over the range of observed allele counts were also inspected and were found to
be very consistent with the assumption of linear allelic effect (Supplemental Table 2-3).
We also tested for pairwise gene by gene interactions and observed 7 nominally
significant interactions; however none remained statistically significant after Bonferroni
correction for multiple comparisons. We also constructed a second summary score,
weighting each SNP by its published log OR (Supplemental Table 1-2). This score
measure was highly correlated with the unweighted score (r=0.88), and was not superior
to the unweighted score when included in the same model.
Discussion
An ultimate public health goal of mapping risk alleles is to predict individual risk so that
we can identify those at greater risk, among whom targeted intervention and preventive
measures may be applied. An understanding of the polygenic component to breast cancer
risk would undoubtedly add significantly to risk prediction (Kraft et al., 2009; Pharoah et
al., 2002) and to the efficacy of population-based programs for prevention and early
detection. Nevertheless, as noted by Gail (Gail, 2008) a much larger number of modestly
28
penetrant risk variants may be needed to make a significant impact on the problem of
breast cancer risk prediction. Replicating aggregate allele counts, or other summary
variables, thus, is much more important to this problem than the replication of any one
specific risk marker. Our sample sizes were too small to fully replicate in any single
racial/ethnic group the modest risks that each of the 12 validated markers have shown in
whites (Supplemental Table 1-2 and 1-4); however we did have very good power to
detect the effect of the aggregate variable in any specific racial/ethnic group assuming
homogeneity of effect by ethnic group.
Possible explanations for the lack of association with the aggregate variable in African
Americans is that the majority of true risk alleles underlying the marker associations are
rare in African Americans and/or linkage disequilibrium does not extend as far in persons
of African-ancestry. Both possibilities emphasize the need to conduct full-scale high-
density association studies to identify racial/ethnic specific risk markers or to further
refine the association signals in the regions containing these risk alleles in racial/ethnic
groups. For example, fine-mapping of FGFR2 has revealed a stronger marker of risk in
African Americans (rs2981578) (Udler et al., 2009). This marker made some
improvement to risk prediction with the aggregate score in this population (per allele:
OR=1.05, 95% CI=1.00-1.10, p=0.054), which emphasizes the value to be gained from
comprehensively surveying genetic variation across all risk loci in all populations.
Another possible source of the difference in association among ethnic groups could be
environmental exposures that vary in frequency across populations, and which may
modify the effect of these variants. However, recent studies provide little support for
29
known breast cancer risk factors serving as modifiers of the associations with these
alleles (Travis et al., 2010).
In summary, in this multiethnic study, we evaluated the generalizability of breast cancer
risk markers identified by GWAS to other populations. We observed strong evidence that,
in aggregate, the 12 published risk variants are strongly associated with breast cancer risk
in the majority of, but not all, populations considered. However, even for populations of
non-African origin, it is clear that many more variants will be needed for this risk score to
be informative in predicting breast cancer risk. Larger studies that include even more
diverse populations, aimed at discovery, validation and fine-mapping are needed to
identify an accurate and more complete set of risk alleles which could better determine
the contribution of these genetic regions to breast cancer risk in various populations,
especially for women of African ancestry.
30
Table 1-1. The summary associations of validated breast cancer risk variants in
diverse populations.
a
Number of cases/controls.
b
p-value for interaction between the summary score and ethnic group (4 df).
c
OR adjusted for age(quartiles), the first 10 eigenvectors from principal component analysis, and race (in pooled
analysis).
European
Americans
534/541
a
African
Americans
539/1022
Native
Hawaiians
146/287
Japanese
561/550
Latinos
391/395
Pooled
2171/2795
p
int
b
per allele OR
c
(95% CI)
1.11
(1.05-1.18)
1.03
(0.98-1.08)
1.21
(1.09-1.35)
1.10
(1.03-1.17)
1.12
(1.05-1.20)
1.09
(1.06-1.12)
-----
p-value 1.6×10
-4
0.23 4.2×10
-4
3.9×10
-3
8.5×10
-4
2.0×10
-10
0.030
31
Chapter Two: Testing for pleiotropic effects of known type 2 diabetes (T2D) and
obesity risk variants on breast cancer.
A brief introduction:
Body mass index (BMI) is an established risk factor for several common diseases,
including postmenopausal breast cancer and T2D. Epidemiological studies have also
shown association between T2D and breast cancer independent of BMI. In addition,
pathway studies indicate that breast cancer shares some biological markers with diabetes
and obesity. The aim of the study is to explore shared genetic etiology of breast cancer
with T2D and obesity, by testing for pleiotropic effects of 31 established risk variants for
T2D (n=18) and obesity (n=13) in a study of 1,915 breast cancer cases and 2,884 breast
cancer controls from the Multiethnic Cohort (MEC). We tested for individual effect of
each SNP on breast cancer risk; moreover we constructed three risk scores comprised of
the T2D risk variants, the obesity risk variants, and all the risk variants together which we
used to test for their aggregate effects.
We found no association between the T2D and obesity risk variants and breast cancer.
None of the three risk scores were associated with breast cancer risk (T2D SNPs:
OR=1.01, P=0.31; obesity SNPs: OR=1.02, P=0.24; All SNPs: OR=1.01, P=0.14),
indicating that the shared biological pathway of the diseases was not due to pleiotropic
effects of the GWAS risk variants.
This work was published in Cancer Epidemiology, Biomarkers & Prevention in May,
2011 (PMID: 21357383).
32
No association of risk variants for diabetes and obesity with breast cancer: the
Multiethnic Cohort and PAGE studies
Fang Chen
1
, Lynne R. Wilkens
2
, Kristine R. Monroe
1
, Daniel O. Stram
1
, Laurence N.
Kolonel
2
, Brian E. Henderson
1
, Loïc Le Marchand
2
, Christopher A. Haiman
1
1
Department of Preventive Medicine, Keck School of Medicine, University of Southern
California/Norris Comprehensive Cancer Center, Los Angeles, California
2
Epidemiology Program, University of Hawaii Cancer Center, Honolulu, Hawaii.
33
Abstract
Body mass index is an established risk factor for post-menopausal breast cancer.
Epidemiologic studies have also reported a positive association between type 2 diabetes
(T2D) and breast cancer risk. To investigate a genetic basis linking these common
phenotypes with breast cancer, we tested 31 common variants for T2D and obesity in a
case-control study of 1,915 breast cancer cases and 2,884 controls nested within the
Multiethnic Cohort (MEC) study. Following adjustment for multiple tests, we found no
significant association between any variant and breast cancer risk. Summary scores
comprised of the numbers of risk alleles for T2D and/or obesity were also not found to be
significantly associated with breast cancer risk. Our findings provide no evidence for
association between established T2D and/or obesity risk variants and breast cancer risk
among women of various ethnicities. These results suggest that the potential for a shared
biology between T2D/obesity and breast cancer is not due to pleiotropic effects of these
risk variants.
34
Introduction
Obesity is a risk factor for many common chronic diseases, including breast cancer in
postmenopausal women (Morimoto et al., 2001; Petrelli et al., 2002) and type 2 diabetes
(T2D). Many epidemiologic studies have also reported diabetics to have a greater risk of
breast cancer than non-diabetics, independent of body weight (Grote et al., 2010;
Novosyadlyy et al., 2010). Biological markers of obesity and diabetes, such as insulin
and insulin-like growth factors (IGFs), have been associated with breast cancer risk
(Endogenous Hormones and Breast Cancer Collaborative Group, 2010; Kaaks, 2004;
Verheus et al., 2006), which suggests that there may be shared biological processes in the
etiology of these common phenotypes. We further explored the hypothesis of shared
etiologic pathways for obesity, T2D and breast cancer, by testing for pleiotropic effects
of 31 established risk variants for T2D (n=18) and obesity (n=13) in a study of 1,915
breast cancer cases and 2,884 breast cancer controls from the Multiethnic Cohort (MEC).
Methods
Study subjects
The MEC is a prospective cohort study consisting of 215,251 adult men and women
living in Hawaii and California (Kolonel et al., 2000b) predominantly of five populations:
European Americans, African Americans, Native Hawaiians, Japanese and Latinos.
Through 2005, the breast cancer case-control study in the MEC included 1,915 invasive
cases and 2,014 controls. Cases were identified through cohort linkage to population-
based cancer Surveillance, Epidemiology and End Results (SEER) registries in California
35
and Hawaii. We also included an additional 870 controls with no history of breast cancer
from a colorectal cancer (CRC) case-control study in the MEC.
Genotyping
Genotyping of the 31 SNPs was performed using the allelic discrimination assay. The
genotype completion rate for each SNP was >95.0% for cases and controls (average
99.1%). Hardy-Weinberg Equilibrium (HWE) was assessed for each allele in each
racial/ethnic group and of the 155 tests, 7 were significant while 8 were expected.
Statistical analysis
We tested for log-additive effects of the 31 variants with odds ratios estimated using
unconditional logistic regression adjusted for age (quartiles), body mass index (quartiles),
self-reported diabetes and race/ethnicity (in pooled analysis). To account for multiple
hypothesis testing, an α ≤ 0.0016 (0.05/31 tests) was used. To examine the combined
contribution of all variants on breast cancer risk, we constructed summary risk scores,
taken as the number of risk alleles for the 18 validated T2D SNPs, for the 13 validated
obesity SNPs and for the total of 31 T2D/obesity SNPs, respectively. Individuals missing
genotypes were given the mean score for that variant within each population of the same
race/ethnicity. We excluded from the analysis 32 (1.7%) cases and 78 (2.7%) controls
with missing genotypes for ≥ 10 SNPs. Analysis for overall breast cancer risk was
conducted on 1,883 cases and 2,806 controls. We also conducted analyses stratified by
ER status (ER+ cases, n=1,217; ER- cases, n=299). The statistical analysis was
performed using the SAS 9.2 package, SAS Institute Inc., Cary, North Carolina.
36
Results
The mean age of the breast cancer cases (65.3 years at diagnosis) was only slightly higher
than that of controls (64.9 years at the time of blood draw).
We observed a nominally significant association (p<0.05) with only 1 variant in the
pooled analysis (Tables 2-1 and 2-2) whereas 1.6 were expected. The most significant
findings included inverse associations with rs5219 (KCNJ11) among all cases, (OR=0.89,
P=0.012) and ER- cases (OR=0.73, P=0.0031), as well as with rs864745 (JAZF) in ER-
cases (OR=0.75, P=0.0020). These associations, however, were no longer significant
after adjusting for multiple comparisons (Supplemental Table 2-1). Results were similar
when limiting the analysis to postmenopausal women (n=1,197 cases and 1,731 controls).
We also did not find any significant association of the aggregate risk scores comprised of
the T2D, obesity or T2D/obesity risk alleles with breast cancer risk (T2D SNPs: OR=1.01,
P=0.31; obesity SNPs: OR=1.02, P=0.24; All SNPs: OR=1.01, P=0.14).
Discussion
We found no strong evidence that the validated risk variants for T2D and obesity are
associated with breast cancer risk among women of various ethnicities. Neither did we
find any significant association between a summary risk score comprised of the risk
alleles for these variants and breast cancer risk. We had adequate statistical power (80%)
to detect an OR of 1.21 for SNPs with a MAF of 0.10, and an OR of 1.15 for SNPs with a
MAF of 0.20. However, power may be lower as most of these markers of T2D and
37
obesity risk were identified in GWAS among men and women of European ancestry and
may not be strongly correlated with the functional alleles in all populations.
In conclusion, while obesity and, to a lesser extent, T2D are risk factors for breast cancer,
we found no evidence that the known risk variants for T2D or obesity are associated with
breast cancer risk in a multiethnic population. These data suggest that the potential for a
shared biology between T2D/obesity and breast cancer is not due to pleiotropic effects of
these risk variants.
38
Table 2-1. Association of known T2D and obesity risk alleles with breast cancer risk
by race/ethnicity.
OR
(95%CI)
a
Risk allele frequency in controls
SNP/
Allele
tested
b
Chr./
Nearest
gene
European
Americans
503 ca/
633 co
African
Americans
381 ca/
542 co
Native
Hawaiians
135 ca/
344 co
Japanese
Americans
509 ca/
782 co
Latinos
355 ca/
505 co
P
het
c
Type 2 Diabetes SNPs
rs10923931
T
1
NOTCH2
1.13
(0.85-1.49)
0.097
0.87
(0.71-1.07)
0.33
0.96
(0.49-1.86)
0.055
0.97
(0.58-1.63)
0.025
1.28
(0.93-1.76)
0.095
0.32
rs7578597
T
2
THADA
1.05
(0.88-1.48)
0.89
1.13
(0.90-1.41)
0.75
1.04
(0.54-2.00)
0.95
1.47
(0.69-3.15)
0.99
0.71
(0.48-1.03)
0.94
0.27
rs1801282
C
3
PPARG
1.14
(0.88-1.48)
0.87
0.70
(0.39-1.26)
0.98
1.06
(0.55-2.04)
0.94
0.83
(0.54-1.28)
0.97
1.15
(0.84-1.57)
0.89
0.41
rs4607103
C
3
ADAMTS9
0.86
(0.71-1.04)
0.74
0.99
(0.81-1.22)
0.71
0.89
(0.64-1.22)
0.74
0.95
(0.80-1.13)
0.65
0.99
(0.80-1.22)
0.68
0.81
rs4402960
T
3
IGF2BP2
0.91
(0.76-1.08)
0.34
1.07
(0.88-1.30)
0.49
0.89
(0.64-1.23)
0.28
0.91
(0.76-1.08)
0.33
1.16
(0.92-1.45)
0.26
0.30
rs10010131
G
4
WFS1
1.01
(0.86-1.20)
0.59
0.91
(0.74-1.12)
0.67
0.86
(0.60-1.23)
0.80
1.14
(0.68-1.89)
0.98
0.96
(0.78-1.19)
0.70
0.82
rs7754840
C
6
CDKAL1
1.02
(0.85-1.20)
0.31
1.10
(0.91-1.34)
0.54
1.04
(0.78-1.38)
0.52
1.06
(0.90-1.25)
0.43
1.00
(0.81-1.23)
0.32
0.97
rs864745
A
7
JAZF1
0.93
(0.78-1.09)
0.53
0.87
(0.71-1.07)
0.74
0.81
(0.59-1.10)
0.72
1.02
(0.84-1.23)
0.78
1.01
(0.83-1.24)
0.62
0.68
rs13266634
C
8
SLC30A8
1.00
(0.83-1.20)
0.71
0.78
(0.57-1.07)
0.90
0.95
(0.70-1.28)
0.63
1.05
(0.89-1.24)
0.63
1.04
(0.82-1.31)
0.75
0.60
rs2383208
A
9
CDKN2B
1.01
(0.81-1.25)
0.81
1.06
(0.83-1.34)
0.80
0.98
(0.71-1.35)
0.74
1.07
(0.91-1.26)
0.57
0.96
(0.73-1.27)
0.86
0.92
rs1111875
G
10
HHEX
1.01
(0.85-1.20)
0.60
1.16
(0.93-1.46)
0.75
1.09
(0.79-1.52)
0.30
1.20
(1.00-1.44)
0.26
0.94
(0.76-1.16)
0.63
0.48
rs7903146
T
10
TCF7L2
1.06
(0.89-1.28)
0.29
1.00
(0.81-1.23)
0.27
1.16
(0.77-1.73)
0.14
1.05
(0.72-1.55)
0.045
1.00
(0.79-1.27)
0.22
0.98
rs12779790
C
10
CDC123
0.88
(0.71-1.10)
0.17
0.97
(0.74-1.26)
0.14
1.15
(0.80-1.64)
0.18
1.21
(0.97-1.49)
0.16
0.98
(0.76-1.27)
0.17
0.33
rs2237895
d
C
11
KCNQ1
0.99
(0.83-1.18)
0.42
0.98
(0.77-1.25)
0.21
0.94
(0.68-1.31)
0.36
1.06
(0.88-1.32)
0.35
1.00
(0.80-1.27)
0.42
0.96
rs2237897
d
C
11
KCNQ1
1.01
(0.68-1.49)
0.95
1.13
(0.80-1.59)
0.92
1.13
(0.77-1.67)
0.78
0.92
(0.75-1.13)
0.62
0.97
(0.74-1.26)
0.78
0.016
39
Table 2-1 (Continued)
OR(95%CI)
a
Risk allele frequency in controls
SNP/
Allele
tested
b
Chr./
Nearest
gene
European
Americans
503 ca/
633 co
African
Americans
381 ca/
542 co
Native
Hawaiians
135 ca/
344 co
Japanese
Americans
509 ca/
782 co
Latinos
355 ca/
505 co
P
het
c
rs5219
T
11
KCNJ11
1.07
(0.90-1.28)
0.35
0.77
(0.55-1.08)
0.10
1.01
(0.75-1.37)
0.38
0.81
(0.68-0.95)
0.39
0.77
(0.63-0.96)
0.39
0.082
rs7961581
C
12
TSPAN8
0.96
(0.79-1.17)
0.27
0.95
(0.75-1.20)
0.21
1.10
(0.80-1.51)
0.28
1.01
(0.82-1.24)
0.20
1.27
(1.00-1.60)
0.20
0.39
rs8050136
A
16
FTO
0.86
(0.72-1.02)
0.42
0.89
(0.73-1.08)
0.44
0.63
(0.43-0.93)
0.24
1.07
(0.88-1.30)
0.21
0.96
(0.77-1.19)
0.28
0.21
Obesity SNPs
rs2815752
A
1
NEGR1
1.00
(0.85-1.19)
0.63
1.10
(0.90-1.34)
0.54
1.04
(0.70-1.56)
0.83
0.85
(0.64-1.14)
0.93
1.06
(0.85-1.31)
0.71
0.63
rs10913469
C
1
SEC16B
0.99
(0.80-1.23)
0.18
0.87
(0.70-1.08)
0.28
0.57
(0.38-0.85)
0.21
1.20
(1.00-1.44)
0.23
0.89
(0.70-1.14)
0.20
0.016
rs6548238
T
2
TMEM18
1.08
(0.87-1.34)
0.17
1.07
(0.80-1.44)
0.11
0.59
(0.31-1.13)
0.079
1.15
(0.89-1.49)
0.10
1.17
(0.88-1.56)
0.13
0.40
rs925946
e
T
11
BDNF
0.92
(0.76-1.12)
0.29
1.01
(0.81-1.26)
0.28
1.73
(1.19-2.53)
0.18
1.12
(0.70-1.77)
0.030
0.90
(0.70-1.15)
0.23
0.24
rs6265
e
C
11
BDNF
1.33
(1.06-1.68)
0.79
1.32
(0.85-2.05)
0.94
0.70
(0.51-0.97)
0.68
0.92
(0.78-1.09)
0.60
0.91
(0.70-1.19)
0.85
0.049
rs10838738
G
11
MTCH2
0.99
(0.83-1.19)
0.34
1.16
(0.86-1.56)
0.11
0.96
(0.70-1.33)
0.32
0.99
(0.83-1.17)
0.34
1.17
(0.96-1.43)
0.35
0.58
rs7138803
A
12
BCDIN3D
0.93
(0.78-1.10)
0.39
0.97
(0.77-1.23)
0.19
0.97
(0.68-1.38)
0.19
1.06
(0.90-1.26)
0.34
0.97
(0.78-1.21)
0.26
0.90
rs7498665
G
16
SH2B1
0.82
(0.69-0.97)
0.40
1.03
(0.84-1.26)
0.27
1.03
(0.75-1.41)
0.28
1.00
(0.79-1.27)
0.13
0.87
(0.72-1.06)
0.45
0.47
rs8050136
A
16
FTO
0.87
(0.74-1.03)
0.42
0.91
(0.75-1.10)
0.44
0.60
(0.41-0.86)
0.24
1.04
(0.86-1.26)
0.21
0.96
(0.78-1.19)
0.28
0.21
rs17782313
C
18
MC4R
0.97
(0.80-1.19)
0.22
0.98
(0.80-1.20)
0.29
1.00
(0.65-1.52)
0.13
0.95
(0.79-1.15)
0.23
0.92
(0.68-1.23)
0.13
0.98
rs11084753
G
19
KCTD15
0.89
(0.75-1.07)
0.68
1.12
(0.92-1.36)
0.64
1.04
(0.78-1.39)
0.42
1.03
(0.87-1.22)
0.29
0.96
(0.79-1.18)
0.67
0.53
rs29941
G
19
KCTD15
0.90
(0.75-1.08)
0.69
1.11
(0.86-1.42)
0.82
0.87
(0.65-1.17)
0.40
1.05
(0.87-1.27)
0.21
0.81
(0.66-1.00)
0.66
0.26
a
OR adjusted for age(quartiles), BMI(quartiles), diabetes status (self-report) and ethnicity (in pooled analysis).
b
NCBI
build 36 (forward strand).
c
P value for interaction between risk allele and ethnic groups (4-df).
d
Rs2237895 and
rs2237897 adjusted for each other.
e
Rs925946 and rs6265 adjusted for each other.
40
Table 2-2. Association of known T2D and obesity risk alleles with breast cancer risk
in ethnicity-pooled analysis.
SNP/
Allele tested
b
Chr./
Nearest gene
Pooled
OR(95%CI)
a
Risk allele frequency in controls
1,883 cases, 2,806 controls
P for trend
(pooled)
Type 2 Diabetes SNPs
rs10923931
T
1
NOTCH2
1.01(0.88-1.15)
0.12
0.94
rs7578597
T
2
THADA
1.03(0.89-1.20)
0.90
0.69
rs1801282
C
3
PPARG
1.04(0.88-1.23)
0.93
0.63
rs4607103
C
3
ADAMTS9
0.94(0.86-1.03)
0.70
0.20
rs4402960
T
3
IGF2BP2
0.98(0.89-1.07)
0.34
0.61
rs10010131
G
4
WFS1
0.97(0.87-1.07)
0.76
0.55
rs7754840
C
6
CDKAL1
1.05(0.96-1.14)
0.42
0.29
rs864745
A
7
JAZF1
0.94(0.86-1.03)
0.68
0.17
rs13266634
C
8
SLC30A8
1.00(0.90-1.10)
0.72
0.91
rs2383208
A
9
CDKN2B
1.03(0.93-1.13)
0.74
0.62
rs1111875
G
10
HHEX
1.07(0.98-1.18)
0.50
0.15
rs7903146
T
10
TCF7L2
1.03(0.92-1.15)
0.19
0.59
rs12779790
C
10
CDC123
1.03(0.92-1.15)
0.16
0.63
rs2237895
c
C
11
KCNQ1
1.00(0.90-1.10)
0.35
0.93
rs2237897
c
C
11
KCNQ1
1.00(0.89-1.13)
0.80
1.00
rs5219
T
11
KCNJ11
0.89(0.81-0.97)
0.32
0.012
rs7961581
C
12
TSPAN8
0.94(0.76-1.15)
0.23
0.60
rs8050136
A
16
FTO
0.91(0.83-1.00)
0.32
0.055
Obesity SNPs
rs2815752
A
1
NEGR1
1.02(0.92-1.13)
0.73
0.70
rs10913469
C
1
SEC16B
0.96(0.86-1.06)
0.22
0.38
rs6548238
T
2
TMEM18
1.08(0.95-1.22)
0.12
0.23
rs10938397
G
4
GNPDA2
0.97(0.89-1.06)
0.32
0.50
rs925946
d
T
11
BDNF
1.02(0.91-1.14)
0.19
0.78
41
Table 2-2 (Continued)
SNP/
Allele tested
b
Chr./
Nearest gene
Pooled
OR(95%CI)
a
Risk allele frequency in controls
1,883 cases, 2,806 controls
P for trend
(pooled)
Obesity SNPs
rs6265
d
C
11
BDNF
1.01(0.91-1.12)
0.76
0.88
rs10838738
G
11
MTCH2
1.05(0.95-1.15)
0.29
0.36
rs7138803
A
12
BCDIN3D
0.99(0.90-1.09)
0.29
0.83
rs7498665
G
16
SH2B1
0.92(0.83-1.01)
0.30
0.073
rs8050136
A
16
FTO
0.92(0.84-1.00)
0.32
0.055
rs17782313
C
18
MC4R
0.98(0.88-1.08)
0.21
0.63
rs11084753
G
19
KCTD15
1.00(0.91-1.09)
0.53
0.97
rs29941
G
19
KCTD15
0.94(0.85-1.03)
0.54
0.18
a
OR adjusted for age(quartiles), BMI(quartiles), diabetes status (self-report) and ethnicity (in pooled analysis).
b
NCBI
build 36 (forward strand).
c
Rs2237895 and rs2237897 adjusted for each other.
d
Rs925946 and rs6265 adjusted for each
other.
42
Chapter Three: Fine-mapping of known breast cancer risk loci in African
Americans
A brief introduction:
Studies in populations of African origin have reported lack of replication with many
GWAS reported risk variants for breast cancer. Shorter LD blocks and variation in risk
allele frequencies might be contributing factors to the reason that GWAS results are not
readily portable to populations of African origin. Therefore, a panel of markers
specifically selected is required to study the disease in these populations. Fine-mapping
of the established risk regions not only provides a chance to find new markers, but also
narrows down the possible location of the functional alleles. In this paper we fine-
mapped 19 known breast cancer risk loci in 5,761African American breast cancer cases
and controls.
Through fine-mapping we identified markers in 4 regions that appear to better capture the
functional allele. We also found 4 novel markers that appear to be independent of the
index signal and better capture the association. We constructed risk scores using GWAS
variants as well as markers identified in this study that better characterize breast cancer
risk in African Americans, and the latter was much more strongly associated with disease
risk (per allele OR=1.18, P=2.8×10
-24
vs. OR=1.04, P=6.1×10
-5
). Through fine-mapping
of the known breast cancer risk loci, we found markers that better characterize genetic
risk of breast cancer in African Americans.
This work was published in Human Molecular Genetics in Nov., 2011 (PMID:
21852243).
43
Fine-mapping of breast cancer susceptibility loci characterizes genetic risk in
African Americans
Fang Chen
1
, Gary K. Chen
1
, Robert C. Millikan
2
,
Esther M. John
3
, Christine B.
Ambrosone
4
, Leslie Bernstein
5
, Wei Zheng
6
, Jennifer J. Hu
7
, Regina G. Ziegler
8
, Sandra
L. Deming
6
, Elisa V. Bandera
9
,
Sarah Nyante
2
, Julie R. Palmer
10
, Timothy R. Rebbeck
11
,
Sue A. Ingles
1
, Michael F. Press
12
, Jorge L. Rodriguez-Gil
7
, Stephen J. Chanock
8
, Loïc
Le Marchand
13
, Laurence N. Kolonel
13
, Brian E. Henderson
1
, Daniel O. Stram
1
,
Christopher A. Haiman
1
1
Department of Preventive Medicine, Keck School of Medicine and Norris
Comprehensive Cancer Center, University of Southern California, Los Angeles, CA,
USA
2
Department of Epidemiology, Gillings School of Global Public Health, and Lineberger
Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC, USA
3
Northern California Cancer Center, Fremont, CA and Stanford University School of
Medicine and Stanford Cancer Center, Stanford, CA, USA
4
Department of Cancer Prevention and Control, Roswell Park Cancer Institute, Buffalo,
NY, USA
5
Division of Cancer Etiology, Department of Population Science, Beckman Research
Institute, City of Hope, CA, USA
6
Division of Epidemiology, Department of Medicine, Vanderbilt Epidemiology Center,
and Vanderbilt-Ingram Cancer Center, Vanderbilt University School of Medicine,
Nashville, TN, USA
7
Sylvester Comprehensive Cancer Center and Department of Epidemiology and Public
Health, University of Miami Miller School of Medicine, Miami, FL, USA
8
Epidemiology and Biostatistics Program, Division of Cancer Epidemiology and
Genetics, National Cancer Institute, Bethesda, MD, USA
9
The Cancer Institute of New Jersey, New Brunswick, NJ, USA
10
Slone Epidemiology Center at Boston University, Boston, MA, USA
11
University of Pennsylvania School of Medicine, Philadelphia, PA, USA
12
Department of Pathology, Keck School of Medicine and Norris Comprehensive Cancer
Center, University of Southern California, Los Angeles, CA, USA
13
Epidemiology Program, Cancer Research Center, University of Hawaii, Honolulu, HI,
USA
44
Abstract
Genome-wide association studies (GWAS) have revealed 19 common genetic variants
that are associated with breast cancer risk. Testing of the index signals found through
GWAS and fine-mapping of each locus in diverse populations will be necessary for
characterizing the role of these risk regions in contributing to inherited susceptibility. In
this large study of breast cancer in African American women (3,016 cases and 2,745
controls), we tested the 19 known risk variants identified by GWAS and replicated
associations (P<0.05) with only 4 variants. Through fine-mapping, we identified markers
in 4 regions that better capture the association with breast cancer risk in African
Americans as defined by the index signal (2q35, 5q11, 10q26 and 19p13). We also
identified statistically significant associations with markers in 4 separate regions (8q24,
10q22, 11q13 and 16q12) that are independent of the index signals and may represent
putative novel risk variants. In aggregate the more informative markers found in the study
enhance the association of these risk regions with breast cancer in African Americans
(per allele OR=1.18, P=2.8×10
-24
vs. OR=1.04, P=6.1×10
-5
). In this detailed analysis of
the known breast cancer risk loci, we have validated and improved upon markers of risk
that better characterize their association with breast cancer in women of African ancestry.
45
Introduction
Genome-wide association studies (GWAS) of breast cancer have identified at least 19
chromosomal regions that harbor common alleles that contribute to genetic susceptibility
(Ahmed et al., 2009; Antoniou et al., 2010; Easton et al., 2007; Fletcher et al., 2011;
Hunter et al., 2007; Stacey et al., 2007; Stacey et al., 2008; Thomas et al., 2009; Turnbull
et al., 2010; Zheng et al., 2009b). These discoveries have allowed for improved
understanding of genetic risk for this common cancer, although it is argued that many
more markers will be needed to elucidate disease heritability, and in the clinical setting
for disease prediction (Gail, 2008; Pepe and Janes, 2008; Pharoah et al., 2002). Except
for the breast cancer risk locus at 6q25 identified in a GWAS of Chinese women, the risk
loci for breast cancer have been revealed in studies in women of European ancestry. We
have recently shown in a multiethnic study that a summary score comprised of the index
variants at many of these risk loci is statistically significantly associated with breast
cancer risk in multiple populations (odds ratio (OR) per allele of >1.10), but not in
African Americans (Chen et al., 2010). Similar studies in African American women have
also reported lack of replication with many of the reported index signals (Ruiz-Narvaez et
al., 2010; Stacey et al., 2010; Zheng et al., 2009a). Limited statistical power of these
initial reports as well as variation in both allele frequency and patterns of linkage
disequilibrium across populations may be contributing factors as to why the associations
found in the GWAS populations may not be generalizable to African Americans.
Association testing of the risk variants as well as fine-mapping in a sufficiently large
46
sample of African Americans will be needed to identify and localize the subset of
markers that best define risk of the functional allele(s) within known risk regions.
In the present study, we tested common genetic variation at the breast cancer risk loci
identified in women of European and Asian descent in a large sample comprised of 3,016
African American breast cancer cases and 2,745 controls to identify markers of risk that
are relevant to this population. More specifically, we examined the index variants and
conducted fine-mapping of the locus to both improve the current set of risk markers in
African Americans as well as to identify new risk variants for breast cancer. We then
applied this information to model breast cancer risk in African American women in
attempt to characterize the spectrum of genetic risk in this population defined by common
variants at the known risk loci.
Results
The ages of cases and controls ranged from 22 to 87 years and 23 to 86 years,
respectively, with cases and controls having similar mean ages (55 and 58 years,
respectively; Supplemental Table 3-1).
We tested 19 validated breast cancer risk variants (referred as “index variants”
throughout the paper) at 1p11, 2q35, 3p24, 5p12, 5q11, 6q25, 8q24, 9p21, 9q31, 10p15,
10q21, 10q22, 10q26, 11p15, 11q13, 14q24, 16q12, 17q23 and 19p13 in models adjusted
for age, study, global ancestry (the first 10 eigenvectors) and local ancestry (Table 3-1,
Supplemental Table 3-2) (Ahmed et al., 2009; Antoniou et al., 2010; Easton et al., 2007;
Fletcher et al., 2011; Hunter et al., 2007; Stacey et al., 2007; Stacey et al., 2008; Thomas
47
et al., 2009; Turnbull et al., 2010; Zheng et al., 2009b); 17 SNPs were directly genotyped,
while 2 were imputed (r
2
>0.98; see Methods). All 19 variants were common ( ≥0.05) in
African Americans, with 11 variants being more common in Europeans than African
Americans (Table 3-1, Figure 3-1). In previous GWAS, the index signals had modest
odds ratios (1.05-1.29 per copy of the risk allele) and our sample size provided ≥70%
statistical power to detect the reported effects for 12 of the 19 variants (at P<0.05;
Supplemental Table 3-2).
We observed positive associations with 11 of the 19 variants (OR >1) however only 4
were statistically significant (P<0.05 at 2q35, 9q31, 10q26 and 19p13; Table 3-1). Of the
15 variants that were not replicated at P<0.05, statistical power was <70% for only 7 of
the variants. Although power was more limited, we also evaluated associations by
estrogen receptor (ER) status as some risk variants have been found to be more strongly
associated with ER-positive (ER+) or ER-negative (ER-) breast cancer (Garcia-Closas et
al., 2008; Stacey et al., 2007). We observed positive associations with 12 variants (2 at
P<0.05) for ER+ disease (n=1,520) and with 9 variants for ER- (3 at P<0.05; n=988)
(Supplemental Table 3-3). For only one variant did we observe statistically significant
risk heterogeneity by ER status (rs13387042 at 2q35, P=0.013) (Supplemental Table 3-3).
Local ancestry was included in all models as it was found to be associated with breast
cancer risk in many regions (Supplemental Table 3-4). We observed nominally
significant associations between local ancestry and overall breast cancer, ER+ or ER-
disease risk at 5 loci (5p12, 6q25, 8q24, 10p15, 10q26). The most statistically significant
48
association was between European ancestry and ER+ breast cancer risk at 6q25 (OR per
European allele chromosome=1.19, P=6.2×10
-3
). The inverse association observed
between European ancestry and ER+ disease risk at 10q26 (OR per European
chromosome=0.85, P=0.011) is consistent with previous reports of over-representation of
African ancestry at this locus in many of these same cases (Fejerman et al., 2009;
Pasaniuc et al., 2011).
Aside from statistical power, the lack of a statistically significant association with an
index variant (OR>1 and p<0.05) suggests that the particular variant revealed in the
GWAS populations may not be adequately correlated with the biologically relevant allele
in African Americans. In an attempt to identify a better genetic marker of risk in African
Americans we conducted fine-mapping across all risk regions using genotyped SNPs on
the Illumina 1M array and imputed SNPs to Phase 2 HapMap populations (see Methods).
If a marker associated with risk in African Americans represents the same signal as that
reported in the initial GWAS, then it should be correlated to some degree with the index
signal in the GWAS population. Using HapMap data for the populations in which the risk
variant was identified (Utah residents with ancestry from northern and western Europe
(CEU), or Han Chinese in Beijing, China (CHB)), we catalogued and tested all SNPs that
were correlated (r
2
≥0.2) with the index signal (within 250kb), applying an α
a
of 3.2×10
-3
which was estimated as 0.05/the average number of tags needed to capture (r
2
≥0.8) the
common risk alleles correlated with the index allele in each region in the Yoruba
HapMap population (in Ibadan, Nigeria (YRI); Supplemental Table 3-5). We also tested
for novel independent associations, focusing on SNPs that were uncorrelated with the
49
index signal in the initial GWAS populations. Here, we applied a Bonferroni correction
for defining novel associations as statistically significant in each region, with α
b
estimated as 0.05/the total number of tags needed to capture (r
2
≥0.8) all common risk
alleles in the 19 regions in the YRI population ( α
b
=1.0×10
-5
; similar to the genome-wide-
type correction of 5×10
-8
, which accounts for the number of tags needed to capture all
common alleles in the genome; Supplemental Table 3-5). For each region, stepwise
logistic regression was used with SNPs kept in the final model based on α
a
or α
b
(results
for each model are provided in Supplemental Tables 4-6 and 4-7). These procedures were
applied to all cases and controls as well as in hypothesis-generating analyses stratified by
ER status.
At 9 loci we detected variants that were statistically significantly associated with breast
cancer risk in African Americans. These regions include 9q31 where the sole marker of
risk was the index signal (rs865686: OR=1.08; P=0.034, Table 3-1). In 5 of these 9
regions, the index marker itself was not statistically significantly associated with disease
risk. Through fine-mapping we revealed markers in four regions that were more
significantly associated with risk than the index signal (>1 order of magnitude change in
the p-value) and are likely capturing the same signal (2q35, 5q11, 10q26 and 19p13). We
also identified markers in four regions that are not correlated with the index signal in the
GWAS populations (8q24, 10q22, 11q13 and 16q12) and may represent putative novel
risk variants, with one being specific for ER+ disease (8q24) (Table 3-1, Figure 3-2 and
Supplemental Table 3-8). These regions are discussed below.
50
Risk variants that better define the index signal in African Americans
2q35
The index signal at 2q35 was statistically significantly associated with risk of overall
breast cancer (rs13387042: OR=1.12, P=7.5×10
-3
; Table 3-1) and ER+ disease (OR=1.22,
P=2.6×10
-4
; Supplemental Table 3-3). However, we found stronger associations with two
markers that are each modestly correlated with the index signal in CEU and YRI:
rs13000023 with overall breast cancer (OR=1.20, P=5.8×10
-4
) and rs12998806: with ER+
disease (OR=1.39, P=3.3×10
-6
) (Table 3-1 and Supplemental Table 3-8). As shown in
Supplemental, Figure 3-1, the signal in this region appeared limited to ER+ breast cancer,
which is consistent with the initial report of this risk locus (Stacey et al., 2007) but not
subsequent large-scale replication efforts in European populations (Milne et al., 2009).
5q11
We found a positive non-significant association with the index signal at 5q11, which is
located 79 kb centromeric of the MAP3K1 gene (rs889312: OR=1.07, P=0.084; Table 3-
1). Fine-mapping revealed statistically significant associations with markers, rs16886165
for overall breast cancer (OR=1.15, P=6.5×10
-4
) and rs832529 for ER- disease (OR=1.22,
P=1.3×10
-3
; Table 3-1 and Supplemental Table 3-8). These SNPs show greater
correlation with the index signal in Europeans (CEU, r
2
=0.40 and 0.46) than in Africans
(YRI, r
2
<0.01 and r
2
=0.09), which suggests that they may be better markers of the
biologically functional variant in African Americans (Table 3-1, Figure 3-2).
51
10q26
Both the index signal, rs2981582 (OR=1.11, P=8.6×10
-3
; Table 3-1), and rs2981578, that
was identified previously through fine-mapping in African Americans (which some of
these studies contributed to) (Udler et al., 2009), were statistically significantly associated
with risk (OR=1.24, P=1.7×10
-4
, Table 3-1). Variant rs2981578 was the most strongly
associated marker in the region for overall breast cancer and for ER+ disease, which is
consistent with previous reports of variation in this region being more strongly associated
with ER+ breast cancer (Supplemental Table 3-8) (Garcia-Closas et al., 2008). In fine-
mapping the locus we observed a suggestive association with a correlated marker and
ER- disease (rs2912774: OR=1.19, P=2.1×10
-3
; Supplemental, Table 3-8) however the
association was also noted with ER+ disease (OR=1.10, P=0.041; Supplemental Table 3-
9) and is likely capturing the same signal as rs2981578.
19p13
19p13 was the first risk locus reported to harbor a variant that may be specific for ER-
disease (Antoniou et al., 2010). In African Americans, the index variant was statistically
significantly associated with risk of overall breast cancer (rs2363956: OR=1.14,
P=8.0×10
-4
), as well as ER+ (OR=1.12, P=0.016) and ER- disease (OR=1.14, P=0.018;
Table 3-1 and Supplemental, Table 3-3). The most significant association in the region
for overall breast cancer and ER+ disease was with rs3745185 (P=3.7×10
-5
and
P=8.2×10
-4
, respectively), which is likely to be capturing the same functional variant
(r
2
=0.57 in CEU and 0.19 in YRI; Table 3-1 and Supplemental Table 3-8). The most
52
significant marker for ER- breast cancer was correlated with both rs2363956 and
rs3745185 (rs11668840: OR=1.25, P=5.1×10
-5
; Supplemental Tables 4-8 and 4-10).
Novel risk-associated markers at breast cancer susceptibility loci
8q24
Given the importance of the 8q24 locus in cancer, we conducted association testing
across the entire cancer risk region (126.0 Mb-130.0 Mb) (Al Olama et al., 2009;
Crowther-Swanepoel et al., 2010; Freedman, 2006; Ghoussaini et al., 2008; Goode et al.,
2010; Gudmundsson et al., 2009; Gudmundsson et al., 2007; Haiman et al., 2007; Jia et
al., 2009; Kiemeney et al., 2008; Salinas et al., 2008; Yeager et al., 2009; Yeager et al.,
2007). The index signal (rs13281615) was not statistically significantly associated with
risk in African Americans (Table 3-1 and Supplemental Table 3-3), nor did we identify
significant associations with correlated SNPs. However, we did detect a significant
association with rs16902056 and ER+ breast cancer (risk allele frequency, 0.95;
P=6.7×10
-6
; ER-: P=0.66; Supplemental Table 3-8). This SNP is located 78 kb
centromeric of the index variant and is not correlated with the index variant (r
2
<0.01 in
CEU and r
2
=0.027 in YRI). No statistically significant associations were observed with
variants found previously in association with cancers of the bladder and ovary, or
leukemia (rs9642880: OR=1.03, P=0.58; rs10088218: OR=1.02, P=0.62; rs2456449:
OR=1.07, P=0.14) (Crowther-Swanepoel et al., 2010; Goode et al., 2010; Kiemeney et al.,
2008). Of the known risk variants for prostate cancer (Al Olama et al., 2009;
Gudmundsson et al., 2009; Gudmundsson et al., 2007; Haiman et al., 2007; Salinas et al.,
53
2008; Yeager et al., 2009; Yeager et al., 2007) we found a single nominally significant
(P<0.05) association with the same risk allele of rs1016343 (P=0.015) which is
located >260 kb centromeric of the breast cancer risk region and is not correlated with
rs13281615 or rs16902056.
10q22
We observed no association with the index signal at 10q22 (rs704010) which is located in
intron 1 of the gene ZMIZ1, or with any correlated markers. However, we did detect
strong evidence of a second signal located 215 kb telomeric in intron 12 of the gene
ZMIZ1 (rs12355688: OR=1.24, P=6.8×10
-6
). As is shown in Table 3-1 and Fig. 2, this
putative novel risk variant is not correlated with the index variant in the CEU or YRI
populations (r
2
<0.01).
11q13
No positive association was noted with the index variant at 11q13. However, we did
detect evidence of a second independent signal (rs609275: OR=1.20, P=1.0×10
-5
),
located 74 kb telomeric, and 53 kb centromeric of CCND1. The variant is monomorphic
and uncorrelated with the index signal in the CEU population; and r
2
with the index signal
in the YRI population is <0.01 (Table 3-1).
16q12
As in previous studies of African Americans we were not able to replicate the association
signal defined by the index variant rs3803662 (Table 3-1) (Ruiz-Narvaez et al., 2010;
54
Zheng et al., 2009a). A recent study of African Americans reported a suggestive
association with SNP rs3104746, which is located 15 kb telomeric of rs3803662 (Ruiz-
Narvaez et al., 2010). This SNP has a minor allele frequency of 0.04 in the HapMap CEU
population, 0.19 in our African American controls, and is modestly correlated with
rs3803662 in Africans (r
2
=0.31 in YRI), but not in Europeans (r
2
=0.038; Supplemental
Table 3-10). Fine-mapping around this putative signal revealed a perfect proxy (r
2
=1) for
rs3104746, rs3112572, which is significantly associated with breast cancer risk in
African Americans (OR=1.18, P=3.9×10
-4
) with the association noted to be stronger for
ER+ breast cancer (OR=1.27, P=3.1×10
-5
; Table 3-1 and Supplemental Table 3-8).
For index SNPs found to be nominally associated with breast cancer risk, as well as risk-
associated markers identified through fine-mapping, we also tested for associations by
genotype. Results from the genotype-specific model were consistent with log-additive-
associations (Supplemental Tables 4-9 and 4-11). Risk variants at 2q35 and 8q24 were
also found to have significantly stronger associations with ER+ breast cancer than ER-
disease (Supplemental Table 3-7) which is consistent with previous studies (Garcia-
Closas et al., 2008; Stacey et al., 2007).
We observed no statistically significant associations with common variation at 10 risk
loci on 1p11, 3p24, 5p12, 6q25, 9p21, 10p15, 10q21, 11p15, 14q24 and 17q23
(Supplemental Figure 3-2). We also could not replicate the association with the recently
identified SNP rs9397435 at 6q25 that was found through fine-mapping in European,
African and Asian population samples (Stacey et al., 2010) (P=0.26 for overall breast
55
cancer, P=0.71 for ER+ and P=0.36 for ER- tumor subtypes). Neither could we replicate
the association with SNP rs4784227 at 16q12, which was identified by a recent multi-
stage GWAS in women of Asian ancestry (Long et al., 2010) in our African American
sample (P=0.51 overall, P=0.35 and P=0.65 for ER+ and ER- subtypes, respectively).
Risk modeling
We next estimated the cumulative effect of all breast cancer risk variants, and compared a
summary risk score comprised of unweighted counts of all GWAS reported risk variants
to a risk score that included variants we identified as being associated with risk in African
Americans (Table 3-2). Using the 19 index signals from GWAS (see Methods), the risk
per allele was 1.04 (95% CI, 1.02-1.06; P=6.1×10
-5
) and individuals in the top quintile of
the risk allele distribution were at 1.4-fold greater risk (P=7.4×10
-5
) of breast cancer
compared to those in the lowest quintile (Table 3-2). As expected, the risk score was
improved when utilizing the markers that we identified at the known risk loci as being
more relevant to African Americans (8 markers for overall breast cancer: 2q35, 5q11,
9q31, 10q22, 10q26, 11q13, 16q12 and 19p13; OR=1.18; 95% CI, 1.14-1.22; P=2.8×10
-
24
), with risk for those in the top quartile being 2.2-times that observed in the lowest
quintile (P=3.6×10
-17
). This score was significantly associated with risk of both ER+
(OR=1.20, P=1.7×10
-19
) and ER- (OR=1.15, P=2.8×10
-9
) disease (P
het
=0.12)
(Supplemental Table 3-12).
Stratifying by first-degree family history of breast cancer differentiated risk further with
those with a family history and in the top quintile of the risk score distribution (4% of the
56
population) having a 3.4-fold greater risk (P=9.9×10
-14
) compared to those without a
family history and in the lowest quintile of the risk score (Table 3-2).
In hypothesis-generating analyses, we also developed risk scores for ER+ and ER- breast
tumor subtypes utilizing the most informative markers revealed through fine-mapping of
each phenotype. These phenotype-specific scores were highly significant (ER+: OR=1.30,
P=6.0×10
-18
; ER-: OR=1.20, P=2.3×10
-10
) with statistically significant heterogeneity
noted when the scores were applied to the other subtype (P
het
=1.7×10
-5
and 5.0×10
-3
for
ER+ and ER- scores, respectively) (Supplemental Table 3-12).
Discussion
In this large study of breast cancer in African American women we were able to replicate
associations with 4 of the 19 index variants (at P<0.05). Through fine-mapping, we
observed that overall breast cancer risk was statistically significantly associated with
markers in 4 regions which are likely to capture the GWAS-reported signal and to serve
as better markers of the functional allele and risk in African Americans. We also detected
putative novel associations that are independent of the index signals in 3 regions for
overall breast cancer (10q22, 11q13 and 16q12) and in one region for ER+ disease (8q24).
In 10 of the risk regions, however, we were not able to replicate the GWAS index signals,
nor did we detect statistically significant associations of common SNPs with breast
cancer risk at the levels of statistical significance we set for fine-mapping. The inability
to replicate associations with the index signals despite adequate statistical power (>70%
power for 12 of 19 variants) suggests that they are unlikely to be functional variants or
57
capture the functional variants as efficiently in this population. Our ability to find
associated markers in 5 regions where index signals were not significantly associated
with risk also demonstrates the value of testing common variation at GWAS-identified
risk loci in additional populations (Chen et al., 2010; Ruiz-Narvaez et al., 2010; Stacey et
al., 2010; Udler et al., 2009; Waters et al., 2009; Waters et al., 2010).
In four regions we observed risk markers that are correlated and in the same LD block
with the index markers in CEU (rs13000023 at 2q35, rs16886165 at5q11, rs2981578 at
10q26 and rs3745185 at 19p13). It is likely that these risk markers capture the same
signal as defined by the index markers based on the r
2
’s between these markers and the
index markers ( ≥0.35). We cannot rule out the possibility, though, that some of them may
represent a second, independent signal in the same region.
In the four regions where we observed independent signals, the risk alleles (rs16902056
at 8q24, rs12355688 at 10q22, rs609275 at 11q13 and rs3112572 at 16q12) were
uncorrelated with and not in the same LD block as the index variant in Europeans (CEU,
r
2
<0.04)) [distances from the index signal ranged from 14kb at 16q12 to 215kb at 10q22]
(Supplemental Figure 3-3). Therefore, these variants are likely to pick up a novel signal
independent of the index signal. However, because of different LD patterns in European
and African-ancestry populations they may each mark the same functional variant and if
the functional variant is less common it may not be well captured by either common
marker alone. At 10q22, both the index SNP and the novel variant are located within
introns of the ZMIZ1 gene. ZMIZ1 encodes zinc finger MIZ-type containing 1, which
58
regulates the activity of various transcription factors including the androgen receptor,
Smad3/4, and p53 (Lee et al., 2007; Li et al., 2006; Sharma et al., 2003). At 11q13,
rs609275 lies 74 kb telomeric of the index signal and in closer proximity to a number of
candidate genes including CCND1 (encoding cyclin D1, a protein crucial for cell cycle
control), ORAOV1 (encoding oral cancer overexpressed 1) and FGF19 (encoding
fibroblast growth factor 19). The association at 16q12 confirms the findings of a previous,
smaller study of African Americans(Ruiz-Narvaez et al., 2010), and is consistent with a
previous fine-mapping study suggesting that African Americans may harbor a separate
causal variant in this region.(Udler et al., 2010) Whether this variant is influencing the
same genes/pathways as the index variant rs3803662 is not known, however the stronger
associations noted for both variants with ER+ disease (Garcia-Closas et al., 2008; Stacey
et al., 2007) suggest that they may affect the same biological process.
Notably, at region 19p13 which was originally reported in association with ER- breast
cancer (Antoniou et al., 2010), the index signal was statistically significantly associated
with both ER+ and ER- subtypes in African Americans. In addition, we found a stronger
marker in this region (rs3745185) for ER+ as well as overall breast cancer risk (Table 3-1
and Supplemental Table 3-8). We also found stronger associations with ER+ than ER-
disease for variants in many regions, including 2q35, 8q24, 10q26 and 16q12, which is
consistent with previous reports (Garcia-Closas et al., 2008; Stacey et al., 2007). In the
study we also found strong signals for ER- disease in regions 5q11, 10q26 and 19p13. It
is possible that these signals may explain some of the excess risk for ER- disease in
African Americans, since these risk alleles have higher frequencies in this population
59
than they do in European-ancestry populations. However, our understanding of their
contribution to racial and ethnic differences in disease incidence will only be determined
once the functional variants have been identified and tested across populations.
Unfortunately we were not able to assess associations with triple negative (ER/PR/HER2
negative) breast cancer, since HER2 status was available for only a limited number of
cases. However, in a large study of women of European-ancestry which tested many of
these same index variants, further stratification on tumor subtype using HER2 status was
not additionally informative for ER/PR negative breast cancer(Broeks et al., 2011).
The observation of secondary signals at many loci, and associations of variants with
different tumor subtypes that have not yet been reported in European-ancestry
populations could indicate a different genetic architecture of breast cancer across
populations. For example, the index signal at TNRC9 does not replicate in African
Americans, but there appears to be a second risk variant that is unique to this population.
At FGFR2, which was originally reported to be associated with ER+ disease in women of
European ancestry, we found a signal for ER- disease with the marker correlated with the
index variant. Similarly, for chromosome 19p13, which was reported as an ER- locus, we
observed an association with ER+ breast cancer. However, these findings and their
implications require further validation.
We investigated local ancestry as a potential confounding factor in the analysis of each
risk locus. At 5 loci we observed nominally significant evidence of association between
local ancestry and breast cancer risk, with the most statistically significant association
60
observed at 6q25 between European ancestry and ER+ breast cancer risk. While the
association of local ancestry and breast cancer risk needs to be validated in additional
large studies, the inability to identify a risk variant that is differentiated in frequency
between populations of European and Africans ancestry implies that either the
association with local ancestry at many regions is a false positive signal and/or we have
not tested an adequate surrogate of the functional alleles.
The majority of the variants identified by GWAS for common cancers are of low risk
(relative risks <1.30) and in aggregate are not yet informative for risk prediction (Gail,
2008; Pepe and Janes, 2008; Pharoah et al., 2002). Until the functional alleles at each
susceptibility locus are identified and their effects are accurately estimated, modeling of
the genetic risk will rely on markers that best capture risk for a given population. Many
of the markers we identified at these risk loci appear to have stronger associations with
breast cancer risk compared to the GWAS-identified variants in African American
women. The risk score for overall breast cancer was also equally efficient for ER+ and
ER- tumors. However, our hypothesis-generating model suggests that identification of
tumor subtype-specific variants will improve the fit of these models.
While this is the largest study of African Americans to date to investigate genetic risk at
known breast cancer susceptibility loci, statistical power was still limited. We had only
35% power to detect an OR of 1.10 for a risk allele of 0.10 frequency which may account
for our inability to replicate GWAS signals or risk-associated markers in 10 of the
regions. While attempting to apply a strict threshold for declaring significance through
61
fine-mapping, we did not take into account testing for multiple phenotypes (overall breast
as well as ER+ and ER- disease). As a result, the α levels used as selection criteria may
be too liberal. However, our risk modeling focused on the variants revealed for overall
breast cancer, whereas we consider the associations observed for markers identified for
ER+ or ER- disease and used in the subtype-specific risk modeling as hypothesis-
generating. Since all of the cases and controls used for fine-mapping/discovery were also
included in the risk modeling, the risk model is likely to over-estimate the level of
association due to winner’s curse. Instead of partitioning the sample into test and
validation sets, we felt it was necessary to use all of the subjects in the association testing
of known variants and in fine-mapping to increase the statistical power to detect
associations in each region. Therefore, other studies with reasonable power in African
Americans must be performed in the future to test the model presented.
In summary, through fine-mapping of the breast cancer susceptibility regions in a large
sample of African American women, we identified markers with enhanced association
with breast cancer in this population. Validation and augmentation of this model is
needed before risk modeling based on genetic variants of low risk can be implemented in
the clinical setting.
Materials and Methods
Ethics Statement
The Institutional Review Board at the University of Southern California approved the
study protocol.
62
Study Populations
This study included 9 epidemiological studies of breast cancer among African American
women, which comprise a total of 3,153 cases and 2,831 controls. Sample size and
selected characteristics for these studies are summarized in Supplemental Table 3-1.
Below is a brief description of these studies.
The Multiethnic Cohort Study (MEC): The MEC is a prospective cohort study of 215,000
men and women in Hawaii and Los Angeles (Kolonel et al., 2000a) between the ages of
45 and 75 years at baseline (1993-1996). Through December 31, 2007, a nested breast
cancer case-control study in the MEC included 556 African American cases (544
invasive and 12 in situ) and 1,003 African American controls. An additional 178 African
American breast cancer cases (ages: 50-84) diagnosed between June 1, 2006 and
December 31, 2007 in Los Angeles County (but outside of the MEC) were included in
the study.
The Los Angeles component of The Women’s Contraceptive and Reproductive
Experiences (CARE) Study: The CARE Study is a large multi-center population-based
case-control study that was designed to examine the effects of oral contraceptive (OC)
use on invasive breast cancer risk among African American women and white women
ages 35-64 years in five U.S. locations (Marchbanks et al., 2002). Cases in Los Angeles
County were diagnosed from July 1, 1994 through April 30, 1998, and controls were
sampled by random-digit dialing (RDD) from the same population and time period; 380
African American cases and 224 African American controls were included in the study.
63
The Women’s Circle of Health Study (WCHS): The WCHS is an ongoing case-control
study of breast cancer among European women and African American women in the
New York City boroughs and in seven counties in New Jersey (Ambrosone et al., 2009).
Eligible cases included women with invasive breast cancer between 20 and 74 years of
age; controls were identified through RDD. The WCHS contributed 272 invasive African
American cases and 240 African American controls.
The San Francisco Bay Area Breast Cancer Study (SFBCS): The SFBCS is a population-
based case-control study of invasive breast cancer in Hispanic, African American and
non-Hispanic White women conducted between 1995 and 2003 in the San Francisco Bay
Area (John et al., 2007). African American cases, ages 35-79 years, were diagnosed
between April 1, 1995 and April 30, 1999, with controls identified through RDD.
Included from this study was 172 invasive African American cases and 231 African
American controls.
The Northern California Breast Cancer Family Registry (NC-BCFR): The NC-BCFR is a
population-based family study conducted in the Greater San Francisco Bay Area, and one
of 6 sites of the Breast Cancer Family Registry (BCFR) (John et al., 2004). African
American breast cancer cases in NC-BCFR were diagnosed after January 1, 1995 and
between the ages of 18 and 64 years; population controls were identified through RDD.
Genotyping was conducted for 440 invasive African American cases and 53 African
American controls.
64
The Carolina Breast Cancer Study (CBCS): The CBCS is a population-based case-control
study conducted between 1993 and 2001 in 24 counties of central and eastern North
Carolina (Newman et al., 1995). Cases were identified by rapid case ascertainment
system in cooperation with the North Carolina Central Cancer Registry and controls were
selected from the North Carolina Division of Motor Vehicle and United States Health
Care Financing Administration beneficiary lists. Participants’ ages ranged from 20 to 74
years. DNA samples were provided from 656 African American cases with invasive
breast cancer and 608 African American controls.
The Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO) Cohort:
PLCO, coordinated by the U.S. National Cancer Institute (NCI) in 10 U.S. centers,
enrolled during 1993 - 2001 approximately 155,000 men and women, aged 55-74 years,
in a randomized, two-arm trial to evaluate the efficacy of screening for these four cancers
(Prorok et al., 2000). A total of 64 African American invasive breast cancer cases and
133 African American controls contributed to this study.
The Nashville Breast Health Study (NBHS): The NBHS is a population-based case-
control study of incident breast cancer conducted in Tennessee (Zheng et al., 2009a). The
study was initiated in 2001 to recruit patients with invasive breast cancer or ductal
carcinoma in situ, and controls, recruited through RDD between the ages of 25 and 75
years. NBHS contributed 310 African American cases (57 in situ), and 186 African
American controls.
65
Wake Forest University Breast Cancer Study (WFBC): African American breast cancer
cases and controls in WFBC were recruited at Wake Forest University Health Sciences
from November 1998 through December 2008 (Smith et al., 2008). Controls were
recruited from the patient population receiving routine mammography at the Breast
Screening and Diagnostic Center. Age range of participants was 30-86 years. WFBC
contributed 125 cases (116 invasive and 9 in situ) and 153 controls to the analysis.
Genotyping and Quality Control
Genotyping in stage 1 was conducted using the illumina Human1M-Duo BeadChip. Of
the 5,984 samples from these studies (3,153 cases and 2,831 controls), we attempted
genotyping of 5,932, removing samples (n=52) with DNA concentrations <20 ng/ul.
Following genotyping, we removed samples based on the following exclusion criteria: 1)
unknown replicates ( ≥98.9% genetically identical) that we were able to confirm (only one
of each duplicate was removed, n=15); 2) unknown replicates that we were not able to
confirm through discussions with study investigators (pair or triplicate removed, n=14); 3)
samples with call rates <95% after a second attempt (n=100); 4) samples with ≤ 5%
African ancestry (n=36) (discussed below); and, 5) samples with <15% mean
heterozygosity of SNPs in the X chromosome and/or similar mean allele intensities of
SNPs on the X and Y chromosomes (n=6) (these are likely to be males).
In the analysis, we removed SNPs with <95% call rates (n=21,732) or minor allele
frequencies (MAFs) <1% (n=80,193). To assess genotyping reproducibility we included
138 replicate samples; the average concordance rate was 99.95% (>99.93% for all pairs).
66
We also eliminated SNPs with genotyping concordance rates <98% based on the
replicates (n=11,701). The final analysis dataset included 1,043,036 SNPs genotyped on
3,016 cases (1,520 ER+, 988 ER- and the remaining 508 cases with unknown ER status)
and 2,745 controls, with an average SNP call rate of 99.7% and average sample call rate
of 99.8%.
Statistical Analysis
Ancestry Estimation. We used principal components analysis (PCA) (Price et al., 2006) to
estimate global ancestry among the 5,761 individuals using 2,546 ancestry informative
markers. Eigenvector 1 was highly correlated ( ρ=0.997, p<1×10
-16
) with percentage of
European ancestry, estimated in HAPMIX (Price et al., 2009), and accounted for 10.1%
of the variation between subjects; subsequent eigenvectors accounted for no more than
0.5%. At each locus and for each participant, we also estimated local ancestry (i.e. the
number of European chromosomes (continuous between 0-2) carried by the participant)
using the HAPMIX program (Price et al., 2009). To summarize local ancestry at each
region, for each individual we averaged across all local ancestry estimates that were
within the start and end points of the region (Supplemental Table 3-5). We observed an
inverse association with European ancestry at 10q26 and risk of ER+ breast cancer risk
(OR per European chromosome=0.85, P=0.011; Supplemental Table 3-4) which is
consistent with previous reports of over-representation of African ancestry at this locus in
these cases (Fejerman et al., 2009; Pasaniuc et al., 2011). We also observed a positive
association with European ancestry at 6q25 and risk of ER+ disease (OR per European
67
chromosome=1.19, P=6.2×10
-3
; Supplemental Table 3-4). To address the potential for
confounding by genetic ancestry, we adjusted for both global and local ancestry in all
analyses.
SNP Imputation. In order to generate a dataset suitable for fine-mapping, we carried out
genome-wide imputation using the software MACH (Li et al., 2009). Phased haplotype
data from the founders of the CEU and YRI HapMap Phase 2 samples were used to infer
LD patterns in order to impute ungenotyped markers. The r
2
metric, defined as the
observed variance divided by the expected variance, provides a measure of the quality of
the imputation at any SNP, and was used as a threshold in determining which SNPs to
filter from analysis (r
2
<0.3). Of the 1,539,328 common SNPs (MAF ≥0.05) in the YRI
population in HapMap Phase 2, we could impute 1,392,294 (90%) with r
2
≥0.8. For all
imputed SNPs presented in the Results and Tables reported herein, the average r
2
was
0.92 (estimated in MACH).
Association Testing. For each typed and imputed SNP, odds ratios (OR) and 95%
confidence intervals (95% CI) were estimated using unconditional logistic regression
adjusting for age at diagnosis (or age at the reference date for controls), study, the first 10
eigenvalues and local ancestry. For each SNP, we tested for allele dosage effects through
a 1 d.f. Wald chi-square trend test.
We fine-mapped each risk locus using the combined genotyped and imputed SNPs in
search of 1) a SNP that is more associated with risk in African Americans than the index
signal; and 2) a novel signal that is independent of the index signal. As some risk loci
68
have been found to be more strongly associated with breast cancer subtypes, we
investigated three outcomes: 1) overall breast cancer, 2) ER+ breast cancer, and 3) ER-
breast cancer, with the latter two being hypothesis-generating. These analyses included
SNPs (genotyped and imputed) spanning 250 kb upstream and 250kb downstream of
each index signal. If the index signal was contained within an LD block (based on the D
'
statistic) of >250kb, then the region was extended to include the entire region of LD.
Stepwise regression was performed by region to select the most informative risk variants
as discussed below, in models adjusted for age, study, global ancestry (the first 10
eigenvectors) and local ancestry. In the stepwise regression we preserved the original
sample size by using the mean genotype of typed subjects in place of “no-calls” for SNPs
with <100% genotyping completion rate.
Within each known risk locus, it is expected that markers that are associated with risk in
African Americans will be correlated with the index signal reported in Europeans. Thus,
we identified and tested SNPs that are correlated (r
2
>0.2) with the index signals in the
GWAS populations (HapMap CEU or CHB for 6q25). For each region we determined the
number of tags needed to capture all the SNPs correlated with the index signal in the YRI
population (Phase 2 HapMap). The average number of tags in each region was then used
as the correction factor for Bonferroni correction. An α level of 0.05/average number of
tags needed in each region was applied in the stepwise regression process. For all of the
remaining markers that were not correlated with the index signal (in Europeans), we
applied a more stringent α level for defining statistical significance. In each risk region,
69
we determined the number of tag SNPs needed to capture all common alleles (MAF>0.05,
with r
2
>0.8) in the YRI HapMap population. The total number of tags across the 19
regions was then used as a correction factor, as they define the number of independent
tests in each region. An α of 0.05/ the number of tags was applied to assess statistical
significance for any putative novel, independent signal in each region. For correlated
SNPs that were selected to be better markers, we also assessed phase to ensure that the
new risk allele is on the same haplotype as the GWAS-reported risk allele in the HapMap
CEU population.
Risk Modeling. We modeled the cumulative genetic risk of breast cancer using the risk
variants reported in previous GWAS (total=19). We compared the results to a model of
the SNPs found to be significantly associated with risk in African Americans, which
included SNPs identified from the stepwise procedures at all loci for overall breast cancer
risk (presented in Table 3-1). More specifically, in each case we summed the number of
risk alleles for each individual and estimated the odds ratio per allele for this aggregate
unweighted allele count variable as an approximate risk score appropriate for unlinked
variants with independent effects of approximately the same magnitude for each allele.
We then applied this risk score to overall breast cancer as well as ER+/ ER- breast cancer
subtypes. We also constructed risk scores based on risk alleles for ER+ and ER- tumor
subtypes separately, and, as hypothesis-generating, applied both risk scores to overall and
ER+/ER- breast cancer subtypes.
70
Table 3-1. Associations with common variants at known breast cancer risk regions
in African Americans.
Chr.
Nearest
Genes
Index SNP from GWAS Best marker in African Americans
3,016 cases, 2,745 controls 3,016 cases, 2,745 controls
Marker
Position, Alleles
(Risk/reference)
RAF in CEU/AA
a
OR (95% CI),
P
trend
Marker
Position, Alleles
(Risk/reference)
RAF in CEU/AA
a
OR (95% CI),
P
trend
from
stepwise analysis
r
2
with index
in CEU/YRI
b
1p11 rs11249433
120982136, G/A
0.43/0.13
1.01 (0.90-1.14),
0.84
2q35 rs13387042
217614077, A/G
0.56/0.72
1.12 (1.03-1.21),
7.5×10
-3
rs13000023
c
217632639, G/A
0.82/0.83
1.20 (1.09-1.33),
5.8×10
-4
0.35/0.53
3p24
NEK10
rs4973768
27391017, T/C
0.44/0.36
1.04 (0.96-1.13),
0.32
5p12
MRPS30
rs4415084
44698272, T/C
0.38/0.63
1.02 (0.95-1.11),
0.54
5q11
MAP3K1
rs889312
56067641, C/A
0.30/0.34
1.07 (0.99-1.18),
0.084
rs16886165
56058840, G/T
0.16/0.31
1.15 (1.06-1.25),
6.5×10
-4
0.40/<0.01
6q25
C6orf97
rs2046210
c,d
151990059, A/G
0.38/0.60
1.00 (0.93-1.09),
0.88
8q24 rs13281615
128424800, G/A
0.45/0.43
1.05 (0.97-1.13),
0.20
9p21
CDKN2B
rs1011970
22052134, T/G
0.17/0.33
1.05 (0.97-1.14),
0.24
9q31 rs865686
109928199, T/G
0.61/0.52
1.08 (1.01-1.17),
0.034
10p15
ANKRD16
rs2380205
5926740, C/T
0.52/0.42
0.98 (0.91-1.06),
0.60
10q21
ZNF365
rs10995190
63948688, G/A
0.87/0.83
0.97 (0.88-1.08),
0.57
10q22
ZMIZ1
rs704010
80511154, T/C
0.43/0.11
0.99 (0.87-1.12),
0.83
rs12355688
80725632, T/C
0.090/0.20
1.24 (1.13-1.36),
6.8×10
-6
<0.01/<0.01
10q26
FGFR2
rs2981582
123342307, A/G
0.46/0.46
1.11 (1.03-1.19),
8.6×10
-3
rs2981578
c
123330301, C/T
0.46/0.81
1.24 (1.11-1.39),
1.7×10
-4
0.66/0.059
11p15
LSP1
rs3817198
1865582, C/T
0.33/0.17
0.98 (0.88-1.08),
0.63
11q13 rs614367
69037945, T/C
0.18/0.13
0.96 (0.86-1.07),
0.45
rs609275
c
69112096, C/T
1.00/0.59
1.20 (1.11-1.30),
1.0×10
-5
NA/<0.01
71
Table 3-1 (Continued)
Chr.
Nearest
Genes
Index SNP from GWAS Best marker in African Americans
3,016 cases, 2,745 controls 3,016 cases, 2,745 controls
Marker
Position, Alleles
(Risk/reference)
RAF in CEU/AA
a
OR (95% CI),
P
trend
Marker
Position, Alleles
(Risk/reference)
RAF in CEU/AA
a
OR (95% CI),
P
trend
from
stepwise analysis
r
2
with index
in CEU/YRI
b
14q24
RAD51L1
rs999737
68104435, T/C
0.26/0.051
0.98 (0.82-1.17),
0.80
16q12
TNRC9
rs3803662
51143842, A/G
0.25/0.51
0.99 (0.92-1.08),
0.85
rs3112572
51157948, A/G
0.020/0.20
1.18 (1.08-1.30),
3.9×10
-4
0.038/0.31
17q23
COX11
rs6504950
c
50411470, G/A
0.70/0.66
1.05 (0.97-1.14),
0.19
19p13
ANKLE1
rs2363956
17255124, T/G
0.45/0.49
1.14 (1.05-1.22),
8.0×10
-4
rs3745185
17245267, G/A
0.52/0.75
1.20 (1.10-1.32),
3.7×10
-5
0.57/0.19
SNP positions are based on NCBI build 36.
ORs are per allele odds ratios adjusted for age, study, the first 10 eigenvectors and local ancestry at each risk locus.
P
trend
are based on test of trend (1-d.f.).
a
RAF, risk allele frequencies in the original GWAS population (HapMap CEU, or CHB for rs2046210) and
AA(African American) controls in this study. Risk allele is the allele associated with increased risk in previous GWAS.
b
Pairwise correlations (r
2
) between the index signal and the best marker are from the CEU (CHB for rs2046210) and
YRI populations in the 1000 Genomes Project (March 2010 release).
c
Imputed SNPs.
d
Index signal reported in Han Chinese. RAFs based on HapMap CHB and r
2
based on CHB in the 1000 Genomes
Project (March 2010 release).
72
Table 3-2. The association of the total risk score with breast cancer risk in African
Americans.
Index markers from
GWAS
(19 markers)
Risk-associated best markers in African Americans
a
(8 markers)
Mean no. of risk alleles in
controls (range)
15.7 (6-25) 8.4 (3-14)
Per allele OR
(95% CI) 1.04 (1.02-1.06) 1.18 (1.14-1.22)
P
trend
6.1×10
-5
2.8×10
-24
Subjects, n cases / n,
controls
3,016/2,745 3,016/2,745
First-degree
family history
negative
b
,
2,387/2,349
First-degree
family history
positive
b
,
554/303
Risk quintiles
c
Q1 n cases / n, controls 536/549 352/462 281/387 62/57
OR
(95%CI) 1.00 (ref) 1.00 (ref) 1.00 (ref) 1.58 (1.06-2.37)
P-value
---- ---- ---- 0.025
Q2 n cases / n, controls 722/742 430/505 344/437 77/47
OR (95%CI) 0.99 (0.84-1.16) 1.17 (0.96-1.42) 1.15 (0.93-1.43) 2.18 (1.46-3.26)
P-value 0.88 0.11 0.18 1.5×10
-4
Q3 n cases / n, controls 435/382 632/625 503/549 115/53
OR (95%CI) 1.15 (0.96-1.39) 1.37 (1.14-1.64) 1.31 (1.07-1.60) 3.14 (2.17-4.53)
P-value 0.14 7.2×10
-4
8.0×10
-3
1.2×10
-9
Q4 n cases / n, controls 753/669 665/566 517/476 132/75
OR
(95%CI) 1.16 (0.98-1.36) 1.56 (1.30-1.87) 1.51 (1.24-1.86) 2.52 (1.81-3.52)
P-value 0.080 2.3×10
-6
6.2×10
-5
4.0×10
-8
Q5 n cases / n, controls 570/403 937/587 742/500 168/71
OR
(95%CI) 1.44 (1.20-1.72) 2.16 (1.80-2.58) 2.11 (1.73-2.56) 3.44 (2.47-4.77)
P-value 7.4×10
-5
3.6×10
-17
1.3×10
-13
9.9×10
-14
ORs are adjusted for age, study and the first 10 eigenvectors.
P
trend
are based on test of trend (1 d.f.).
a
The most significant markers from the stepwise analysis for overall breast cancer in each region from Table 3-1.
b
Information about first-degree family history of breast cancer is available on 97.5% cases and 96.6% controls.
c
Based on distribution in controls. (Cut points for index markers aggregate: 13.3, 15, 16, 18; cut points for best markers
aggregate: 7, 8, 9, 10).
73
Figure 3-1. Risk allele frequencies in Europeans and African Americans.
74
Figure 3-2. –Log P plots for common alleles at 8 breast cancer risk loci in African
Americans.
75
Chapter Four: Methodological considerations in genome-wide assessment of
heritability
A brief introduction:
The common disease-common variant (CD-CV) hypothesis assumes that a large
proportion of genetic risk for common diseases is attributable to common variants.
Although GWAS have been successful in identifying common risk markers for common
diseases, the common variants that are identified so far can only explain a small fraction
of the heritability of common diseases. In this chapter we addressed the question of how
much do common variants contribute to the heritability of complex diseases by studying
human height. Height is a highly heritable phenotype with an extremely polygenic pattern
of inheritance; previous family studies have reported that the additive heritability of
height is around 80% (Fisher, 1918; Macgregor et al., 2006; Silventoinen et al., 2003). In
2010, Lango Allen et al studied the aggregate effect of hundreds of common variants
from 180 height risk loci and concluded that only 10% of the phenotypic variation of
height was attributable to the additive effects of these common variants (Lango Allen et
al., 2010). However , according to Yang et al’s study in a sample of Australian
adolescents, common SNPs could explain a large fraction (45%) of the phenotypic
variation of height (Yang et al., 2010).
In this chapter, we estimated the heritability of height that can be explained by common
variants in a large sample (N=14,419) of people of African ancestry using a variance
component approach first described in Yang et al’s paper (Yang et al., 2010). We
genotyped and analyzed 966,578 autosomal SNPs across the entire genome and estimated
76
an additive heritability of 44.7% (se: 3.7%) for this phenotype in a sample of evidently
unrelated individuals. However, through an in-depth investigation of the variance
components method, we pointed out that the estimated heritability in these studies might
not be due to the common variants included in the model, and/or the SNPs in LD with
them alone. The amount of heritability that was estimated using this method was more
likely a measurement of the effect of relatedness among the subjects on height which was
captured by the genotyped SNPs. In addition, we examined the performance of the
method using a simulation study and concluded that some population structure seems to
be needed in order for the variance components approach to perform well.
We also examined the heritability of height in close relatives defined by probability of
identical-by-descent (IBD) alleles sharing (Pr (IBD=1)>=0.3, n=1,415 and Pr
(IBD=1)>=0.4, n=575), where we found that the additive heritability estimate increased
to 76.5% (se: 11.7%) and 75.1% (13.3%), respectively. These results indicated the
additive component of genetic variation for height may have been overestimated in
earlier studies (80%). The proportion of heritability estimated in close relatives could
include variation due to epistatic effects.
This work has been submitted to PLOS Genetics.
77
Methodological considerations related to a genome-wide assessment of height
heritability among people of African ancestry
Fang Chen
1
, Gary K. Chen
1
, Venetta Thomas
2
, Christine B. Ambrosone
3
, Elisa V .
Bandera
4
, Sonja I. Berndt
5
, Leslie Bernstein
6
, William J. Blot
7,8
, Qiuyin Cai
8
, John
Carpten
9
, Graham Casey
1
, Stephen J. Chanock
5
, Iona Cheng
10
, Lisa Chu
11
, Sandra L.
Deming
8
, W. Ryan Driver
12
, Phyllis Goodman
13
, Richard B. Hayes
14
, Anselm J. M.
Hennis
15,16
, Ann W . Hsing
5
, Jennifer J. Hu
2
, Sue A. Ingles
1
, Esther M. John
11,17
, Rick A.
Kittles
18
, Suzanne Kolb
13
, M. Cristina Leske
16
, Robert C. Millikan
19
, Kristine R.
Monroe
1
,Adam Murphy
20
, Barbara Nemesure
16
, Christine Neslund-Dudas
21
, Sarah
Nyante
5
, Elaine A Ostrander
22
, Michael F. Press
23
, Jorge L. Rodriguez-Gil
2
, Ben A.
Rybicki
21
, Fredrick Schumacher
1
, Janet L. Stanford
13
, Lisa B. Signorello
24
, Sara S.
Strom
25
, Michael J. Thun
12
, David Van Den Berg
1
, Zhaoming Wang
5
, John S. Witte
26
,
Suh-Yuh Wu
16
, Y uko Yamamura
26
, Wei Zheng
8
, Regina G. Ziegler
5
, Laurence N.
Kolonel
10
, Loïc Le Marchand
10
, Brian E. Henderson
1
, Christopher A. Haiman
1
, Daniel O.
Stram
1
1
Department of Preventive Medicine, Keck School of Medicine and Norris
Comprehensive Cancer Center, University of Southern California, Los Angeles, CA,
USA
2
Sylvester Comprehensive Cancer Center and Department of Epidemiology and Public
Health, University of Miami Miller School of Medicine, Miami, FL, USA
3
Department of Cancer Prevention and Control, Roswell Park Cancer Institute, Buffalo,
NY, USA
4
The Cancer Institute of New Jersey, New Brunswick, NJ, USA
5
Division of Cancer Epidemiology and Genetics, National Cancer Institute, National
Institutes of Health, Bethesda, MD, USA
6
Division of Cancer Etiology, Department of Population Science, Beckman Research
Institute, City of Hope, CA, USA
7
International Epidemiology Institute, Rockville, MD, USA
8
Division of Epidemiology, Department of Medicine, Vanderbilt Epidemiology Center,
Vanderbilt University and the Vanderbilt-Ingram Cancer Center, Nashville, TN, USA.
9
The Translational Genomics Research Institute, Phoenix, AZ, USA
10
Epidemiology Program, Cancer Research Center, University of Hawaii, Honolulu, HI,
USA
11
Cancer Prevention Institute of California, Fremont, CA, USA
12
Epidemiology Research Program, American Cancer Society, Atlanta, GA, USA
13
Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle,
WA, USA
14
Division of Epidemiology, Department of Environmental Medicine, New York
University Langone Medical Center, New York, NY, USA
15
Chronic Disease Research Centre and Faculty of Medical Sciences, University of the
West Indies, Bridgetown, Barbados.
78
16
Department of Preventive Medicine, Stony Brook University, Stony Brook, NY, USA
17
Division of Epidemiology, Department of Health Research & Policy, Stanford
University School of Medicine and Stanford Cancer Institute, Stanford, CA, USA
18
Department of Medicine, University of Illinois at Chicago, Chicago, IL, USA
19
Department of Epidemiology, Gillings School of Global Public Health, and Lineberger
Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC, USA
20
Department of Urology, Northwestern University, Chicago, IL, USA
21
Department of Biostatistics and Research Epidemiology, Henry Ford Hospital, Detroit,
MI, USA
22
Cancer Genetics Branch, National Human Genome Research Institute, National
Institutes of Health, Bethesda, MD, USA.
23
Department of Pathology, Keck School of Medicine and Norris Comprehensive Cancer
Center, University of Southern California, Los Angeles, CA, USA
24
Department of Epidemiology, Harvard School of Public Health, Boston, MA, USA
25
Department of Epidemiology, The University of Texas M.D. Anderson Cancer Center,
Houston, TX, USA.
26
Institute for Human Genetics, Department of Epidemiology and Biostatistics,
University of California, San Francisco, San Francisco, CA, USA.
79
Abstract
Height has an extremely polygenic pattern of inheritance. Genome-wide association
studies (GWAS) have revealed hundreds of common variants that are associated with
human height at genome-wide levels of significance. However, only a small fraction of
phenotypic variation can be explained by the aggregate of these common variants. In this
large study of African-American men and women (n=14,419), we genotyped and
analyzed 966,578 autosomal SNPs across the entire genome using a variance components
approach, and estimated an additive heritability of 44.7% (se: 3.7%) for this phenotype
in a sample of evidently unrelated individuals. Using simulation, we concluded that the
additive heritability estimate is not necessarily the heritability proportion directly
explained by the genotyped SNPs; in our interpretation the measured SNPs included in
the variance components analysis may best be regarded as instruments used to measure
distant relatedness between individuals rather than being either directly causal or in LD
with directly causal variants. We then explored the performance of the variance
components approach in an unrelated sample and found that the approach fails when a
large number of independent variables are included, indicating that some relatedness
between subjects is required for the method to perform well using large number of SNPs.
However whenever such relatedness exists there is the potential for misattribution of the
effects of measured SNPs compared with the effects of unmeasured and untagged causal
variants.
80
In two samples of close relatives defined by probability of identical-by-descent (IBD)
alleles sharing (Pr (IBD=1)>=0.3, n=1,415 and Pr (IBD=1)>=0.4, n=575), the additive
heritability estimate increased to 76.5% (se: 11.7%) and 75.1% (13.3%), respectively
which is consistent with the view (Zuk et al PNAS 2011) that the additive component of
genetic variation for height may have been overestimated in earlier studies (80%) and the
proportion could also include variation due to epistatic effects.
81
Introduction
Hundreds of GWAS have reported over 5,000 common variants that were found to be
associated with over 200 diseases and traits since 2005
(http://www.genome.gov/gwastudies/) (Klein et al., 2005). According to the common
disease-common variant (CD-CV) hypothesis discussed by Risch and Merikangas in
1996, common variants should contribute a large fraction of risk to common diseases
(Risch and Merikangas, 1996). There is doubt, however, about the potential of the
GWAS-identified common variants in screening and predicting common diseases. For
example, Gail (2008) showed that a joint model with seven SNPs discovered by GWAS
for breast cancer had lower discriminatory accuracy (AUC=0.574) than a model based on
age at menarche, age at first live birth and family history of breast cancer (AUC=0.607)
(Gail, 2008), and that about 150 common variants that are mildly associated with breast
cancer risk are required to explain the relative risk observed among siblings, in contrast
to the approximately 22 risk loci that have so far been discovered by GWAS (Ahmed et
al., 2009; Antoniou et al., 2010; Easton et al., 2007; Fletcher et al., 2011; Gail, 2008;
Haiman et al., 2011c; Hunter et al., 2007; Kim et al., 2012; Long et al., 2012; Stacey et al.,
2007; Stacey et al., 2008; Thomas et al., 2009; Turnbull et al., 2010; Zheng et al., 2009b).
In addition, Dickson et al (2010) questioned the CD-CV hypothesis, arguing that it was
the rare variants that were responsible for most genetic risk associated with common
diseases, and that the rare variants created synthetic associations between common
variants and common diseases, which were detected by GWAS (Dickson et al., 2010).
With simulation, they illustrated how these synthetic associations arise and showed that
82
they were not only possible but could be ubiquitous (Dickson et al., 2010). The
proportion of disease risk that common variants can possibly contribute to is also
controversial. Two research groups, Lango Allen et al (2010) and Yang et al (2010)
estimated the heritability of height that is explained by common variants and drew
distinct conclusions (Lango Allen et al., 2010; Yang et al., 2010). We will describe these
studies in detail below.
The phenotypic variation of a disease is usually attributable to both genetic and non-
genetic risk factors, each of them contributing to a component of variance:
P GE
VV V =+
The proportion of phenotypic variance that is explained by genetic variation, i.e., V
G
/V
P
,
or broad-sense heritability (H
2
) measures the extent to which an individual’s phenotype is
determined by his/her genotype (Falconer and Mackay, 1996). Further decomposition of
the genetic proportion of variance includes a combination of the additive (V
A
), dominant
(V
D
) and epistatic (interactive, V
I
) proportion of variance, i.e.,
P AD I E
VV V V V =+ + +
The value V
A
/V
P
describes the extent to which the additive effect of genes inherited from
one’s parents determines his/her phenotype, and is defined as additive (narrow-sense)
heritability (h
2
) (Falconer and Mackay, 1996). For complex phenotypes such as cancer,
diabetes, obesity, and height, association studies including GWAS have been focused on
the additive effects of risk alleles, as it was generally accepted that ignoring the
83
dominance and epistatic effects of risk alleles had little effect on the analysis of traits that
are caused by polygenes with mild effects (Crow, 2010).
However, Zuk et al (2012) indicated in their recent paper that epistatic interaction might
in fact play an important role in the genetics of complex diseases and could produce a
large amount of “phantom” additive heritability (Zuk et al., 2012). They claim that
narrow-sense heritability has been overestimated in earlier studies; specifically, they
argue that epistatic effect leads to an overestimation of additive heritability when
heritability is estimated among close relatives. Zuk’s main point seems to be that
interactions could constitute the missing heritability that had been much described in
Manolio et al (2009) (Manolio et al., 2009). Since complex interactions are inherently
much harder to detect and examine, this would impose a constraint on how much
heritability could be identified using GWAS.
On the other hand, it appears that one can use “unrelated” (only very distantly related)
individuals and still estimate the additive component of variation, recognizing that the
contribution of epistatic interactions decays faster as relationships become more distant.
In GWAS studies that include some close relatives, it may also be possible to estimate
these both and more general variance components for highly heritable phenotypes (e.g.
height).
When we test for additive effects using GWAS data, analysis restricted to nearly
unrelated individuals should estimate the additive component cleanly. If a higher fraction
84
of variance is explained when examining close relatives, then this would give us direct
information about the “inflation” of additive heritability due to epistatic components.
Human height is highly heritable and appears to have an extremely polygenic pattern of
inheritance. GWAS including meta-analysis of height have so far identified over 200
common SNPs that are associated with human height (Carty et al., 2012; Cho et al., 2009;
Estrada et al., 2009; Gudbjartsson et al., 2008; Johansson et al., 2009; Kim et al., 2010;
Lango Allen et al., 2010; Lettre et al., 2008; Liu et al., 2010; N'Diaye et al., 2011; Okada
et al., 2010; Sanna et al., 2008; Soranzo et al., 2009; Tonjes et al., 2009; Weedon et al.,
2008; Weedon et al., 2007). Earlier family studies conducted in twins have indicated that
the overall heritability of adult height is around 80% (Fisher, 1918; Macgregor et al.,
2006; Silventoinen et al., 2003). Estimation of heritability that could be explained by
GWAS-identified common variants, however, varies from study to study and lacks a
definite conclusion. For example, Lango Allen et al (2010) showed in a large meta-
analysis of 133,653 individuals of European ancestry that hundreds of variants in 180
genetic loci can only explain 10% of the phenotypic variance of height. Accounting for
potential undiscovered common risk variants of similar effect size increases the explained
proportion of phenotypic variance to 16% (Lango Allen et al., 2010). On the contrary,
Yang et al (2010) reported that 45% of the total phenotypic variance could be explained
by considering 294,831 common SNPs simultaneously in an Australian adolescent
sample of 3,925 subjects (Yang et al., 2010). The two studies employed different
statistical approaches when estimating the proportion of phenotypic variation explained
85
by common variants, namely a score statistic approach and a variance components
approach, respectively, which we will describe in detail below.
In the present study, we assembled a large sample of men and women of African ancestry
from 21 studies in order to explore how much of the heritable component of height is
attributable to common variants in this population. According to the pair-wise IBD
estimation, most of the individuals recruited are only distantly related, but there are also a
few close relatives in the sample (see Methods). Using the IBD sharing information, we
were able to construct subsets of “unrelated” individuals and closer relatives from the
sample and estimate additive heritability of height in both subsets. Although we refer to
additive heritability as “heritability” all through the article, one should be aware that the
heritability estimation could include both additive and epistatic components, especially in
close relatives, for the reasons stated above.
There are a few different approaches used in polygenic analysis. Commonly used
approaches include the omnibus (global) analysis, the score statistics approach, and
kernel machine approaches (Lango Allen et al., 2010; Liu et al., 2008; Liu et al., 2007;
Purcell et al., 2009; Schaid, 2010a, b). The variance components approach used in Yang
et al is one special case of kernel machine approaches, which projects the similarity in
height on a genetic similarity matrix (G) equivalent to the Balding-Nichols matrix (Astle
and Balding, 2009; Yang et al., 2010).
The method features a mixed model
y = µ + X α + Zu + ε
86
Where α is a vector of fixed effects, Z is a genotype matrix of common variants, and is
column-standardized to zero mean and unit variance. The risk coefficient associated with
these variants are treated as random effects with effect size u, u ~ N(0, σ
u
2
I). The residual
terms are distributed as multivariate normal N(0, σ
e
2
I). The polygene term Zu, or
1
p
ij i
i
zu
=
∑
in the jth subject, where p is the number of markers in the model, has variance σ
g
2
= pσ
u
2
as genotypes are standardized and assumed to be independent. Therefore we have
var(y) = σ
u
2
ZZ’ + σ
e
2
I = σ
g
2
G + σ
e
2
I
where G=ZZ’/p, or
1
(2)( 2)
1
2(1 )
p
ik k jk k
ij
k
kk
zfz f
G
pff
=
−−
=
−
∑
and f
k
is the allele frequency of the k
th
marker. Note that the G matrix is 2 times the
Balding-Nichols matrix which has the following form
1
(2)( 2)
1
ˆ
4(1 )
p
ik k jk k
ij
k
kk
x fx f
K
pff
=
−−
=
−
∑
This matrix was commonly used for estimation of relatedness (Astle and Balding, 2009),
such as in the principal components analysis (PCA) described by Price et al (2006) (Price
et al., 2006) and the Efficient Mixed-Model Association eXpedited (EMMAX) method
described by Kang et al (2010) (Kang et al., 2010), both of which are widely employed
by researchers to control for hidden population structure.
87
Since the variance components method relies on genetic relationships between subjects,
the existence of subpopulations may influence the results using this method. Browning
and Browning (2011) pointed out that the estimate of heritability using such an approach
might be biased under certain circumstances, including the existence of population
substructure and correlation between environmental factors and the phenotype. They also
showed that including principal components (PCs) in the model did not fully control for
inflation due to population structure (Browning and Browning, 2011). This implies that
distant relatedness among study subjects, rather than the particular set of SNPs studied
and the “causal” information they are capturing, may underlie estimates of heritability.
That is, heritability estimates may simply be capturing genetic similarity.
We investigate these concerns by answering the following questions: 1) Does the
coverage of different subsets of SNPs, e.g., how well they predict common variants in the
1000 Genomes Project, correspond to the estimation of heritability? For example, how
much heritability can be explained by 100,000 SNPs, which are likely to provide a poor
genome-wide coverage, compared to heritability estimated from 1,000,000 SNPs?
Similarly, if we exclude SNPs in high LD with the causal SNPs in a simulation, do we
still observe a large amount of heritability explained by the remaining SNPs? 2) If the
approach is sensitive to signals from causal variants which only comprise a small subset
of all the SNPs tested, is it also very sensitive to weak hidden population structure? 3)
We were concerned about technical questions related to the variance components
approach when a) there is no hidden population structure and b) there is hidden structure
88
in the population. We will also investigate how well the approach performs in a nearly
unrelated (distantly related) sample.
Methods
Ethics Statement
The Institutional Review Board at the University of Southern California approved the
study protocol.
Study population
This study includes women and men of African ancestry from 9 epidemiological studies
of breast cancer and 12 epidemiological studies of prostate cancer, which comprise a total
sample size of 15,032 (5,984 women and 9,048 men). Please refer to Supporting
Information for a brief description of each of the studies.
Genotyping and quality control
Genotyping was conducted on Illumina 1M-Duo BeadChip. Of 15,032 DNA samples
available we excluded 613 with either low DNA concentrations, unexpected sex
chromosome heterozygosity (i.e. conflicting or ambiguous sex based on the X
chromosome genotypes when compared to reported sex), call rates < 95%, or<=5%
African ancestry. The resulting dataset included genotypes for a total of 14,419
participants. Starting with 1,153,397 SNPs, we removed SNPs on sex chromosomes
(n=34,504), with <95% call rate, MAFs <1%, or P value for Hardy-Weinberg equilibrium
(HWE) <1×10
-7
(n=142,339). Randomly selected SNPs that are not located within the
89
known risk loci of height are used to infer the principal components (PCs) (n=9,976, see
below) and excluded from analysis. The final analysis included 966,578 autosomal SNPs
among 14,419 subjects.
Statistical analysis
Controlling for population structure
We used the principal components approach to control for population structure. We used
9,976 SNPs randomly selected from the genome excluding known risk loci for height to
infer PCs. Known risk loci for height were identified from a catalog of published GWAS
(http://www.genome.gov/gwastudies/) listed on the webpage of the National Human
Genome Research Institute, including GWAS scan results from Lango Allen et al (2010),
and a GWAS scan in individuals of African ancestry by N’Dyate et al (2011) (Lango
Allen et al., 2010; N'Diaye et al., 2011). We included the first 10 PCs in all models to
control for population structure.
Estimating heritability explained by common variants
We employed a variance components approach that was described by Yang et al (2010)
to estimate the heritability explained by common SNPs from an additive polygenic model
(Yang et al., 2010).
Using the method, we fit several mixed linear models using the Genome-wide Complex
Trait Analysis tool (GCTA). Fixed effects included age, sex, study site and the first 10
PCs.
90
Estimating additive and epistatic components of variance: We used the --genome
command in PLINK (Purcell et al., 2007) (http://pngu.mgh.harvard.edu/purcell/plink/) to
estimate the probability of pair-wise IBD (Pr(IBD=0), or z0; Pr(IBD=1), or z1;
Pr(IBD=2), or z2) exhausting all pairs among the sample using genotype information
from all the SNPs. Following that we were able to construct an unrelated (only distantly
related) sample by dropping one of each pair of related subjects (kinship coefficient
(k) >=0.025). This left 12,488 unrelated subjects. We argue that, as is mentioned above,
we can estimate the additive term of variance cleanly in an unrelated sample as long as
epistatic effects decay faster as relationships become more distant. On the other hand, by
considering higher IBD sharing pairs, we constructed two related samples (z1>=0.3,
n=1,415; z1>=0.4, n=575) and again estimated heritability of height using the variance
components approach. We argue that the resulting estimate includes both additive and
epistatic components.
Genomic partitioning: In addition to estimating the heritability explained by all 966,578
common SNPs, we fit the model on several subsets of these SNPs to further investigate
the properties of the approach. We randomly dropped 90%, 70%, 50% and 20% of the
entire set of 966,578 SNPs and estimated the heritability of height explained by the
remaining SNPs.
Genome-wide coverage estimation
Genome-wide coverage estimation was realized using imputation software MACH (Li et
al., 2009; Li et al., 2010). We used 10% random sample of SNPs from the SNPs on the
91
chip on chromosome 21 (n=1,372) to impute the common SNPs (MAF>=0.05) on
chromosome 21 that are documented in the 1000 Genomes Project
(http://www.1000genomes.org/, Feb 2012 release). We removed from the reference panel
SNPs violating HWE (p<1×10
-6
), rare SNPs (MAF<0.05), SNPs having more than 2
alleles, and SNPs with a low genotype call rate (< 95%). After quality control, there are
121,039 common SNPs and 246 individuals with African ancestry in the 1000 Genomes
Project reference panel. For comparison purposes, we also used all chromosome 21 SNPs
on the chip (n=13,442) to impute the same set of 1000 Genome Project SNPs. We then
obtained the r
2
value for imputing each of the missing SNPs.
Simulation studies
Heritability estimation excluding causal SNPs
We pruned SNPs from the whole genome and kept only those with multiple-tagging
correlation coefficient (R
2
) less than 0.25 using PLINK (Purcell et al., 2007)
(http://pngu.mgh.harvard.edu/purcell/plink/). Following that, we randomly selected 1,064
SNPs from the pruned SNP set and treated them as causal SNPs. A random effect model
was assumed where the effect size u was sampled from N(0,1). Upon calculating the
genetic score g, residual effect e was sampled from N(0, σ
g
2
(1-h
2
)/h
2
) to ensure that
heritability of the model equals h
2
. Height data were then generated by summing the
additive genetic effects and residual effects.
92
We then excluded the causal SNPs from the pruned SNP set and estimated heritability of
height from the remaining SNPs, which ensured that all SNPs used in the estimation were
only weakly correlated with causal SNPs (r
2
< 0.25).
Performance of variance components approach when p gets large
Consider the linear model that was mentioned in Introduction
1
p
jiijj
i
ya bz e
=
=+ +
∑
where z
ij
is the genotypes of SNP i at individual j standard to zero mean and unit variance,
and var(e
j
) equals to σ
e
2
.We assume in this calculation that σ
e
2
is known, so that the value
SSM (the between-group sum of squares)/σ
e
2
follows a non-central χ
2
distribution with p
df. Note that this is proportional to the numerator of the F-test in ANOVA. Since we
assume that σ
e
2
is known, we can use a χ
2
test as the omnibus test rather than use an F test.
The non-centrality parameter (ncp) in this case is
2
()
T
e
ZbZb
ncp
σ
=
As we included more independent markers into the model, we adjusted the non-centrality
parameter (numerator of the F-test in ANOVA times its df) to fix the power at 80%. In
order to do this, we used the following R code to get the empirical relationship between
the number of markers in the model (p) and ncp:
power_chisq = function(p,ncp)
{
crit = qchisq(0.05,p,lower.tail=F)
93
pchisq(crit,p,ncp,lower.tail=F)
}
Using this function we were able to get a set of ncp’s that change with different values of
p’s that fix the power for the omnibus test (with σ
e
2
known) at 80%. Fitting a linear
model of the ncp’s on the square root of p’s results in an R
2
of nearly 1 with the
following model:
( , 0.8) 4.426 3.5569 ncp p power p == + ×
which we find approximates the ncp so that a χ
2
test with p df has power of 0.8 with type
I error equal 0.05.
To simplify things we start by assuming that the p markers are independent from each
other and each has the same effect size, b, then the ncp is:
2
22 2
()()
TTT
ee e
ZbZb bZZb npb
ncp
σ σσ
== =
Which leads to
2
(4.426 3.5569 )
e
p
b
np
σ
+
=×
It is interesting to consider how much heritability is expected in this model in order for
the effects to be detectable (power=0.8). Based on the model, the following formula gives
the heritability:
94
2
2
2 1
22
22
1
4.426 3.5569
4.426 3.5569
1
4.426 3.5569
4.426 3.5569
p
i
i
p
e
ie
i
b
pb
h
pb
b
p
n
p
n
p
pn
σ
σ
=
=
==
+
+
+
=
+
+
+
=
++
∑
∑
which increases to 1 as p increases when n is fixed.
A plot of the heritability in relation with p, when fixing n at 1000 and power at 0.8, was
shown in Supplemental Figure 5-1. With n=1000, a heritability of around 52% is needed
in order for the global effect of the 100,000 markers to be detected at power of 0.8, which
is a reasonable requirement for phenotypes like height. If we increase the sample size n,
the heritability requirement is lowered as more information is included.
We simulated p independent genotypes for each individual from the standard normal
distribution, and applied effect size (b) on each genotype which is determined by p. The
parameter σ
e
2
was fixed to be 1.
To compare the power of the omnibus test and the variance components approach, we
used ANOVA and PROC MIXED in SAS to test the genetic effect, respectively. In the
latter case, we fit a linear structured covariance matrix model which can be interpreted as
a mixed model with random effects for each SNP (see Supporting Information)
(Anderson, 1973).
95
We carried out the simulation with n=1000 subjects and different number of SNPs (p,
from 1 to 1×10
6
) and each trial was replicated for 1000 times.
Results
Heritability estimation among unrelated and related individuals
While we did not identify any monozygotic (MZ) twins or duplicate samples (z2=1), we
found evidence of first-degree relatives (parent-offspring pairs and siblings) in the sample
(Supplemental Figure 5-2). Our independent sample (k<0.025) should have excluded
possible parent-offspring pairs (z1=1), siblings or half siblings (z1=0.5) and cousins
(z1=0.25). In the independent sample the 966,578 SNPs as a whole explain 44.7% (se:
3.7%) of the total phenotypic variance (Figure 5-1), which is close to the fraction that
Yang et al (2010) reported as being explained by 294, 831 common SNPs in a sample of
Australian adolescents (44.9%; se: 8.4%; N=3,925) (Yang et al., 2010).
A significant increase in the estimation of genetic variance component was noted,
however, when considering closely related subjects. In the sample consisting of pairs of
subjects with pair-wise z1>=0.3, we found that 76.5% (se: 11.7%) of total phenotypic
variance could be explained by the set of 966,578 SNPs, while in the sample where pair-
wise z1>=0.4, the fraction was 75.1% (se: 13.3%) (Figure 5-2). This is close to the 80%
heritability of height estimated from twin studies (Fisher, 1918; Macgregor et al., 2006;
Silventoinen et al., 2003). In fact, if we subtract the additive component of variance that
we obtained from estimation in unrelated subjects from the total of genetic component
from close relatives, we are able to calculate the epistatic (or epistatic plus dominant)
96
component of variance which is around 30% assuming no shared environmental
exposures.
Genomic partitioning of variance
We found that in the sample that excluded close relatives a random set of about 10% of
the SNPs (n=97,035) of the whole genome explained about 26.7% (se: 2.6%) of the
phenotypic variance (Figure 5-1). In other words, 10% of the SNPs explained more than
half of variation that the entire set of SNPs can explain. As more SNPs were included in
the model, the proportion of variance explained by the model showed an increasing
pattern, which appeared to tail off when the number of SNPs reached a certain point
(Figure 4-1). For example, with roughly half of the SNPs (n=483,913), the proportion of
phenotypic variation explained already reached 39.4% (se: 3.4%), which means half of
the SNPs could do almost as well as one million SNPs in modeling genetic variation.
Genome-wide coverage estimation
Imputation analysis on chromosome 21 showed that using 10% of SNPs on the chip, the
mean r
2
was only 0.28 for the 1000 Genomes Project common SNPs, as compared to a
mean r
2
of 0.89 when using all the SNPs on chromosome 21. In particular, only 8% of
1000 Genomes Project common SNPs could be imputed with r
2
> 0.8 when using 10% of
the SNPs, while in the case of using all the SNPs, 84% of them could be well-imputed
(r
2
> 0.8) (Table 4-1). Imputation results indicate that although 10% of SNPs could
explain more than half of the heritability that was explained by all one million SNPs, the
genomic coverage was far less than half of the coverage from all one million SNPs.
97
Estimating heritability from non-causal SNPs
We pruned the entire set of SNPs using a multiple tagging process, leaving only SNPs
with r
2
less than 0.25 when regressed on all the other SNPs within a 50-SNP window.
After that, 454,598 SNPs remained in the data set. A random set of 1,064 SNPs was
selected from the independent set as causal SNPs. The genomic inflation factor ( λ) for the
genome-wide scan on the simulated phenotype was 1.08 after adjusting for the first 10
PCs, age and study, which was smaller than what was obtained from the genome-wide
scan on real data (1.11) (Supplemental Figures 5-3 and 5-4). Note that some over-
dispersion of the Q-Q plots is implied by a highly polygenic inheritance pattern of height
(Yang et al., 2011c).
Under the setting of h
2
= 50%, the entire set of pruned SNPs, including causal SNPs, was
estimated to explain 50.1% (se: 5.7%) of the phenotypic variance, whereas the reduced
set excluding causal SNPs was estimated to explain 37.8% (se: 6.2%) of total variation in
height (Figure 4-3). Note that the entire set was pruned and SNPs in this set only
correlated with others at r
2
< 0.25. Therefore the reduced set of SNPs could be expected to
explain less than 50.1% ×0.25= 12.5% of the total variation if the genetic component of
variance estimated is due to effect of causal SNPs. Similar results were noted when h
2
varied from 0.2 to 0.4 (Figure 4-3). We had concerns that controlling for population
structure using PCs could be an over-adjusting of the model, since we generated data in a
way such that genotypes of causal SNPs were the only risk factors that contribute to
98
phenotype variation; however results obtained from unadjusted models were reasonably
similar to those from full models (results not shown).
Performance of the variance components approach as p gets large
We constrained sample size (n) to 1000, and simulated situations where the number of
markers in the model (p) ranges from 1 to 1×10
5
. In the simulation we fixed power to
detect SNP effects using an omnibus approach with known error σ
2
e
to be 80% (See
Methods). Upon generating the phenotype data, we applied both the omnibus approach
(ANOVA) and the variance components approach on the data for power estimation. The
omnibus approach was initially at least as powerful as the variance components approach,
but as expected, the power of the omnibus approach gradually decreased as p increased,
which dropped steeply when p approached 1000. On the other hand, the power of the
variance components approach was stabilized between 60% ~ 70% as p varied from 1 to
1,000. As p continued to increase, the power started to decrease, with 47% power when p
was as large as 10,000 while the sample size was 1,000. The type I errors ranged from
0.038 to 0.072, but in most cases they were around 0.05 (Table 4-2).
Our results indicate that the variance components approach was quite robust against
increasing the number of independent variables in the model. In our case of a sample size
of 1,000, as long as the number of genetic markers did not exceed 10,000, the variance
components approach remained reliable. As was stated above, we were particularly
concerned that in an independent sample, the G matrix where we regress the phenotypic
variance on (var (y) = σ
g
2
G + σ
e
2
I, see Introduction and Methods) would approach I as
99
we increase the number of genetic markers used to calculate G, which may cause
convergence problems or unstable estimation of variance components. Although in the
specific setting here the variance components approach appeared to be more robust that
the omnibus approach and was found to be a valid method for multivariate analysis when
the number of independent variables is less than 10,000 with a sample size of 1,000,
simulation showed that the approach began to “collapse” when p reached a certain limit.
With 50,000 independent markers in the model (50 times the sample size), power
dropped to only 26%, and with 100,000 independent markers power was only 17%
(Table 4-2). A closer look at the results indicated that with 100,000 independent markers,
the approach failed to converge in about half of the simulations (Table 4-2).
Discussion
Using a variance components approach, we estimated the heritability of human height in
a sample of 14,419 subjects of African ancestry. We found that around 44.7% of
phenotypic variance of height was attributable to the additive effect of genetic variants in
populations of African ancestry. In related samples, our results showed that the genetic
component of variance explained about 75% of phenotypic variation. We also noticed
that, in the earlier sample where all individuals were unrelated, losing 90% of the SNPs
on the 1M array only resulted in a loss of less than half of their contribution to
phenotypic variation, which seemed to imply that it was unlikely to be the effect of
captured causal variants alone that explained the heritability. With simulation studies, we
were able to further confirm that SNPs that were only weakly correlated with causal
100
SNPs (r
2
< 0.25) explained a larger fraction of variation than expected. Finally, simulation
studies showed as the number of independent variables increased, the variance
components approach failed to perform well beyond a certain point in a sample of truly
independent subjects.
Family studies of twins have estimated that heritability for height is around 80% (Fisher,
1918; Macgregor et al., 2006; Silventoinen et al., 2003). Although GWAS of height have
discovered hundreds of common risk variants, little evidence has been found to suggest
that these common variants could explain as much as 80% of phenotypic variation. For
example, Lango Allen et al estimated that approximately 10% of variation in adult height
could be explained by common variants from 180 risk loci (Lango Allen et al., 2010).
Yang et al (2010) employed a variance components approach and concluded that 45% of
phenotypic variance in height could be explained by common variants (Yang et al., 2010),
which, as we argue in the present paper, might be an overstatement of the genetic effects
of common variants, and in particular, an overestimate of the effects of the specific set of
294,831 common SNPs measured in their study. First of all, we noted that although 10%
of the entire one million SNPs have poor genome-wide coverage for the common SNPs
(MAF>=0.05) in the 1000 Genomes Project compared to the coverage from all the one
million SNPs (8% vs. 84% with imputation r
2
>0.8), they could nonetheless explain more
than half of the variation that was explained by the latter. If it were the effects of captured
causal SNPs that contributed to the estimated genetic variation, we might expect 10% of
the SNPs to explain a much smaller proportion of the variance, since they are only 1/10
101
as effective as all the one million SNPs in terms of capturing common variants in the
genome. In addition, through our simulation study we were able to confirm that even
when causal SNPs and SNPs in strong correlation (r
2
>= 0.25) with them were excluded
from the model, the remaining SNPs could still explain more than half of the variation.
Again, since the two sets of SNPs are only weakly correlated, we would expect little
information of the causal SNPs to be captured by the non-causal SNPs in the model, yet
the non-causal SNPs seemed to capture a fairly large amount of phenotypic variation.
Evidence we obtained so far indicates that the estimated amount of phenotypic variation
explained by the model may not solely be due to the genetic effects of causal variants that
are tagged by the markers.
As was mentioned before, the G matrix that was used to model genetic distance is really
the Balding-Nichols matrix multiplied by a factor of two (see Introduction) (Astle and
Balding, 2009). The Balding-Nichols matrix is an unbiased estimator of the kinship
matrix K if the allele frequencies are known (Astle and Balding, 2009). Under simple
genetic models, the accuracy in estimating relatedness among subjects using this G
matrix will increase as more and more independent markers are included. In this case, the
G matrix has expected value equal to the kinship matrix when allele frequencies of the
markers are known. In an unrelated population, the kinship matrix is a diagonal matrix
which has the diagonal elements K
ii
= (1+F
i
)/2 with F
i
being the inbreeding coefficient of
individual i. In an outbred and unrelated population, the kinship matrix will have all its
diagonal elements equal to 1/2 which is just a factor (1/2) times the identity matrix I. It
became apparent then that as more and more independent markers are included in the
102
model, the G matrix would approach a constant matrix which is not dependent on the
particular set of genetic markers or the frequencies or even the type of genetic markers
(which could be SNPs, copy number variations (CNVs), etc). The estimate of the genetic
component of variance (σ
g
2
) would approach a scalar constant in this case. In our analysis,
the genetic component of variance increased as more SNPs were included in the model,
and reached a plateau at about 500,000 SNPs, which complied with our expectation that
the estimation tends to approach a constant (Figure 4-1). In samples of closely related
subjects, it takes fewer SNPs to reach the plateau (Figure 4-2) due to higher genetic
resemblance among closer relatives.
We did note that, when more SNPs were used in the model, the heritability estimation
was almost constantly higher. Intuitively higher heritability results from a more complete
set of captured risk loci; however, our results suggest that heritability estimation from the
variance components approach is not likely to be due to genetic effects of causal alleles
only, but more likely due to the relatedness captured by these SNPs; we know that the G
matrix estimates relatedness more accurately when more independent markers are used. It
has long been recognized that errors of measurement lead to attenuated estimation of
regression coefficients which cause them to be biased to the null (Cochran, 1968; Fuller
and Hidiroglou, 1978; Walker and Lev, 1953). Fitting the model var(y) = σ
g
2
G + σ
e
2
I is
basically equivalent to a weighted regression of the cross products of the outcome (y-
E(y))(y-E(y))’ on the elements of G and I (Harville, 1977; Yang et al., 2010). We argue
that measurement error, or inaccurate measurement of relatedness in this case, is the main
cause of the lower heritability estimation with smaller number of SNPs in the model.
103
Our results found that, with 1,000 truly independent subjects, the power of the approach
dropped drastically when the number of markers goes to 50,000 and higher. One
possibility for the reduced power is, as mentioned before, as more independent markers
were included, the G matrix gradually approached the kinship matrix, which is the
identity matrix in this situation. This implies lack of identifiabilities in the model between
σ
g
2
and σ
e
2
and resulting in a failure of maximum likelihood estimation (MLE) to
distinguish the two parameters.
It has been controversial how important a role common variants play in contributing to
heritable variation of common diseases or traits, which has led to various studies aimed at
estimating the amount of variability that common variants are able to explain. Our study
showed that results from a variance components approach could overestimate the
proportion of heritability explained by common variants; the approach could instead be
estimating the extent to which heritability is determined by relatedness. Our results
indicate that the additive heritability of human height is around 40% to 50%, which is
lower than the 80% heritability reported by earlier twin studies, but still, not necessarily
to be due to additive effects of common variants alone, especially those in strong LD with
the measured markers. Larger GWAS/meta-analyses are being performed aimed at
revealing common risk variants that remain undiscovered due to lack of power. However,
it is unclear how much heritability could be explained by those common variants whose
effect sizes are expected to be progressively smaller than those already discovered
(Manolio et al., 2009). Cis-interactions between common variations, or multi-SNP
haplotypes could contribute to missing heritability as well. Finally, although studies
104
employing the variance components approach concluded that common variants might
explain a large proportion of heritability of human height, we are inclined to regard that
conclusion as unproven. Further exploration is required to characterize the genetic risk
structure of human height and to resolve the missing heritability problem.
105
Table 4-1. Coverage of the 1000 genomes project SNPs on chromosome 21 by
different numbers of SNPs
a
.
1/10 of the SNPs on the chip
(n = 1,372)
All the SNPs on the chip
(n = 13,442)
Mean r
2
0.28 0.89
Proportion that r
2
> 0.8 8% 84%
a
A total of 121,039 common SNPs (MAF>=0.05) included in the reference panel. R
2
’s are from MACH output.
106
Table 4-2. Power and type I error of the omnibus approach and variance
components approach with different number of markers
a
.
#
predictors
Omnibus approach Variance components approach
Type I error Power Type I error Power
1 0.038 0.79 0.072 0.57
25 0.062 0.78 0.064 0.67
50 0.058 0.79 0.048 0.68
100 0.042 0.74 0.042 0.67
200 0.036 0.69 0.043 0.64
300 0.076 0.67 0.064 0.66
400 0.046 0.59 0.054 0.64
500 0.054 0.56 0.054 0.65
700 0.038 0.41 0.046 0.65
900 0.056 0.19 0.046 0.62
1000 NA NA 0.056 0.59
3000 NA NA 0.054 0.56
5000 NA NA 0.052 0.52
8000 NA NA 0.038 0.49
1×10
4
NA NA 0.040 0.47
5×10
4
NA NA 0.062 0.26
1×10
5
NA NA 0.058 0.17
a
Number of subjects (n) is 1000.
107
Figure 4-1. Proportion of phenotypic variation explained by different number of
SNPs in an independent sample (k<0.025).
108
Figure 4-2. A comparison of phenotypic variation explained by different number of
SNPs in independent samples and close relatives.
109
Figure 4-3. Simulation results of heritability explained by SNPs only weakly
correlated with causal SNPs.
110
Conclusion
An ultimate goal of genome-wide polygenic analysis is disease prediction from the
genotypes of a set of risk markers. In order for this goal to be achieved, the genetic
markers selected should efficiently capture the functional alleles associated with the
disease in a specific population; meanwhile, a reasonably complete catalogue of markers
is needed in order to have enough discrimination power to predict the particular disease.
In Chapter One, I assessed the generalizability of common breast cancer risk-associated
SNPs that were identified by GWAS in multiple populations. Through careful
examination of the associations in different ethnic groups, we concluded that the known
risk-associated SNPs, which were identified in European and/or Asian populations, might
not serve as valid genetic markers for breast cancer in other populations, especially in
African Americans. As is known, the LD blocks in people of African ancestry are
noticeably shorter than in other populations, which means lower tagging efficiency for
genetic markers. Different allele frequencies might be another obstacle that hinders the
generalization of GWAS-identified risk markers across populations. Following up these
results, we fine-mapped the risk regions for breast cancer in African Americans, and
found new markers in several regions that more efficiently capture the signals, as is
described in Chapter Three. In Chapter Two we evaluated the pleiotropic effect of SNPs
that are known to be associated with type 2 diabetes and/or obesity risk with breast
cancer, since there are established biological interrelations among the risk of these
diseases/traits.
111
It is generally accepted that the common risk-associated variants identified by GWAS so
far can only explain a small proportion of variation for common diseases (Gail, 2008;
Lango Allen et al., 2010; Manolio et al., 2009). More well-powered GWAS are being
conducted (in multiple populations) to look for variants with smaller effect sizes or lower
minor allele frequencies to complete the risk set. So, how complete is complete? We have
mentioned that Gail (2008) pointed out that it takes hundreds of common variants with
modest effect sizes to improve the of breast cancer risk (Gail, 2008). Below we used a
score statistic approach to calculate the number of common variants needed to explain the
familial relative risk (FRR) of a certain complex disease.
FRR is commonly used in clinical practice as a description of the inheritable proportion
of a disease that has a binary outcome (diseased versus non-diseased). It is defined as the
ratio of the conditional probability of disease given a relative is diseased, and the disease
rate in the general population, i.e.,
(1| 1)
(1)
OF
O
PD D
FRR
PD
==
=
=
For the derivations conducted below we assume an additive model, i.e., the heritable
component of the disease is due to the additive genetic effects only. Start with a
simplified assumption where all the functional alleles have similar risk allele frequency
and are all in linkage equilibrium, the number of independent functional alleles (k) that
are needed to explain an FRR of R can be approximated from the following formula:
2
log( )
2(1 )
R
k
pp ρ β
=
−
112
where p is the risk allele frequency, β is the effect size the risk allele, and ρ is the
relatedness between the relatives considered, which equals to 0.5 for first-degree relatives.
The derivation using a risk score approach is shown in the appendix.
The formula indicates that the number of common variants that underlie an FRR of R is
determined by the allele frequency and effect size of the variants. This number could be
huge for rare variants. For example, the number of variants with MAF=0.001 needed to
achieve the same FRR is 48 times the number of variants with MAF=0.05 assuming that
they have the same effect size.
The FRR of a particular disease is often known. For example, prostate cancer is believed
to have an approximately two-fold first-degree FRR (Bruner et al., 2003; Johns and
Houlston, 2003; Zeegers et al., 2003). For complex diseases where multiple risk-
associated loci are discovered, the relative risk between the carriers of the top quantile of
the risk score and the bottom quantile of the risk score is often used to assess the effect of
the polygenes. If we consider the top vs. bottom quintiles of a risk score comprising of k
SNPs each with allele frequency p and effect size β, the risk score is approximately
distributed as N(2kp, 2kp(1-p)). Considering the median at the top and bottom quintiles,
the normal top and bottom 10% quantiles are 1.28 times its standard deviation from the
mean on each side. Therefore the relative risk between the top and bottom quintiles is
2
2
21.28 2 (1 )
log( )
2.56 2 (1 )
2(1 )
log( )
2.56
kp p
R
pp
pp
R
RR e
e
e
β
β
ρβ
ρ
×−×
×− ×
−
=
=
=
113
According to this formula, a relative risk of ~10 between people carrying the top and
bottom quintiles of risk alleles is expected for a disease with an FRR of 1.5 assuming an
additive model. We clearly have not yet achieved this level of discrimination for most /all
complex diseases studied by GWAS so far. For example, the relative risk of breast
cancer between women within the top and bottom quintiles of a risk score consisting 19
known risk-associated SNPs is 1.4 in African Americans, which is much weaker than the
expected 10-fold relative risk (Chen et al., 2011). In other words, a lot of causal variants
are still missing from the complete set of genetic markers that contribute to breast cancer
risk.
Heritability is also used to describe the inheritance property of a phenotype, especially in
the case of quantitative traits, but also has been extended to binary traits using a threshold
liability model (Yang et al., 2011a). So far, the GWAS-identified common SNPs can only
explain a small proportion of heritability for most complex diseases. Although Yang et al
concluded in their paper that common variants could explain a large proportion of
heritability for human height (Yang et al., 2010), we tend to regard that the heritability
estimation is not due to the effect of common variants alone, for the reasons stated in
Chapter Four. Still more well-powered GWAS are being performed today, uncovering
common risk variants with even milder effect sizes. Because of the progressively smaller
effect sizes, the number of SNPs expected to fully explain the heritability becomes even
larger.
114
It has been nearly ten years since the first GWAS was launched (Klein et al., 2005). This
flourishing era of GWAS has led to the discovery of more than 5,000 common variants
associated with over 200 common diseases ((http://www.genome.gov/gwastudies/).
Although GWAS have provided some insights in the biological pathways for common
diseases, the results have been unsatisfying in terms of the proportion of phenotypic
variation explained by the variants discovered. Does the unexplained heritability hide in
common variants with milder effect sizes that remain undiscovered due to lack of power
(as is described above)?Do common variants play an essential role in the genetic
architecture of common diseases we have thought? Different views have been raised on
whether or not we should continue to pursue GWAS on the current genotyping platforms,
or a completely new class of genetic markers (e.g. rare variants) is need to fully explain
the heritability. I will close my dissertation with a review of the current major opinions
on the missing heritability issue to provide some perspectives, followed by a description
of how our work contributes to the literature.
Zuk et al pointed out that current estimate of heritability can be seriously inflated because
it is under the erroneous assumption that all genetic effects are additive (Zuk et al., 2012).
Using the limiting pathway (LP) model, the authors discovered that gene by gene
interactions could constitute the major fraction of missing heritability (Zuk et al., 2012).
In the debate of whether or not we are in the right path to find all the genetic risk factors
by the current GWAS method, Zuk’s opinion seems to be the most pessimistic, which
called for a halt on the association testing of additive genetic effects. Instead, the authors
argued that the focus of GWAS should be placed on detecting genetic interactions. Over
115
the past years few significant signals have been found in the examination of pair-wise
gene by gene interactions using GWAS data. The major obstacle for detecting genetic
interaction seems to be power limitation: on the one hand, the effect sizes of genetic
interactions are expected to be much smaller than linear effects. On the other hand, the
number of hypotheses would be enormous if we consider all possible ways of interactions,
including pair-wise three-way, four-way, etc interactions among SNPs of the genome. If
we assume that the interaction effect sizes are similar to additive effects, for example, for
SNPs with MAF of 0.2, genetic interaction OR of 1.2, it takes a sample size of ~36,000 to
achieve a 80% power for testing the genome-wide pair-wise interactions a case-control
design (assuming the main effects are additive) (Gauderman, 2002; Gauderman and
Morrison, 2006). With the collaboration among research groups, such as data sharing,
this sample size seems achievable. But again, as effect size drops, the sample size would
increase proportionally to the inverse of the square of effect sizes (Zuk et al., 2012),
making the pursuit of genome-wide testing for genetic interactions very difficult, if not
impossible (and we are only considering pair-wise interactions here).
In their 2010 paper, Dickson et al seems to take a different view on why GWAS do not
fully explain the heritability. They showed that signals detected by GWAS could be a
result of “synthetic association” created by less common “causal” variants (Dickson et al.,
2010). These indirect associations could be incurred by individual or multiple rare
variants located in LD with measured common SNPs. It is indicated in the paper that the
probability of creating a synthetic association increases as the number of rare variants
increases, and rare variants could produce multiple independent synthetic association
116
signals (Dickson et al., 2010). Although most common variants identified by GWAS only
have moderate effect size, discovery of these variants can lead to the uncovering of
underlying high-penetrance rare variants. According to the conclusions from Dickson et
al, re-sequencing and association testing of rare variants seem to be the direction to solve
the missing heritability problem. As of today, however, evidence that individual rare
variants make a large contribution to missing heritability is limited. For example, a two-
stage exome sequencing study of schizophrenia suggested that individual coding
variation that are moderately rare (1%~5%) with large effect sizes (RR greater than 2)
play a modest role in the risk of this disease (Need et al., 2012).
Visscher’s group, however, stick to the CD-CV hypothesis that the major faction of
heritability lies in common variants, and it is just a matter of time that more common
SNPs with smaller effect sizes would be uncovered to fill in the gap between the
expected and discovered heritability. In their 2010 paper by Yang et al, they concluded
that 294,831 SNPs could explain 45% of the phenotypic variation of height; it was also
suggested in the paper that the discrepancy between the previous estimate of 80% height
heritability and their 45% estimate was due to incomplete LD between the measured
SNPs and causal variants. In contrast to Zuk’s view, Visscher’s idea seems to be that 1)
additive heritability still constitutes a major fraction of total heritability; 2) the genome-
wide coverage of current GWA genotyping platforms is enough to capture most variants
that contribute to the heritability of common diseases. Therefore, through genotyping of
common SNPs that are currently being studied in larger data sets, more well-powered
117
GWAS would be able to identify risk-associated common SNPs that explain most of the
missing heritability.
We agree with Visscher’s first view that study of additive effects of genetic makers is
important. In our study, the additive component of variation constituted 45% of the total
phenotypic variation, which is very close to the heritability of height estimated by Yang
et al (Yang et al., 2010). However, in our interpretation, this estimate of 45% additive
heritability wasn’t necessarily attributable to the set of common SNPs included in the
model; it could be due to the distant relatedness that was captured by the SNPs in use.
Therefore, other types of variations could also contribute to a substantial proportion of
heritability. In addition, we agree with Zuk et al that the traditional estimate of
heritability is significantly inflated by epistatic effects, since the heritability estimate of
height was much higher (around 75%) in close relatives. Since the currently known SNPs
only explain 10% of the variation of height (Lango Allen et al., 2010), the proportion of
additive variation remaining unexplained is around 35% (not 70% as was previously
seen), which could be attributable to other classes of variations.
Some researchers believe that epigenetics could also contribute to heritability (Marian,
2012). Epigenetic effects are often transient since modification to DNA and the histone
proteins are usually cleared during mitosis. However, it is found in model animals that
certain epigenetic modifications are transgenerational (de Assis et al., 2012; Gu et al.,
2012), which are believed to be regulated by microRNAs (Sato et al., 2011). Some
epigenetic modifications can last for two to three generations and contribute to short-term
118
heritability (Marian, 2012). It is also noted that environmental risk factors can contribute
to heritability too, either via its impact on epigenetic modifications or via gene by
environment interactions (Marian, 2012; Skinner et al., 2010).
Since the CD-CV hypothesis was first proposed in 1996 (Lander, 1996), the genetic
decomposition of complex diseases has made significant progress. GWAS studies have
discovered numerous risk-associated SNPs, and provided valuable hints concerning
biological pathways and identification of disease-causing genes. Today, the study of
complex diseases has come to a post-GWAS era, with groups of researchers combining to
form large consortia to evaluate the aggregate effects of already discovered common
variation, and to begin the study of rare variation. In our study of height in a large sample
of African ancestry in Chapter Four, we find considerable evidence for as yet
unexplained additive heritability, however we do not necessarily attribute all of this
additive heritability to the common SNPs that are actually genotyped, and other classes of
genetic variants may be involved in the additive genetic component of complex diseases.
With the development of sequencing technology, especially low cost next-generation
sequencing (NGS) methods, extensive investigation of rare genetic variants becomes
possible. It seems most likely that studies of the cumulative additive effects of common
variants (each individually with weak effects) or rare genetic variants (with stronger
effects), as well as other types of (non-additive) genetic effects, such as gene by gene
interactions, gene by environment interactions, and transgenerational epigenetic
modifications will be needed to shed light on the missing heritability problem, and
119
therefore contribute to a more complete definition of the genetic architecture of complex
diseases.
120
References
Ahmed, S., Thomas, G., Ghoussaini, M., Healey, C.S., Humphreys, M.K., Platte, R., Morrison, J.,
Maranian, M., Pooley, K.A., Luben, R., et al. (2009). Newly discovered breast cancer
susceptibility loci on 3p24 and 17q23.2. Nat Genet 41, 585-590.
Al Olama, A.A., Kote-Jarai, Z., Giles, G.G., Guy, M., Morrison, J., Severi, G., Leongamornlert,
D.A., Tymrakiewicz, M., Jhavar, S., Saunders, E., et al. (2009). Multiple loci on 8q24 associated
with prostate cancer susceptibility. Nat Genet 41, 1058-1060.
Alaya, F.J., Escalante, A., O'Huigin, C., and Klein, J. (1994). Molecular genetics of speciation
and human origins. Proc Natl Acad Sci 91, 6787-6794.
Aldrich, M.C., Selvin, S., Hansen, H.M., Barcellos, L.F., Wrensch, M.R., Sison, J.D., Kelsey,
K.T., Buffler, P.A., Quesenberry, C.P., Jr., Seldin, M.F., et al. (2009). CYP1A1/2 haplotypes and
lung cancer and assessment of confounding by population stratification. Cancer Res 69, 2340-
2348.
Ambrosone, C.B., Ciupak, G.L., Bandera, E.V., Jandorf, L., Bovbjerg, D.H., Zirpoli, G., Pawlish,
K., Godbold, J., Furberg, H., Fatone, A., et al. (2009). Conducting Molecular Epidemiological
Research in the Age of HIPAA: A Multi-Institutional Case-Control Study of Breast Cancer in
African-American and European-American Women. J Oncol 2009, 871250.
Anderson, T.W. (1973). Asymptotically Efficient Estimation of Covariance Matrices with Linear
Structure. Ann Stat 1, 135-141.
Antoniou, A.C., Wang, X., Fredericksen, Z.S., McGuffog, L., Tarrell, R., Sinilnikova, O.M.,
Healey, S., Morrison, J., Kartsonaki, C., Lesnick, T., et al. (2010). A locus on 19p13 modifies
risk of breast cancer in BRCA1 mutation carriers and is associated with hormone receptor-
negative breast cancer in the general population. Nat Genet 42, 885-892.
Astle, W., and Balding, D.J. (2009). Population Structure and Cryptic Relatedness in Genetic
Association Studies. Statistical Science 24, 451-471.
Azevedo, L., Suriano, G., van Asch, B., Harding, R.M., and Amorim, A. (2006). Epistatic
interactions: how strong in disease and evolution? Trends Genet 22, 581-585.
Barbalic, M., Reiner, A.P., Wu, C., Hixson, J.E., Franceschini, N., Eaton, C.B., Heiss, G., Couper,
D., Mosley, T., and Boerwinkle, E. (2011). Genome-wide association analysis of incident
coronary heart disease (CHD) in African Americans: a short report. PLoS Genet 7, e1002199.
Bosch, E., Laayouni, H., Morcillo-Suarez, C., Casals, F., Moreno-Estrada, A., Ferrer-Admetlla,
A., Gardner, M., Rosa, A., Navarro, A., Comas, D., et al. (2009). Decay of linkage disequilibrium
within genes across HGDP-CEPH human samples: most population isolates do not show
increased LD. BMC Genomics 10, 338.
121
Broeks, A., Schmidt, M.K., Sherman, M.E., Couch, F.J., Hopper, J.L., Dite, G.S., Apicella, C.,
Smith, L.D., Hammet, F., Southey, M.C., et al. (2011). Low penetrance breast cancer
susceptibility loci are associated with specific breast tumor subtypes: findings from the Breast
Cancer Association Consortium. Hum Mol Genet.
Browning, S.R., and Browning, B.L. (2011). Population structure can inflate SNP-based
heritability estimates. Am J Hum Genet 89, 191-193; author reply 193-195.
Bruner, D.W., Moore, D., Parlanti, A., Dorgan, J., and Engstrom, P. (2003). Relative risk of
prostate cancer for men with affected relatives: systematic review and meta-analysis. Int J Cancer
107, 797-803.
Carty, C.L., Johnson, N.A., Hutter, C.M., Reiner, A.P., Peters, U., Tang, H., and Kooperberg, C.
(2012). Genome-wide association study of body height in African Americans: the Women's
Health Initiative SNP Health Association Resource (SHARe). Hum Mol Genet 21, 711-720.
Chang, A.S., and Noor, M.A. (2010). Epistasis modifies the dominance of loci causing hybrid
male sterility in the Drosophila pseudoobscura species group. Evolution 64, 253-260.
Chen, F., Chen, G.K., Millikan, R.C., John, E.M., Ambrosone, C.B., Bernstein, L., Zheng, W.,
Hu, J.J., Ziegler, R.G., Deming, S.L., et al. (2011). Fine-mapping of breast cancer susceptibility
loci characterizes genetic risk in African Americans. Hum Mol Genet 20, 4491-4503.
Chen, F., Stram, D.O., Le Marchand, L., Monroe, K.R., Kolonel, L.N., Henderson, B.E., and
Haiman, C.A. (2010). Caution in generalizing known genetic risk markers for breast cancer
across all ethnic/racial populations. Eur J Hum Genet 19, 243-245.
Cho, Y.S., Go, M.J., Kim, Y.J., Heo, J.Y., Oh, J.H., Ban, H.J., Yoon, D., Lee, M.H., Kim, D.J.,
Park, M., et al. (2009). A large-scale genome-wide association study of Asian populations
uncovers genetic factors influencing eight quantitative traits. Nat Genet 41, 527-534.
Cichon, S., Craddock, N., Daly, M., Faraone, S.V., Gejman, P.V., Kelsoe, J., Lehner, T.,
Levinson, D.F., Moran, A., Sklar, P., et al. (2009). Genomewide association studies: history,
rationale, and prospects for psychiatric disorders. Am J Psychiatry 166, 540-556.
Cicila, G.T., Morgan, E.E., Lee, S.J., Farms, P., Yerga-Woolwine, S., Toland, E.J., Ramdath, R.S.,
Gopalakrishnan, K., Bohman, K., Nestor-Kalinoski, A.L., et al. (2009). Epistatic genetic
determinants of blood pressure and mortality in a salt-sensitive hypertension model. Hypertension
53, 725-732.
Cochran, W.G. (1968). Errors of Measurement in Statistics. Technometrics 10, 637-&.
Crow, J.F. (2010). On epistasis: why it is unimportant in polygenic directional selection. Philos
Trans R Soc Lond B Biol Sci 365, 1241-1244.
122
Crowther-Swanepoel, D., Broderick, P., Di Bernardo, M.C., Dobbins, S.E., Torres, M., Mansouri,
M., Ruiz-Ponte, C., Enjuanes, A., Rosenquist, R., Carracedo, A., et al. (2010). Common variants
at 2q37.3, 8q24.21, 15q21.3 and 16q24.1 influence chronic lymphocytic leukemia risk. Nat Genet
42, 132-136.
de Assis, S., Warri, A., Cruz, M.I., Laja, O., Tian, Y., Zhang, B., Wang, Y., Huang, T.H., and
Hilakivi-Clarke, L. (2012). High-fat or ethinyl-oestradiol intake during pregnancy increases
mammary cancer risk in several generations of offspring. Nat Commun 3, 1053.
Devlin, B., and Roeder, K. (1999). Genomic control for association studies. Biometrics 55, 997-
1004.
Devlin, B., Roeder, K., and Wasserman, L. (2001). Genomic control, a new approach to genetic-
based association studies. Theor Popul Biol 60, 155-166.
Dickson, S.P., Wang, K., Krantz, I., Hakonarson, H., and Goldstein, D.B. (2010). Rare variants
create synthetic genome-wide associations. PLoS Biol 8, e1000294.
Easton, D.F., Pooley, K.A., Dunning, A.M., Pharoah, P.D., Thompson, D., Ballinger, D.G.,
Struewing, J.P., Morrison, J., Field, H., Luben, R., et al. (2007). Genome-wide association study
identifies novel breast cancer susceptibility loci. Nature 447, 1087-1093.
Endogenous Hormones and Breast Cancer Collaborative Group (2010). Insulin-like growth factor
1 (IGF1), IGF binding protein 3 (IGFBP3), and breast cancer risk: pooled individual data analysis
of 17 prospective studies. Lancet Oncol 11, 530-542.
Estrada, K., Krawczak, M., Schreiber, S., van Duijn, K., Stolk, L., van Meurs, J.B., Liu, F.,
Penninx, B.W., Smit, J.H., Vogelzangs, N., et al. (2009). A genome-wide association study of
northwestern Europeans involves the C-type natriuretic peptide signaling pathway in the etiology
of human height variation. Hum Mol Genet 18, 3516-3524.
Falconer, D.S., and Mackay, T.F. (1996). Introduction to Quantitative Genetics, Fourth edition,
Chapter 8.
Falush, D., Stephens, M., and Pritchard, J.K. (2003). Inference of population structure using
multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164, 1567-1587.
Fejerman, L., Haiman, C.A., Reich, D., Tandon, A., Deo, R.C., John, E.M., Ingles, S.A.,
Ambrosone, C.B., Bovbjerg, D.H., Jandorf, L.H., et al. (2009). An admixture scan in 1,484
African American women with breast cancer. Cancer Epidemiol Biomarkers Prev 18, 3110-3117.
Fisher, R.A. (1918). The correlation between relatives on the supposition of Mendelian
inheritance. Trans R Soc Edinb 52, 399-433.
Fletcher, O., Johnson, N., Orr, N., Hosking, F.J., Gibson, L.J., Walker, K., Zelenika, D., Gut, I.,
Heath, S., Palles, C., et al. (2011). Novel breast cancer susceptibility locus at 9q31.2: results of a
genome-wide association study. J Natl Cancer Inst 103, 425-435.
123
Frazer, K.A., Ballinger, D.G., Cox, D.R., Hinds, D.A., Stuve, L.L., Gibbs, R.A., Belmont, J.W.,
Boudreau, A., Hardenbol, P., Leal, S.M., et al. (2007). A second generation human haplotype
map of over 3.1 million SNPs. Nature 449, 851-861.
Freedman, M.L. (2006). Admixture mapping identifies 8q24 as a prostate cancer risk locus in
African-American men. Proc Natl Acad Sci 103, 14068-14073.
Fuller, W.A., and Hidiroglou, M.A. (1978). Regression Estimation after Correcting for
Attenuation. J Am Stat Assoc 73, 99-104.
Gail, M.H. (2008). Discriminatory accuracy from single-nucleotide polymorphisms in models to
predict breast cancer risk. J Natl Cancer Inst 100, 1037-1041.
Garcia-Closas, M., Hall, P., Nevanlinna, H., Pooley, K., Morrison, J., Richesson, D.A., Bojesen,
S.E., Nordestgaard, B.G., Axelsson, C.K., Arias, J.I., et al. (2008). Heterogeneity of breast cancer
associations with five susceptibility loci by clinical and pathological characteristics. PLoS Genet
4, e1000054.
Gauderman, W.J. (2002). Sample size requirements for association studies of gene-gene
interaction. Am J Epidemiol 155, 478-484.
Gauderman, W.J., and Morrison, J.M. (2006). QUANTO 1.1: A computer program for power and
sample size calculations for genetic-epidemiology studies, http://hydra.usc.edu/gxe.
Ghoussaini, M., Song, H., Koessler, T., Al Olama, A.A., Kote-Jarai, Z., Driver, K.E., Pooley,
K.A., Ramus, S.J., Kjaer, S.K., Hogdall, E., et al. (2008). Multiple loci with different cancer
specificities within the 8q24 gene desert. J Natl Cancer Inst 100, 962-966.
Goode, E.L., Chenevix-Trench, G., Song, H., Ramus, S.J., Notaridou, M., Lawrenson, K.,
Widschwendter, M., Vierkant, R.A., Larson, M.C., Kjaer, S.K., et al. (2010). A genome-wide
association study identifies susceptibility loci for ovarian cancer at 2q31 and 8q24. Nat Genet 42,
874-879.
Greenberg, R., and Crow, J.F. (1960). A Comparison of the Effect of Lethal and Detrimental
Chromosomes from Drosophila Populations. Genetics 45, 1153-1168.
Grote, V., Becker, S., and Kaaks, R. (2010). Diabetes Mellitus Type 2 – An Independent Risk
Factor for Cancer? Experimental and Clinical Endocrinology & Diabetes 118, 4-8.
Gu, S.G., Pak, J., Guang, S., Maniar, J.M., Kennedy, S., and Fire, A. (2012). Amplification of
siRNA in Caenorhabditis elegans generates a transgenerational sequence-targeted histone H3
lysine 9 methylation footprint. Nat Genet 44, 157-164.
Gudbjartsson, D.F., Walters, G.B., Thorleifsson, G., Stefansson, H., Halldorsson, B.V.,
Zusmanovich, P., Sulem, P., Thorlacius, S., Gylfason, A., Steinberg, S., et al. (2008). Many
sequence variants affecting diversity of adult human height. Nat Genet 40, 609-615.
124
Gudmundsson, J., Sulem, P., Gudbjartsson, D.F., Blondal, T., Gylfason, A., Agnarsson, B.A.,
Benediktsdottir, K.R., Magnusdottir, D.N., Orlygsdottir, G., Jakobsdottir, M., et al. (2009).
Genome-wide association and replication studies identify four variants associated with prostate
cancer susceptibility. Nat Genet 41, 1122-1126.
Gudmundsson, J., Sulem, P., Manolescu, A., Amundadottir, L.T., Gudbjartsson, D., Helgason, A.,
Rafnar, T., Bergthorsson, J.T., Agnarsson, B.A., Baker, A., et al. (2007). Genome-wide
association study identifies a second prostate cancer susceptibility variant at 8q24. Nat Genet 39,
631-637.
Haiman, C.A., Chen, G.K., Blot, W.J., Strom, S.S., Berndt, S.I., Kittles, R.A., Rybicki, B.A.,
Isaacs, W.B., Ingles, S.A., Stanford, J.L., et al. (2011a). Characterizing genetic risk at known
prostate cancer susceptibility loci in African Americans. PLoS Genet 7, e1001387.
Haiman, C.A., Chen, G.K., Blot, W.J., Strom, S.S., Berndt, S.I., Kittles, R.A., Rybicki, B.A.,
Isaacs, W.B., Ingles, S.A., Stanford, J.L., et al. (2011b). Genome-wide association study of
prostate cancer in men of African ancestry identifies a susceptibility locus at 17q21. Nat Genet 43,
570-573.
Haiman, C.A., Chen, G.K., Vachon, C.M., Canzian, F., Dunning, A., Millikan, R.C., Wang, X.,
Ademuyiwa, F., Ahmed, S., Ambrosone, C.B., et al. (2011c). A common variant at the TERT-
CLPTM1L locus is associated with estrogen receptor-negative breast cancer. Nat Genet 43, 1210-
1214.
Haiman, C.A., Hsu, C., de Bakker, P.I., Frasco, M., Sheng, X., Van Den Berg, D., Casagrande,
J.T., Kolonel, L.N., Le Marchand, L., Hankinson, S.E., et al. (2008). Comprehensive association
testing of common genetic variation in DNA repair pathway genes in relationship with breast
cancer risk in multiple populations. Hum Mol Genet 17, 825-834.
Haiman, C.A., Patterson, N., Freedman, M.L., Myers, S.R., Pike, M.C., Waliszewska, A.,
Neubauer, J., Tandon, A., Schirmer, C., McDonald, G.J., et al. (2007). Multiple regions within
8q24 independently affect risk for prostate cancer. Nat Genet 39, 638-644.
Hanfling, B., and Brandl, R. (1998). Genetic variability, population size and isolation of distinct
populations in the freshwater fish Cottus gobio L. Molecular Ecology 7, 1625-1632.
Harville, D.A. (1977). Maximum Likelihood Approaches to Variance Component Estimation and
to Related Problems. J Am Stat Assoc 72, 320-338.
Hunter, D.J., Kraft, P., Jacobs, K.B., Cox, D.G., Yeager, M., Hankinson, S.E., Wacholder, S.,
Wang, Z., Welch, R., Huntchinson, A., et al. (2007). A genome-wide association study identifies
alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet 39,
870-874.
Ioannidis, J.P. (2009). Population-wide generalizability of genome-wide discovered associations.
J Natl Cancer Inst 101, 1297-1299.
125
Jia, L., Landan, G., Pomerantz, M., Jaschek, R., Herman, P., Reich, D., Yan, C., Khalid, O.,
Kantoff, P., Oh, W., et al. (2009). Functional enhancers at the gene-poor 8q24 cancer-linked
locus. PLoS Genet 5, e1000597.
Johansson, A., Marroni, F., Hayward, C., Franklin, C.S., Kirichenko, A.V., Jonasson, I., Hicks,
A.A., Vitart, V., Isaacs, A., Axenovich, T., et al. (2009). Common variants in the JAZF1 gene
associated with height identified by linkage and genome-wide association analysis. Hum Mol
Genet 18, 373-380.
John, E.M., Hopper, J.L., Beck, J.C., Knight, J.A., Neuhausen, S.L., Senie, R.T., Ziogas, A.,
Andrulis, I.L., Anton-Culver, H., Boyd, N., et al. (2004). The Breast Cancer Family Registry: an
infrastructure for cooperative multinational, interdisciplinary and translational studies of the
genetic epidemiology of breast cancer. Breast Cancer Res 6, R375-389.
John, E.M., Schwartz, G.G., Koo, J., Wang, W., and Ingles, S.A. (2007). Sun exposure, vitamin D
receptor gene polymorphisms, and breast cancer risk in a multiethnic population. Am J Epidemiol
166, 1409-1419.
Johns, L.E., and Houlston, R.S. (2003). A systematic review and meta-analysis of familial
prostate cancer risk. BJU Int 91, 789-794.
Kaaks, R. (2004). Nutrition, insulin, IGF-1 metabolism and cancer risk: a summary of
epidemiological evidence. Novartis Found Symp 262, 247-260; discussion 260-268.
Kang, H.M., Sul, J.H., Service, S.K., Zaitlen, N.A., Kong, S.Y., Freimer, N.B., Sabatti, C., and
Eskin, E. (2010). Variance component model to account for sample structure in genome-wide
association studies. Nat Genet 42, 348-354.
Kiemeney, L.A., Thorlacius, S., Sulem, P., Geller, F., Aben, K.K., Stacey, S.N., Gudmundsson, J.,
Jakobsdottir, M., Bergthorsson, J.T., Sigurdsson, A., et al. (2008). Sequence variant on 8q24
confers susceptibility to urinary bladder cancer. Nat Genet 40, 1307-1312.
Kim, H.C., Lee, J.Y., Sung, H., Choi, J.Y., Park, S.K., Lee, K.M., Kim, Y.J., Go, M.J., Li, L.,
Cho, Y.S., et al. (2012). A genome-wide association study identifies a breast cancer risk variant
in ERBB4 at 2q34: results from the Seoul Breast Cancer Study. Breast Cancer Res 14, R56.
Kim, J.J., Lee, H.I., Park, T., Kim, K., Lee, J.E., Cho, N.H., Shin, C., Cho, Y.S., Lee, J.Y., Han,
B.G., et al. (2010). Identification of 15 loci influencing height in a Korean population. J Hum
Genet 55, 27-31.
Klein, R.J., Zeiss, C., Chew, E.Y., Tsai, J.Y., Sackler, R.S., Haynes, C., Henning, A.K.,
SanGiovanni, J.P., Mane, S.M., Mayne, S.T., et al. (2005). Complement factor H polymorphism
in age-related macular degeneration. Science 308, 385-389.
Kolonel, L.N., Henderson, B.E., Hankin, J.H., Nomura, A.M., Wilkens, L.R., Pike, M.C., Stram,
D.O., Monroe, K.R., Earle, M.E., and Nagamine, F.S. (2000a). A multiethnic cohort in Hawaii
and Los Angeles: baseline characteristics. Am J Epidemiol 151, 346-357.
126
Kolonel, L.N., Henderson, B.E., Hankin, J.H., Nomura, A.M., Wilkens, L.R., Pike, M.C., Stram,
D.O., Monroe, K.R., Earle, M.E., and Nagamine, F.S. (2000b). A Multiethnic Cohort in Hawaii
and Los Angeles: Baseline Characteristics. Am J Epidemiol 151, 346-357.
Kraft, P., Wacholder, S., Cornelis, M.C., Hu, F.B., Hayes, R.B., Thomas, G., Hoover, R., Hunter,
D.J., and Chanock, S. (2009). Beyond odds ratios -- communicating disease risk based on genetic
profiles. Nat Rev Genet 10, 264-269.
Laan, M., and Paabo, S. (1997). Demographic history and linkage disequilibrium in human
populations. Nat Genet 17, 435-438.
Lander, E.S. (1996). The new genomics: global views of biology. Science 274, 536-539.
Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar,
K., Doyle, M., FitzHugh, W., et al. (2001). Initial sequencing and analysis of the human genome.
Nature 409, 860-921.
Lango Allen, H., Estrada, K., Lettre, G., Berndt, S.I., Weedon, M.N., Rivadeneira, F., Willer, C.J.,
Jackson, A.U., Vedantam, S., Raychaudhuri, S., et al. (2010). Hundreds of variants clustered in
genomic loci and biological pathways affect human height. Nature 467, 832-838.
Laurie, C.C., Chasalow, S.D., LeDeaux, J.R., McCarroll, R., Bush, D., Hauge, B., Lai, C., Clark,
D., Rocheford, T.R., and Dudley, J.W. (2004). The genetic architecture of response to long-term
artificial selection for oil concentration in the maize kernel. Genetics 168, 2141-2155.
Lee, J., Beliakoff, J., and Sun, Z. (2007). The novel PIAS-like protein hZimp10 is a
transcriptional co-activator of the p53 tumor suppressor. Nucleic Acids Res 35, 4523-4534.
Lee, L.G., Connell, C.R., and Bloch, W. (1993). Allelic discrimination by nick-translation PCR
with fluorogenic probes. Nucleic Acids Res 21, 3761-3766.
Lettre, G., Jackson, A.U., Gieger, C., Schumacher, F.R., Berndt, S.I., Sanna, S., Eyheramendy, S.,
Voight, B.F., Butler, J.L., Guiducci, C., et al. (2008). Identification of ten loci associated with
height highlights new biological pathways in human growth. Nat Genet 40, 584-591.
Li, W.H., and Sadler, L.A. (1991). Low nucleotide diversity in man. Genetics 129, 513-523.
Li, X., Thyssen, G., Beliakoff, J., and Sun, Z. (2006). The novel PIAS-like protein hZimp10
enhances Smad transcriptional activity. J Biol Chem 281, 23748-23756.
Li, Y., Willer, C., Sanna, S., and Abecasis, G. (2009). Genotype imputation. Annu Rev Genomics
Hum Genet 10, 387-406.
Li, Y., Willer, C.J., Ding, J., Scheet, P., and Abecasis, G.R. (2010). MaCH: using sequence and
genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 34, 816-834.
127
Liu, D., Ghosh, D., and Lin, X. (2008). Estimation and testing for the effect of a genetic pathway
on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC
Bioinformatics 9, 292.
Liu, D., Lin, X., and Ghosh, D. (2007). Semiparametric regression of multidimensional genetic
pathway data: least-squares kernel machines and linear mixed models. Biometrics 63, 1079-1088.
Liu, J.Z., Medland, S.E., Wright, M.J., Henders, A.K., Heath, A.C., Madden, P.A., Duncan, A.,
Montgomery, G.W., Martin, N.G., and McRae, A.F. (2010). Genome-wide association study of
height and body mass index in Australian twin families. Twin Res Hum Genet 13, 179-193.
Long, A.D., Mullaney, S.L., Reid, L.A., Fry, J.D., Langley, C.H., and Mackay, T.F. (1995). High
resolution mapping of genetic factors affecting abdominal bristle number in Drosophila
melanogaster. Genetics 139, 1273-1291.
Long, J., Cai, Q., Shu, X.O., Qu, S., Li, C., Zheng, Y., Gu, K., Wang, W., Xiang, Y.B., Cheng, J.,
et al. (2010). Identification of a functional genetic variant at 16q12.1 for breast cancer risk:
results from the Asia Breast Cancer Consortium. PLoS Genet 6, e1001002.
Long, J., Cai, Q., Sung, H., Shi, J., Zhang, B., Choi, J.Y., Wen, W., Delahanty, R.J., Lu, W., Gao,
Y.T., et al. (2012). Genome-wide association study in east Asians identifies novel susceptibility
loci for breast cancer. PLoS Genet 8, e1002532.
Lowe, J.K., Maller, J.B., Pe'er, I., Neale, B.M., Salit, J., Kenny, E.E., Shea, J.L., Burkhardt, R.,
Smith, J.G., Ji, W., et al. (2009). Genome-wide association studies in an isolated founder
population from the Pacific Island of Kosrae. PLoS Genet 5, e1000365.
Macgregor, S., Cornes, B.K., Martin, N.G., and Visscher, P.M. (2006). Bias, precision and
heritability of self-reported and clinically measured height in Australian twins. Hum Genet 120,
571-580.
Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B., Hindorff, L.A., Hunter, D.J., McCarthy,
M.I., Ramos, E.M., Cardon, L.R., Chakravarti, A., et al. (2009). Finding the missing heritability
of complex diseases. Nature 461, 747-753.
Marchbanks, P.A., McDonald, J.A., Wilson, H.G., Burnett, N.M., Daling, J.R., Bernstein, L.,
Malone, K.E., Strom, B.L., Norman, S.A., Weiss, L.K., et al. (2002). The NICHD Women's
Contraceptive and Reproductive Experiences Study: methods and operational results. Ann
Epidemiol 12, 213-221.
Marian, A.J. (2012). Elements of 'missing heritability'. Curr Opin Cardiol 27, 197-201.
McVean, G.A., Myers, S.R., Hunt, S., Deloukas, P., Bentley, D.R., and Donnelly, P. (2004). The
fine-scale structure of recombination rate variation in the human genome. Science 304, 581-584.
128
Milne, R.L., Benitez, J., Nevanlinna, H., Heikkinen, T., Aittomaki, K., Blomqvist, C., Arias, J.I.,
Zamora, M.P., Burwinkel, B., Bartram, C.R., et al. (2009). Risk of estrogen receptor-positive and
-negative breast cancer and single-nucleotide polymorphism 2q35-rs13387042. J Natl Cancer Inst
101, 1012-1018.
Morimoto, L.M., White, E., Chen, Z., Chlebowske, R.T., Hays, J., Kuller, L., Lopez, A.M.,
Manson, J., Margolis, K.L., Muti, P.C., et al. (2001). Obesity, body size, and risk of
postmenopausal breast cancer: the Women's Health Initiative (United States). Cancer Causes
Control 13, 741-751.
Myers, S., Bottolo, L., Freeman, C., McVean, G., and Donnelly, P. (2005). A fine-scale map of
recombination rates and hotspots across the human genome. Science 310, 321-324.
N'Diaye, A., Chen, G.K., Palmer, C.D., Ge, B., Tayo, B., Mathias, R.A., Ding, J., Nalls, M.A.,
Adeyemo, A., Adoue, V., et al. (2011). Identification, replication, and fine-mapping of Loci
associated with adult height in individuals of african ancestry. PLoS Genet 7, e1002298.
Need, A.C., McEvoy, J.P., Gennarelli, M., Heinzen, E.L., Ge, D., Maia, J.M., Shianna, K.V., He,
M., Cirulli, E.T., Gumbs, C.E., et al. (2012). Exome sequencing followed by large-scale
genotyping suggests a limited role for moderately rare risk factors of strong effect in
schizophrenia. Am J Hum Genet 91, 303-312.
Nei, M., Maruyama, T., and Chakraborty, R. (1975). The bottleneck effect and genetic variability
in populations. Evolution 29, 1-10.
Newman, B., Moorman, P.G., Millikan, R., Qaqish, B.F., Geradts, J., Aldrich, T.E., and Liu, E.T.
(1995). The Carolina Breast Cancer Study: integrating population-based epidemiology and
molecular biology. Breast Cancer Res Treat 35, 51-60.
Ng, M.C., Hester, J.M., Wing, M.R., Li, J., Xu, J., Hicks, P.J., Roh, B.H., Lu, L., Divers, J.,
Langefeld, C.D., et al. (2011). Genome-Wide Association of BMI in African Americans. Obesity
(Silver Spring).
Novosyadlyy, R., Lann, D.E., Vijayakumar, A., Rowzee, A., Lazzarino, D.A., Fierz, Y., Carboni,
J.M., Gottardis, M.M., Pennisi, P.A., Molinolo, A.A., et al. (2010). Insulin-Mediated
Acceleration of Breast Cancer Development and Progression in a Nonobese Model of Type 2
Diabetes. Cancer Research 70, 741-751.
Okada, Y., Kamatani, Y., Takahashi, A., Matsuda, K., Hosono, N., Ohmiya, H., Daigo, Y.,
Yamamoto, K., Kubo, M., Nakamura, Y., et al. (2010). A genome-wide association study in 19
633 Japanese subjects identified LHX3-QSOX2 and IGF1 as adult height loci. Hum Mol Genet
19, 2303-2312.
Orr, H.A., and Irving, S. (2005). Segregation distortion in hybrids between the Bogota and USA
subspecies of Drosophila pseudoobscura. Genetics 169, 671-682.
129
Park, J.H., Wacholder, S., Gail, M.H., Peters, U., Jacobs, K.B., Chanock, S.J., and Chatterjee, N.
(2010). Estimation of effect size distribution from genome-wide association studies and
implications for future discoveries. Nat Genet 42, 570-575.
Pasaniuc, B., Zaitlen, N., Lettre, G., Chen, G., Tandon, A., Kao, L., Ruczinski, I., Fornage, M.,
Siscovick, D., Zhu, X., et al. (2011). Enhanced statistical tests for GWAS in admixed populations:
assessment using African Americans from CARe and a breast cancer consortium. Under review.
Pepe, M.S., and Janes, H.E. (2008). Gauging the performance of SNPs, biomarkers, and clinical
factors for predicting risk of breast cancer. J Natl Cancer Inst 100, 978-979.
Petrelli, J.M., Calle, E.E., Rodriguez, C., and Thun, M.J. (2002). Body mass index, height, and
postmenopausal breast cancer mortality in a prospective cohort of US women. Cancer Causes
Control 13, 325-332.
Pharoah, P.D., Antoniou, A., Bobrow, M., Zimmern, R.L., Easton, D.F., and Ponder, B.A. (2002).
Polygenic susceptibility to breast cancer and implications for prevention. Nat Genet 31, 33-36.
Price, A.L., Patterson, N., Yu, F., Cox, D.R., Waliszewska, A., McDonald, G.J., Tandon, A.,
Schirmer, C., Neubauer, J., Bedoya, G., et al. (2007). A genomewide admixture map for Latino
populations. Am J Hum Genet 80, 1024-1036.
Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., and Reich, D. (2006).
Principal components analysis corrects for stratification in genome-wide association studies. Nat
Genet 38, 904-909.
Price, A.L., Tandon, A., Patterson, N., Barnes, K.C., Rafaels, N., Ruczinski, I., Beaty, T.H.,
Mathias, R., Reich, D., and Myers, S. (2009). Sensitive detection of chromosomal segments of
distinct ancestry in admixed populations. PLoS Genet 5, e1000519.
Pritchard, J.K., Stephens, M., and Donnelly, P. (2000a). Inference of population structure using
multilocus genotype data. Genetics 155, 945-959.
Pritchard, J.K., Stephens, M., Rosenberg, N.A., and Donnelly, P. (2000b). Association mapping
in structured populations. Am J Hum Genet 67, 170-181.
Prorok, P.C., Andriole, G.L., Bresalier, R.S., Buys, S.S., Chia, D., Crawford, E.D., Fogel, R.,
Gelmann, E.P., Gilbert, F., Hasson, M.A., et al. (2000). Design of the Prostate, Lung, Colorectal
and Ovarian (PLCO) Cancer Screening Trial. Control Clin Trials 21, 273S-309S.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar,
P., de Bakker, P.I., Daly, M.J., et al. (2007). PLINK: a tool set for whole-genome association and
population-based linkage analyses. Am J Hum Genet 81, 559-575.
Purcell, S.M., Wray, N.R., Stone, J.L., Visscher, P.M., O'Donovan, M.C., Sullivan, P.F., and
Sklar, P. (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar
disorder. Nature 460, 748-752.
130
Risch, N., and Merikangas, K. (1996). The future of genetic studies of complex human diseases.
Science 273, 1516-1517.
Rosenberg, N.A., Huang, L., Jewett, E.M., Szpiech, Z.A., Jankovic, I., and Boehnke, M. (2010).
Genome-wide association studies in diverse populations. Nat Rev Genet 11, 356-366.
Ruiz-Narvaez, E.A., Rosenberg, L., Cozier, Y.C., Cupples, L.A., Adams-Campbell, L.L., and
Palmer, J.R. (2010). Polymorphisms in the TOX3/LOC643714 locus and risk of breast cancer in
African-American women. Cancer Epidemiol Biomarkers Prev 19, 1320-1327.
Sachidanandam, R., Weissman, D., Schmidt, S.C., Kakol, J.M., Stein, L.D., Marth, G., Sherry, S.,
Mullikin, J.C., Mortimore, B.J., Willey, D.L., et al. (2001). A map of human genome sequence
variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928-933.
Salinas, C.A., Kwon, E., Carlson, C.S., Koopmeiners, J.S., Feng, Z., Karyadi, D.M., Ostrander,
E.A., and Stanford, J.L. (2008). Multiple independent genetic variants in the 8q24 region are
associated with prostate cancer risk. Cancer Epidemiol Biomarkers Prev 17, 1203-1213.
Sanna, S., Jackson, A.U., Nagaraja, R., Willer, C.J., Chen, W.M., Bonnycastle, L.L., Shen, H.,
Timpson, N., Lettre, G., Usala, G., et al. (2008). Common variants in the GDF5-UQCC region
are associated with variation in human height. Nat Genet 40, 198-203.
Sato, F., Tsuchiya, S., Meltzer, S.J., and Shimizu, K. (2011). MicroRNAs and epigenetics. FEBS
J 278, 1598-1609.
Schaid, D.J. (2010a). Genomic Similarity and Kernel Methods I: Advancements by Building on
Mathematical and Statistical Foundations. Hum Hered 70, 109-131.
Schaid, D.J. (2010b). Genomic Similarity and Kernel Methods II: Methods for Genomic
Information. Hum Hered 70, 132-140.
Seldin, M.F., Pasaniuc, B., and Price, A.L. (2011). New approaches to disease mapping in
admixed populations. Nat Rev Genet 12, 523-528.
Sharma, M., Li, X., Wang, Y., Zarnegar, M., Huang, C.-Y., Palvimo, J.J., Lim, B., and Sun, Z.
(2003). hZimp10 is an androgen receptor co-activator and forms a complex with SUMO-1 at
replication foci. EMBO J 22, 6101-6114.
Shrimpton, A.E., and Robertson, A. (1988a). The isolation of polygenic factors controlling bristle
score in Drosophila melanogaster. I. Allocation of third chromosome sternopleural bristle effects
to chromosome sections. Genetics 118, 437-443.
Shrimpton, A.E., and Robertson, A. (1988b). The Isolation of Polygenic Factors Controlling
Bristle Score in Drosophila Melanogaster. II. Distribution of Third Chromosome Bristle Effects
within Chromosome Sections. Genetics 118, 445-459.
131
Silventoinen, K., Sammalisto, S., Perola, M., Boomsma, D.I., Cornes, B.K., Davis, C., Dunkel, L.,
De Lange, M., Harris, J.R., Hjelmborg, J.V., et al. (2003). Heritability of adult body height: a
comparative study of twin cohorts in eight countries. Twin Res 6, 399-408.
Skinner, M.K., Manikkam, M., and Guerrero-Bosagna, C. (2010). Epigenetic transgenerational
actions of environmental factors in disease etiology. Trends Endocrinol Metab 21, 214-222.
Smith, M.W., Patterson, N., Lautenberger, J.A., Truelove, A.L., McDonald, G.J., Waliszewska,
A., Kessing, B.D., Malasky, M.J., Scafe, C., Le, E., et al. (2004). A high-density admixture map
for disease gene discovery in african americans. Am J Hum Genet 74, 1001-1013.
Smith, T.R., Levine, E.A., Freimanis, R.I., Akman, S.A., Allen, G.O., Hoang, K.N., Liu-Mares,
W., and Hu, J.J. (2008). Polygenic model of DNA repair genetic polymorphisms in human breast
cancer risk. Carcinogenesis 29, 2132-2138.
Soranzo, N., Rivadeneira, F., Chinappen-Horsley, U., Malkina, I., Richards, J.B., Hammond, N.,
Stolk, L., Nica, A., Inouye, M., Hofman, A., et al. (2009). Meta-analysis of genome-wide scans
for human adult stature identifies novel Loci and associations with measures of skeletal frame
size. PLoS Genet 5, e1000445.
Spickett, S.G., and Thoday, J.M. (1966). Regular responses to selection. 3. Interaction between
located polygenes. Genet Res 7, 96-121.
Stacey, S.N., Manolescu, A., Sulem, P., Rafnar, T., Gudmundsson, J., Gudjonsson, S.A., Masson,
G., Jakobsdottir, M., Thorlacius, S., Helgason, A., et al. (2007). Common variants on
chromosomes 2q35 and 16q12 confer susceptibility to estrogen receptor-positive breast cancer.
Nat Genet 39, 865-869.
Stacey, S.N., Manolescu, A., Sulem, P., Thorlacius, S., Gudjonsson, S.A., Jonsson, G.F.,
Jakobsdottir, M., Bergthorsson, J.T., Gudmundsson, J., Aben, K.K., et al. (2008). Common
variants on chromosome 5p12 confer susceptibility to estrogen receptor-positive breast cancer.
Nat Genet 40, 703-706.
Stacey, S.N., Sulem, P., Zanon, C., Gudjonsson, S.A., Thorleifsson, G., Helgason, A., Jonasdottir,
A., Besenbacher, S., Kostic, J.P., Fackenthal, J.D., et al. (2010). Ancestry-shift refinement
mapping of the C6orf97-ESR1 breast cancer susceptibility locus. PLoS Genet 6, e1001029.
The 1000 Genomes Project Consortium (2010). A map of human genome variation from
population-scale sequencing. Nature 467, 1061-1073.
The International HapMap Consortium (2003). The International HapMap Project. Nature 426,
789-796.
Thomas, G., Jacobs, K.B., Kraft, P., Yeager, M., Wacholder, S., Cox, D.G., Hankinson, S.E.,
Hutchinson, A., Wang, Z., Yu, K., et al. (2009). A multistage genome-wide association study in
breast cancer identifies two new risk alleles at 1p11.2 and 14q24.1 (RAD51L1). Nat Genet 41,
579-584.
132
Tian, C., Hinds, D.A., Shigeta, R., Adler, S.G., Lee, A., Pahl, M.V., Silva, G., Belmont, J.W.,
Hanson, R.L., Knowler, W.C., et al. (2007). A genomewide single-nucleotide-polymorphism
panel for Mexican American admixture mapping. Am J Hum Genet 80, 1014-1023.
Tian, C., Hinds, D.A., Shigeta, R., Kittles, R., Ballinger, D.G., and Seldin, M.F. (2006). A
genomewide single-nucleotide-polymorphism panel with high ancestry information for African
American admixture mapping. Am J Hum Genet 79, 640-649.
Tonjes, A., Koriath, M., Schleinitz, D., Dietrich, K., Bottcher, Y., Rayner, N.W., Almgren, P.,
Enigk, B., Richter, O., Rohm, S., et al. (2009). Genetic variation in GPR133 is associated with
height: genome wide association study in the self-contained population of Sorbs. Hum Mol Genet
18, 4662-4668.
Travis, R.C., Reeves, G.K., Green, J., Bull, D., Tipper, S.J., Baker, K., Beral, V., Peto, R., Bell, J.,
Zelenika, D., et al. (2010). Gene-environment interactions in 7610 women with breast cancer:
prospective evidence from the Million Women Study. Lancet 375, 2143-2151.
Turnbull, C., Shahana, A., Morrison, J., Pernet, D., Renwick, A., Maranian, M., Seal, S.,
Ghoussaini, M., Hines, S., Healey, C.S., et al. (2010). Genome-wide association study identifies
five new breast cancer susceptibility loci. Nat Genet 42, 504-507.
Udler, M.S., Ahmed, S., Healey, C.S., Meyer, K., Struewing, J., Maranian, M., Kwon, E.M.,
Zhang, J., Tyrer, J., Karlins, E., et al. (2010). Fine scale mapping of the breast cancer 16q12 locus.
Hum Mol Genet 19, 2507-2515.
Udler, M.S., Meyer, K.B., Pooley, K.A., Karlins, E., Struewing, J.P., Zhang, J., Doody, D.R.,
MacArthur, S., Tyrer, J., Pharoah, P.D., et al. (2009). FGFR2 variants and breast cancer risk:
fine-scale mapping using African American studies and analysis of chromatin conformation. Hum
Mol Genet 18, 1692-1703.
Verheus, M., Peeters, P.H.M., Rinaldi, S., Dossus, L., Biessy, C., Olsen, A., Tjønneland, A.,
Overvad, K., Jeppesen, M., Clavel-Chapelon, F., et al. (2006). Serum C-peptide levels and breast
cancer risk: Results from the European prospective investigation into cancer and nutrition (EPIC).
International Journal of Cancer 119, 659-667.
Wacholder, S., Hartge, P., Prentice, R., Garcia-Closas, M., Feigelson, H.S., Diver, W.R., Thun,
M.J., Cox, D.G., Hankinson, S.E., Kraft, P., et al. (2010). Performance of common genetic
variants in breast-cancer risk models. N Engl J Med 362, 986-993.
Walker, H.M., and Lev, J. (1953). Statistical Inference, New York: Holt, Rinehart & Winston.
Wang, D.G., Fan, J.B., Siao, C.J., Berno, A., Young, P., Sapolsky, R., Ghandour, G., Perkins, N.,
Winchester, E., Spencer, J., et al. (1998). Large-scale identification, mapping, and genotyping of
single-nucleotide polymorphisms in the human genome. Science 280, 1077-1082.
Waters, K.M., Le Marchand, L., Kolonel, L.N., Monroe, K.R., Stram, D.O., Henderson, B.E., and
Haiman, C.A. (2009). Generalizability of associations from prostate cancer genome-wide
association studies in multiple populations. Cancer Epidemiol Biomarkers Prev 18, 1285-1289.
133
Waters, K.M., Stram, D.O., Hassanein, M.T., Le Marchand, L., Wilkens, L.R., Maskarinec, G.,
Monroe, K.R., Kolonel, L.N., Altshuler, D., Henderson, B.E., et al. (2010). Consistent association
of type 2 diabetes risk variants found in europeans in diverse racial and ethnic groups. PLoS
Genet 6.
Weedon, M.N., Lango, H., Lindgren, C.M., Wallace, C., Evans, D.M., Mangino, M., Freathy,
R.M., Perry, J.R., Stevens, S., Hall, A.S., et al. (2008). Genome-wide association analysis
identifies 20 loci that influence adult height. Nat Genet 40, 575-583.
Weedon, M.N., Lettre, G., Freathy, R.M., Lindgren, C.M., Voight, B.F., Perry, J.R., Elliott, K.S.,
Hackett, R., Guiducci, C., Shields, B., et al. (2007). A common variant of HMGA2 is associated
with adult and childhood height in the general population. Nat Genet 39, 1245-1250.
Xu, J., Kibel, A.S., Hu, J.J., Turner, A.R., Pruett, K., Zheng, S.L., Sun, J., Isaacs, S.D., Wiley,
K.E., Kim, S.T., et al. (2009). Prostate cancer risk associated loci in African Americans. Cancer
Epidemiol Biomarkers Prev 18, 2145-2149.
Yang, J., Benyamin, B., McEvoy, B.P., Gordon, S., Henders, A.K., Nyholt, D.R., Madden, P.A.,
Heath, A.C., Martin, N.G., Montgomery, G.W., et al. (2010). Common SNPs explain a large
proportion of the heritability for human height. Nat Genet 42, 565-569.
Yang, J., Lee, S.H., Goddard, M.E., and Visscher, P.M. (2011a). GCTA: a tool for genome-wide
complex trait analysis. Am J Hum Genet 88, 76-82.
Yang, J., Manolio, T.A., Pasquale, L.R., Boerwinkle, E., Caporaso, N., Cunningham, J.M., de
Andrade, M., Feenstra, B., Feingold, E., Hayes, M.G., et al. (2011b). Genome partitioning of
genetic variation for complex traits using common SNPs. Nat Genet 43, 519-525.
Yang, J., Weedon, M.N., Purcell, S., Lettre, G., Estrada, K., Willer, C.J., Smith, A.V., Ingelsson,
E., O'Connell, J.R., Mangino, M., et al. (2011c). Genomic inflation factors under polygenic
inheritance. Eur J Hum Genet 19, 807-812.
Yeager, M., Chatterjee, N., Ciampa, J., Jacobs, K.B., Gonzalez-Bosquet, J., Hayes, R.B., Kraft, P.,
Wacholder, S., Orr, N., Berndt, S., et al. (2009). Identification of a new prostate cancer
susceptibility locus on chromosome 8q24. Nat Genet 41, 1055-1057.
Yeager, M., Orr, N., Hayes, R.B., Jacobs, K.B., Kraft, P., Wacholder, S., Minichiello, M.J.,
Fearnhead, P., Yu, K., Chatterjee, N., et al. (2007). Genome-wide association study of prostate
cancer identifies a second risk locus at 8q24. Nat Genet 39, 645-649.
Yoon, D., Kim, Y.J., Cui, W.Y., Van der Vaart, A., Cho, Y.S., Lee, J.Y., Ma, J.Z., Payne, T.J., Li,
M.D., and Park, T. (2011). Large-scale genome-wide association study of Asian population
reveals genetic factors in FRMD4A and other loci influencing smoking initiation and nicotine
dependence. Hum Genet.
Zeegers, M.P., Jellema, A., and Ostrer, H. (2003). Empiric risk of prostate carcinoma for relatives
of patients with prostate carcinoma: a meta-analysis. Cancer 97, 1894-1903.
134
Zhang, Q., Lewis, C.E., Wagenknecht, L.E., Myers, R.H., Pankow, J.S., Hunt, S.C., North, K.E.,
Hixson, J.E., Jeffrey Carr, J., Shimmin, L.C., et al. (2008). Genome-wide admixture mapping for
coronary artery calcification in African Americans: the NHLBI Family Heart Study. Genet
Epidemiol 32, 264-272.
Zheng, W., Cai, Q., Signorello, L.B., Long, J., Hargreaves, M.K., Deming, S.L., Li, G., Li, C.,
Cui, Y., and Blot, W.J. (2009a). Evaluation of 11 breast cancer susceptibility loci in African-
American women. Cancer Epidemiol Biomarkers Prev 18, 2761-2764.
Zheng, W., Long, J., Gao, Y.T., Li, C., Zheng, Y., Xiang, Y.B., Wen, W., Levy, S., Deming, S.L.,
Haines, J.L., et al. (2009b). Genome-wide association study identifies a new breast cancer
susceptibility locus at 6q25.1. Nat Genet 41, 324-328.
Zuk, O., Hechter, E., Sunyaev, S.R., and Lander, E.S. (2012). The mystery of missing heritability:
Genetic interactions create phantom heritability. Proc Natl Acad Sci U S A 109, 1193-1198.
135
Appendix
A simple approximation of the number of SNPs needed to explain a certain FRR
Let s
O
and s
F
denote the total numbers of risk alleles in the offspring and the parent,
respectively. Assume in total there are k SNPs that are associated with the disease. Let
n(1), n(2), …, n(k) denote the numbers of risk alleles at the k loci, respectively. For
simplicity, we assume that these k SNPs have the same risk allele frequency p, are
independent with each other, and have the same effect size. Under such circumstance,
(1), (2),..., ( ) ~ (2, )
iid
nn nk Bin p
Then the sum of risk alleles, i.e., the risk score aggregate, is asymptotically normally
distributed:
2
~(2,) (, )
O L
sBinkp N μ σ ⎯⎯ →
Where
2
2
2(1 )
kp
kp p
μ
σ
=
=−
It is easy to prove that s
O
and s
F
are asymptotically bivariate normal variants with
covariance matrix
22
22
σ ρσ
ρσ σ
⎛⎞
Σ=
⎜⎟
⎝⎠
where ρ is the expect number of identity-by-descent (IBD) allele shared by the relatives
considered, and ρ =0.5 for first-degree relatives.
The definition of FRR:
136
(1| 1)
(1)
OF
O
PD D
FRR
PD
==
=
=
(1)
We try to write the numerator and denominator in terms of the parameters of the
distribution.
Under a logistic regression model, the conditional probability of disease
exp( )
( 1| ) exp( )
1exp( )
x
PD S x
x
α β
α β
αβ
+
== ≈ +
++
when exp( α+ βx) is small.
Then
( 1) ( 1| ) ( ) exp( ) ( )
OOOOO OOO
PD PD s P s ds s P s ds αβ == = = +
∫∫
Using the normal moment generating function (MGF), we have
22
( 1) exp( ) ( ) exp( ) exp( ) ( )
exp( ) (exp( )) exp( )
2
OOOO OOO
O
PD s P s ds s P s ds
Es
αβ α β
σβ
αβ α μβ
== + =
==++
∫∫
(2)
The numerator
( 1|1) ( 1|, 1)(|1)
(1|)(| 1)
exp( ) ( | 1)
OF O OF OF O
OO O F O
OO F O
PD D P D s D P s D ds
PD s P s D ds
sPs D ds αβ
=== = = =
== =
=+ =
∫
∫
∫
(3)
(| 1) (| , 1) ( | 1) ( | ) ( | 1)
O F OF F F F F OF F F F
Ps D P s s D P s D ds Ps s P s D ds == = = = =
∫∫
Where
137
(, 1) ( 1| ) ( )
(| 1)
(1) ( 1)
FF F F F
FF
FF
Ps D P D s Ps
Ps D
PD PD
==
== =
==
Therefore,
(, )exp( )
( | )( 1| )( )
(| 1)
(1) ( 1)
OF F F
OF F F F
OF F
FF
Ps s s ds
Ps s P D s Ps
Ps D ds
PD P D
αβ +
=
== =
==
∫
∫
(4)
Where P(D
F
=1) is a constant,
22
( 1) exp( )
2
F
PD
σβ
α μβ == + +
From (2)-(4), and let C=exp
22
()
2
σβ
α μβ ++ , then
exp( ) exp( ) ( , )
(1| 1)
OFOFOF
OF
s s P s s ds ds
PD D
C
αβ α β ++
===
∫∫
(5)
Using the MGF of bivariate normal distribution, expression (5) is
22
1
exp(2 )exp( ' ' )
exp(2 ) (exp( ' ))
2
(1| 1)
exp(2 )exp(2 (1 ) )
OF
EX
PD D
CC
C
α μβ β β
αβ
αμβ ρσβ
+ Σ
=== =
++
= (6)
From (1) and (5), and let R denote FRR, we finally have
22 2
exp( ) exp(2 (1 ) ) Rkpp ρσβ ρβ == −
therefore
2
log( )
2(1 )
R
k
pp ρ β
=
−
Supplemental Table 1-1. Genotyping efficiency: genotype call rates, Hardy-Weinberg Equilibrium tests.
SNP
Genotype call rates (cases/controls)
HWE p-value (cases/controls)
European
Americans
538/551
1
African
Americans
576/1031
Native
Hawaiians
149/291
Japanese
565/555
Latinos
396/399
rs11249433
97.0%/97.6%
0.42/0.25
95.1%/100.0%
0.38/0.86
98.0%/98.6%
0.53/0.11
99.3%/99.1%
0.93/0.92
97.0%/97.8%
0.83/0.26
rs13387042
99.4%/99.8%
0.80/0.65
99.8%/99.8%
0.49/0.64
99.3%/98.3%
0.43/0.20
99.1%/99.8%
1.00/0.95
98.7%/99.8%
0.25/0.21
rs4973768
97.0%/96.6%
0.33/0.32
99.3%/100.0%
0.10/0.63
96.0%/98.2%
0.06/0.67
96.3%/98.6%
0.68/0.46
97.0%/97.8%
0.31/0.63
rs10941679
99.6%/98.7%
0.69/0.65
95.2%/98.5%
0.94/0.14
97.3%/99.0%
0.65/0.25
99.0%/99.4%
0.03/0.12
99.2%/99.5%
0.99/0.98
rs889312
98.0%/98.4%
0.31/0.62
99.6%/99.6%
0.37/0.44
94.6%/96.5%
0.98/0.72
97.5%/98.6%
0.28/0.62
97.0%/99.0%
0.54/0.57
rs2046210
98.0%/97.5%
0.79/0.88
98.6%/95.1%
0.75/0.041
96.0%/96.2%
0.22/0.03
96.1%/96.8%
0.20/0.82
96.7%/97.5%
0.28/0.33
rs13281615
98.9%/98.5%
0.34/0.98
99.0%/99.2%
0.75/0.95
95.0%/98.6%
0.58/0.22
98.4%/97.7%
0.81/0.13
97.2%/99.2%
0.07/0.12
rs2981582
96.8%/97.5%
0.16/0.15
99.7 %/99.9%
0.62/0.44
95.0%/98.2%
0.64/0.45
98.6%/98.6%
0.52/0.37
97.7%/97.8%
0.90/0.39
rs2981578
96.8%/98.4%
0.15/0.28
95.9%/98.3%
0.42/0.29
95.3%/97.6%
0.36/0.18
96.8%/98.4%
0.15/0.28
95.2%/95.5%
0.21/0.99
rs3817198
95.4%/96.6%
0.07/0.86
99.5%/99.7%
0.57/0.84
95.1%/96.2%
0.16/0.44
97.0%/96.8%
0.43/0.80
98.0%/97.0%
0.13/0.63
rs10483813
97.2%/96.4%
0.35/0.47
96.7%/98.4%
0.31/0.72
98.7%/97.9%
0.71/0.74
98.4%/97.7%
0.71/0.67
98.2%/97.3%
0.63/0.92
rs3803662
98.9%/98.7%
0.83/0.13
99.5%/99.9%
0.90/0.84
98.0%/99.3%
0.28/0.20
97.5%/98.4%
0.03/0.72
98.0%/99.5%
0.56/0.63
rs6504950
98.9%/98.0%
0.14/0.23
95.7%/98.4%
0.79/0.28
96.0%98.6%
0.007/0.27
99.3%/98.9%
0.82/0.82
98.2%/97.8%
0.19/0.53
1
Number of cases/controls (total: 2,224 cases/2,827 controls in the study).
Supplemental Table 1-2. Published effect sizes and frequencies of validated breast cancer
risk variants.
SNP
(position)
1
Chromosome
band
Nearest
genes
Allele
2
tested (frequency
in controls)
per allele OR Reference
rs11249433
(120982136)
1p11.2
G
(0.43)
1.16
3
5
rs13387042
(217614077)
2q35
A
(0.50)
1.20 2
rs4973768
(27391017)
3p24.1
NEK10/
SLC4A7
T
(0.46)
1.11 6
rs10941679
(44742255)
5p12 MRPS30
G
(0.24)
1.19 3
rs889312
(56067641)
5q11.2 MAP3K1
C
(0.28)
1.13 1
rs2046210
(151990059)
6q25.1 C6orf97
A
(0.35)
1.29 4
rs13281615
(128424800)
8q24.2
G
(0.40)
1.08 1
rs2981582
(123342307)
10q26.1 FGFR2
A
(0.38)
1.26 1
rs3817198
(1865582)
11p15.5 LSP1
C
(0.30)
1.07 1
rs999737
4
(68104435)
14q24.1 RAD51L1
C
(0.76)
1.06
3
5
rs3803662
(51143842)
16q12.1 TNRC9
A
(0.25)
1.20 1
rs6504950
(50411470)
17q23.2 COX11
G
(0.73)
1.05 6
1
Position from NCBI genome build 36.
2
Alleles are expressed in the forward (+) strand of the reference human genome sequence (NCBI build 36).
3
OR for heterozygotes.
4
r
2
=1 with rs10483813 (position:68101037) in Hapmap CEU (phase II).
Supplemental Table 1-3. Population-specific linear effects of the risk allele summary score.
Quintile
OR (95%CI)
1
European
Americans
534/541
2
African
Americans
539/1022
Native
Hawaiians
146/287
Japanese
561/550
Latinos
391/395
Pooled
2171/2795
1 1.00 (ref.) 1.00 (ref.) 1.00 (ref.) 1.00 (ref.) 1.00 (ref.) 1.00 (ref.)
2 1.61 (1.02-2.56)
0.89 (0.58-1.36) 1.70 (0.77-3.75) 0.77 (0.50-1.18) 1.09 (0.60-2.00) 1.06 (0.85-1.32)
3 1.63 (1.04-2.56) 0.80 (0.54-1.17) 2.12 (0.97-4.64) 1.01 (0.67-1.53) 1.33 (0.76-2.33) 1.16 (0.94-1.43)
4 2.01 (1.35-2.97) 0.89 (0.63-1.26) 2.03 (1.02-4.06) 1.34 (0.72-1.95) 1.40 (0.85-2.32) 1.37 (1.14-1.65)
5 2.00 (1.31-3.05) 1.09 (0.76-1.56) 4.60(2.10-10.07) 1.33 (0.86-2.05) 1.90 (1.14-3.16) 1.61 (1.32-1.97)
p for trend
1
7.3×10
-4
0.48 5.0×10
-4
7.3×10
-3
5.1×10
-3
2.4×10
-8
1
OR and p for trend adjusted for age(quartiles), the first 10 eigenvectors from principal component analysis, and race (in pooled analysis).
2
Number of cases/controls.
Supplemental Table 1-4. Frequencies of risk alleles and associations with breast cancer risk by population in the MEC.
SNP
Chr.
band
Allele
1
tested
OR (95%CI)
2
Risk allele frequency in controls
European
Americans
534/541
3
African
Americans
539/1022
Native
Hawaiians
146/287
Japanese
561/550
Latinos
391/395
Pooled
2171/2795
p-value p
int
4
rs11249433 1p11.2 G
0.99(0.83-1.19)
0.41
0.87(0.69-1.10)
0.15
1.67(1.13-2.48)
0.19
0.97(0.59-1.60)
0.031
0.99(0.78-1.25)
0.25
1.01(0.90-1.12)
0.20
0.92 0.14
rs13387042 2q35 A
1.22(1.02-1.45)
0.52
1.11(0.94-1.31)
0.71
0.87(0.60-1.26)
0.27
1.10(0.85-1.43)
0.12
1.17(0.95-1.44)
0.38
1.13(1.03-1.24)
0.46
0.0073 0.58
rs4973768 3p24.1 T
1.06(0.88-1.26)
0.48
1.01(0.86-1.18)
0.37
1.24(0.87-1.78)
0.21
1.07(0.87-1.33)
0.19
1.07(0.87-1.31)
0.44
1.04(0.95-1.14)
0.37
0.38 0.95
rs10941679 5p12 G
1.21(1.00-1.47)
0.25
1.10(0.91-1.32)
0.20
1.31(0.95-1.80)
0.46
1.14(0.96-1.36)
0.53
1.15(0.93-1.42)
0.33
1.16(1.06-1.27)
0.32
0.0013 0.87
rs889312 5q11.2 C
1.03(0.86-1.23)
0.33
1.07(0.92-1.26)
0.32
1.18(0.86-1.62)
0.57
1.11(0.94-1.32)
0.57
1.11(0.91-1.37)
0.42
1.09(1.00-1.18)
0.41
0.050 0.94
rs2046210 6q25.1 A
1.06(0.89-1.28)
0.35
1.02(0.87-1.18)
0.58
1.15(0.84-1.56)
0.50
1.14(0.94-1.38)
0.26
0.97(0.77-1.23)
0.27
1.07(0.98-1.16)
0.42
0.14 0.74
rs13281615 8q24.2 G
0.99(0.83-1.18)
0.44
1.10(0.95-1.28)
0.43
1.62(1.18-2.22)
0.37
1.07(0.90-1.26)
0.61
1.22(1.00-1.49)
0.59
1.11(1.02-1.21)
0.48
0.012 0.092
rs2981582 10q26.1 A
1.40(1.05-1.86)
0.37
0.88(0.75-1.03)
0.47
0.89(0.56-1.41)
0.23
1.10(0.87-1.39)
0.26
1.11(0.78-1.57)
0.40
1.02(0.91-1.14)
0.37
0.76 0.071
rs2981578 10q26.1 C
1.05(0.79-1.40)
0.50
1.48(1.17-1.87)
0.81
1.45(0.94-2.23)
0.38
1.10(0.90-1.35)
0.52
1.06(0.75-1.48)
0.49
1.23(1.10-1.38)
0.60
3.4×10
-4
0.27
rs3817198 11p15.5 C
1.17(0.97-1.41)
0.29
1.13(0.93-1.37)
0.18
1.07(0.73-1.55)
0.19
0.90(0.71-1.15)
0.16
1.11(0.87-1.42)
0.20
1.09(0.99-1.21)
0.20
0.080 0.59
rs10483813 14q24.1 T
1.09(0.89-1.35)
0.77
0.98(0.75-1.27)
0.91
1.07(0.59-1.94)
0.92
1.14(0.74-1.76)
0.96
1.18(0.90-1.55)
0.81
1.10(0.97-1.25)
0.88
0.14 0.83
rs3803662 16q12.1 A
1.25(1.04-1.50)
0.30
0.89(0.77-1.04)
0.50
1.03(0.76-1.40)
0.44
1.07(0.91-1.27)
0.51
1.27(1.03-1.56)
0.37
1.07(0.99-1.16)
0.44
0.10 0.034
rs6504950 17q23.2 G
0.93(0.76-1.13)
0.73
1.08(0.92-1.26)
0.64
1.11 (0.77-
1.58)
0.69
1.25(0.85-1.83)
0.94
0.92(0.71-1.20)
0.82
1.02(0.92-1.13)
0.75
0.69 0.45
1
Alleles are expressed in the forward (+) strand of the reference human genome sequence (NCBI build 36).
2
OR adjusted for age(quartiles), the first 10 eigenvectors from principal component analysis, and race (in pooled analysis).
3
Number of cases/controls.
4
p-value for interaction between each SNP and ethnic group (4 df).
Supplemental Table 1-5. The summary associations of validated breast cancer risk variants in diverse populations by ER
status.
European
Americans
African
Americans
Native
Hawaiians
Japanese Latinos Pooled
1
p
int
2
ER
positive
No. of
cases/controls
370/541 223/1022 113/287 428/550 215/395 1349/2795 ----
OR
3
(95%CI)
1.13
(1.06-1.19)
1.03
(0.96-1.10)
1.18
(1.05-1.34)
1.09
(1.02-1.17)
1.11
(1.03-1.21)
1.10
(1.06-1.13)
----
p
3
5.3×10
-5
0.49 6.1×10
-3
0.011 8.8×10
-3
5.9×10
-9
0.12
ER
negative
No. of
cases/controls
78/541 91/1022 22/287 68/550 67/395 326/2795 ----
OR
3
(95%CI)
0.97
(0.87-1.08)
1.02
(0.92-1.14)
1.17
(0.91-1.50)
1.05
(0.92-1.21)
1.12
(0.98-1.27)
1.05
(0.99-1.11)
----
p
3
0.56 0.67 0.23 0.45 0.092 0.11 0.39
p
het
4
9.0×10
-3
0.35 0.83 0.88 0.87 0.18 ----
1
ER information was available on 1675 out of 2171 cases (77%).
2
p value for interaction between the summary score and ethnic group (4 df).
3
OR and p value adjusted for age (quartiles), the first 10 eigenvectors from principal component analysis, and race (in pooled analysis).
4
p value for heterogeneity between ER positive and ER negative cases.
Supplemental Table 1-6. Testing for deviations in gene dosage effects (i.e. dominant and recessive effects) in the aggregate risk
score model.
SNP Allele
1
tested
p-value
2
p-value
for heterozygotes
p-value
for homozygotes
rs11249433 G 0.18 0.69 0.06
rs13387042 A 0.57 0.54 0.29
rs4973768 T 0.51 0.49 0.25
rs10941679 G 0.41 0.41 0.19
rs889312 C 0.77 0.54 0.99
rs2046210 A 0.67 0.80 0.53
rs13281615 G 0.89 0.98 0.68
rs2981582 A 0.31 0.25 0.15
rs3817198 C 0.48 0.58 0.35
rs10483813 T 0.86 0.59 0.61
rs3803662 A 0.73 0.43 0.67
rs6504950 G 0.47 0.89 0.46
1
Alleles are expressed in the forward (+) strand of the reference human genome sequence (NCBI build 36).
2
p-value from a Wald test (2 df, SNP genotypes coded as dummy variables).
Supplemental Table 3-1. Descriptive characteristics of the 9 studies of breast cancer in African Americans.
Studies
MEC CARE WCHS SFBCS NC-BCFR CBCS PLCO NBHS WFBC TOTAL
No. Cases/No. Controls
a
694/990 357/215 261/239 165/220 424/50 635/589 56/116 304/182 120/144 3,016/2,745
Age, median years
b
67/68 49/48 51/51 54/54 51/50 50/50 68/67 54/52 54/55 55/58
First degree family history of breast cancer, %
Yes 21/13 11/8 16/8 9/12 31/12 15/11 14/8 19/10 17/10 17/11
No 74/81 84/90 84/92 91/88 69/88 82/85 84/89 80/90 83/90 80/87
Stage (Based on SEER or AJCC coding), n(%)
Localized 458(66) 177(50) 11(4) 105(64) 232(55) 341(54) 12(21) 113(37) 36(30) 1485(49)
Advanced 215(31) 177(50) 12(5) 56(34) 138(33) 258(41) 8(14) 67(22) 24(20) 955(22)
ER status, n(%)
Positive 408(59) 183(51) 131(51) 84(51) 219(52) 272(43) 14(25) 143(47) 66(55) 1520(50)
Negative 176(25) 130(36) 80(31) 50(30) 121(29) 317(50) 6(11) 65(21) 43(36) 988(33)
PR status, n(%)
Positive 292(42) 144(40) 105(40) 76(46) 196(46) 252(40) 13(23) 113(37) 48(40) 1239(41)
Negative 221(32) 120(34) 107(41) 58(35) 141(33) 335(53) 7(13) 94(31) 61(51) 1144(38)
a
Numbers of cases and controls included in the analysis.
b
For PLCO age is determined based on the midpoint of 5-year age groups.
Note: not all values add up to 100% due to missing values.
Supplemental Table 3-2. Statistical power to detect associations with the known risk variants in African Americans.
SNP Chr./Nearest gene Per allele OR from
GWAS
[reference]
Allele tested/ Risk
allele frequency in
AfricanAmericans
Power
All cases/controls
3,016/2,745
ER+ cases/controls
1,520/2,745
ER- cases/controls
988/2,745
rs11249433 1p11 1.16
a
(61) G/0.13 78% 62% 50%
rs13387042 2q35 1.12 (89) A/0.72 77% 60% 86%
rs4973768 3p24/NEK10 1.11 (62) T/0.36 77% 60% 48%
rs4415084 5p12/MRPS30 1.16 (60) T/0.63 97% 88% 77%
rs889312 5q11/MAP3K1 1.13 (58) C/0.34 88% 73% 60%
rs2046210 6q25/C6orf97 1.29 (35) A/0.60 >99% >99% >99%
rs13281615 8q24 1.08 (58) G/0.43 53% 39% 31%
rs1011970 9p21/CDKN2BAS 1.09
b
(83) T/0.33 59% 43% 34%
rs865686 9q31/KLF4 1.12 (85) T/0.52 86% 70% 57%
rs2380205 10p15/ANKRD16 1.06
b
(83) T/0.58 34% 24% 19%
rs10995190 10q21/ZNF365 1.16
b
(83) G/0.83 83% 66% 54%
rs704010 10q22/ZMIZ1 1.07
b
(83) T/0.11 21% 16% 13%
rs2981582 10q26/FGFR2 1.26 (58) A/0.46 >99% >99% >99%
rs3817198 11p15/LSP1 1.07 (58) C/0.17 28% 20% 16%
rs614367 11q13 1.15
b
(83) T/0.13 73% 56% 45%
rs999737 14q24/RAD51L1 1.06
a
(61) T/0.051 11% 9% 8%
rs3803662 16q12/TNRC9 1.20 (58) A/0.51 >99% 98% 93%
rs6504950 17q23/COX11 1.05 (62) G/0.66 23% 17% 14%
rs2363956 19p13/ANKRD41 1.20
c
(84) T/0.49 >99% 98% 93%
Statistical power to detect the OR reported in GWAS based on the RAF in African Americans. Power estimated using Quanto (http://hydra.usc.edu/gxe/).
a
OR reported for heterozygotes.
b
OR reported from stage 2 of GWAS.
c
OR reported for ER- breast cancer.
Supplemental Table 3-3. Associations of common variants at known breast cancer risk loci by ER status in African Americans.
Chr., Marker
Position, Risk/ref allele
RAF (CEU/AA)
a
1,520 ER+ cases, 2,745 controls 988 ER- cases, 2,745 controls P
het
OR (95% CI), P
trend
OR (95% CI), P
trend
1p11, rs11249433
120982136, G/A
0.43/0.13 1.01 (0.88-1.17), 0.85 1.09 (0.92-1.30), 0.30 0.89
2q35, rs13387042
217614077, A/G
0.56/0.72 1.22 (1.10-1.35), 2.6×10
-4
1.00 (0.88-1.12), 1.00 0.013
3p24, rs4973768
27391017, T/C
0.44/0.36 1.09 (0.99-1.20), 0.076 0.98 (0.88-1.10), 0.74 0.092
5p12, rs4415084
44698272, T/C
0.38/0.63 1.03 (0.94-1.13), 0.54 0.97 (0.87-1.08), 0.60 0.45
5q11, rs889312
56067641, C/A
0.30/0.34 1.09 (0.98-1.19), 0.12 1.06 (0.94-1.19), 0.34 0.64
6q25, rs2046210
b,c
151990059, A/G
0.38/0.60 0.97 (0.88-1.06), 0.46 1.07 (0.96-1.20), 0.21 0.09
8q24, rs13281615
128424800, G/A
0.45/0.43 1.05 (0.95-1.15), 0.34 1.03 (0.92-1.15), 0.59 0.77
9p21, rs1011970
22052134, T/G
0.17/0.33 1.06 (0.96-1.17), 0.23 0.94 (0.84-1.05), 0.28 0.10
9q31, rs865686
109928199, T/G
0.61/0.52 1.06 (0.97-1.16), 0.22 1.15 (1.03-1.29), 0.011 0.17
10p15, rs2380205
5926740, C/T
0.52/0.42 0.95 (0.86-1.04), 0.27 0.98 (0.88-1.09), 0.68 0.51
10q21, rs10995190
63948688, G/A
0.87/0.83 1.00 (0.88-1.12), 1.00 0.92 (0.80-1.06), 0.28 0.33
10q22, rs704010
80511154, T/C
0.43/0.11 0.99 (0.85-1.16), 0.93 1.01 (0.83-1.22), 0.95 0.71
10q26, rs2981582
123342307, A/G
0.46/0.46 1.05 (0.96-1.15), 0.27 1.14 (1.02-1.27), 0.020 0.24
11p15, rs3817198
1865582, C/T
0.33/0.17 0.99 (0.88-1.12), 0.89 0.90 (0.78-1.05), 0.17 0.16
11q13, rs614367
69037945, T/C
0.18/0.13 1.00 (0.87-1.14), 1.00 0.91 (0.77-1.07), 0.25 0.44
14q24, rs999737
68104435, T/C
0.26/0.051 1.01 (0.83-1.26), 0.86 0.91 (0.70-1.19), 0.49 0.44
16q12, rs3803662
51143842, A/G
0.25/0.51 0.98 (0.89-1.08), 0.63 1.00 (0.89-1.11), 0.95 0.63
17q23, rs6504950
b
50411470, G/A
0.70/0.66 1.02 (0.93-1.12), 0.64 1.07 (0.96-1.20), 0.22 0.54
19p13, rs2363956
17255124, T/G
0.45/0.49 1.12 (1.02-1.22), 0.016 1.14 (1.02-1.27), 0.018 0.99
SNP positions are based on NCBI build 36.
ORs are adjusted for age, study, the first 10 eigenvectors and local ancestry at each risk locus. P
trend
, P-value from test of trend (1-d.f.).
P
het
, P-value for heterogeneity by ER status from case-only test.
a
RAF, risk allele frequency in original GWAS population (Hapmap CEU, CHB for rs2046210), and in African Americans (AA) in this study. This is the allele associated with
increased risk in previous GWAS.
b
Imputed SNPs.
c
Index signal reported in Han Chinese. RAF based on HapMap CHB population.
Supplemental Table 3-4. The association of local ancestry surrounding the index signal at each risk locus and breast cancer
risk.
Chr. OR per European chromosome (P-value)
3,016 cases, 2,745 controls 1,520 ER+ cases, 2,745 controls 988 ER- cases, 2,745 controls
1p11 1.00(0.95) 0.96(0.60) 1.12(0.23)
2q35 1.01(0.91) 0.99(0.94) 1.06(0.48)
3p24 0.94(0.22) 0.92(0.17) 0.93(0.32)
5p12 0.89(0.04) 0.88(0.086) 0.90(0.23)
5q11 0.91(0.11) 0.93(0.34) 0.87(0.11)
6q25 1.11(0.056) 1.19(6.2×10
-3
) 0.98(0.77)
8q24 0.88(0.035) 0.88(0.058) 0.92(0.32)
9p21 1.00(0.93) 1.02(0.79) 1.04(0.63)
9q31 1.04(0.41) 1.04(0.49) 1.06(0.44)
10p15 1.11(0.034) 1.07(0.25) 1.16(0.045)
10q21 1.08(0.27) 1.00(0.98) 1.21(0.074)
10q22 0.93(0.19) 0.94(0.39) 0.91(0.25)
10q26 0.92(0.11) 0.85(0.011) 1.06(0.44)
11p15 1.05(0.35) 1.02(0.81) 1.05(0.53)
11q13 1.11(0.13) 1.19(0.035) 1.05(0.63)
14q24 0.98(0.72) 1.01(0.86) 0.87(0.065)
16q12 0.94(0.25) 0.97(0.63) 0.96(0.60)
17q23 0.91(0.083) 0.92(0.20) 0.86(0.055)
19p13 0.98(0.68) 0.93(0.28) 1.04(0.59)
Local ancestry was estimated at each risk locus using HAPMIX (see Methods). ORs for local ancestry and P-values are adjusted for study, age and the first 10 eigenvectors.
Supplemental Table 3-5. Information about the 19 regions fine-mapped.
Chr. Index SNP Position
(NCBI Build
36)
Start Stop Size
a
(Kb)
No. of
common
SNPs
(MAF ≥0.05)
correlated
with index at
r
2
≥0.2
b
No. of tags
c
for
correlated
SNPs
(r
2
≥0.8)
α = 0.05/
average
No. of
tags in
each
region
No. of
common
SNPs
(MAF ≥0.05)
in Phase 2
HapMap
(YRI)
No. of tags
c
for all
common
SNPs
(MAF ≥0.05)
(r
2
≥0.8)
α =
0.05/tota
l No. of
tags in
19
regions
No. of typed and
imputed SNPs
tested in cases
and controls
1p11 rs11249433 120982136 120732136 121232136 500 27 8 3.2×10
-3
67 19 1.0×10
-5
75
2q35 rs13387042 217614077 217364077 217864077 500 75 24 522 250 582
3p24 rs4973768 27391017 27013344 27641017 628 188 36 571 117 743
5p12 rs4415084 44698272 44432110 45432110 1000 177 16 467 88 610
5q11 rs889312 56067641 55817641 56317641 500 120 30 557 232 680
6q25 rs2046210
b
151990059 151740059 152240059 500 46 14 519 206 588
8q24 rs13281615 128424800 126000000 130000000 4000 74 20 3157 1278 5144
9p21 rs1011970 22052134 21802134 22302134 500 20 7 457 200 561
9q31 rs865686 109928299 109678299 110178299 500 60 29 672 114 827
10p15 rs2380205 5926740 5676740 6176740 500 64 20 553 252 775
10q21 rs10995190 63948688 63698688 64198688 500 8 3 486 187 614
10q22 rs704010 80511154 80261154 80761154 500 40 17 607 347 778
10q26 rs2981582 123342307 123092307 123592307 500 13 9 550 286 653
11p15 rs3817198 1865582 1615582 2150173 535 5 5 349 208 526
11q13 rs614367 69037945 68787945 69287945 500 22 9 490 260 632
14q24 rs999737 68104435 67854435 68354435 500 56 18 626 321 800
16q12 rs3803662 51143842 50893842 51393842 500 25 6 467 177 576
17q23 rs6504950 50411470 50161470 50661470 500 143 21 551 202 694
19p13 rs2363956 17255124 17000704 17505124 504 14 7 253 130 359
a
Target region was 250kb on either side of the index SNP. The region was expanded in cases where the LD block containing the index SNP was >250kb.
b
LD estimates based on HapMap
CEU population (Phase 2, release 21) except for 6q25 where LD estimates was based on HapMap CHB population (Phase 2, release 21).
c
Multi-allele tagging using Tagger (120) as
implemented in Haploview version 4.1 (121).
Supplemental Table 3-6. Results of the stepwise procedure.
3,016 cases, 2,745 controls
Region SNPs Selected Correlated with Index (r
2
>0.2) P
trend
2q35 rs13000023 Yes 5.8×10
-4
5q11 rs16886165 Yes 6.5×10
-4
10q22 rs12355688 No 6.8×10
-6
10q26 rs2981578 Yes 1.7×10
-4
11q13 rs609275 No 1.0×10
-5
16q12 rs3112572
b
No
3.9×10
-4
19p13 rs3745185 Yes 3.7×10
-5
P
trend,
P-value based on test of trend (1-d.f.) from logistic regression models adjusted for age, study, the first 10 eigenvectors and local ancestry. In each region, markers correlated
with the index signal were kept in the model if P
trend
<3.2×10
-3
, all other markers were kept if P
trend
<1.0×10
-5
(see Table S4 and Methods).
Supplemental Table 3-7. Results of the stepwise procedure by ER status.
1,520 ER+ cases, 2,745 controls
988 ER- cases, 2,745 controls
Region SNPs Selected Correlated with
Index (r
2
>0.2)
P
trend
SNPs Selected Correlated with
Index (r
2
>0.2)
P
trend
2q35 rs12998806 Yes 3.3×10
-6
----
5q11 ---- rs832529 Yes 1.3×10
-3
8q24 rs16902056 No 6.7×10
-6
----
10q26 rs2981578 Yes 2.9×10
-4
rs2912774 Yes 2.7×10
-3
16q12 rs3112572 No
3.1×10
-5
----
19p13 rs3745185 Yes 8.2×10
-4
rs11668840 Yes 5.1×10
-5
P
trend,
P-value based on test of trend (1-d.f.) from logistic regression models adjusted for age, study, the first 10 eigenvectors and local ancestry. In each region, markers correlated
with the index signal were kept in the model if P
trend
<3.2×10
-3
, all other markers were kept if P
trend
<1.0×10
-5
(see Table S4 and Methods).
Supplemental Table 3-8. Associations with common variants at known breast cancer risk regions in African Americans by ER
status.
Chr.
Nearest
Gene
Index SNP from GWAS Best marker in African Americans
3,016 cases, 2,745 controls 1,520 ER+ cases, 2,745 controls 988 ER- cases, 2,745 controls
Marker
Position
Risk/ref alleles
RAF (CEU/AA)
a
OR (95% CI)
P
trend
Marker
Position
Risk/ref alleles
RAF (CEU/AA)
a
OR (95% CI)
b
P
trend
b
r
2
with index
in CEU/YRI
c
Marker
Position
Risk/ref alleles
RAF (CEU/AA)
a
OR (95% CI)
b
P
trend
b
r
2
with index
in CEU/YRI
c
1p11 rs11249433
120982136
G/A
0.43/0.13
1.01 (0.90-1.14)
0.84
2q35 rs13387042
217614077
A/G
0.56/0.72
1.12 (1.03-1.21)
7.5×10
-3
rs12998806
d
217602008
G/A
0.84/0.85
1.39 (1.20-1.59)
3.3×10
-6
0.29/0.53
3p24
NEK10
rs4973768
27391017
T/C
0.44/0.36
1.04 (0.96-1.13)
0.32
5p12
MRPS30
rs4415084
44698272
T/C
0.38/0.63
1.02 (0.95-1.11)
0.54
5q11
MAP3K1
rs889312
56067641
C/A
0.30/0.34
1.07 (0.99-1.18)
0.084
rs832529
56265045
C/T
0.37/0.67
1.22 (1.08-1.37)
1.3×10
-3
0.46/0.092
6q25
C6orf97
rs2046210
d,e
151990059
A/G
0.38/0.60
1.00 (0.93-1.09)
0.88
8q24 rs13281615
128424800
G/A
0.45/0.43
1.05 (0.97-1.13)
0.20
rs16902056
d
128346821
A/C
0.91/0.95
1.69 (1.34-2.12)
6.7×10
-6
<0.01/0.027
9p21
CDKN2B
rs1011970
22052134
T/G
0.17/0.33
1.05 (0.97-1.14)
0.24
9q31 rs865686
109928199
T/G
0.61/0.52
1.08 (1.01-1.17)
0.034
10p15
ANKRD16
rs2380205
5926740
C/T
0.52/0.42
0.98 (0.91-1.06)
0.60
10q21
ZNF365
rs10995190
63948688
G/A
0.87/0.83
0.97 (0.88-1.08)
0.57
10q22
ZMIZ1
rs704010
80511154
T/C
0.43/0.11
0.99 (0.87-1.12)
0.83
10q26
FGFR2
rs2981582
123342307
A/G
0.46/0.46
1.11 (1.03-1.19)
8.6×10
-3
rs2981578
d
123330301
C/T
0.46/0.81
1.30 (1.13-1.49)
2.9×10
-4
0.66/0.059 rs2912774
d
123338652
T/G
0.46/0.54
1.19 (1.06-1.33)
2.7×10
-3
0.98/0.47
11p15
LSP1
rs3817198
1865582
C/T
0.33/0.17
0.98 (0.88-1.08)
0.63
11q13 rs614367
69037945
T/C
0.18/0.13
0.96 (0.86-1.07)
0.45
14q24
RAD51L1
rs999737
68104435
T/C
0.26/0.051
0.98 (0.82-1.17)
0.80
16q12
TNRC9
rs3803662
51143842
A/G
0.25/0.51
0.99 (0.92-1.08)
0.85
rs3112572
51157948
A/G
0.020/0.20
1.27 (1.13-1.42)
3.1×10
-5
0.038/0.31
17q23
COX11
rs6504950
d
50411470
G/A
0.70/0.66
1.05 (0.97-1.14)
0.19
19p13
ANKLE1
rs2363956
17255124
T/G
0.45/0.49
1.14 (1.05-1.22)
8.0×10
-4
rs3745185
17245267
G/A
0.52/0.75
1.20 (1.08-1.35)
8.2×10
-4
0.57/0.19 rs11668840
17260625
T/C
0.55/0.57,
1.25 (1.12-1.39),
5.1×10
-5
0.82/0.50
SNP positions are based on NCBI build 36.
ORs are adjusted for age, study, the first 10 eigenvectors and local ancestry at each risk locus. P
trend
, P-value from test of trend (1-d.f.).
a
RAF, risk allele frequency in original GWAS population (HapMap CEU, or CHB for rs2046210), and in African American (AA) controls in this study. This is the allele
associated with increased risk in previous GWAS.
b
From stepwise analysis.
c
Pairwise correlation between the index signal and the best marker in CEU and YRI in 1000 Genomes
Project (March 2010 release).
d
Imputed SNPs.
e
Index signal reported in Han Chinese. RAFs based on HapMap CHB and r
2
based on CHB in 1000 Genomes Project.
Supplemental Table 3-9. Associations by genotype for SNPs nominally associated with risk in African Americans in known
breast cancer risk regions by ER status.
Chr. Marker
Position, Risk/ref alleles
RAF
a
in
AA
1,520 ER+ cases, 2,745 controls 988 ER- cases, 2,745 controls P
het
Per allele OR
(95% CI)
Het OR
(95% CI)
Hom OR
(95% CI)
Per allele OR
(95% CI)
Het OR
(95% CI)
Hom OR
(95% CI)
2q35 rs13387042
217614077, A/G
0.72 1.22
(1.10-1.35)
1.19
(0.90-1.56)
1.46
(1.12-1.91)
1.00
(0.88-1.12)
0.99
(0.73-1.34)
1.00
(0.74-1.34)
0.013
rs12998806
b
217602008, G/A
0.85 1.39
(1.20-1.59)
1.29
(0.77-2.18)
1.82
(1.09-3.03)
0.99
(0.85-1.16)
1.09
(0.64-1.86)
1.07
(0.63-1.80)
3.1×10
-4
rs13000023
b
217632639, G/A
0.83 1.35
(1.18-1.54)
1.13
(0.71-1.78)
1.55
(0.99-2.42)
1.07
(0.92-1.24)
1.03
(0.62-1.70)
1.11
(0.68-1.81)
0.011
5q11 rs16886165
56058840, G/T
0.31 1.16
(1.05-1.28)
1.15
(1.00-1.32)
1.35
(1.09-1.68)
1.07
(0.96-1.20)
1.03
(0.87-1.21)
1.20
(0.93-1.56)
0.18
rs832529
56265045, C/T
0.67 1.08
(0.98-1.19)
1.07
(0.86-1.34)
1.16
(0.92-1.45)
1.22
(1.08-1.37)
1.06
(0.81-1.39)
1.37
(1.05-1.80)
0.16
8q24 rs16902056
128346821, A/C
0.95 1.69
(1.34-2.12)
2.07
(0.58-7.40)
3.36
(0.96-11.79)
1.06
(0.84-1.34)
1.38
(0.43-4.37)
1.40
(0.45-4.35)
4.7×10
-4
9q31 rs865686
109928299, T/G
0.52 1.06
(0.97-1.16)
0.97
(0.82-1.14)
1.11
(0.93-1.33)
1.15
(1.03-1.29)
1.10
(0.96-1.34)
1.31
(1.06-1.63)
0.17
10q22 rs12355688
80725632, T/C
0.20 1.26
(1.12-1.41)
1.31
(1.14-1.51)
1.39
(1.01-1.92)
1.24
(1.09-1.41)
1.21
(1.03-1.42)
1.64
(1.17-2.29)
0.77
10q26 rs2981582
123342307, A/G
0.46 1.05
(0.96-1.15)
1.03
(0.89-1.20)
1.11
(0.92-1.33)
1.14
(1.02-1.27)
1.28
(1.07-1.54)
1.28
(1.03-1.60)
0.51
rs2981578
b
123330301, C/T
0.81 1.27
(1.11-1.45)
1.26
(0.81-1.98)
1.61
(1.03-2.49)
1.15
(0.98-1.35)
1.12
(0.66-1.88)
1.29
(0.78-2.16)
0.17
rs2912774
b
123338652, T/G
0.54 1.10
(1.00-1.21)
1.07
(0.90-1.27)
1.19
(0.98-1.43)
1.19
(1.06-1.33)
1.11
(0.91-1.36)
1.38
(1.11-1.72)
0.39
11q13 rs609275
b
69112096, C/T
0.59 1.16
(1.05-1.29)
1.26
(1.04-1.51)
1.38
(1.13-1.69)
1.18
(1.06-1.33)
1.21
(0.97-1.49)
1.42
(1.13-1.79)
0.58
16q12 rs3104746
b
51158601, A/T
0.19 1.25
(1.12-1.40)
1.09
(0.98-1.21)
1.30
(1.00-1.68)
1.13
(0.99-1.29)
1.21
(1.03-1.43)
1.03
(0.69-1.54)
0.095
rs3112572
51157948, A/G
0.20 1.27
(1.13-1.42)
1.35
(1.18-1.55)
1.34
(0.98-1.84)
1.14
(1.00-1.30)
1.19
(1.01-1.40)
1.18
(0.82-1.71)
0.087
19p13 rs2363956
17255124, T/G
0.45 1.12
(1.02-1.22)
1.27
(1.08-1.49)
1.26
(1.05-1.51)
1.14
(1.02-1.27)
1.24
(1.02-1.50)
1.30
(1.05-1.61)
0.99
rs3745185
17245267, G/A
0.77 1.20
(1.08-1.35)
1.25
(0.92-1.70)
1.50
(1.11-2.03)
1.15
(1.01-1.31)
1.16
(0.81-1.66)
1.33
(0.94-1.89)
0.53
rs11668840
17260625, T/C
0.57 1.10
(1.00-1.20)
1.19
(1.00-1.42)
1.23
(1.02-1.48)
1.25
(1.12-1.39)
1.47
(1.18-1.84)
1.65
(1.31-2.07)
0.061
SNP positions are based on NCBI build 36.
ORs are adjusted for age, study, the first 10 eigenvectors and local ancestry.
P
het,
P-value for heterogeneity by ER status from case-only test.
a
RAF, risk allele frequency in African Americans (AA).
b
For imputed SNPs, the probability of the number of risk alleles from MACH was converted to genotype groups (i.e. <0.5
= homozygous reference; 0.5-1.5 =heterozygous, and >1.5=homozygous variant).
Supplemental Table 3-10. Pairwise correlations of index SNPs and the most significant markers found in fine-mapping.
Region Index marker SNP pairs r
2
in CEU/YRI
2q35 rs13387042 rs13387042 and rs12998806 0.29/0.53
rs13387042 and rs13000023 0.35/0.53
rs12998806 and rs13000023 0.83/1.0
5q11 rs889312 rs889312 and rs16886165 0.15/0.009
rs889312 and rs832529 0.46/0.092
rs16886165 and rs832529 0.29/0.09
8q24 rs13281615 rs13281615 and rs16902056 0/0.027
10q22 rs704010 rs704010 and rs12355688 0.007/0.002
10q26 rs2981582 rs2981582 and rs2912774 0.98/0.47
rs2981582 and rs2981578 0.66/0.059
rs2912774 and rs2981578 0.72/0.10
11q13 rs614367 rs614367 and rs609275 0.024/0.001
16q12 rs3803662
rs3803662 and rs3112572 0.038/0.31
rs3803662 and rs3104746 0.038/0.31
rs3104746 and rs3112572 1.0/1.0
19p13 rs2363956 rs2363956 and rs3745185 0.57/0.19
rs2363956 and rs11668840 0.82/0.50
rs3745185 and rs11668840 0.68/0.32
r
2
, pairwise correlation based on CEU and YRI populations in the 1000 Genomes Project (March 2010 release).
Supplemental Table 3-11. Associations by genotype for SNPs associated with risk in African Americans in known breast
cancer risk regions.
Chr. 3,016 cases, 2,745 controls
Marker
Position, Risk/ref alleles
RAF
a
in AA Per allele OR (95% CI) Heterozygotes OR (95% CI) Homozygotes OR (95% CI)
2q35 rs13387042
217614077, A/G
0.72 1.12 (1.03-1.21) 1.13 (0.91-1.41) 1.27 (1.03-1.57)
rs13000023
b
217632639, G/A
0.83 1.20 (1.09-1.33) 1.09 (0.76-1.55) 1.33 (0.94-1.88)
5q11 rs16886165
56058840, G/T
0.31 1.15 (1.06-1.25) 1.12 (1.01-1.26) 1.36 (1.13-1.63)
9q31 rs865686
109928299, T/G
0.52 1.08 (1.01-1.17) 1.03 (0.90-1.17) 1.17 (1.01-1.36)
10q22 rs12355688
80725632, T/C
0.20 1.24 (1.13-1.36) 1.23 (1.10-1.38) 1.55 (1.20-2.01)
10q26 rs2981582
123342307,A/G
0.46 1.11 (1.03-1.19) 1.16 (1.02-1.31) 1.22 (1.05-1.42)
rs2981578
b
123330301, C/T
0.81 1.24 (1.11-1.39) 1.18 (0.83-1.68) 1.45 (1.03-2.05)
11q13 rs609275
b
69112096, C/T
0.59 1.20 (1.11-1.30) 1.28 (1.10-1.49) 1.47 (1.25-1.73)
16q12 rs3104746
b
51158601, A/T
0.19 1.17 (1.06-1.29) 1.24 (1.11-1.39) 1.11 (0.84-1.47)
rs3112572
51157948, A/G
0.20 1.18 (1.08-1.30) 1.24 (1.11-1.39) 1.23 (0.94-1.59)
19p13 rs2363956
17255124, T/G
0.45 1.14 (1.05-1.22) 1.25 (1.10-1.43) 1.30 (1.12-1.50)
rs3745185
17245267, G/A
0.77 1.20 (1.10-1.32) 1.29 (1.00-1.65) 1.53 (1.20-1.95)
SNP positions are based on NCBI build 36.
ORs are adjusted for age, study, the first 10 eigenvectors and local ancestry.
a
RAF, risk allele frequency in African Americans (AA).
b
For imputed SNPs, the probability of the number of risk alleles from MACH was converted to genotype groups (i.e. <0.5
= homozygous reference; 0.5-1.5 =heterozygous, and >1.5=homozygous variant).
Supplemental Table 3-12. Risk summary scores in association with breast cancer by ER status.
3,016 cases, 2,745 controls 1,520 ER+ cases, 2,745 controls 988 ER- cases, 2,745 controls P
het
Summary score of index markers (19 markers)
Average number of risk alleles in controls (range) 15.7 (6-25)
Per allele OR
1.04 (1.02-1.06) 1.04 (1.01-1.07) 1.03 (1.00-1.06) 0.40
P
trend
6.1×10
-5
1.7×10
-3
0.081
Quintile OR (95% CI) Q1 1.00 (ref) 1.00 (ref) 1.00 (ref) ----
Q2 0.99 (0.84-1.16) 0.86 (0.70-1.05) 1.12 (0.89-1.40)
Q3 1.15 (0.96-1.39) 1.19 (0.95-1.49) 1.10 (0.84-1.45)
Q4 1.16 (0.98-1.36) 1.12 (0.92-1.36) 1.19 (0.95-1.51)
Q5 1.44 (1.20-1.72) 1.38 (1.11-1.71) 1.29 (0.99-1.67)
Summary score of risk-associated best markers in African Americans for all cases vs controls (8 markers)
a
Average number of risk alleles in controls (range) 8.4 (3-14)
Per allele OR
1.18 (1.14-1.22) 1.20 (1.15-1.25) 1.15 (1.10-1.20) 0.12
P
trend
2.8×10
-24
1.7×10
-19
2.8×10
-9
Quintile OR (95% CI)
Q1 1.00 (ref) 1.00 (ref) 1.00 (ref) ----
Q2 1.17 (0.96-1.42) 1.17 (0.91-1.49) 1.03 (0.77-1.36)
Q3 1.37 (1.14-1.64) 1.36 (1.08-1.71) 1.29 (1.00-1.67)
Q4 1.56 (1.30-1.87) 1.63 (1.29-2.04) 1.26 (0.97-1.64)
Q5 2.16 (1.80-2.58) 2.31 (1.86-2.88) 1.81 (1.40-2.33)
Summary score of risk-associated best markers in African Americans for ER+ cases vs controls (5 markers)
a
Average number of risk alleles in controls (range) 7.1 (2-10)
Per allele OR
1.22 (1.16-1.28) 1.30 (1.23-1.38) 1.11 (1.03-1.19) 1.7×10
-5
P
trend
2.0×10
-15
6.0×10
-18
3.1×10
-3
Quintile OR (95% CI)
Q1 1.00 (ref) 1.00 (ref) 1.00 (ref) ----
Q2 1.19 (0.97-1.46) 1.34 (1.03-1.75) 1.04 (0.78-1.38)
Q3 1.51 (1.23-1.84) 1.70 (1.32-2.20) 1.12 (0.84-1.48)
Q4 1.52 (1.19-1.95) 2.25 (1.66-3.06) 0.82 (0.57-1.20)
Q5 1.80 (1.48-2.18) 2.23 (1.74-2.86) 1.35 (1.03-1.76)
Summary score of risk-associated best markers in African Americans for ER- cases vs controls (4 markers)
a
Average number of risk alleles in controls (range) 4.7 (0-8)
Per allele OR
1.13 (1.08-1.17) 1.09 (1.04-1.14) 1.20 (1.13-1.27) 5.0×10
-3
P
trend
1.7×10
-9
5.9×10
-4
2.3×10
-10
Quintile OR (95% CI) Q1 1.00 (ref) 1.00 (ref) 1.00 (ref) ----
Q2 1.10 (0.93-1.29) 0.98 (0.80-1.19) 1.22 (0.95-1.56)
Q3 1.23 (1.05-1.05) 1.12 (0.93-1.36) 1.43 (1.13-1.81)
Q4 1.47 (1.24-1.73) 1.30 (1.06-1.59) 1.81 (1.42-2.31)
Q5 1.60 (1.30-1.99) 1.33 (1.02-1.72) 2.12 (1.58-2.86)
P
het,
P-value for heterogeneity from case-only testing of ER+ vs. ER- cases.
ORs are adjusted for age, study and the first 10 eigenvectors.
P
trend,
P values from test of trend (1-d.f.).
a
Summing over the most significant markers from the stepwise analysis in each region from Table 3-1, Supplemental Table 3-3 and Supplemental Table 3-7.
Supplemental Figure Legends
Chapter 3
Supplemental Figure 3-1. –Log P plots for common alleles in 19 breast cancer risk
regions in African Americans. Pairwise correlation (r
2
) are shown in relation to GWAS
index variants in HapMap CEU (JPT+CHB for 6q25). –Log P values modeling overall,
ER+ and ER- breast cancer risk are shown separately in each panel.
Supplemental Figure 3-2. –Log P plots for common alleles in breast cancer risk regions
in African Americans. Pairwise correlations (r
2
) are shown in relation to the strongest
signal in each region based on HapMap CEU (YRI for 11q13). –Log P values modeling
overall, ER+ and ER- breast cancer risk are shown separately in each panel.
Supplemental Figure 3-3. Linkage disequilibrium plots of breast cancer risk regions in
the GWAS population (HapMap CEU, JPT+CHB for 6q25) and HapMap YRI.
Chapter 4
Supplemental Figure 4-1. Relationship between expected heritability and number of
predictors in order for the effects to be detectable with 80% power in a simplified linear
model.
Supplemental Figure 4-2. A plot of z1 against z0 to show relatedness in the sample.
Supplemental Figure 4-3. A Q-Q plot for a genome-wide scan of height using simulated
data (h
2
=0.5), after adjusting for the fisrt 10 PCs.
Supplemental Figure 4-4. A Q-Q plot for a genome-wide scan of height using real data,
after adjusting for the first 10 PCs.
Supplemental Figure 3-1
A 1p11
All cases vs. controls
ER+ cases vs. controls
ER- cases vs. controls
B 2q35
All cases vs. controls
ER+ cases vs. controls
ER- cases vs. controls
C 3p24
ER- cases vs. controls
ER+ cases vs. controls
All cases vs. controls
D 5p12
ER- cases vs. controls
ER+ cases vs. controls
All cases vs. controls
E 5q11
ER- cases vs. controls
ER+ cases vs. controls
All cases vs. controls
F 6q25
ER- cases vs. controls
ER+ cases vs. controls
All cases vs. controls
G 8q24
ER- cases vs. controls
ER+ cases vs. controls
All cases vs. controls
H 9p21
ER- cases vs. controls
All cases vs. controls
ER+ cases vs. controls
I 9q31
ER- cases vs. controls
ER+ cases vs. controls
All cases vs. controls
J 10p15
ER- cases vs. controls
All cases vs. controls
ER+ cases vs. controls
K 10q21
ER- cases vs. controls
All cases vs. controls
ER+ cases vs. controls
L 10q22
ER- cases vs. controls
All cases vs. controls
ER+ cases vs. controls
M 10q26
ER- cases vs. controls
All cases vs. controls
ER+ cases vs. controls
N 11p15
ER- cases vs. controls
All cases vs. controls
ER+ cases vs. controls
O 11q13
ER- cases vs. controls
All cases vs. controls
ER+ cases vs. controls
P 14q24
ER- cases vs. controls
All cases vs. controls
ER+ cases vs. controls
Q 16q12
ER- cases vs. controls
All cases vs. controls
ER+ cases vs. controls
R 17q23
ER+ cases vs. controls
All cases vs. controls
ER- cases vs. controls
S 19p13
ER+ cases vs. controls
ER- cases vs. controls
All cases vs. controls
Supplemental Figure 3-2
A 2q35
ER- cases vs. controls
ER+ cases vs. controls
All cases vs. controls
B 5q11
ER- cases vs. controls
ER+ cases vs. controls
All cases vs. controls
C 8q24
All cases vs. controls
ER+ cases vs. controls
ER- cases vs. controls
D 10q22
ER+ cases vs. controls
ER- cases vs. controls
All cases vs. controls
E 10q26
ER+ cases vs. controls
All cases vs. controls
ER- cases vs. controls
F 11q13
ER+ cases vs. controls
All cases vs. controls
ER- cases vs. controls
G 16q12
ER+ cases vs. controls
All cases vs. controls
ER- cases vs. controls
H 19p13
ER+ cases vs. controls
All cases vs. controls
ER- cases vs. controls
Supplemental Figure 3-3
A 2q35
B 5q11
C 8q24
D 10q22
E 10q26
F 11p13
G 16q12
H 19p13
Supplemental Figure 4-1.
Supplemental Figure 4-2.
Supplemental Figure 4-3.
Supplemental Figure 4-4.
Chapter 5
5-1 Information of participating studies.
This study includes African American women and men from 9 epidemiological studies of breast
cancer and 12 epidemiological studies of prostate cancer, which comprise a total sample size of
15,032. Below is a brief description of each of the studies.
The Multiethnic Cohort Study (MEC): The MEC is a prospective cohort study of 215,000 men
and women in Hawaii and Los Angeles (1) between the ages of 45 and 75 years at baseline
(1993-1996). Through December 31, 2007, a nested breast cancer case-control study in the MEC
included 556 African American cases (544 invasive and 12 in situ) and 1,003 African American
controls. An additional 178 African American breast cancer cases (ages: 50-84) diagnosed
between June 1, 2006 and December 31, 2007 in Los Angeles County (but outside of the MEC)
were included in the study. Through January 1, 2008 the African American case-control study in
the MEC included 1,094 prostate cancer cases and 1,096 controls. An additional 746 prostate
cancer cases from the MEC diagnosed after January 1, 2009 and 656 controls with GWAS data
were included. All together, the MEC contributed 5,329 subjects to the height study.
The Los Angeles component of The Women’s Contraceptive and Reproductive Experiences
(CARE) Study: The CARE Study is a large multi-center population-based case-control study that
was designed to examine the effects of oral contraceptive (OC) use on invasive breast cancer risk
among African American women and white women ages 35-64 years in five U.S. locations (2).
Cases in Los Angeles County were diagnosed from July 1, 1994 through April 30, 1998, and
controls were sampled by random-digit dialing (RDD) from the same population and time period;
380 African American cases and 224 African American controls were included in the study.
The Women’s Circle of Health Study (WCHS): The WCHS is an ongoing case-control study of
breast cancer among European women and African American women in the New York City
boroughs and in seven counties in New Jersey (3). Eligible cases included women with invasive
breast cancer between 20 and 74 years of age; controls were identified through RDD. The
WCHS contributed 272 invasive African American cases and 240 African American controls.
The San Francisco Bay Area Breast Cancer Study (SFBCS): The SFBCS is a population-based
case-control study of invasive breast cancer in Hispanic, African American and non-Hispanic
White women conducted between 1995 and 2003 in the San Francisco Bay Area (4). African
American cases, ages 35-79 years, were diagnosed between April 1, 1995 and April 30, 1999,
with controls identified through RDD. Included from this study were 172 invasive African
American cases and 231 African American controls.
The Northern California Breast Cancer Family Registry (NC-BCFR): The NC-BCFR is a
population-based family study conducted in the Greater San Francisco Bay Area, and one of 6
sites of the Breast Cancer Family Registry (BCFR) (5). African American breast cancer cases in
NC-BCFR were diagnosed after January 1, 1995 and between the ages of 18 and 64 years;
population controls were identified through RDD. Genotyping was conducted for 440 invasive
African American cases and 53 African American controls.
The Carolina Breast Cancer Study (CBCS): The CBCS is a population-based case-control study
conducted between 1993 and 2001 in 24 counties of central and eastern North Carolina (6).
Cases were identified by rapid case ascertainment system in cooperation with the North Carolina
Central Cancer Registry and controls were selected from the North Carolina Division of Motor
Vehicle and United States Health Care Financing Administration beneficiary lists. Participants’
ages ranged from 20 to 74 years. DNA samples were provided from 656 African American cases
with invasive breast cancer and 608 African American controls.
The Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO) Cohort: The
Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (7), is a randomized, two-arm
trial among men and women aged 55-74 years to determine if screening reduced the mortality
from these cancers. Male participants randomized to the intervention arm underwent prostate
specific antigen (PSA) screening at baseline and annually for 5 years and digital rectal
examination at baseline and annually for 3 years. Sequential blood samples were collected from
participants assigned to the screening arm; participation was 93% at the baseline blood draw
(1993-2001). Buccal cell samples were collected from participants in the control arm of the trial;
participation was about 85% for this component. A total of 64 African American invasive breast
cancer cases and 133 African American controls, as well as 286 African American prostate
cancer cases and 269 controls without a history of prostate cancer contributed to this study.
The Nashville Breast Health Study (NBHS): The NBHS is a population-based case-control study
of incident breast cancer conducted in Tennessee (8). The study was initiated in 2001 to recruit
patients with invasive breast cancer or ductal carcinoma in situ, and controls, recruited through
RDD between the ages of 25 and 75 years. NBHS contributed 310 African American cases (57
in situ), and 186 African American controls.
Wake Forest University Breast Cancer Study (WFBC): African American breast cancer cases
and controls in WFBC were recruited at Wake Forest University Health Sciences from
November 1998 through December 2008 (9). Controls were recruited from the patient population
receiving routine mammography at the Breast Screening and Diagnostic Center. Age range of
participants was 30-86 years. WFBC contributed 125 cases (116 invasive and 9 in situ) and 153
controls to the analysis.
The Southern Community Cohort Study (SCCS): The SCCS is a prospective cohort of African
and non-African Americans which during 2002-2009 enrolled approximately 86,000 residents
aged 40-79 years across 12 southern states (10). Recruitment occurred mainly at community
health centers, institutions providing basic health services primarily to the medically uninsured,
so that the cohort includes many adults of lower income and educational status. Each study
participant completed a detailed baseline questionnaire, and nearly 90% provided a biologic
specimen (approximately 45% a blood sample and 45% buccal cells). Follow-up of the cohort is
conducted by linkage to national mortality registers and to state cancer registries. Included in this
study are 212 incident African American prostate cancer cases and a matched stratified random
sample of 419 African American male cohort members without prostate cancer at the index date
selected by incidence density sampling. We included an additional 51 incident and prevalent
cases from the SCCS (incident cases diagnosed after June 1, 2006) and 104 controls with GWAS
data.
The Cancer Prevention Study II Nutrition Cohort (CPS-II).The CPS-II Nutrition Cohort includes
over 86,000 men and 97,000 women from 21 US states who completed a mailed questionnaire in
1992 (aged 40-92 years at baseline) (11). Starting in 1997, follow-up questionnaires were sent to
surviving cohort members every other year to update exposure information and to ascertain
occurrence of new cases of cancer; a >90% response rate has been achieved for each follow-up
questionnaire. From 1998-2001, blood samples were collected in a subgroup of 39,376 cohort
members. To further supplement the DNA resources, during 2000-2001, buccal cell samples
were collected by mail from an additional 70,000 cohort members. Incident cancers are verified
through medical records, or through state cancer registries or death certificates when the medical
record cannot be obtained. Genomic DNA from 76 African American prostate cancer cases and
152 age-matched controls were included in stage 1 of the scan.
Prostate Cancer Case-Control Studies at MD Anderson (MDA): Participants in this study were
identified from epidemiological prostate cancer studies conducted at the University of Texas
M.D. Anderson Cancer Center in the Houston Metropolitan area since 1996 Cases were accrued
from six institutions in the Houston Medical Center and were no restricted with respect to
Gleason score, stage or PSA. Controls were identified via random-digit-dialing or among
hospital visitors and they were frequency matched to cases on age and race. Lifestyle,
demographic, and family history data were collected using a standardized questionnaire. These
studies contributed 543 African American cases and 474 controls to this study (12).
The Los Angeles Study of Aggressive Prostate Cancer (LAAPC): The LAAPC is a population-
based case-control study of aggressive prostate among African Americans in Los Angeles
County (13). Cases were identified through the Los Angeles County Cancer Surveillance
Program rapid case ascertainment system and eligible cases included African American men
diagnosed with a first primary prostate cancer between January 1, 1999 and December 31, 2003.
Eligible cases also had either tumor extension outside the prostate, metastatic prostate cancer in
sites other than prostate, or needle biopsy of the prostate with Gleason grade 8 or higher, or
Gleason grade 7 and tumor in more than 2/3 of the biopsy cores. Controls were identified by a
neighborhood walk algorithm and were men never diagnosed with prostate cancer, and were
frequency matched to cases on age (±5 years). For this study, genomic DNA was included for
296 cases and 140 controls. We also included an additional 163 African American controls from
the MEC that were frequency matched to cases on age.
Prostate Cancer Genetics Study (CaP Genes): The African-American component of this study
population comprised 160 men: 75 cases diagnosed with more aggressive prostate cancer and 85
age-matched controls (13). All subjects were recruited and frequency-matched on the major
medical institutions in Cleveland, Ohio (i.e., the Cleveland Clinic, University Hospitals of
Cleveland, and their affiliates) between 2001 and 2004. The cases were newly diagnosed with
histologically confirmed disease: Gleason score 7; tumor stage T2c; or a prostate-specific antigen
level >10 ng/ml at diagnosis. Controls were men without a prostate cancer diagnosis who
underwent standard annual medical examinations at the collaborating medical institutions.
Case-Control Study of Prostate Cancer among African Americans in Washington, DC (DCPC):
Unrelated men self-described as African American were recruited for several case-control
studies on genetic risk factors for prostate cancer between the years 2001 and 2005 from the
Division of Urology at Howard University Hospital (HUH) in Washington, DC. Control subjects
unrelated to the cases and matched for age (± 5 years) were also ascertained from the prostate
cancer screening population of the Division of Urology at HUH (14). These studies included 292
cases and 359 controls.
King County (Washington) Prostate Cancer Studies (KCPCS): The study population consists of
participants from one of two population-based case-control studies among residents of King
County, Washington (15, 16). Incident Caucasian and African American cases with
histologically confirmed prostate cancer were ascertained from the Seattle-Puget Sound SEER
cancer registry during two time periods, 1993-1996 and 2002-2005.Age-matched (5-year age
groups) controls were men without a self-reported history of being diagnosed with prostate
cancer and were identified using one-step random digit telephone dialing. Controls were
ascertained during the same time periods as the cases. A total of 145 incident African American
cases and 81 African American controls were included from these studies.
The Gene-Environment Interaction in Prostate Cancer Study (GECAP): The Henry Ford Health
System (HFHS) recruited cases diagnosed with adenocarcinoma of the prostate of Caucasian or
African-American race, less than 75 years of age, and living in the metropolitan Detroit tri-
county area (17). Controls were randomly selected from the same HFHS population base from
which cases were drawn. The control sample was frequency matched at a ratio of 3 enrolled
cases to 1 control based on race and five-year age stratum. In total, 637 cases and 244 controls
were enrolled between January 2002 and December 2004. Of study enrollees, DNA for 234
African Americans cases and 92 controls were included in stage 1 of the scan.
Prostate Cancer in a Black Population (PCBP): The PCBP is a population-based case-control
study of prostate cancer conducted in Barbados, West Indies (18). The study (2002-2011)
included all incident, histologically-confirmed cases of prostate cancer ascertained from the
Pathology Department of the Queen Elizabeth Hospital, Bridgetown, the only institution on the
island where specimens are evaluated. Controls were randomly selected from a national database
and frequency matched (by 5-year age groups) to the cases. We included 238 prostate cancer
cases and 231 controls with GWAS data.
Selenium and Vitamin E Cancer Prevention Trial (SELECT): SELECT is a phase III, placebo-
controlled trial that tested whether selenium and vitamin E might reduce the risk of developing
prostate cancer (19). A total of 35,534 men 55 and older (50 years and older for African
Americans) without a history of prostate cancer were enrolled between 2001 and 2004. About 12%
of the SELECT participants are African American. A case-cohort study has been established in
SELECT and, as of December 31, 2009, includes 217 African American prostate cancer cases
and 222 African American non-cases, all of which are included in the analysis.
5-2 R and SAS codes used in generating independent data and investigating the
performance of the variance components approach.
R codes:
data_gen=function(n, p){
sigmae=1
z=matrix(rnorm(n*p),n,p)
zzp=z%*%t(z)
b = rep(sqrt((3.557*sqrt(p)+4.426)*sigmae^2/(n*p)),p)
y=z%*%b+sigmae*rnorm(n)
myanova=anova(lm(y~z))
write.table (myanova$P[1], paste("g.n",n,".p",p,".txt",sep=""), col.name=F, row.name=F, append=T)
write.table(cbind(y,z), "yz.dat", sep=" ",col.name=F, row.name=F)
write.table(zzp, "zzp.dat", sep=" ", col.name=F, row.name=F)
invisible()
}
SAS MIXED PROCEDURE:
%macro mixed;
data yz;
infile "yz.dat" lrecl=500;
input y z1‐z10;
id=_n_;
run;
data zzp;
infile "zzp.dat" lrecl=50000;
input col1‐col1000;
row=_n_;
parm=1;
run;
ods listing close;
proc mixed data=yz method=reml covtest noprofile itdetails;
class id;
model y=;
random id/type=lin(1) ldata=zzp;
ods output lrt=lrt ;
run;
proc append base=lrt_all data=lrt force;
run;
%mend;
data _null_;
i=1;
do while (i le 1000);
call execute ('%mixed');
i=i+1;
end;
References
1 Kolonel, L.N., Henderson, B.E., Hankin, J.H., Nomura, A.M., Wilkens, L.R., Pike, M.C., Stram, D.O.,
Monroe, K.R., Earle, M.E. and Nagamine, F.S. (2000) A multiethnic cohort in Hawaii and Los Angeles:
baseline characteristics. Am J Epidemiol, 151, 346‐357.
2 Marchbanks, P.A., McDonald, J.A., Wilson, H.G., Burnett, N.M., Daling, J.R., Bernstein, L., Malone,
K.E., Strom, B.L., Norman, S.A., Weiss, L.K. et al. (2002) The NICHD Women's Contraceptive and
Reproductive Experiences Study: methods and operational results. Ann Epidemiol, 12, 213‐221.
3 Ambrosone, C.B., Ciupak, G.L., Bandera, E.V., Jandorf, L., Bovbjerg, D.H., Zirpoli, G., Pawlish, K.,
Godbold, J., Furberg, H., Fatone, A. et al. (2009) Conducting Molecular Epidemiological Research in the
Age of HIPAA: A Multi‐Institutional Case‐Control Study of Breast Cancer in African‐American and
European‐American Women. J Oncol, 2009, 871250.
4 John, E.M., Schwartz, G.G., Koo, J., Wang, W. and Ingles, S.A. (2007) Sun exposure, vitamin D
receptor gene polymorphisms, and breast cancer risk in a multiethnic population. Am J Epidemiol, 166,
1409‐1419.
5 John, E.M., Hopper, J.L., Beck, J.C., Knight, J.A., Neuhausen, S.L., Senie, R.T., Ziogas, A., Andrulis,
I.L., Anton‐Culver, H., Boyd, N. et al. (2004) The Breast Cancer Family Registry: an infrastructure for
cooperative multinational, interdisciplinary and translational studies of the genetic epidemiology of
breast cancer. Breast Cancer Res, 6, R375‐389.
6 Newman, B., Moorman, P.G., Millikan, R., Qaqish, B.F., Geradts, J., Aldrich, T.E. and Liu, E.T.
(1995) The Carolina Breast Cancer Study: integrating population‐based epidemiology and molecular
biology. Breast Cancer Res Treat, 35, 51‐60.
7 Gohagan, J.K., Prorok, P.C., Hayes, R.B. and Kramer, B.S. (2000) The Prostate, Lung, Colorectal
and Ovarian (PLCO) Cancer Screening Trial of the National Cancer Institute: history, organization, and
status. Control Clin Trials, 21, 251S‐272S.
8 Zheng, W., Cai, Q., Signorello, L.B., Long, J., Hargreaves, M.K., Deming, S.L., Li, G., Li, C., Cui, Y.
and Blot, W.J. (2009) Evaluation of 11 breast cancer susceptibility loci in African‐American women.
Cancer Epidemiol Biomarkers Prev, 18, 2761‐2764.
9 Smith, T.R., Levine, E.A., Freimanis, R.I., Akman, S.A., Allen, G.O., Hoang, K.N., Liu‐Mares, W. and
Hu, J.J. (2008) Polygenic model of DNA repair genetic polymorphisms in human breast cancer risk.
Carcinogenesis, 29, 2132‐2138.
10 Signorello, L.B., Hargreaves, M.K., Steinwandel, M.D., Zheng, W., Cai, Q., Schlundt, D.G.,
Buchowski, M.S., Arnold, C.W., McLaughlin, J.K. and Blot, W.J. (2005) Southern community cohort study:
establishing a cohort to investigate health disparities. J Natl Med Assoc, 97, 972‐979.
11 Calle, E.E., Rodriguez, C., Jacobs, E.J., Almon, M.L., Chao, A., McCullough, M.L., Feigelson, H.S.
and Thun, M.J. (2002) The American Cancer Society Cancer Prevention Study II Nutrition Cohort:
rationale, study design, and baseline characteristics. Cancer, 94, 2490‐2501.
12 Strom, S.S., Gu, Y., Zhang, H., Troncoso, P., Babaian, R.J., Pettaway, C.A., Shete, S., Spitz, M.R.
and Logothetis, C.J. (2004) Androgen receptor polymorphisms and risk of biochemical failure among
prostatectomy patients. Prostate, 60, 343‐351.
13 Ingles, S.A., Coetzee, G.A., Ross, R.K., Henderson, B.E., Kolonel, L.N., Crocitto, L., Wang, W. and
Haile, R.W. (1998) Association of prostate cancer with vitamin D receptor haplotypes in African‐
Americans. Cancer Res, 58, 1620‐1623.
14 Robbins, C., Torres, J.B., Hooker, S., Bonilla, C., Hernandez, W., Candreva, A., Ahaghotu, C.,
Kittles, R. and Carpten, J. (2007) Confirmation study of prostate cancer risk variants at 8q24 in African
Americans identifies a novel risk locus. Genome Res, 17, 1717‐1722.
15 Agalliu, I., Salinas, C.A., Hansten, P.D., Ostrander, E.A. and Stanford, J.L. (2008) Statin use and
risk of prostate cancer: results from a population‐based epidemiologic study. Am J Epidemiol, 168, 250‐
260.
16 Stanford, J.L., Wicklund, K.G., McKnight, B., Daling, J.R. and Brawer, M.K. (1999) Vasectomy and
risk of prostate cancer. Cancer Epidemiol Biomarkers Prev, 8, 881‐886.
17 Rybicki, B.A., Neslund‐Dudas, C., Nock, N.L., Schultz, L.R., Eklund, L., Rosbolt, J., Bock, C.H. and
Monaghan, K.G. (2006) Prostate cancer risk from occupational exposure to polycyclic aromatic
hydrocarbons interacting with the GSTP1 Ile105Val polymorphism. Cancer Detect Prev, 30, 412‐422.
18 Nemesure, B., Wu, S.Y., Hennis, A. and Leske, M.C. (2012) Central adiposity and Prostate Cancer
in a Black Population. Cancer Epidemiol Biomarkers Prev, 21, 851‐858.
19 Lippman, S.M., Goodman, P.J., Klein, E.A., Parnes, H.L., Thompson, I.M., Jr., Kristal, A.R., Santella,
R.M., Probstfield, J.L., Moinpour, C.M., Albanes, D. et al. (2005) Designing the Selenium and Vitamin E
Cancer Prevention Trial (SELECT). J Natl Cancer Inst, 97, 94‐102.
Abstract (if available)
Abstract
Genome-wide association scans (GWAS) have identified numerous common variants associated with hundreds of complex diseases. In this dissertation, I investigated the properties of the GWAS-identified common SNPs in multiple populations, and estimated their aggregate effects on complex diseases. In the first Chapter, I assessed the generalizability of a risk score derived from 12 SNPs known to be associated with breast cancer risk in European or Asian populations in the Multiethnic Cohort (MEC). I performed a case-control study with 2,224 cases and 2,827 controls nested in the MEC and found that when viewed as a summary risk score, the total number of risk alleles carried by women was significantly associated with breast cancer risk overall (OR per allele: 1.09
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Polygenes and estimated heritability of prostate cancer in an African American sample using genome-wide association study data
PDF
Extending genome-wide association study methods in African American data
PDF
Post-GWAS methods in large scale studies of breast cancer in African Americans
PDF
Genetic studies of cancer in populations of African ancestry and Latinos
PDF
Breast epithelial cell type specific enhancers and functional annotation of breast cancer risk loci
PDF
Functional characterization of colorectal cancer GWAS loci
PDF
Screening and association testing of coding variation in steroid hormone coactivator and corepressor genes in relationship with breast cancer risk in multiple populations
PDF
Two-stage genotyping design and population stratification in case-control association studies
PDF
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping
PDF
Shortcomings of the genetic risk score in the analysis of disease-related quantitative traits
PDF
Leveraging functional datasets of stimulated cells to understand the relationship between environment and diseases
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
Comparisons of four commonly used methods in GWAS to detect gene-environment interactions
PDF
The multiethnic nature of chronic disease: studies in the multiethnic cohort
PDF
Functional characterization of colon cancer risk-associated enhancers: connecting risk loci to risk genes
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Methodology and application of modern genetic association tests in admixed populations
PDF
Using genetic ancestry to improve between-population transferability of a prostate cancer polygenic risk score
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Computational design for analysis of SNP association studies
Asset Metadata
Creator
Chen, Fang
(author)
Core Title
Polygenic analyses of complex traits in complex populations
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
05/29/2013
Defense Date
10/05/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
common variants,GWAS,heritability,OAI-PMH Harvest,polygenes
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Stram, Daniel O. (
committee chair
), Coetzee, Gerhard (Gerry) A. (
committee member
), Gauderman, William James (
committee member
), Haiman, Christopher A. (
committee member
), Lewinger, Juan Pablo (
committee member
)
Creator Email
fangchen@usc.edu,silencecf@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-124501
Unique identifier
UC11291979
Identifier
usctheses-c3-124501 (legacy record id)
Legacy Identifier
etd-ChenFang-1362.pdf
Dmrecord
124501
Document Type
Dissertation
Rights
Chen, Fang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
common variants
GWAS
heritability
polygenes