Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Extending genome-wide association study methods in African American data
(USC Thesis Other)
Extending genome-wide association study methods in African American data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EXTENDING GENOME-WIDE ASSOCIATION STUDY METHODS IN
AFRICAN AMERICAN DATA
by
Chi Song
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BIOSTATISTICS)
December 2015
Copyright 2015 Chi Song
Acknowledgments
First, I would like to acknowledge my committee members: Drs. Daniel O. Stram,
Wendy Cozen, Christopher A. Haiman, Juan Pablo Lewinger, and Gerhard A. Coetzee for their
generous support and invaluable suggestions. Special thanks go to Dr. Stram who guided and
walked me through the entire pursuit of my PhD. I feel extremely fortunate to have him as my
advisor and couldn’t have asked for a better mentor. I am thankful to be introduced to the
Multiple Myeloma Study by Dr. Cozen, whose dedication to research has been a constant
inspiration.
During different stages of my life, I have had the support, encouragement and company
from myriads of friends. It is impossible to list all of the names, but I will never forget the
kindness of everyone. You all made me a better person.
I am deeply indebted to my family for their unconditional love. I could not have achieved
what I have today without you. I miss my grandparents and parents most when we are thousands
of miles apart.
Last, I would like to thank my significant other, W. Y. Lau. Your love, understanding,
and patience keep pushing me forward. All words pale in comparison to your love.
i
Table of Contents
ACKNOWLEDGMENTS ................................................................................................... I
LIST OF TABLES ............................................................................................................ IV
LIST OF FIGURES ............................................................................................................ V
ABSTRACT .................................................................................................................... VIII
1 OVERVIEW ............................................................................................................1
1.1 Genome-wide Association Studies (GWAS) ...............................................1
1.2 Multiethnic fine-mapping ............................................................................6
1.2.1 Population stratification and admixture ...........................................7
1.2.2 Adjusting for population structure ...................................................8
1.3 Haplotype Analysis and Genotype Imputation ..........................................10
1.3.1 Haplotype Analysis ........................................................................10
1.3.2 Genotype Imputation .....................................................................13
1.4 Multiple Testing .........................................................................................16
1.5 Overall Summary .......................................................................................18
1.6 Chapter 1 References .................................................................................20
2 A GENOME-WIDE SCAN FOR BREAST CANCER RISK
HAPLOTYPES AMONG AFRICAN AMERICAN WOMEN ............................23
2.1 Abstract ......................................................................................................24
2.2 Introduction ................................................................................................25
2.3 Materials and Methods ...............................................................................27
2.3.1 Ethics statement .............................................................................27
2.3.2 Study population ............................................................................27
2.3.3 Genotyping and Quality Control ....................................................28
2.3.4 Statistical Analysis .........................................................................28
2.4 Results ........................................................................................................35
2.5 Discussion ..................................................................................................40
2.6 Chapter 2 References .................................................................................56
3 MULTIPLE MYELOMA SUSCEPTIBILITY LOCI EXAMINED
IN AFRICAN AND EUROPEAN ANCESTRY POPULATIONS ......................61
3.1 Abstract ......................................................................................................64
3.2 Introduction ................................................................................................65
3.3 Methods......................................................................................................66
3.3.1 Ethics statement .............................................................................66
3.3.2 European Ancestry Study Participants ...........................................66
3.3.3 African Ancestry Study Participants ..............................................67
3.3.4 Multiethnic Fine-Mapping Analysis ..............................................70
3.4 Results ........................................................................................................71
3.5 Discussion ..................................................................................................75
ii
3.6 Chapter 3 References .................................................................................81
4 EXPANDING GWAS THROUGH THE REUSE OF EXISTING
GENOTYPE DATA ..............................................................................................83
4.1 Combining External Data Helps Enhance the Study Power ......................83
4.2 Caveats in Combining External Genotype Data ........................................83
4.3 Complications in the Example of the AAMM Study.................................84
4.4 Statistical Power of Association Tests While Controlling for
Population Structure...................................................................................90
4.4.1 The Bourgain Test: a Retrospective Approach ..............................90
4.4.2 The Power of the Bourgain Test for Studies of
Completely Admixed Populations .................................................92
4.4.3 The Power of the Bourgain Test for Studies of
Incompletely Admixed Populations ...............................................93
4.4.4 Comparing the Bourgain test with PCA and
Genomic Control ............................................................................97
4.5 Additional Issues in the Reuse of Existing Data ........................................99
4.6 Chapter 4 references ................................................................................103
5 EVALUATING INFLATION IN TEST STATISTICS ARISING
FROM DIFFERENT IMPUTATION STRATEGIES .........................................104
5.1 Motivation ................................................................................................104
5.2 Genomic Coverage Assessments for the 188k Genotyped SNPs ............104
5.3 Quantifying Imputation Quality Using the Info Score and R2 ................107
5.4 Imputations and the Inflation Factors in Test Statistics for
Case-Control Studies ...............................................................................110
5.4.1 Imputation to KGP Based on the 188k Overlapping SNPs .........111
5.4.2 Imputation Based on All Genotyped SNPs for Cases
and Controls Separately ...............................................................114
5.4.3 Imputation for Cases and Controls Separately While
Based on the Same 188k ..............................................................117
5.4.4 Potential differential imputation errors within controls ...............120
5.4.5 Imputation to the Full 260k SNPs Genotyped in Cases ...............123
5.5 Summary ..................................................................................................126
5.6 Chapter 5 References ...............................................................................126
6 FINAL REMARKS .............................................................................................127
Chapter 6 References ...........................................................................................129
COMPREHENSIVE REFERENCES ..............................................................................130
iii
List of Tables
Table 2.1 Fitting the minimum p-values from 1,000 permutations of
chromosome 22 data to theoretical beta distributions beta(a,b). ...............47
Table 2.2 The most significant individual haplotypes identified in
the extended regions on chromosomes 1, 4 and 18. ..................................48
Table 2.3 The most significant individual haplotypes in 10p15 and 14q24. .............49
Table 3.1 Replication of reported index SNPs and most significant
associations for each region from the multiethnic meta-analysis. .............79
Table 4.1 Proportions of Estimated African ancestry for individuals in
AAMM, AABC and AAPC by study site. .................................................87
Table 5.1 Genomic coverage of the 188k SNPs by minor allele frequency
and info score. ..........................................................................................105
Table 5.2 Genomic coverage estimates between HumanCore and other arrays. .....106
Table 5.3 Inflation in test statistics by info score and MAF cutoffs when
cases and controls were imputed together based on the 188k. ................112
Table 5.4 Inflation in test statistics by info score and MAF cutoffs when
cases and controls were imputed separately based on all genotypes. ......115
Table 5.5 Comparing the inflation in test statistics derived from three
different imputation strategies (chromosome 1 data only). .....................118
Table 5.6 Inflation in a hypothetical case-control analysis, AABC vs.
AAPC (chromosome 1 data only) ............................................................121
Table 5.7 Inflation for the “TA” SNPs when the genotypes of cases were
combined with the imputed dosages for controls ....................................124
iv
List of Figures
Figure 1.1 (From Pulit et al. 2014) Correlation between study sample size
and number of loci identified in various common human
complex diseases ..........................................................................................2
Figure 1.2 (From Stram 2014) Illustration of the rationale of haplotype
association ..................................................................................................11
Figure 1.3 (Adapted from Li et al. 2009) Illustration of genotype
imputation procedures ................................................................................15
Figure 2.1 Comparison of the permutation minimum p-values to
theoretical beta distributions. .....................................................................50
Figure 2.2 Comparison of the significance of individual haplotypes with
the most significant SNPs in three regions on
chromosomes 1, 4 and 18. ..........................................................................51
Figure 2.3 Comparison of the significance of individual haplotypes with
imputed SNPs in regions on chromosomes 1, 4 and 18. ............................52
Figure 2.4 Two known breast cancer risk regions 10p15 and 14q24 exhibit
putative haplotype effects. .........................................................................53
Figure 2.5 Comparison of the significance of individual haplotypes with
imputed SNPs in 10p15 and 14q24. ...........................................................54
Figure 3.1 Genomic annotation of the region around rs4487645 on
chromosome 7. ...........................................................................................80
Figure 4.1 Distributions of the estimated ancestral proportions by study
center. From top to bottom: AAMM, AABC, and AAPC. Red,
African (YRI); green, European (CEU); yellow, Asian (JPT).
Horizontal dotted lines mark the 80% and 20% of ancestry
fraction, respectively. .................................................................................89
Figure 4.2 (Modified from Chen et al.) Probability density plots of beta
distributions with different levels of heterogeneity.
Solid line:
1
0.8 and dashed line:
2
0.75 for fractions of
African ancestry in two incompletely admixed populations. .....................95
Figure 4.3 (Modified from Chen et al.) Non-centrality parameters of the
Bourgain test in a case-control study where cases and controls
are from two populations with varying degrees of admixture. ..................96
v
Figure 4.4 (Adapted from Stram 2014) a, QQ plot of p-values for 2100
non-causal SNPs. Over-dispersion ’s are equal to 1.93
(Armitage trend test without correction, black solid line), 1.06
(adjusting for the 1st PC, blue dashed line), 1 (first PC+GC,
green solid line), and .999 (retrospective model, red solid line)................98
Figure 4.5 (From Sebastiani et al.) a (Top) the Manhattan plot shows
a large number of significantly associated SNPs across
the genome; b (Bottom) the QQ plot reveals an inflation of
association tests. Neither is quite plausible in a calibrated
GWAS that has eliminated systematic allele frequency
differences. ...............................................................................................100
Figure 4.6 QQ plot of the Armitage trend test p-values for 188,376
overlapping SNPs genotyped on both chips ( 1.02 ) ..........................101
Figure 5.1 Box plots comparing the distributions of the info score
and the r2. A. Distributions of info scores (white boxes) for
each subset of imputed variants. TO, SNPs typed in controls
only; TA, SNPs typed in cases only; AFR, all KGP variants
with corresponding MAF estimated in AFR samples;
[0.05-0.50], [0.01-0.05), and (min-0.01) are ranges of MAFs.
B. r2 (gray boxes) is the squared Pearson correlation between
posterior probability and actual genotypes for each variant;
info scores were obtained from IMPUTE2. .............................................108
Figure 5.2 Linear correlation between the info scores and r2. A. Subplots by
SNP type and MAF bracket: TO, SNPs typed in controls only;
TA, SNPs typed in cases only; [0.05-0.50] and [0.01-0.05) are
ranges of MAFs. B. Overlay of the left four subplots. ............................109
Figure 5.3 Inflation in test statistics by info score and MAF cutoffs when
cases and controls were imputed together based on the 188k.
The colors are for different minimum info score cutoffs:
blue (>0.8), magenta (>0.9), dark green (>0.95), red (>0.98),
orange (>0.99) and light green (typed SNPs); x axis denotes
MAF cutoffs: 1%, 2%, 5%, 7%, and 10%. ..............................................113
Figure 5.4 Inflation in test statistics by info score and MAF cutoffs when
cases and controls were imputed separately based on all
available genotypes. The colors are for different minimum
info score cutoffs: blue (>0.8), magenta (>0.9), dark green (>0.95),
red (>0.98), and orange (>0.99); x axis denotes MAF cutoffs:
1%, 2%, 5%, 7%, and 10%. .....................................................................116
vi
Figure 5.5 Comparing the inflation factors in test statistics derived from
the three different imputation strategies (showing chromosome 1
data only). Imputation bases were 260k+1M SNPs (dark green
plus signs), 188k SNPs imputed separately (blue circles) and
188k SNPs imputed together (magenta triangles); x axis denotes
MAF cutoffs: 1%, 2%, 5%, 7%, and 10%. ..............................................119
Figure 5.6 Inflation in a hypothetical case-control analysis, AABC vs.
AAPC (chromosome 1 data only); x axis denotes MAF cutoffs:
1%, 2%, 5%, 7%, and 10%. .....................................................................122
Figure 5.7 Inflation for the “TA” SNPs when the genotypes of cases were
combined with the imputed dosages for controls; x axis denotes
MAF cutoffs: 1%, 2%, 5%, 7%, and 10%. ..............................................125
vii
Abstract
African Americans have been known to have higher risks for a wide range of complex
diseases, such as breast and prostate cancer, multiple myeloma, and many others. Differences in
genetic architecture across diverse racial/ethnic groups may explain the elevated risks. However,
research resources for African Americans have been limited compared to those for the European-
derived populations, a major impediment to elucidating the genetic risk factors
disproportionately affecting African Americans. The possibility of improving the study power is
therefore of primary scientific interest. This dissertation discusses the approaches to addressing
the problem as follows: 1) integrating haploptypes with single SNPs in testing for breast cancer
risk in African Americans; 2) fine-mapping of risk loci for multiple myeloma in a meta-analysis
with both African Americans and European Americans capitalizing on the LD differences; 3)
evaluating the feasibility and practicality of reusing existing genotype/imputation data to cost-
effectively enlarge the sample size in studies with diversely admixed populations and also
different genotyping platforms. These approaches show a potential to enhance the power of
studies in African Americans although many difficulties remain to be solved.
viii
1 Overview
1.1 Genome-wide Association Studies (GWAS)
The past decade has witnessed an overwhelming success of GWAS in identifying
thousands of common genetically susceptibility loci for a wide range of complex phenotypes,
e.g., breast cancer, prostate cancer, diabetes, obesity, Alzheimer’s disease, etc. [1-3]. As of
March 2015, more than 22,000 variants from 2,100 publications have been cataloged in the
GWAS Catalog maintained jointly by the National Human Genome Research Institute and the
European Bioinformatics Institute (http://www.ebi.ac.uk/gwas/home) and the number is still
rapidly growing. Large sample sizes in study designs, recent advances in genotyping
technologies and a plethora of state-of-the-art statistical methods all together contribute to the
thriving of GWAS. GWAS adopt an agnostic approach where no prior hypothesis is favored
irrespective of genic or intergenic regions and hence hundreds of thousands or even millions of
variants across the genome are tested simultaneously for associations. This is in stark contrast to
the so-called “candidate gene studies” which heavily rely on prior biological knowledge of gene
functionality and genomic locations to select “candidate genes” in test of associations with
disease risk. In fact, most genetic variants identified in GWAS have not been found in or
adjacent to any genes [4]; many of these findings could not have been implicated in elevated
disease risks by performing candidate gene studies only.
It has been well recognized by the community that susceptibility to common complex
diseases is likely to be consequences of a large number of low-penetrance variants en masse with
each conveying only a modest relative risk (RR). The vast majority of genetic markers identified
so far have supported such a belief (RR in the range of 1.1-1.5 for binary traits) [5]; GWAS with
an adequately large sample size are required to detect associations at this scale. Thanks to the
1
establishm
with unp
variants f
Trust Ca
a total of
with one
mega-ana
discovere
identifyin
of Schizo
controls)
the numb
future inv
ment of inte
recedentedly
for diseases
se Control C
f 14,000 case
of the seven
alyses of mu
ed proportio
ng an additio
ophrenia iden
[8]. A summ
ber of identif
vestments in
rnational co
y large study
of high, med
Consortium (
es and 3,000
n diseases (p
ultiple indepe
nately more
onal 30 new
ntifying 108
mary of rece
fied associat
n even larger
nsortia aroun
y samples ha
dium, or low
(WTCCC) st
shared cont
p < 5x10
-7
) [6
endent GWA
risk variant
loci at p < 1
8 independen
ent GWAS fi
tion loci and
r scale studie
nd the globe
ave become a
w prevalence
tudy in whic
trols, identify
6]. More rec
AS, utilizing
s. Examples
10
-8
(22,027
nt risk loci at
indings reve
the study sa
es [9].
e aimed to co
appreciably
e. A landmar
ch seven dise
fying 24 inde
cent GWAS,
g even larger
s include a st
cases / 29,0
t p < 10
-8
(36
eals a roughly
ample size (F
Fig
201
stu
of l
com
dis
onsolidate re
powered to
rk example i
eases were in
ependent loc
some of wh
r sample size
tudy of Croh
82 controls)
6,989 cases
y linear relat
Figure 1.1),
gure 1.1 (Fro
14) Correlat
udy sample s
loci identifie
mmon huma
seases
esource, GW
study risk
s the Wellco
nvestigated u
ci in associat
hich are meta
es have
hn’s disease
[7] and a stu
/ 113,075
tionship betw
advocating
om Pulit et a
ion between
ize and num
ed in various
an complex
WAS
ome
using
tion
a- or
udy
ween
al.
n
mber
s
2
Despite the tremendous success of GWAS in filling in our knowledge gaps about
complex diseases, GWAS are not immune to criticisms: (1) the vast majority of genetic variants
identified thus far have been associated with small to modest effect sizes (Odds Ratio (OR): 1.1-
1.4 for binary traits; variance explained < 0.5% for quantitative traits), suggesting individual
low-penetrance variants are neither necessary nor sufficient to cause the disease by themselves.
(2) In aggregate, genome-wide significant variants identified to date have only accounted for a
small proportion of the phenotypic variation of various highly heritable diseases (~10% for
Crohn’s disease and type 2 diabetes, and ~1% for schizophrenia) [10], an indication that a large
fraction of disease heritability remains uncharted. (3) Many GWAS hits identified in one
racial/ethnic group (mostly of European descent) have failed to be replicated in other groups
(African Americans, Hispanics, or Asians), further complicating the attempts to localize the true
casual variants and obfuscating the biological and clinical relevance of identified risk variants
[11-13].
In response to the above criticisms, several explanations and developments have been
discussed in the GWAS literature in recent years, a lot of which are unified around the main
issue of study power.
First of all, a range of parameters determine the power (quantified by the non-centrality
parameter, ) of a given GWAS, including the study sample size N as discussed above, minor
allele frequency (MAF, p ) and effect size of the risk variant, as well as the linkage
disequilibrium (LD) between the actually genotyped SNP and the true underlying causal variant
2
r , formulated as
22
(1 ) Np pr assuming an additive genetic effects model [9]. While a
sample of a few thousand cases and controls may be sufficient to detect ORs greater than 1.3 for
SNPs with MAF > 5% in populations of European ancestry, the sample size required to achieve
3
similar power would have to be considerably enlarged if (a) the effect sizes of the variants are
smaller (e.g., OR for a deleterious allele < 1.3); (b) the SNPs or other forms of genetic variation
under consideration are less frequent (MAF: 1-5%) or rare (< 1%) (e.g., in a sample of 5,000
individuals, a SNP with MAF = 0.1% translates to merely 10 copies of the test allele); or (c) the
genotyped SNPs cannot effectively “tag” the true causal variant because of low degree of LD
between them or the rarity of the causal variant making LD too weak. Therefore, it is evident that
although ever-increasing available sample sizes can surely boost study power, an array of
challenges still needs to be tackled before desired, often enormous, sample sizes become
economically achievable.
Second, past GWAS were underpowered to investigate genetic susceptibility attributable
to rare variants. If rare variants indeed explain a good deal of disease heritability, then the so-
called “missing heritability” may lie in rare variants that have not been fully interrogated due to
technological and statistical difficulties. The cornerstone upon which the GWAS paradigm was
built is the “common disease-common variant” (CDCV) hypothesis, which essentially states
most of the heritability of common diseases can be explained by common genetic variants in the
genome. Despite a large of number of significant SNPs have been uncovered through GWAS,
the fraction of total heritability they account for is far from satisfactory. Under simplifying
assumptions of neutral selection and constant effective population sizes, across the allele
frequency spectrum there would be as many variants with MAF in the range of (0.1-1%) as there
are in the range of (5-50%), the latter of which constitute the majority of GWAS successes to
date [14]. Additionally, in order for rare variants to play a similar role in explaining as much
heritability as do common variants, considerably larger ORs would be expected. For example, an
OR of 5.6 is required for a risk variant with MAF = 0.5% to explain comparable phenotypic
4
variance to a common variant with MAF = 10% and OR = 1.5. However, the hypothesis of rare
variants conveying larger effects than common variants has not been fully supported.
Alternatively, the “infinitesimal model” hypothesizes that a myriad of variants, common or rare,
each with infinitesimally small effect on the trait, might have been neglected in searching for
genome-wide significant associations with an extremely stringent type 1 error rate [15].
Evidence of rare variants carrying either large or small effects exists [16]. It merits future
development such as the emerging Next-Generation Sequencing (NGS) data to dissect the
missing heritability attributable to rare variants. Furthermore, there is another layer of
complexity in the discussion of missing heritability – epistasis (GxG) and gene-environment
interactions (GxE) [17]. It is generally believed that complex human diseases are more likely to
be the interplay of a large number of genes and environmental factors, rather than a simplistic
manifestation of a single high-penetrance gene as seen in Mendelian diseases. The estimation
and interpretation of heritability in the context of GxG and GxE can be profoundly flawed as
pointed out by Zuk et al.[18] They argued through mathematical derivation that the “total”
heritability inferred from population data could be overestimated and thus the proportion of
explainable heritability is erroneously underestimated. The so-called “phantom heritability”, the
overestimation in “total” heritability, is more pronounced when interactions between multiple
pathways and between genes and environmental factors are commonplace. Albeit challenging
both in theory and application, it provides a unique perspective to tackle the missing heritability
conundrum.
Third, the vast majority of GWAS have been conducted in European-derived populations,
owing in part to the advanced research infrastructure and also to the relative homogeneity in
ancestry compared to the other ethnicities [19]. Commonly used commercial genotyping
5
platforms by Illumina and Affymetrix have been primarily designed to have optimal coverage of
genetic variation measured in the HapMap/1000 Genomes EUR populations, which in turn
facilitates investigations with an emphasis on populations of European ancestry – somewhat a
circular argument. In contrast, African-derived populations, older than their counterparts on the
other continents, tend to have comparatively deeper ancestry, more genetic variation and shorter
LDs. Common SNPs observed in European populations may not necessarily be as common in
Africans; the genotyped SNPs and the true variant in LD among European populations may
reside in different LD blocks in Africans. Large-scale GWAS of African Americans, Latinos,
and Asians have recently been published and many more are underway [20-22]. This will
certainly assist in dissecting such disparities across different populations.
1.2 Multiethnic fine-mapping
As outlined in Section 1.1, genetic architecture varying across diverse populations limits
generalizability of risk variants identified in one group of people to another. Besides, it has long
been known that different populations have dramatic differences in disease incidence,
aggressiveness, mortality, and survival. Population differences can on the one hand be associated
with disease susceptibility directly, and on the other be a surrogate of a wide range of
confounders not taken into account [23]. However, once properly adjusted for, population
differences can be exploited to facilitate localization of causal variants in higher resolution by
leveraging different LD patterns in a multiethnic study setting [24]. Udler et al. through fine
mapping the FGFR2 gene in an African American sample, localized a strongly associated SNP
rs2981578 (OR = 1.20; 95% CI = 1.03-1.43; p = 0.02) implicated in breast cancer risk after six
neighboring SNPs had been found potentially causal for breast cancer only in populations of
6
European and Asian ancestry [25]. They also validated the association in a multiethnic sample
consisting of African Americans, Europeans and Asians. Their findings narrowed down the risk
region of a 25 kilobase (kb) LD block almost wholly within intron 2 of FGFR2, suggesting that
differential expression of FGFR2 may be involved in the association with breast cancer risk.
Following the same vein of this fine-mapping strategy, I first tested multiple myeloma
(MM) risk loci in a large African American sample, and then combined them with the other co-
first author Dr. Kristin Rand’s European American samples. Through a meta-analysis of 8407
African Americans (AA: 1329 cases/7078 controls) and 2760 European Americans (EA: 1274
cases/1486 controls), we identified better SNPs in regions implicated in excessive MM risk in
both AAs and EAs (more in Chapter 3).
1.2.1 Population stratification and admixture
Initially, the focus on homogenous populations was beneficial to discovery of various
genetic susceptibility loci. Yet it is rarely true that the population composition of a GWAS is
homogenous given the large sample size, geographical diversity, and complex human history
[26]. Population stratification and admixture arise from uneven makeups of subpopulations
within a large population due to nonrandom mating (such as isolation) or formation of
contemporary populations from two or more historically separated ancestral populations (e.g.,
through mass migration). In many modern populations that are “recently” admixed (e.g., African
Americans, Hispanics, and Pacific Islanders), extensive LDs between two apparently unlinked
loci in the mixing populations have been introduced to the genome and could lead to spurious
associations insofar as the allele frequencies differ in their original populations [27, 28].
7
Accurately adjusting for population structure has become indispensable to modern day large-
scale GWAS.
1.2.2 Adjusting for population structure
Several methods to control for population structure in genetic association testing have
been proposed and thoroughly evaluated [29, 30]. A conceptually simple and widely used
method is computing the genomic control parameter , defined as the median of observed
Armitage trend test statistics
2
divided by the median of the theoretical distribution
2
with 1
degree-of-freedom (d.f.) [31]. If the there exists a different yet unknown population structure
between cases and controls, many false-positive associations would be observed and an inflated
( 1 ) can easily be visualized on a quantile-quantile (QQ) plot. It would be sufficient to
correct the inflation if it is driven by recent genetic drift as long as the extent to which allele
frequencies deviate is random. Conversely, genomic control would be less useful in light of a
distinguishable pattern of allele frequency differences propelled by natural selection.
Another prevailing approach is the principal components analysis (PCA), which entails
inferring continuous axes (termed “principal components”) of genetic variation and fitting top
principal components as covariates in the regression model [32]. Explicit modeling of ancestral
differences via principal components provides a sensitive correction for allele frequency
differences owing to population stratification, and has better control of spurious associations and
greater power than the genomic control method. Loosely speaking, tens of thousands of
randomly selected common SNPs (MAF > 5%) across the genome with low LD (
2
r < 0.2) are
sufficient to fully control for population stratification within diverse European-derived
populations (Wright’s F
ST
= 0.01) and many more are necessary for more closely related
8
subpopulations (F
ST
< 0.01) [32]. However, PCA’s utility is restricted in the presence of family
structure or cryptic relatedness among nominally unrelated samples. Because lesser degree of
relatedness generally yields weaker LDs dwarfed by larger population structure, and therefore
can hardly be characterized by any leading PCs [29].
The variance components approach explicitly models population structure and cryptic
relatedness via partitioning the covariance matrix of phenotype into fixed and random effects
components as follows (equations 1.1-1.3),
y Xue (1.1)
() Var
22
y IK (1.2)
1
)ˆˆ (2 ( )
2(
1
1
2
ˆ
ˆˆ )
M
ik k jk k
ij
k kk
npn p
k
Mpp
(1.3)
where y is the vector of phenotype; X is the matrix of fixed effects covariates (e.g., age, sex,
principal components, etc.) with coefficients ; u is the random genetic effects with distribution
~(0, ) N
2
uK ; K is the genetic relationship matrix (GRM) with each element estimated as the
correlation between subject i and subject j, each carrying
ik
n and
jk
n copies of allele at SNP k
with estimated frequency ˆ
k
p , across a total of M SNPs; e is the non-heritable component of
random noise with distribution ~ (0, ) N
2
eI . Variance components
2
and
2
can be estimated
using the restricted maximum likelihood (REML); the genetic effect of interest is tested in a
Wald test provided the total number of subjects is reasonably large. Estimation of variance
components involves inversion of large matrices and presents an enormous computational
burden, although efficient implementation has been realized by running programs such as
EMMAX [33] and GCTA [34].
9
Within the scope of this dissertation, genomic control has been widely used to assess the
overall distribution of association test statistics before and after applying other correction
methods. Leading principal components derived from PCA have been consistently included in
models to acknowledge the fact that our main sample – African Americans – is a largely
admixed population. Variance components or mixed effects models are explored in the
methodology development section.
1.3 Haplotype Analysis and Genotype Imputation
The preceding sections have discussed one way to boost the power of GWAS through
increasing the study sample size, i.e., genotyping more cases and controls or efficient reuse of
existing data (highlighted in the WTCCC study and will be discussed in depth in Chapters 3 and
4). Approached from a different perspective, the association analysis can be expanded by
including a larger set of test hypotheses through haplotype and genotype imputation. Haplotype
analysis complements the single SNP association analysis by inferring haplotypes using a small
number of neighboring genotyped SNPs so that the untyped SNPs residing in between are likely
to be captured by these haplotypes. Genotype imputation capitalizes on external LD information
available in a reference panel to statistically impute untyped variants such that a greater number
of variants than originally genotyped on a GWAS chip can be tested for associations. Both
haplotypes and imputed SNPs shed light on unmeasured genetic variants and they are identical in
certain respects.
1.3.1 Haplotype Analysis
10
H
homolog
recombin
haplotyp
purposes
SNPs 1-3
and 3 are
to captur
(1 1), (0
leveragin
identifyin
O
consistin
sensitivit
secondar
not impli
perhaps m
Figure 1.
Haplotypes a
gous chromos
nation [35]. H
es can be in
, consider a
3 (Figure 1.2
e typed. Simp
re the risk all
1), and (0 0)
ng the LD be
ng unmeasur
On the other h
ng of SNPs in
ty [36]. Func
ry structure a
icated in the
more powerf
.2 (From Str
are the alignm
some inherit
Haplotypes a
higher LD w
simplified e
2). Suppose S
ply testing s
lele of SNP2
), it is eviden
etween corre
red yet poten
hand, haplot
n the COMT
ctional analy
and ultimatel
same functi
ful perspecti
am 2014) Ill
ment of SNP
ted from a si
are of impor
with the true
example of th
SNP2 is the
ingle SNP a
2. When hap
nt that the (0
elated SNPs
ntially casua
types themse
T gene-codin
ysis suggeste
ly the COMT
ion of alterin
ive to identif
lustration of
Ps or other fo
ingle parent
rtance to inv
e causal than
hree haploty
causal SNP
associations f
lotypes can
0 1) haplotyp
(in the same
al SNPs coul
elves can be
g region hav
ed these hapl
T enzymatic
ng the RNA
fy the true cu
f the rational
forms of gene
with little ev
vestigators fo
n any genotyp
ypes h1, h2 a
but not geno
for SNPs 1 o
be construct
pe is perfectl
e “haplotype
ld be improv
the causal v
ve been foun
lotypes had a
c activity, wh
structure. Th
ulprit for the
e of haploty
etic variants
vidence of c
or several rea
ped SNPs. F
and h3 consti
otyped wher
or 3 would n
ted from SN
ly correlated
e block”), the
ved.
variants. Thr
nd in associa
an impact on
hile individu
his provides
e disease [37
ype associatio
s on the same
contemporary
asons:
For illustrativ
ituted by thr
reas both SN
not be suffici
NPs 1 and 3, i
d with SNP 2
e likelihood
ree haplotype
ation with pa
n the mRNA
ual SNPs wer
a direct and
7, 38].
on
e
y
ve
ree
NPs 1
ient
i.e.,
2. By
of
es
ain
A
re
d
11
In diploid genomes, haplotypes are not directly observed but inferred from genotypes.
Even if genotypes can be called with 100% accuracy, haplotypes will not. Common haplotype
phasing programs employ an expectation-maximization (E-M) algorithm to estimate haplotypes
and their frequencies. The estimation step entails estimating the haplotype dosage ( )
h
H ,
counting the number of copies for each possible haplotype h , carried by an individual with the
true haplotype pair
12
{} , Hhh . For each individual i with the observed genotype
i
G , starting
with an initial vector of haplotype frequencies
00
12
2
( , ,..., )
m
hh
h
pp p p indicating 2
m
possible
haplotypes for a total of m SNPs, the expected haplotype dosage variable,
, hi
conditional
i
G is
computed as follows,
~, 1 2
,
~1 2
()
(| )
i
i
kk
HGhi h h
hi i
kk
HG h h
Hp p
E
p
G
p
(1.4)
where
~
i
H G
denotes the summation is over all haplotype pairs
12
{} , Hhh compatible with
i
G ;
1
k
h
p and
2
k
h
p denote the estimated haplotype frequencies for
1
h and
2
h updated in the k-th
iteration;
,
()
hi
H can take possible values 0, 1, or 2. The following maximization step updates
the estimated haplotype frequency
1 k
p
by averaging over all subjects ( 1,..., iN )
1
,
1
(| )
1
2
N
k
hhii
i
p EG
N
(1.5)
The E-M algorithm repeats this process by alternating between equations 1.4 and 1.5 until the
haplotype dosage estimates
,
(| )
hi i
E G converge. These dosage variables can be substituted in the
downstream association tests to account for haplotype uncertainty. Haplotype effects can be
modeled either together for all h ’s within a haplotype block in an omnibus test with multiple d.f.
below,
12
,
[)
ˆ
(]
ihhhi
gE Y (1.6)
or individually for a specific haplotype h in a one d.f. test,
,
[()]
ˆ
ihhi
gE Y (1.7)
where [ ] g is the link function in the generalized linear model; is the mean of the outcome
variable;
,
ˆ
hi
is the estimated dosage of haplotype h carried by subject i from the above E-M
algorithm. This “expectation-substitution” method has reasonable statistical properties under the
null hypothesis of no haplotype effects and yields acceptable results under the alternative as well
unless the true magnitudes of haplotype effects are quite large [39, 40].
1.3.2 Genotype Imputation
Commercial GWAS chips are designed in a cost-effective way that common genetic
variation in the human genome is captured through probes of known SNPs (called “tag SNPs”),
rather than to attempt to capture them all. As illustrated in Figure 1.3, the dots represent the
untyped SNPs in a GWAS sample that need to be imputed. The much denser reference panel
such as the HapMap or more recently the 1000 Genomes Project data (1KGP) contains much
more genetic variants, the linkage information between which can be leveraged to predict the
unmeasured SNPs in the study sample. Chromosomal regions shared between the study sample
and the subjects in the reference panel by identity-by-descent (IBD) are identified through
stretching a short distance around the observed SNP (boxed haplotypes in Figure 1.3(b)); the
haplotypes of each individual in the study samples are now constructed as the “mosaic” of
haplotypes in the reference panel (Figure 1.3(c)). By matching the same haplotypes, missing
genotypes can be predicted using the known genotypes available in the reference haplotypes.
13
Behind many common implemented programs such as IMPUTE2 [41] and MaCH [42] is
the Hidden Markov Model (HMM) that makes fast and reliable imputation for a large scale of
SNPs possible. The HMM explores a myriad of unobserved possible ways of recombination that
are consistent with the observed genotypes in the study sample and gives imputed genotype
dosages at untyped loci with associated probabilities. One method to evaluate the quality of
imputation is to compute the correlation between true genotype count and imputed dosage
variable. In practice, we rarely know the true genotypes, although we can mask a fraction of
known genotypes to assess the accuracy for these already genotyped SNPs, which is again not
necessarily informative about the unmeasured SNPs of primary interest. The correlation is
instead computed as the fraction of variance explained assuming Hardy-Weinberg equilibrium
for the imputed SNPs, i.e.,
(( ))
2
|
ˆˆ (1 )
Ai
Var E
p
nG
p
where
A
n is the allele count with estimated allele
frequency p; ( ) |
Ai
En G is the imputed dosage variable; ˆ p is empirically estimated as one half of
the average of the dosages, i.e.,
1
)
2
| (
Ai
i
En G
N
. Note that this fraction, denoted as
2
R in
MaCH and similar to the info score metrics in IMPUTE2, in turn relies on the accuracy of
imputed dosage variables ( ) |
Ai
En G . Large
2
R values, i.e., close to 1, generally imply accurate
imputation; inclusion of only well imputed SNPs (e.g.,
2
0.8 R for imputed SNPs with an
estimated MAF > 1%) is advised in downstream analyses.
14
F
g
Figure 1.3 (A
genotype im
Adapted from
mputation pro
m Li et al. 2
ocedures
009) Illustra
(a)
(b)
ation of
(c)
) Phase g
haploty
sample
downlo
haploty
HapMa
) Identify
haploty
referen
the bas
haploty
GCA)
) Fill in i
genotyp
the info
the refe
genotypes in
ypes for stud
es and
oad phased
ypes from
ap or 1KGP.
y the same
ypes in the
nce panel as i
is (e.g.,
ypes AAA an
imputed
pes based on
ormation on
erence panel
nto
dy
in
nd
n
l.
15
Further, imputation plays an important role in combining data that are genotyped on
different platforms or chip versions. Newer chips generally carry more genetic variants than old
ones. Imputation of genotypes obtained from a variety of sources to a common reference panel
facilitates the reuse of preexisting genotype data and expands the total number of SNPs for
association testing than limiting to the smaller number of overlapped SNPs between different
chips. Some properties and issues arising from imputation based on genotypes of different
origins will be discussed in Chapter 5.
1.4 Multiple Testing
Another issue that has plagued the power of GWAS is the burden of multiple testing of
millions of markers while having to control the type I error appropriately (family-wise error rate
(FWER) = 0.05). Consider a GWAS with 1 million SNPs to be tested, 50,000 of which (5%) are
expected to have p < 0.05 merely by chance even under the null hypothesis of no association.
Several recommendations have been made to advise researchers to declare genome-wide
significance [43-45]. Simulation studies based on HapMap Encyclopedia of DNA elements
(ENCODE) examining all common SNPs (MAF > 0.05) in European populations reported a
threshold of 5x10
-8
, coincidentally equal to naïve application of Bonferroni (0.05/10
6
) [43].
Similar results were obtained by simulating sequence level data for as large as 5,000
cases and 5,000 controls, suggesting 3x10
-8
for Europeans and East Asians and 1.5x10
-8
for West
Africans [44].
Permutation tests randomly shuffle the phenotypes of study samples because under the
null hypothesis the outcome is unrelated with the genotypes. By permuting the phenotype
adequately many times, an empirical distribution of test statistics can be obtained under the null
16
hypothesis while the correlation structure between genotyped data is still preserved. Then the test
statistic observed in the real data is compared to the permutation-based empirical distribution,
from where a statistical inference can be formally made. Dudbridge proposed an efficient
modification to standard permutation tests by fitting analytic distributions to the minimum p-
value obtained through each permutation [46]. The rationale is simple: for independent and
identically distributed variables
i
x following a standard uniform distribution (0,1) ~
i
xUnif , the
minimum of all
i
x ’s would follow a beta distribution, i.e.,
(1)
~(1,) xbeta . The two shape
parameters and can be solved iteratively by maximum likelihood estimation (equation 1.8).
(1) (1)
()
, log ( 1)log ( ) ( )log(1 ) 1
() ( )
[]
perm
p lp
(1.8)
where
(1)
p is the minimum p-value recorded from each replicate; the log likelihood is summed
over all replicates. If the estimate of is very close to 1, then the estimate of can be interpreted
as the “effective” number of tests and then be substituted in the denominator of the Bonferroni
correction. Dudbridge’s method uses actual numbers in permutated data to fit an analytic
distribution, rather than just count how often the tests in permutation replicates are greater than
the observed one. In this sense, more information from permutations is incorporated than
standard permutation tests. The efficiency of fitting minimum p-values to beta distributions was
more than ten times higher than that of standard permutations, evaluated by comparing the
number of permutations required to achieve the same length of confidence intervals [46].
While correctly maintaining the FWER at desired level in the presence of dependent
tests, permutation testing is well known for being computationally intensive even with the aid of
high-performance computing clusters. In principle, the derived significance levels can only be
17
applied to the underlying data, unless similarity can be assumed prior to extrapolating results to
other data.
In my completed work, permutation testing was employed to derive a chromosome-wide
significance level for chromosome 22 that was used as a reference in search of putative
haplotype effects across the whole genome adopting a sliding window framework (Chapter 2).
Region-specific significance levels for both typed and imputed SNPs were determined through
permutations according to their LD with the regional index markers identified in previous studies
(correlated SNP:
2
r > 0.5; uncorrelated SNP:
2
r < 0.5 estimated in the 1000 Genomes EUR
populations) (Chapter 3).
1.5 Overall Summary
This dissertation consists of six chapters, aimed at discussing the expansion of the scope
of GWAS in African Americans (AA) in three ways: 1), expanding data through the reuse of
existing genotyping data; 2), expanding the total number of tests by examining haplotype effects
in addition to single marker associations; 3), expanding the number of variants through genotype
imputation using the much denser 1KGP reference panels.
Chapter 1 has introduced the power of current large-scale GWAS in identifying novel
risk variants for a wide range of human diseases and the challenges to studies with unknown
population structure, as well as commonly used techniques to address those problems.
Chapter 2 expands association tests by investigating haplotype effects across the genome
for breast cancer risk among AA women. Haplotypes inferred from neighboring SNPs can
capture more LD information and can be more powerful in identifying novel risk loci.
18
Chapter 3 describes a multiethnic fine-mapping of risk loci for multiple myeloma (MM)
capitalizing on the LD differences between the European and African populations. Replication of
risk signals in both EAs and AAs and localization of causal variants facilitate understanding of
disease etiology.
Chapter 4 demonstrates both the challenges and opportunities in reusing existing control
data with admixture through evaluating the behavior of non-centrality parameter for the
“Bourgain” test. Reusing controls with large admixture heterogeneity is feasible.
Chapter 5 evaluates differential errors arising from various imputation strategies and the
impact on association testing in the AAMM GWAS. Using more stringent thresholds, e.g. higher
MAF and/or imputation quality score, may assist in ameliorating the overall inflation in test
statistics to varying degrees, although it does not eradicate the false positives and still suffers
from loss of power to detect true associations due to a much reduced number of remaining
variants.
Chapter 6 concludes the dissertation by revisiting some key characteristics of GWAS in
African American populations, summarizing the major approaches outlined throughout the
preceding chapters, and providing recommendations for future studies concerning the reuse of
existing samples genotyped on a different platform.
19
1.6 Chapter 1 References
1. Welter, D., et al., The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids
Research, 2014. 42(D1).
2. Hindorff, L.A., et al., Potential etiologic and functional implications of genome-wide association loci for
human diseases and traits. Proceedings of the National Academy of Sciences of the United States of
America, 2009. 106(23): p. 9362-7.
3. McCarthy, M.I., et al., Genome-wide association studies for complex traits: consensus, uncertainty and
challenges. Nature reviews. Genetics, 2008. 9(5): p. 356-69.
4. McClellan, J. and M.-C. King, Genetic heterogeneity in human disease. Cell, 2010. 141(2): p. 210-217.
5. Visscher, Peter M., et al., Five Years of GWAS Discovery. The American Journal of Human Genetics, 2012.
90(1): p. 7-24.
6. The Wellcome Trust Case and Control Consortium, Genome-wide association study of 14, 000 cases of
seven common diseases and 3,000 shared controls. Nature, 2007. 447(7145): p. 661-78.
7. Franke, A., et al., Genome-wide meta-analysis increases to 71 the number of confirmed Crohn's disease
susceptibility loci. Nature genetics, 2010. 42(12): p. 1118-25.
8. Ripke, S., et al., Biological insights from 108 schizophrenia-associated genetic loci. Nature, 2014. 511: p.
421-427.
9. Pulit, S.L., et al., Association claims in the sequencing era. Genes, 2014. 5(1): p. 196-213.
10. Witte, J.S., P.M. Visscher, and N.R. Wray, The contribution of genetic variants to disease depends on the
ruler. Nature Reviews Genetics, 2014. 15(11): p. 765-776.
11. Chen, F., et al., A genome-wide association study of breast cancer in women of African ancestry. Human
genetics, 2012.
12. Feng, Y., et al., A comprehensive examination of breast cancer risk loci in African American women.
Human molecular genetics, 2014: p. 1-9.
13. Han, Y., et al., Generalizability of established prostate cancer risk variants in men of African ancestry. Int
J Cancer, 2015. 136(5): p. 1210-7.
14. Haiman, C.A., et al., Genome-Wide Testing of Putative Functional Exonic Variants in Relationship with
Breast and Prostate Cancer Risk in a Multiethnic Population. PLoS Genetics, 2013. 9(3).
15. Gibson, G., Hints of hidden heritability in GWAS. 2010. p. 558-560.
16. Marjoram, P., a. Zubair, and S.V. Nuzhdin, Post-GWAS: where next? More samples, more SNPs or more
biology? Heredity, 2014. 112(1): p. 79-88.
17. Cordell, H.J., Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans.
Human Molecular Genetics, 2002. 11(20): p. 2463-2468.
18. Zuk, O., et al., The mystery of missing heritability: Genetic interactions create phantom heritability. Proc
Natl Acad Sci U S A, 2012. 109(4): p. 1193-8.
20
19. Haiman, C.A. and D.O. Stram, Exploring genetic susceptibility to cancer in diverse populations. Current
opinion in genetics & development, 2010. 20(3): p. 330-5.
20. Cai, Q., et al., Genome-wide association study identifies breast cancer risk variant at 10q21.2: Results from
the asia breast cancer consortium. Human Molecular Genetics, 2011. 20(24): p. 4991-4999.
21. Haiman, C.A., et al., Characterizing genetic risk at known prostate cancer susceptibility loci in African
Americans. PLoS Genetics, 2011. 7(5).
22. Cheng, I., et al., Evaluating genetic risk for prostate cancer among Japanese and Latinos. Cancer
epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research,
cosponsored by the American Society of Preventive Oncology, 2012. 21(11): p. 2048-58.
23. Henderson, B.E., et al., The influence of race and ethnicity on the biology of cancer. 2012. p. 648-653.
24. Haiman, C.A., et al., Multiple regions within 8q24 independently affect risk for prostate cancer. Nature
genetics, 2007. 39: p. 638-644.
25. Udler, M.S., et al., FGFR2 variants and breast cancer risk: fine-scale mapping using African American
studies and analysis of chromatin conformation. Human molecular genetics, 2009. 18(9): p. 1692-703.
26. Campbell, M.C. and S.A. Tishkoff, African genetic diversity: implications for human demographic history,
modern human origins, and complex disease mapping. Annual review of genomics and human genetics,
2008. 9: p. 403-433.
27. Chakraborty, R. and K.M. Weiss, Admixture as a tool for finding linked genes and detecting that difference
from allelic association between loci. Proceedings of the National Academy of Sciences of the United
States of America, 1988. 85(23): p. 9119-9123.
28. Freedman, M.L., et al., Assessing the impact of population stratification on genetic association studies.
Nature genetics, 2004. 36(4): p. 388-393.
29. Price, A.L., et al., New approaches to population stratification in genome-wide association studies. Nature
reviews. Genetics, 2010. 11(7): p. 459-63.
30. Yang, J., et al., Advantages and pitfalls in the application of mixed-model association methods. Nature
genetics, 2014. 46(2): p. 100-6.
31. Devlin, B. and K. Roeder, Genomic control for association studies. Biometrics, 1999. 55(4): p. 997-1004.
32. Price, A.L., et al., Principal components analysis corrects for stratification in genome-wide association
studies. Nat Genet, 2006. 38: p. 904-909.
33. Kang, H.M., et al., Variance component model to account for sample structure in genome-wide association
studies. Nature genetics, 2010. 42(4): p. 348-54.
34. Yang, J., et al., GCTA: a tool for genome-wide complex trait analysis. American journal of human genetics,
2011. 88(1): p. 76-82.
35. Gabriel, S.B., et al., The structure of haplotype blocks in the human genome. Science (New York, N.Y.),
2002. 296(5576): p. 2225-9.
36. Nackley, A.G., et al., Human catechol-O-methyltransferase haplotypes modulate protein expression by
altering mRNA secondary structure. Science (New York, N.Y.), 2006. 314(5807): p. 1930-1933.
21
37. Mouaffak, F., et al., Association of an UCP4 (SLC25A27) haplotype with ultra-resistant schizophrenia.
Pharmacogenomics, 2011. 12(2): p. 185-193.
38. Cheong, H.S., et al., Association of RANBP1 haplotype with smooth pursuit eye movement abnormality.
American Journal of Medical Genetics, Part B: Neuropsychiatric Genetics, 2011. 156(1): p. 67-71.
39. Kraft, P. and D.O. Stram, Re: the use of inferred haplotypes in downstream analysis. American journal of
human genetics, 2007. 81(4): p. 863-5; author reply 865-6.
40. Hu, Y.J. and D.Y. Lin, Analysis of untyped SNPs: maximum likelihood and imputation methods. Genetic
epidemiology, 2010. 34(8): p. 803-15.
41. Howie, B.N., P. Donnelly, and J. Marchini, A flexible and accurate genotype imputation method for the
next generation of genome-wide association studies. PLoS Genet, 2009. 5(6): p. e1000529.
42. Li, Y., et al., MaCH: Using sequence and genotype data to estimate haplotypes and unobserved genotypes.
Genetic Epidemiology, 2010. 34(8): p. 816-834.
43. Pe'er, I., et al., Estimation of the multiple testing burden for genomewide association studies of nearly all
common variants. Genetic epidemiology, 2008. 32(4): p. 381-5.
44. Hoggart, C.J., et al., Genome-wide significance for dense SNP and resequencing data. Genetic
epidemiology, 2008. 32(2): p. 179-85.
45. Dudbridge, F. and A. Gusnanto, Estimation of significance thresholds for genomewide association scans.
Genetic epidemiology, 2008. 32: p. 227-34.
46. Dudbridge, F. and B.P.C. Koeleman, Efficient computation of significance levels for multiple associations
in large studies of correlated data, including genomewide association studies. American journal of human
genetics, 2004. 75(3): p. 424-35.
22
2 A Genome-wide Scan for Breast Cancer Risk Haplotypes among African American
Women
Chi Song
1
, Gary K. Chen
1
, Robert C. Millikan
2
, Christine B. Ambrosone
3
, Esther M. John
4
,
Leslie Bernstein
5
, Wei Zheng
6
, Jennifer J. Hu
7
, Regina G. Ziegler
8
, Sarah Nyante
2
, Elisa V.
Bandera
9
, Sue A. Ingles
1
, Michael F. Press
10
, Sandra L. Deming
6
, Jorge L. Rodriguez-Gil
7
,
Stephen J. Chanock
8
, Peggy Wan
1
, Xin Sheng
1
, Loreall C. Pooler
1
, David J. Van Den Berg
1,11
,
Loic Le Marchand
12
, Laurence N. Kolonel
12
, Brian E. Henderson
1
, Chris A. Haiman
1
, Daniel O.
Stram
1*
1
Department of Preventive Medicine, Keck School of Medicine and Norris Comprehensive
Cancer Center, University of Southern California, Los Angeles, CA, USA
2
Department of Epidemiology, Gillings School of Global Public Health, and Lineberger
Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC, USA
3
Department of Cancer Prevention and Control, Roswell Park Cancer Institute, Buffalo, NY,
USA
4
Cancer Prevention Institute of California, Fremont, CA and Stanford University School of
Medicine and Stanford Cancer Institute, Stanford, CA, USA
5
Division of Cancer Etiology, Department of Population Science, Beckman Research Institute,
City of Hope, Duarte, CA, USA
6
Division of Epidemiology, Department of Medicine, Vanderbilt Epidemiology Center, and
Vanderbilt-Ingram Cancer Center, Vanderbilt University School of Medicine, Nashville, TN,
USA
7
Sylvester Comprehensive Cancer Center and Department of Epidemiology and Public Health,
University of Miami Miller School of Medicine, Miami, FL, USA
8
Epidemiology and Biostatistics Program, Division of Cancer Epidemiology and Genetics,
National Cancer Institute, Bethesda, DC, USA
9
The Cancer Institute of New Jersey, New Brunswick, NJ, USA
10
Department of Pathology, Keck School of Medicine and Norris Comprehensive Cancer Center,
University of Southern California, Los Angeles, CA, USA
11
Epigenome Center, Norris Comprehensive Cancer Center, University of Southern California,
Los Angeles, CA, USA
12
Epidemiology Program, University of Hawaii Cancer Center, Honolulu, HI, USA
*To whom the correspondence should be addressed: stram@usc.edu
23
2.1 Abstract
Genome-wide association studies (GWAS) simultaneously investigating hundreds of
thousands of single nucleotide polymorphisms (SNP) have become a powerful tool in the
investigation of new disease susceptibility loci. Haplotypes are sometimes thought to be superior
to SNPs and are promising in genetic association analyses. The application of genome-wide
haplotype analysis, however, is hindered by the complexity of haplotypes themselves and
sophistication in computation. We systematically analyzed the haplotype effects for breast cancer
risk among 5,761 African American women (3,016 cases and 2,745 controls) using a sliding
window approach on the genome-wide scale. Three regions on chromosomes 1, 4 and 18
exhibited moderate haplotype effects. Furthermore, among 21 breast cancer susceptibility loci
previously established in European populations, 10p15 and 14q24 are likely to harbor novel
haplotype effects. We also proposed a heuristic of determining the significance level and the
effective number of independent tests by the permutation analysis on chromosome 22 data. It
suggests that the effective number was approximately half of the total (7,794 out of 15,645), thus
the half number could serve as a quick reference to evaluating genome-wide significance if a
similar sliding window approach of haplotype analysis is adopted in similar populations using
similar genotype density.
Key words: Genome-wide association study; genome-wide significance; haplotype analysis;
sliding window; multiple testing correction; breast cancer; African American women.
24
2.2 Introduction
Genome-wide association studies (GWAS) have been demonstrated to have the power to
detect modest to small effects of genetic variants with various common diseases [1]. A large
number of novel SNPs have been identified and successfully replicated in associations with
complex diseases, such as cancers, diabetes, and cardiovascular disease [2]. Meanwhile,
haplotype analysis has become a prominent example of multilocus genetic association studies
and has assisted in finding new disease susceptibility loci [3-8]. Haplotypes consist of SNPs or
other genetic markers on the same chromosome that are inherited together with little
contemporary recombination [9]. Haplotype information may aid GWAS in identifying new
marker-phenotype associations for several reasons [10]. First, haplotypes characterize the exact
organization of alleles along the chromosome. Although D’ and r
2
are useful in capturing the
linkage disequilibrium (LD) pattern between a pair of markers, they are hardly to be extended to
higher order of dependency among markers. As a result, LD analysis based on underlying
haplotypes can be more accurate [11]. Second, by constructing haplotype blocks from SNPs,
more information can be incorporated into the association tests, especially when haplotypes
themselves are in closer LD with the causal variant than any single genotyped SNP [12].
Haplotype analysis has been reported to be superior to analysis based on individual SNPs by
simulation [13] and empirical studies [14, 15].
Although haplotype analysis is seemingly appealing, its implementation on the genome
wide scale is unwieldy given the uncertainty and complexity of haplotypes [16], as well as the
difficulty of adjusting for multiple testing when hundreds of thousands of hypotheses are being
tested simultaneously. For instance, there is no consensus in the exact definition of haplotype
blocks, making the boundaries of haplotype blocks not unambiguous [17]. One definition is
25
based on D’ among neighboring SNPs which needs to exceed a pre-specified cutoff value [18];
another commonly implemented method requires a reduced haplotype diversity on a
chromosomal segment [19]. Unfortunately, no method is uniformly better than the others in
application [15]. We favor a sliding window framework since haplotypes can be quickly
constructed and all genotyped SNPs are incorporated [20]. Fixed window sizes are
computationally easier and more efficient in practice relative to varying window sizes. Mathias
et al [21] successfully identified five asthma susceptibility loci on chromosome 11 in African
Americans via the sliding window approach, in which the window sizes were 2 – 6 SNPs.
Lambert et al. [22] adopted a similar approach where 10 consecutive haplotype tagging SNPs
(htSNPs) were defined as a sliding window and found a haplotype residing in FRMD4A gene at
10p13 with increased risk for Alzheimer’s disease. In this paper, we scanned throughout the 22
autosomes to search for significant haplotype effects for breast cancer risk among 5,761 African
American women using the sliding window approach of 5 contiguous SNPs. The haplotype
effects were then compared with individual SNP effects including genotyped and imputed SNPs
at the same chromosomal position. To determine a valid significance level, 1,000 permutations
were exploited using the chromosome 22 data. The permutation-based chromosome-wide
significance level for chromosome 22 and the effective number of independent tests were
computed from the empirical distribution of the minimum p-values. The genome-wide
significance level can then be readily determined through Bonferroni correction by substituting
the effective number of tests for the total number of tests. While globally significant results were
not obtained, closer attention should be paid to the regions revealed by the most significant
haplotypes on chromosomes 1, 4 and 18. We also scrutinized 21 known breast cancer risk
26
regions [23] for potential haplotype effects and found 10p15 and 14q24 may possess novel
haplotype effects.
2.3 Materials and Methods
2.3.1 Ethics statement
The Institutional Review Board at the University of Southern California approved the
study protocol. All participants gave informed written consent at the time of blood draw.
2.3.2 Study population
There were a total of 5,984 African American women included in this study, of which
3,153 were cases with breast cancer and 2,831 were controls. The entire sample was derived
from nine epidemiological studies: (i) The Multiethnic Cohort Study (MEC) [24]: 734 cases and
1,003 controls; (ii) The Los Angeles component of the Women’s Contraceptive and
Reproductive Experiences Study (CARE) [25]: 380 cases and 224 controls; (iii) The Women’s
Circle of Health Study (WCHS) [26]: 272 cases and 240 controls; (iv) The San Francisco Bay
Area Breast Cancer Study (SFBCS) [27]: 172 cases and 231 controls; (v) The Northern
California Breast Cancer Family Registry (NC-BCFR) [28]: 440 cases and 53 controls; (vi) The
Carolina Breast Cancer Study (CBCS) [29]: 656 cases and 608 controls; (vii) The Prostate, Lung,
Colorectal, and Ovarian Cancer Screening Trial (PLCO) Cohort [30]: 64 cases and 133 controls;
(viii) The Nashville Breast Health Study (NBHS) [31]: 310 cases and 186 controls; (ix) Wake
Forest University Breast Cancer Study (WFBC) [32]: 125 cases and 153 controls. All cases were
African American women diagnosed with invasive or in situ breast cancer. Controls were mainly
27
recruited through random digit dialing. A more detailed description of the characteristics of each
study is available in Supplementary Table S1 and elsewhere [23].
2.3.3 Genotyping and Quality Control
Genotyping was performed using the Illumina Human 1M-Duo chip. Individuals whose
samples had low DNA concentrations (< 20 ng/ μl) were removed (n = 52). We also removed
unexpectedly related individuals (n = 29), call rates < 95% (n = 100), African ancestry < 5% (n =
36), and individuals of ambiguous sex (n = 6). We excluded SNPs with call rate < 95% (n =
21,732) and minor allele frequency (MAF) < 1% (n = 80,193). SNPs with a concordance rate
lower than 98% were removed too (n = 11,701). The average concordance rate of the sample was
99.95%. Hardy-Weinberg equilibrium (HWE) was not imposed as one of the quality control
criteria given that African Americans are known as an admixed population [33]. Except for a
SNP on chromosome 5 showing significant deviation from HWE (discussion follows in the
Results section), none of the other SNPs included in the following analyses were severely out of
HWE (Exact test p-value > 1×10
-6
) [34]. The total number of SNPs remained in the analysis was
1,006,480 in 5,761 subjects (3,016 cases and 2,745 controls).
2.3.4 Statistical Analysis
Sliding window size. The sliding window approach was adopted to define haplotype blocks
throughout 22 autosomes for its maximum coverage of genotyped SNPs given the exploratory
nature of the present study. The choice of the 5-SNP window was mostly in agreement with the
average block size for the HapMap Yoruba population [in Ibadan, Nigeria (YRI), HapMap Phase
II] (Supplementary Table S2). Wang et al [35] showed that based on Gabriel’s definition of
28
haplotype blocks [9, 36], 57% of LD blocks in the YRI population were shorter than 10kb and
37% of the blocks were between 10kb and 50kb. For our AABC data, the universal 5-SNP
windows across 22 chromosomes achieved a comparable distribution of haplotype block sizes,
i.e., 55% of the 5-SNP windows shorter than 10kb and 44% between 10kb and 50kb long. The
distributions of haplotype block sizes defined by sliding windows did not differ greatly by
chromosome (Supplementary Figure S1), indicating that on no chromosome the 5-SNP sliding
windows have a disproportionately poor coverage of exceptionally long or short blocks in
general. Admittedly, the universal 5-SNP windows across 22 autosomes or throughout
approximately 1 million SNPs may not comprehensively capture individual haplotype block size
variations at specific loci. It is nonetheless deemed a fairly good approximation with some
theoretical basis.
Haplotype inference. The haplotype frequencies within each haplotype block defined by the
sliding windows were estimated using the Expectation-Maximization (E-M) algorithm outlined
by Excoffier and Slatkin [37], and Stram [38, 39]. Let ( )
h
H count the true, yet generally
unknown, number of copies of a haplotype h, with frequency
h
p , contained in the haplotype pair
H carried by a given individual, i.e., ()
h
H takes possible values of 0, 1 or 2, meaning 0, 1 or 2
copies of such haplotype h in haplotype pair H are inherited from parents; let (( )| )
hi
E HG
denote the expected number of copies of each possible haplotype h given the individual’s
observed genotype
i
G . These expectations are computed iteratively as
12
12
~
~
()
(() )|
i
i
hhh
HG
hi
hh
HG
Hp p
HG
pp
E
(2.1)
29
with
1
(0)
1
(( )| )
2
N
hhi
i
p EH G
N
(2.2)
where
~
i
H G
indicates the summation is over the haplotype pairs, H, compatible with the
observed genotype,
i
G . The algorithm starts with initial haplotype frequencies,
(0)
h
p , and updates
them iteratively. Equation (2.1) is the expectation step and (2.2) is the maximization step of the
E-M algorithm.
Association testing. The inferred haplotype dosage estimates, (( )| )
hi
E HG , abbreviated as
ˆ
h
,
can be used individually in a 1-degree-of-freedom (d.f.) test in testing for haplotype-specific
associations with the disease using model (2.3),
ˆˆ
logit ( 1| , )
ih hh
Pr Y
x
XX (2.3)
or a global test simultaneously fitting all haplotypes
2||1 1
,,.
ˆˆˆ ˆ
( .., )
H
within the haplotype
block defined by a sliding window using model (2.4),
||1
1
ˆˆ
logit ( 1| , )
H
ihh
h
Pr Y
x
XX (2.4)
where || H denotes the total number of possible haplotypes within that block and the degrees of
freedom of the global test in model (4) are therefore ||1 H . In both models, X is the vector of
covariates, including age, study, and the top ten eigenvectors of ancestral information estimated
by principal components analysis [40] to adjust for global ancestry differences. The eigenvectors
are included in the model to control for potential confounding due to population stratification and
admixture. In haplotype association analysis, a large fraction of the inferred haplotypes can be
30
very rare, with frequency close to zero [41]. It is customary to discard rare haplotypes that are
less than 1% frequent to reduce the total d.f. of the model so that the power to detect risk effects
of relatively common haplotypes can be well preserved. Suppose that there are |'| H haplotypes
greater than 1% of frequency, where |'| H <<|| H holds true in many cases, the d.f. of the global
test reduces to |'| 1 H from ||1 H as indicated in model (2.5)
||1
1
ˆˆ
logit ( 1| , )
H
ihh
h
Pr Y
x
XX (2.5)
We started with applying the global test throughout the whole genome to agnostically
search for haplotype effects following the 5-SNP sliding window framework, while the 1 d.f. test
of individual haplotype-specific effects was performed only when a potentially significant region
was detected by the global test. For visualization purposes, haplotype effects were compared to
the effects of the constituent SNPs at the same chromosomal region by an overlaid Manhattan
plot showing the statistical significance, presented as –log
10
(p-value), of both haplotypes and
SNPs. Haplotype effects would become interesting only if a noticeable haplotype effect peak was
not accompanied by a similar significance peak involving the constituent SNPs. For regions
exhibiting considerable haplotype effects, they were further extended both upstream and
downstream by half of the original width to include more flanking SNPs and haplotypes, making
the extended regions twice longer (Supplementary Table S3). All possible individual haplotypes
composed of 2 up to 10 SNPs (or the maximum number of genotyped SNPs contained in the
extended region, whichever is smaller) with haplotype frequency > 1% were investigated
exhaustively to single out the particular haplotype(s) explaining the significant global test. The
top individual haplotypes were further verified by a likelihood ratio (LR) test comparing the
31
model with both the top haplotype and the best single SNP contained (model 2.6) to the nested
model with the same best SNP only (model 2.7),
ˆˆ
logit ( 1| , ,)
ihi hh gi
Pr g Yg
x
XX (2.6)
logit ( 1| , )
ii i g
Pr Y g g
x
XX (2.7)
where g
i
denotes the genotypes of the SNP carried by an individual i and an additive excessive
effect of each risk allele on the disease is assumed. The novelty of the haplotype effects
compared to the SNP effects was assessed using a LR test with 1 d.f. We were also interested in
whether the haplotype effects could be otherwise captured by genotype imputation in the same
region. The genotype imputation was performed by Mendel-GPU [42] using the 1000 Genomes
Projects (1KGP) data as the reference panel [43]. The much denser 1KGP has a better genomic
coverage of rare and low frequency markers and is reported to be capable of providing more
statistical power to identify the underlying associations [44]. The superiority of haplotype
analysis to SNP imputation could be highlighted by the presence of haplotype signals where
significant genotyped or imputed SNPs are absent. In regions with the strongest haplotype
effects, we also inferred and adjusted for the local ancestry information for each marker residing
near the haplotypes of interest (±250kb). The local ancestry characterizes the proportions of
European and African ancestry, represented by the posterior probabilities of carrying 0, 1, and 2
copies of a European allele at each SNP. The local ancestry was computed by HAPMIX [45]
with 240 HapMap EUR + YRI phased founder haplotypes per chromosome as input. The top
haplotype effect was further adjusted for the inferred local ancestry in addition to adjustment for
global ancestry (i.e. using the leading principal components), age, and study as described above.
This additional adjustment for local ancestry could help eliminate false positive haplotype effects
that were confounded by local ancestry [46].
32
In addition, haplotype effects in the neighborhood of known breast cancer risk SNPs
identified predominantly in European populations were investigated especially carefully.
Twenty-one regions (1p11, 2q35, 3p24, 5p12, 5q11, 6q14, 6q25, 8q24, 9p21, 9q31, 10p15,
10q21, 10q22, 10q26, 11p15, 11q13, 14q24, 16q12, 17q22, 19p13, and 20q11) and their
associated SNPs were of primary interest. Regions with potential of harboring unknown
haplotype effects were scrutinized by inferring all possible individual haplotypes of frequency >
1% consisting of 2- 10 consecutive SNPs in the neighborhood of ±250kb of known breast cancer
risk hits (except for 8q24, where ±2Mb was used [47-49]). As before, the important haplotype
effects were compared with the significance of genotyped as well as with the 1KGP imputed
SNPs in the same region. The independence of these haplotype-disease associations were further
verified by LR tests adjusting for the SNP effects from both the regionally best SNP and the
known breast cancer risk SNP. Notable haplotypes residing in proximity to the known breast
cancer risk hits were again corrected for local ancestry inferred from the same region to
eliminate potential confounding due to local genetic ancestry admixture.
PLINK [50] was the primary software to conduct the association analyses. All regression
models were adjusted for age, study, and global ancestry. For important haplotypes identified
through association analyses, local ancestry was additionally adjusted for.
Permutation test. In order to obtain a valid significance threshold for the global test of haplotype
analysis, 1,000 replicates of chromosome 22 data were generated by randomly shuffling the
case-control status for each individual in the sample while maintaining the same numbers of
cases and controls as in the original data. Each replicate was analyzed using the same global test
logistic regression model to test the overall significance of haplotype blocks defined by the same
33
5-SNP sliding window (model 5). The same covariates were adjusted for as well, i.e., age, study
and global ancestry, but not local ancestry. The minimum p-values of the global tests for
haplotype block effects from 1,000 permutations were recorded and sorted in ascending order
and the fifth percentile of the 1,000 minimum p-values was considered the permutation-based p-
value so that the chromosome-wide type I error rate equals 0.05. Following Dudbridge et al.
[51], we substituted the total number of tests with the effective number of independent tests n
eff
.
If n
eff
exists, then it can be inferred from the beta distribution of the minimum p-values with
parameters (1, n
eff
) [52].
eff
Pr(minP ) 1 (1 )
n
The probability density function of the beta distribution with parameters (, ) ab is,
11
()
(; , ) (1 )
() ( )
ab
ab
f xa b x x
ab
where () is the gamma function with two parameters a , b > 0. Therefore beta distributions
were fitted to the minimum p-values from the 1,000 permutation replicates in two scenarios: (i)
the parameter a of the beta distribution is set equal to 1; (ii) both parameters a and b are free to
vary. In the second scenario, the minimum p-values are consistent with the theoretical beta
distribution if the null hypothesis 1 a is not rejected; b can thus be interpreted as the effective
number of independent sliding windows, n
eff
. The parameters in the beta distribution were
estimated using maximum likelihood estimation (MLE) method. Quantile-Quantile (QQ) plots
were generated to evaluate the goodness-of-fit of these beta distributions. The aforementioned
analysis was implemented in SAS version 9.1.2 (SAS Institute, Cary, NC).
34
2.4 Results
The minimum p-values from the 1,000 permutations of chromosome 22 data containing 15,649
genotyped SNPs ranged between 1.54×10
-7
and 9.44×10
-4
with the fifth percentile being
5.58×10
-6
. So the permutation-based effective number of tests for chromosome 22 was simply
6
0.05
8,963
5.58 10
. The maximum likelihood estimates of the beta distribution parameters were
ˆ 0.95 a and
ˆ
7,426 b ; or
ˆ
7,794 b if a was constrained at 1. Although the null hypothesis of
equality 1 a was nominally rejected in the former two-parameter case (p < 0.01), ˆ 0.95 a was
close to 1 and the QQ plot comparing it to the Beta(1,7426) distribution showed the majority of
the data points fell on the diagonal line, suggesting the lack of fit was not severe (Figure 2.1A).
When setting 1 a and experimenting with different b ’s, i.e. 7,400 8,300 b , goodness-of-fit
tests based on empirical distribution functions (EDF) statistics (Kolmogorov-Smirnov, Cramer-
von Mises and Anderson-Darling statistics) did not reject the null hypothesis at the 0.10
significance level, implying that the minimum p-values followed the designated beta
distributions satisfactorily (Table 2.1). The range of the effective numbers of tests, 7,400 – 8,300,
included half the number of total sliding windows (
15,645
=7822.5
2
). The corresponding
significance level under this approximation was
6
0.05
6.39 10
7,823
, benchmarking to the 5.7
percentile of the minimum p-values from 1,000 permutations. The QQ plot for those minimum p-
values compared to Beta(1,7823) distribution indicated the fit was reasonably good (Figure 2.2B)
and none of the goodness-of-fit tests were rejected (p > 0.25). We proceeded with the effective
number of independent tests equal to half of the total number of overlapping haplotype blocks as
a quick reference to spotting potentially significant haplotype effects. The genome-wide
35
significance level was therefore derived as
8
9.94 10
1,006,48
0.0
0/ 2
5
G
p
in contrast to the
Bonferroni corrected genome-wide significance level
8
4.97 10
1, 006,
0.
480
05
BG
p
.
In search of haplotype peaks where significant SNPs were absent on the Manhattan plots,
a region on chromosome 5 exhibited a distinct haplotype effect compared with individual SNP
associations at the same chromosomal region (Supplementary Figure S2). There were five
overlapping haplotype blocks defined by 5-SNP sliding windows with global test p-values (p =
1.70×10
-8
, 3.16×10
-8
, 1.85×10
-7
, 1.45×10
-6
, and 3.38×10
-6
, respectively) less than any single
SNP’s p-value within the same region. However, the most significant SNP rs6882564 (p =
1.14×10
-4
) made up all the significant haplotypes and were noted to be severely out of HWE (p <
1×10
-7
). A review of the intensity plots for this SNP showed that rs6882564 was clearly
miscalled by the genotyping algorithm, and thus we dropped from consideration all haplotypes
that contain rs6882564, leaving no other haplotypes in the same region genome-wide significant.
No other haplotype blocks throughout the genome had a global p-value less than 10
-6
. The top
10 independent genomic regions with haplotype global test p-value between 1.60×10
-6
and
1.51×10
-5
are summarized in Supplementary Table S3. After visual examination of the
Manhattan plots contrasting the haplotype-specific effects with the individual SNP effects, the
remaining most significant regions unlikely to be explained solely by SNPs were chr1:
8,309,317-8,318,147, chr4:122,325,743 - 122,363,114, and chr18: 35,670,316 - 35,683,522.
Notably, on chromosome 1, the 5-SNP haplotype AGCTG (Position: 8309317 - 8318147;
frequency = 0.24) (Figure 2.2; Table 2.2) comprised of SNPs rs9628987, rs2289731,
rs12711517, rs2305016, and rs7535752, had a p-value three orders of magnitude less than that of
the most significant SNP contained in the haplotype, rs12711517 (haplotype p= 5.09×10
-6
vs.
36
SNP p= 9.88×10
-3
). When conditioning on this locally most significant SNP, the haplotype effect
stayed almost unchanged (adjusted OR= 0.82; 95% CI = 0.74 - 0.91) and remained the most
significant haplotype, although the adjusted haplotype specific association p-value was less
significant than that of without adjustment for the best SNP (unadjusted haplotype p=5.09×10
-6
vs. adjusted haplotype p=1.36×10
-4
). On chromosome 4, a 2-SNP haplotype AG (Position:
122340944-122346258; frequency = 0.64) was close to two orders of magnitude more significant
than its best individual SNP, rs13116936 (3.37×10
-7
vs. 1.09×10
-5
) (Figure 2.2B) and the
unadjusted haplotype specific effect was among the most significant in all top 10 independent
regions. After adjusting for the best SNP, the haplotype effect remained significant at p=
7.54×10
-4
. A potentially interesting finding was on chromosome 18 (Figure 2.2C) where a much
rarer 6-SNP haplotype AACGTT (Position: 35670316-35684521; frequency = 0.03) showed an
improvement of haplotype significance with the adjusted p-value of 2.42×10
-5
in contrast to the
unadjusted p-value of 6.96×10
-5
. The haplotype specific effect did not alter meaningfully before
and after the adjustment for the best SNP (unadjusted OR= 1.72, 95% CI = 1.32-2.25; adjusted
OR= 1.79, 95% CI = 1.36-2.34). The carrier of one copy of this haplotype had 1.79 times higher
breast cancer risk relative to women who did not carry it, much stronger than the best SNP
rs47995220 alone (OR=1.23; 95% CI = 1.11-1.45). These three novel haplotypes found on
chromosomes 1, 4 and 18 were further verified with comparison to the imputed SNPs based on
the 1000 Genomes Project released data within the same chromosomal regions. None of the
aforementioned novel haplotype-specific associations could have been revealed by imputed
SNPs (Figure 2.3A-C). As shown in the Manhattan plots contrasting the haplotype effects with
that of the imputed SNPs, the most significant haplotypes were independent of the neighboring
clusters of imputed SNPs; no adjacent SNPs achieved comparable significance as the top
37
haplotypes did. These novel haplotypes were not confounded by local ancestry inferred from
neighboring SNPs either (Supplementary Table S5). The test statistics stayed largely unchanged
after further adjusting for the local ancestry in addition to the global ancestry for a finer
correction for population admixture. Among the remainder of the top 10 independent regions
with haplotype global test p-values less than 1.51×10
-5
, the significance levels of the top
individual haplotypes and SNPs were very close for chromosomes 3, 5 and 10, implying that the
noticeable haplotype effects shown on the Manhattan plots can be mostly credited to the
genotyped SNPs (Supplementary Figures S3.A-C). On the rest of the chromosomes, the top
SNPs were more significant than any inferred haplotypes, so that the haplotypes did not
contribute more information towards genetic association tests in those regions than SNPs
themselves.
As noted by Chen et al [23], the endeavor to replicate the significance of the known
GWAS hits using the AABC data was largely unsuccessful, implying the risk loci for breast
cancer found in other GWAS, predominantly of European ancestries, may not be the same as in
African Americans. For four of the known GWAS SNPs the associations in our African
American breast cancer data had a nominally significant p-value less than 0.05 (Supplemental
Table S4), namely rs13387042 at 2q35 (OR=0.89; 95% CI= 0.82–0.97; p= 0.00713), rs865686 at
9q31 (OR= 0.92; 95% CI= 0.85–0.99; p= 0.0287), rs2981582 at 10q26 (OR= 1.11; 95% CI=
1.03–1.19; p= 0.0087), and rs2363956 at 19p13 (OR= 0.88; 95% CI= 0.82–0.95; p= 8.1×10
-4
).
They are all common variants of modest effects in this study with minor allele frequency
between 0.07 and 0.49. Across these 21 regions with known breast cancer risk, 10p15 and 14q24
showed potential haplotype effects with the global test p-value less than 1.0×10
-4
, albeit not
genome-wide significant. When scrutinizing all possible inferred individual haplotypes of 2-10
38
SNPs long in the vicinity of the known markers, a 3-SNP haplotype at 10p15, CTC (Position:
5705780-5712025; frequency= 0.22) constituted by rs17141741, rs2386661 and rs4414128 was
three orders of magnitude more significant than the most significant individual SNP contained in
the haplotype, rs4414128 (unadjusted haplotype p-value= 5×10
-6
vs. best SNP p-value =
7.08×10
-3
)(Table 2.3). This haplotype was associated with a 20% reduced risk per copy for
breast cancer relative to the women not carrying it. The haplotype-specific effect was almost
unchanged after adjustment for both the best contained SNP (rs4414128) and the index marker
(rs2380205) (adjusted haplotype OR=0.81, 95% CI=0.72-0.91, p=2.16×10
-4
). The haplotype
signal was two or three orders of magnitude more significant than any of the remaining
individual SNPs adjacent to that haplotype, as shown from the leftmost haplotype signal peak in
Figure 2.4A. When further compared to the 1KGP imputed SNPs in the same region, this CTC
haplotype was still independent of the imputed SNPs (Figure 2.5A). The imputed SNPs residing
within close proximity had similar significance levels to that of the genotyped SNPs (Figure
2.4A vs. Figure 2.5A), which emphasized that haplotype effect was unlikely to be explained by
SNP imputation either. Another 3 SNP haplotype GAG (Position: 6042374-6043841; frequency=
0.60) was stronger than any genotyped SNPs. However, we found an imputed SNP (rs3181152;
risk allele: G; frequency: 0.45; p=4.72×10
-5
) that fell on this haplotype and was an even stronger
predictor of risk. The analysis of individual haplotype effects also identified a new region at
14q24 containing the known hit rs999737, where the most significant haplotype was CGCAGC
(Position: 68033499-68045127; frequency = 0.05) with the unadjusted haplotype p-value over
three orders of magnitude less than that of the best contained SNP, rs10132579 (unadjusted
haplotype p= 1.69×10
-6
vs. best SNP p=9.55×10
-3
) (Figure 2.4B). It was also noted that this
haplotype effect was stable after additional adjustment for rs10132579 and rs999737 (unadjusted
39
OR= 0.60, 95% CI= 0.48-0.74 and the adjusted OR= 0.60 with 95% CI= 0.47-0.77), suggesting
approximately a 40% decreased breast cancer risk per copy was associated with this CGCAGC
haplotype among the carriers. Taking local ancestry into account did not change the results for
either the CTC haplotype on 10p15 or the CGCAGC haplotype on 14q25 (Table S5). There were
numerous other individual haplotypes with unadjusted significance between 10
-6
and 10
-5
on
8q24 and 19p13. However, these top haplotype effects were indistinguishable from the top SNPs.
Once adjusted for the best SNP contained, these haplotypes became insignificant (p> 0.05)
(Supplemental Figures S4.A-D).
2.5 Discussion
We implemented a genome-wide haplotype association analysis searching for breast
cancer risk susceptibility loci in African American women. To quickly narrow down to potential
risk regions, a 5-SNP sliding window approach was applied throughout 22 autosomes. Among
approximately 1 million windows, none achieved the genome-wide significance determined by
an approximation to the beta distribution of the minimum p-values through 1,000 permutations
(p
G
= 9.94×10
-8
). Only 10 independent chromosomal regions had the haplotype global test p-
value less than 1.5×10
-5
. The haplotype AGCTG at chromosome 1: 8,309,317-8,318,147 showed
a moderate haplotype effect that was otherwise not captured by association analyses focusing on
SNPs. This region overlaps a solute carrier family 45 member 1 gene (SLC45A1, position:
8,306,977-8,326,814) that is predominantly expressed in brain tissues and is also seen frequently
deleted in brain tumor cells, suggesting a putative role as a tumor suppressor [53], however, the
clear picture of its biological mechanism is far from complete. The 2-SNP haplotype AG on
chromosome 4: 122,340,944-122,346,258 had a stronger association with the disease than any
40
SNPs in the same region. About 30kb upstream of it resides TNIP3 gene (Homo sapiens
TNFAIP3 interacting protein 3). Both TNIP and TNFAIP proteins were reported to overexpress
in human carcinoma cells and suppress the activation of nuclear factor kappa B (NF- κB) [54].
The haplotype AACGTT at chromosome 18: 35,670,316-35,684,521 was associated with
increased risk for breast cancer and the haplotype effect was independent of individual SNP
effects, although no known genes are found nearby. Therefore, these regions revealed by
haplotype analysis are candidates for fine-mapping to locate the casual variants as a first step
towards deciphering the true biological functions. Among the 21 known breast cancer risk
regions revealed by previous GWAS, 10p15 and 14q24 seem most likely to harbor unknown risk
loci based on the suggestive haplotype associations described above.
In previous work, Chen et al. [23] and Siddiq et al. [55] have shown that the genome-
wide significance for the 21 known breast cancer SNPs did not replicate in African Americans.
The majority of those SNPs were discovered predominately in European populations, with the
exception of rs2046219 at 6q25 found in the Han Chinese population. Chen et al have also
shown that many of the index risk variants for breast cancer are significant in multiple
populations except for African Americans [56]. We confirmed that the most significant SNPs
within each known region are all different from the known breast cancer risk SNPs
(Supplemental Table S4). All evidence underscores the different risk association patterns
between African Americans and European populations, and limits the generalizability of the
previously established significant GWAS hits as well as presents new challenges in the
investigation of breast cancer susceptibility loci specifically for African Americans.
The haplotype association tests were based on haplotype dosage estimates inferred by the
E-M algorithm from unphased genotypes for unrelated subjects under the assumption of HWE
41
(the estimation step). We substituted the expected haplotype dosages for the unknown true
haplotypes and fit these continuous dosage variables into conventional logistic regression models
(
ˆ
in models 3 through 7) (the substitution step). Even though the haplotype inference from
diploid genotypes is not free from uncertainty, the use of these continuous dosages largely
correct for the uncertainty derived from haplotype inference and the predictability of haplotypes
is quite high, especially when adjacent SNPs are in high LD, a condition that often satisfies in
analyses focused on haplotype blocks [38]. This simple expectation-substitution approach [57]
has been shown to have a proper control of the type I error rate for the association test when we
believe the haplotype dosage estimates have no differential errors between cases and controls
[58]. In other words, case-control status is unrelated to the errors in haplotype dosage estimation,
which is generally valid when haplotypes are inferred by pooling both cases and controls and the
null hypothesis of no significant association between haplotype and disease is true. Several
concerns arise when under the alternative hypothesis a few assumptions are no longer true. For
instance, if haplotype frequencies in cases and controls are associated with the disease status,
failure to account for haplotype uncertainty can lead to estimates biased towards to null [59, 60].
Second, even though the SNPs are in HWE in the general population, it may not be necessarily
so in the case-enriched case-control sample so that the estimation of haplotype dosages may not
be accurately inferred from the sample’s genotypes. To address these aforementioned issues, Lin
et al. [61, 62] proposed a maximum likelihood (ML) method that simultaneously infers
haplotype frequencies and regression parameters in the same model. Their method yields less
biased estimates and the confidence intervals of the regression coefficients have better coverage
of the true value through simulation data for a variety of settings under the alternative
hypotheses. We note however that the superiority of the ML method over the expectation-
42
substitution applies only to scenarios where the true magnitude of association is very large, i.e.,
β=0.9 (OR= 2.5). Such large effects seem to be rare in GWAS of either common SNPs or
common haplotypes studied here. Another simulation analysis [59] also verified that in practical
settings where a haplotype block formed by a small number of SNPs with limited haplotype
diversity, the bias was minimal and the empirical confidence intervals had appropriate coverage
of the true value. More importantly, the performances of the maximum likelihood method and
the expectation-substitution were almost indistinguishable, implying the expectation-substitution
is robust to reasonable departure from the assumptions. Therefore, substituting the inferred
haplotype dosages in the regression model still retains good statistical properties in most
practical contexts of haplotype association tests. If haplotypes with greater risk effect were of
interest, the simultaneous maximum likelihood method would be preferable.
We may not have had enough statistical power to identify significant rare haplotypes or
modest to weak haplotype effects despite our large sample size. Haplotypes of less than 1%
frequency were unaddressed in our analyses mainly due to the intrinsic difficulty and
unreliability of inference of those rare haplotypes. Uncommonly short or long haplotypes in the
genome compromise our 5-SNP sliding windows flexibility to identify them in the haplotype
global test. It is possible that constructing a larger window will capture more haplotype variety
such that some rare haplotypes can be taken into account. Nonetheless, concerns of computing
efficiency arise as the number of SNPs increases. For example, if the total number of
heterozygous SNPs in each haplotype block is m, there could be 2
m
possible haplotypes and thus
1
(2 1)2
mm
possible haplotype pairs being summed over in the E-M algorithm for each subject.
The number grows exponentially, exacerbating the feasibility of implementing the algorithm.
Even though in reality, the number of possible haplotypes may just be a fraction of 2
m
, the same
43
idea still applies. Qin et al [63] proposed the partition ligation E-M algorithm by breaking up a
sequence of SNPs into smaller pieces, each including 5 – 10 markers. In our study, in order to
maximize the coverage of all genotyped SNPs, a 5-SNP window was adopted to construct
haplotype blocks and haplotype global test was employed therein. Arguably, varying window
sizes are capable of reflecting varying degrees of LD in the data [64, 65]: more SNPs should be
included in the same haplotype block when they are in regions of extensive LD and fewer SNPs
should be portioned together given limited LD structure. However, it is difficult to identify
regions of high and low LD and alter the window sizes accordingly across the entire genome
with high precision. It was recommended by Mathias et al [21] that smaller window sizes be run
prior to larger windows. We employed a strategy that both quickly narrows down to potentially
important regions through the universal 5-SNP sliding windows and permits the flexibility of
detecting underlying haplotypes of 2-10 SNPs long residing in those regions. The choice of 5-
SNP window roughly agrees with the overall average haplotype block sizes for people of African
ancestry, in which the total number of the haplotype blocks longer than 10 SNPs (~25kb) should
not be unexpectedly large [9]. Larger windows may improve the ability to identify unknown
haplotype effects. However, if a haplotype effect existed in a 10-SNP block, it would have been
at least partly captured by at least a few of a series of 5-SNP blocks. Note that this should never
be used as a one-size-fits-all solution since the SNP density, underlying haplotype diversity, and
populations under investigation can be fundamentally different from study to study. A similar
exploration of the choice of the average window size is suggested prior to applying the sliding
windows approach in other groups with different LD patterns.
One drawback in the use of overlapping sliding windows is the difficulty of making
correct inference of the type I errors. Obviously, overlapping windows were not independent. A
44
naïve application of Bonferroni adjustment would incur overly conservative significance levels
and the power to find true positive associations would also be compromised. Permutation tests
have been shown to be capable of drawing the significant threshold directly from the
experimental data [66] and serve as the gold standard in the comparison of performances of
various multiple testing adjustments [67]. Nevertheless, permutation tests are computationally
very intensive and time-consuming. One thousand permutations in a genome-wide haplotype
analysis can take weeks to months to finish in light of large sample sizes, haplotype inference,
and association testing. Numerous innovative recommendations [67-71] have been proposed and
each has its own merits. One category among those approaches incorporates the computation of
the effective number of independent tests: n
eff
and use of n
eff
in Bonferroni correction. n
eff
can be
inferred from the beta distribution of the minimum p-values from permutation replicates. We
conjectured that in our African American sample the true number of effective tests for
chromosome 22 lies somewhere between 7,400 and 8,300, covering half the number of total
overlapping windows. So the permutation test implies that approximately 50% of total sliding
windows can be considered independent and therefore a modified Bonferroni correction can be
used readily.
In summary, we applied a 5-SNP sliding window approach to perform genome-wide
haplotype association analysis and identified three novel regions with potential interest for
further investigation and validation. Two of 21 known breast cancer risk regions established in
previous GWAS, namely 10p15 and 14q24, exhibited moderate haplotype effects and warrant
additional replication work to confirm their significance in African American women.
45
Acknowledgements
We thank the women who volunteered to participate in each study. We also thank Christopher
Edlund, Madhavi Eranti, Andrea Holbrook, Paul Poznaik and David Wong from the University
of Southern California for their technical support. We would also like to acknowledge co-
investigators from the WCHS study: Dana H. Bovbjerg (University of Pittsburgh); Lina Jandorf
(Mount Sinai School of Medicine); and Gregory Ciupak, Warren Davis, Gary Zirpoli, Song Yao,
and Michelle Roberts from Roswell Park Cancer Institute.
Author Contributions
Conceived and designed the experiments: RCM CBA EMJ LB WZ JJH RGZ EVB SJC LCP
DJV LLM LNK BEH CAH DOS. Performed the experiments: CS GKC RCM CBA EMJ LB
WZ JJH RGZ EVB SLD JLR SJC PW CAH DOS. Analyzed the data: CS GKC XS CAH DOS.
Contributed reagents/materials/analysis tools: CS GKC RCM CBA EMJ LB WZ JJH RGZ SN
EVB SAI MFP SJC BEH CAH DOS. Wrote the paper: CS DOS.
46
Table 2.1 Fitting the minimum p-values from 1,000 permutations of chromosome 22 data to theoretical beta distributions beta(a,b).
Parameter Goodness of fit (p-value)
a b KolmogorovSmirnov Cramér-von Anderson-Darling
0.95 7426 >0.25 >0.25 >0.25
1 7794 >0.25 >0.25 >0.25
1 8500 0.044 0.061 0.039
1 8400 0.088 0.123 0.089
1 8300 0.164 0.236 0.188
1 7600 >0.25 >0.25 >0.25
1 7400 >0.25 0.16 0.116
1 7300 0.147 0.049 0.047
47
Table 2.2 The most significant individual haplotypes identified in the extended regions on chromosomes 1, 4 and 18.
Unadjusted for SNP effect Adjusted for SNP effect
Chromosome Constituent SNPs Haplotype Frequency OR 95% CI Hap P OR 95% CI Hap P
a
1 rs9628987,rs2289731,rs12711517, AGCTG 0.24 0.81 (0.74-0.89) 5.09E-06 0.82 (0.74-0.91) 1.36E-04
c
rs2305016,rs7535752
SNP adjusted
b
rs12711517; T, 0.36; 1.11 (1.03-1.20); p=9.88E-03
4 rs17435444,rs13116936 AG 0.64 1.23 (1.13-1.33) 3.37E-07 1.74 (1.26-2.39) 7.54E-04
c
SNP adjusted
b
rs13116936; T, 0.34; 0.84 (0.77-0.91); p=1.09E-05
18 rs7233920,rs4799278,rs12605634, AACGTT 0.03 1.72 (1.32-2.25) 6.96E-05 1.79 (1.36-2.34) 2.42E-05
rs4799520,rs7238528,rs17702736
SNP adjusted
b
rs4799520; A, 0.09; 1.23 (1.11-1.45); p=3.66E-04
a the p-value of LR test of the haplotype-specific effect after adjustment for the best SNP contained in that haplotype.
b the rs number, risk allele and its frequency, Odds Ratios and 95% CI, and the p-value of the SNP that is adjusted for in the LR test are presented.
c There are no individual haplotypes significant at 1.0E-4 in this region after adjustment for the best contained SNP. Instead the most significant
haplotype is reported for the sake of completeness.
48
Table 2.3 The most significant individual haplotypes in 10p15 and 14q24.
Unadjusted for SNP effect Adjusted for SNP effect
Chromsome SNPs Haplotype Frequency OR 95% CI Hap P OR 95% CI Hap P
a
10p15 rs17141741,rs2386661,rs4414128 CTC 0.22 0.79 (0.72-0.88) 5.00E-06 0.81 (0.72-0.91) 2.16E-04
Known Risk SNP adjusted
b
rs2380205; C, 0.42; 0.98 (0.91-1.06); p=0.5945
Best SNP adjusted
b
rs4414128; T, 0.38; 1.11 (1.03-1.21); p=0.007084
14q24 rs765899, rs737387, rs2842347, CGCAGC 0.05 0.6 (0.48-0.74) 1.69E-06 0.6 (0.47-0.77) 4.27E-05
rs757369, rs10132579, rs2842346
Known Risk SNP adjusted
b
rs999737; T, 0.05; 0.98 (0.82-1.17); 0.7994
Best SNP adjusted
b
rs10132579; G, 0.37; 0.89 (0.82-0.97); p=0.009551
a the p-value of LR test of the haplotype specific effect after adjustment for both the known breast cancer risk SNP and the best SNP contained in that
haplotype
b the rs number, risk allele and its frequency, Odds Ratios and 95% CI, and the p-value for the SNP adjusted in the LR test are presented. For the regions
with known breast cancer risk hits, both the known hit and the locally best SNP were adjusted for in the LR test for the independence of haplotype signals.
49
F
(
Q
Figure 2.1 Comp
(A). Quantile-Q
Quantile-Quanti
parison of the p
Quantile plot com
ile plot compari
permutation min
mparing the min
ing the minimum
nimum p-values
nimum p-values
m p-values to be
to theoretical be
from 1,000 perm
eta(1,7823).
eta distributions
mutations on ch
s.
hromosome 22 d
data to beta(1,74 426). (B).
50
F
1
T
3
f
i
s
Figure 2.2 Comp
1, 4 and 18.
These three regi
35,683,522 were
further extended
individual haplo
same region. Blu
parison of the si
ions, namely, (A
e identified by t
d both upstream
otypes, the sizes
ue dot shows th
ignificance of in
A) chr1: 8,309,3
the genome-wid
m and downstream
s of which are pr
he most significa
ndividual haplot
17-8,318,147; (
de haplotype ass
m by half of the
roportional to th
ant SNP.
types with the m
(B) chr4:122,32
ociation analysi
e original width
heir haplotype fr
most significant
5,743- 122,363,
is using 5-SNP
to explore unde
requencies. Red
SNPs in three r
,114; and (C) ch
sliding window
erdetected effect
d dots denote ge
regions on chrom
hr18: 35,670,31
ws. The regions w
ts. Black circles
notyped SNPs w
mosomes
6-
were
s denote
within the
51
F
C
8
i
s
Figure 2.3 Comp
Contrast of the h
8,309,317-8,318
individual haplo
same region. Blu
parison of the si
haplotype effect
8,147; (B) chr4:
otypes, the sizes
ue dot shows th
ignificance of in
ts with the effec
122,325,743- 12
s of which are pr
he most significa
ndividual haplot
cts of the 1000 G
22,363,114; and
roportional to th
ant imputed SNP
types with impu
Genome Project
d (C) chr18: 35,6
heir haplotype fr
P.
uted SNPs in reg
imputed SNPs
670,316-35,683
requencies. Red
gions on chromo
in these three re
3,522 are shown
d dots denote im
osomes 1, 4 and
egions, namely,
n. Black circles d
mputed SNPs wit
d 18.
(A) chr1:
denote
thin the
52
F
(
a
s
Figure 2.4 Two
(A) 5.67-6.17M
are proportional
significant SNP
known breast c
Mb region at 10p
l to their haploty
. Cyan dot deno
ancer risk regio
15; (B) 67.84-68
ype frequencies.
otes the known b
ons 10p15 and 14
8.34Mb region a
. Red dots denot
breast cancer ris
4q24 exhibit pu
at 14q24. Black
te genotyped SN
sk SNP identifie
utative haplotype
k circles denote i
NPs within the s
ed by previous G
e effects.
individual haplo
same region. Blu
GWAS.
otypes, the sizes
ue dot shows th
s of which
he most
53
F
(
a
s
Figure 2.5 Comp
(A) 5.67-6.17M
are proportional
significant impu
parison of the si
Mb region in 10p
l to their haploty
uted SNP. Cyan
ignificance of in
15; (B) 67.84-6
ype frequencies.
dot denotes the
ndividual haplot
8.34Mb region
. Red dots denot
e known breast c
types with impu
in 14q24. Black
te imputed SNP
cancer risk SNP
uted SNPs in 10p
k circles denote
Ps within the sam
P identified by pr
0p15 and 14q24.
individual haplo
me region. Blue
revious GWAS
otypes, the sizes
dot shows the m
.
s of which
most
54
Figure S1. The distributions of 5-SNP sliding window sizes shown in cumulative density.
Each colored line denotes the 5-SNP sliding window sizes on each chromosome, shown as
cumulative density of window sizes from the smallest to the biggest. The black curve shows the
average cumulative density across 22 autosomes. The 1, 25, 50, 75, 90, and 99 percentile of the
average window size are 1, 5, 9, 14, 20 and 32kb, respectively.
Figure S2. Comparison of the significance of haplotype blocks and SNPs on chromosome 5
(97.3Mb – 97.8Mb). Black dots represent the significance of haplotype blocks by the global test.
Red circles denote the significance of SNPs in the same chromosomal region. All top five
haplotype blocks overlap each other and contain SNP rs6882564, which was severely out of
Hardy-Weinberg Equilibrium.
Figure S3. Comparison of the significance of individual haplotypes with the most
significant SNPs in three regions on chromosomes 3, 5 and 10. These three regions, namely,
(A) chr3: 7,220,000-7,280,000; (B) chr5:142,326,000- 142,371,000; and (C) chr10: 115,105,000-
115,124,000 had small p-values in the genome-wide haplotype association analysis. Black circles
denote individual haplotypes, the sizes of which are proportional to their haplotype frequencies.
Red dots denote genotyped SNPs within the same region. Blue dot shows the most significant
SNP. The observed top individual haplotype effects were mostly due to the top SNPs.
Figure S4. Comparison of the significance of individual haplotypes with genotyped and
imputed SNPs in 8p24 and 19p13. (A),(C) 125.99-129.99Mb region in 8q24; (B),(D) 17.00-
17.50Mb region in 19p13. Black circles denote individual haplotypes, the sizes of which are
proportional to their haplotype frequencies. Red dots denote genotyped SNPs in (A),(B) and
imputed SNPs in (C),(D) within the same region of haplotypes. Blue dot shows the most
significant genotyped SNP in (A),(B) and the most significant imputed SNP in (C),(D). Cyan dot
in (C),(D) denotes the known breast cancer risk SNP identified by previous GWAS
55
2.6 Chapter 2 References
1. Freedman, M.L., et al., Principles for the post-GWAS functional characterization of cancer risk loci.
Nature Genetics, 2011. 43(6): p. 513-8.
2. Hindorff, L.A., et al., Potential etiologic and functional implications of genome-wide association loci for
human diseases and traits. Proceedings of the National Academy of Sciences of the United States of
America, 2009. 106(23): p. 9362-7.
3. Meng, Z., et al., Selection of genetic markers for association analyses, using linkage disequilibrium and
haplotypes. American journal of human genetics, 2003. 73(1): p. 115-30.
4. Schaid, D.J., et al., Score tests for association between traits and haplotypes when linkage phase is
ambiguous. American journal of human genetics, 2002. 70(2): p. 425-34.
5. Zaboli, G., et al., Haplotype analysis confirms association of the serotonin transporter (5-HTT) gene with
schizophrenia but not with major depression. American journal of medical genetics Part B
Neuropsychiatric genetics the official publication of the International Society of Psychiatric Genetics,
2008. 147(3): p. 301-307.
6. Poduslo, S.E., R. Huang, and a. Spiro, A genome screen of successful aging without cognitive decline
identifies LRP1B by haplotype analysis. American journal of medical genetics. Part B, Neuropsychiatric
genetics : the official publication of the International Society of Psychiatric Genetics, 2010. 153B(1): p.
114-9.
7. Zhang, C., et al., A whole genome long-range haplotype (WGLRH) test for detecting imprints of positive
selection in human populations. Bioinformatics, 2006. 22(17): p. 2122-2128.
8. Schaid, D.J., Evaluating associations of haplotypes with traits. Genetic Epidemiology, 2004. 27(4): p. 348-
364.
9. Gabriel, S.B., et al., The structure of haplotype blocks in the human genome. Science (New York, N.Y.),
2002. 296(5576): p. 2225-9.
10. Liu, N., K. Zhang, and H. Zhao, Haplotype-association analysis. Advances in Genetics, 2008. 60(07): p.
335-405.
11. Daly, M.J., et al., High-resolution haplotype structure in the human genome. Nature Genetics, 2001. 29(2):
p. 229-32.
12. Lorenz, P., et al., The ancient mammalian KRAB zinc finger gene cluster on human chromosome 8q24.3
illustrates principles of C2H2 zinc finger evolution associated with unique expression profiles in human
tissues. BMC Genomics. 11: p. 206.
13. Morris, R.W. and N.L. Kaplan, On the advantage of haplotype analysis in the presence of multiple disease
susceptibility alleles. Genetic epidemiology, 2002. 23(3): p. 221-33.
14. Trégouët, D.-A., et al., Genome-wide haplotype association study identifies the SLC22A3-LPAL2-LPA gene
cluster as a risk locus for coronary artery disease. Nature Genetics, 2009. 41(3): p. 2008-2010.
15. Shim, H., et al., Genome-wide association studies using single-nucleotide polymorphisms versus
haplotypes: an empirical comparison with data from the North American Rheumatoid Arthritis
Consortium. BMC Proceedings, 2009. 3(Suppl 7): p. S35-S35.
56
16. Zhao, H., R. Pfeiffer, and M.H. Gail, Haplotype analysis in population genetics and association
studies.(Brief article), in Pharmacogenomics. 2003. p. 171(8)-171(8).
17. Cardon, L.R. and G.R. Abecasis, Using haplotype blocks to map human complex trait loci. Trends in
Genetics, 2003. 19: p. 135-140.
18. Reich, D.E., et al., Linkage disequilibrium in the human genome. Nature, 2001. 411(6834): p. 199-204.
19. Patil, N., et al., Blocks of limited haplotype diversity revealed by high-resolution scanning of human
chromosome 21. Science, 2001. 294(5547): p. 1719-1723.
20. Durrant, C., et al., Linkage disequilibrium mapping via cladistic analysis of single-nucleotide
polymorphism haplotypes. The American Journal of Human Genetics, 2004. 75(1): p. 35-43.
21. Mathias, R.A., et al., A graphical assessment of p-values from sliding window haplotype tests of association
to identify asthma susceptibility loci on chromosome 11q. BMC genetics, 2006. 7: p. 38-38.
22. Lambert, J.C., et al., Genome-wide haplotype association study identifies the FRMD4A gene as a risk locus
for Alzheimer's disease. Molecular psychiatry, 2012(November 2011): p. 1-10.
23. Chen, F., et al., Fine-mapping of breast cancer susceptibility loci characterizes genetic risk in African
Americans. Human molecular genetics, 2011. 20(22).
24. Kolonel, L.N., et al., A multiethnic cohort in Hawaii and Los Angeles: baseline characteristics. Am J
Epidemiol, 2000. 151(4): p. 346-57.
25. Marchbanks, P.A., et al., The NICHD Women's Contraceptive and Reproductive Experiences Study:
methods and operational results. Annals of Epidemiology, 2002. 12(4): p. 213-221.
26. Ambrosone, C.B., et al., Conducting Molecular Epidemiological Research in the Age of HIPAA: A Multi-
Institutional Case-Control Study of Breast Cancer in African-American and European-American Women.
Journal of oncology, 2009. 2009: p. 871250.
27. John, E.M., et al., Sun exposure, vitamin D receptor gene polymorphisms, and breast cancer risk in a
multiethnic population. American Journal of Epidemiology, 2007. 166(12): p. 1409-1419.
28. John, E.M., et al., The Breast Cancer Family Registry: an infrastructure for cooperative multinational,
interdisciplinary and translational studies of the genetic epidemiology of breast cancer. Breast Cancer
Research, 2004. 6(4): p. R375-R389.
29. Newman, B., et al., The Carolina Breast Cancer Study: integrating population-based epidemiology and
molecular biology. Breast Cancer Research and Treatment, 1995. 35(1): p. 51-60.
30. Prorok, P.C., et al., Design of the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial.
2000. p. 273S-309S.
31. Zheng, W., et al., Evaluation of 11 breast cancer susceptibility loci in African-American women. Cancer
epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research,
cosponsored by the American Society of Preventive Oncology, 2009. 18(10): p. 2761-4.
32. Smith, T.R., et al., Polygenic model of DNA repair genetic polymorphisms in human breast cancer risk.
Carcinogenesis, 2008. 29(11): p. 2132-2138.
33. Parra, E.J., et al., Estimating African American admixture proportions by use of population-specific alleles.
The American Journal of Human Genetics, 1998. 63(6): p. 1839-1851.
57
34. Chen, F., et al., A genome-wide association study of breast cancer in women of African ancestry. Human
genetics, 2012.
35. Wang, Y., et al., Increased gene coverage and Alu frequency in large linkage disequilibrium blocks of the
human genome. Genetics and molecular research : GMR, 2007. 6(4): p. 1131-41.
36. Barrett, J.C., et al., Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics
(Oxford, England), 2005. 21: p. 263-5.
37. Excoffier, L. and M. Slatkin, Maximum-likelihood estimation of molecular haplotype frequencies in a
diploid population. Molecular Biology and Evolution, 1995. 12(5): p. 921-927.
38. Stram, D.O., et al., Modeling and E-M estimation of haplotype-specific relative risks from genotype data
for a case-control study of unrelated individuals. Human heredity, 2003. 55(4): p. 179-90.
39. D.O. Stram, and V.E. Seshan, Statistical Human Genetics, R.C. Elston, J.M. Satagopan, and S. Sun,
Editors. 2012, Humana Press: Totowa, NJ. p. 423-452.
40. Price, A.L., et al., Principal components analysis corrects for stratification in genome-wide association
studies. Nat Genet, 2006. 38: p. 904-909.
41. Costas, J., et al., Human genome-wide screen of haplotype-like blocks of reduced diversity. Gene, 2005.
349: p. 219-225.
42. Chen, G.K., et al., Mendel-GPU : Haplotyping and genotype imputation on Graphics Processing Units.
2012. 2: p. 2009-2010.
43. Durbin, R.M., et al., A map of human genome variation from population-scale sequencing. Nature, 2010.
467(7319): p. 1061-1073.
44. Sung, Y.J., et al., Genotype imputation for African Americans using data from HapMap phase II versus
1000 genomes projects. Genetic epidemiology, 2012. 36(5): p. 508-16.
45. Price, A.L., et al., Sensitive detection of chromosomal segments of distinct ancestry in admixed populations.
PLoS genetics, 2009. 5(6): p. e1000519-e1000519.
46. Wang, X., et al., Adjustment for local ancestry in genetic association analysis of admixed populations.
2010: p. 1-9.
47. Jia, L., et al., Functional enhancers at the gene-poor 8q24 cancer-linked locus. PLoS Genet, 2009. 5: p.
e1000597.
48. Ghoussaini, M., et al., Multiple loci with different cancer specificities within the 8q24 gene desert. J Natl
Cancer Inst, 2008. 100: p. 962-966.
49. Freedman, M.L., et al., Admixture mapping identifies 8q24 as a prostate cancer risk locus in African-
American men. Proceedings of the National Academy of Sciences of the United States of America, 2006.
103: p. 14068-14073.
50. Purcell, S., et al., PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage
Analyses. The American Journal of Human Genetics, 2007. 81(3): p. 559-575.
51. Dudbridge, F. and B.P.C. Koeleman, Efficient computation of significance levels for multiple associations
in large studies of correlated data, including genomewide association studies. American journal of human
genetics, 2004. 75(3): p. 424-35.
58
52. Šidák, Z., Rectangular confidence regions for the means of multivariate normal distributions. Journal of
the American Statistical Association, 1967. 62(318): p. 626-633.
53. Amler, L.C., et al., Identification and characterization of novel genes located at the t(1;15)(p36.2;q24)
translocation breakpoint in the neuroblastoma cell line NGP. Genomics, 2000. 64: p. 195-202.
54. Huang, L., et al., ABINs inhibit EGF receptor-mediated NF-kappaB activation and growth of EGF
receptor-overexpressing tumour cells. Oncogene, 2008. 27(47): p. 6131-40.
55. Siddiq, A., et al., A meta-analysis of genome-wide association studies of breast cancer identifies two novel
susceptibility loci at 6q14 and 20q11. Human molecular genetics, 2012. 21(24): p. 5373-5384.
56. Chen, F., et al., Caution in generalizing known genetic risk markers for breast cancer across all
ethnic/racial populations. European journal of human genetics : EJHG, 2011. 19(2): p. 243-5.
57. Zaykin, D.V., et al., Testing association of statistically inferred haplotypes with discrete and continuous
traits in samples of unrelated individuals. Human heredity, 2002. 53(2): p. 79-91.
58. Xie, R. and D.O. Stram, Asymptotic equivalence between two score tests for haplotype-specific risk in
general linear models. Genetic epidemiology, 2005. 29(2): p. 166-70.
59. Kraft, P. and D.O. Stram, Re : The Use of Inferred Haplotypes. Journal of Human Genetics, 2007.
81(October): p. 863-868.
60. Kraft, P., et al., Accounting for haplotype uncertainty in matched association studies: a comparison of
simple and flexible techniques. Genetic epidemiology, 2005. 28(3): p. 261-72.
61. Lin, D.Y. and B.E. Huang, The Use of Inferred Haplotypes. 2007. 80(March 2006): p. 2006-2008.
62. Hu, Y.J. and D.Y. Lin, Analysis of untyped SNPs: maximum likelihood and imputation methods. Genetic
epidemiology, 2010. 34(8): p. 803-15.
63. Qin, Z.S., T. Niu, and J.S. Liu, Partition-ligation--expectation-maximization algorithm for haplotype
inference with single-nucleotide polymorphisms. The American Journal of Human Genetics, 2002. 71(5): p.
1242-1242.
64. Guo, Y., et al., Gains in power for exhaustive analyses of haplotypes using variable-sized sliding window
strategy: a comparison of association-mapping strategies. European journal of human genetics EJHG,
2009. 17(6): p. 785-792.
65. Tang, R., et al., A variable-sized sliding-window approach for genetic association studies via principal
component analysis. Annals of Human Genetics, 2009. 73(Pt 6): p. 631-637.
66. Cheverud, J.M., A simple correction for multiple comparisons in interval mapping genome scans. Heredity,
2001. 87(Pt 1): p. 52-58.
67. Gao, X., et al., Avoiding the high Bonferroni penalty in genome-wide association studies. Genetic
Epidemiology, 2010. 34(1): p. 100-105.
68. Li, J. and L. Ji, Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation
matrix. Heredity, 2005. 95(3): p. 221-227.
69. Zaykin, D.V., et al., Truncated product method for combining P-values. Genetic Epidemiology, 2002.
22(2): p. 170-185.
59
70. Dudbridge, F. and B.P.C. Koeleman, Rank truncated product ofP-values, with application to genomewide
association scans. Genetic Epidemiology, 2003. 25(4): p. 360-366.
71. Moskvina, V. and K.M. Schmidt, On multiple-testing correction in genome-wide association studies.
Genetic Epidemiology, 2008. 32(6): p. 567-573.
60
3 Multiple myeloma susceptibility loci examined in African and European ancestry
populations
Authors: Rand KA
1,*
, Song C
1,*
, Dean E
2
, Serie D
3
, Curtin K
4
, Hazelett D
1
, Hwang AE
1
, Sheng X
1
, Stram
A
1
, Van Den Berg DJ
1
, Edlund CK
1
, Hu D
2
, Huff CA
5
, Bernal-Mizrachi L
6
, Tomasson MH
7
, Ailawadhi
S
3
, Singhal S
9
, Pawlish K
10
, Peters E
11
, Bock CH
12
, Mohrbacher A
1
, Conti DV
1
, Colditz G
7
, Zimmerman
T
13
, Huntsman S
2
, Graff J
14
, Berndt SI
15
, Blot WJ
16,17
, Carpten J
18
, Casey G
1
, Chanock SJ
15
, Chu L
19,20
,
Diver WR
21
, Stevens VL
21
,
Lieber M
1
, Goodman P
22
, Hennis AJM
23,24
, Hsing AW
19,20
, Mehta J
9
, Kittles
RA
14
, Kolb S
25
, Klein EA
26
, Leske C
23
, Murphy AB
9
, Nemesure B
23
, Neslund-Dudas C
27
, Pettaway C
28
,
Rodriguez-Gil JL
29
, Rybicki BA
27
, Stanford JL
25
, Signorello LB
30,31
, Nooka A
6
, Strom SS
28
, Janakiraman
N
27
, Tolebero H
25
, Witte JS
32
, Xu J
31
, Zheng SL
33
, Wu SY
23
, Yamamura Y
28
, Ambrosone CB
34
, John
EM
19,20
, Bernstein L
35
, Zheng W
36
, Olshan AF
37
, Hu JJ
29
, Ziegler RG
15
, Nyante S
37
, Bandera EV
38
,
Birmann BM
#
, Ingles SA
1
, Press MF
1
, Deming SL
17
, Brooks-Wilson AR
39
, Rajkumar V
40
, Brown EE
41
,
Raskin L
36
, Kolonel L
42
, Slager S
40
, Henderson BE
1
, Giles GG
43,44,45
, Spinelli JJ
46
, Chiu B
13
, Munshi N
30
,
Anderson KC
30
, Zonder J
12
, Orlowski RZ
28
, Lonial S
6
, Camp N
4,+
, Vachon C
40,+
, Ziv E
2,+
, Stram DO
1,+
,
Haiman CA
1,+,**
, Cozen W
1,+,**
*co-equal authors
+co-senior authors
**corresponding authors
Correspondence to:
Wendy Cozen, Norris Topping Tower, 1551 Eastlake Ave, Room 4451A, Los Angeles, CA 90033,
Telephone: (323) 865-0447, Fax: (323) 865-0141, email: wcozen@usc.edu
or
Christopher A. Haiman, Harlyne Norris Research Tower, 1450 Biggy Street, Room 1504, Los Angeles,
CA 90033, Telephone: (323) 442-7755, Fax: (323) 442-7749, email: haiman@usc.edu
Affiliations:
1
Keck School of Medicine of USC and Norris Comprehensive Cancer Center, University of Southern
California, Los Angeles, CA, USA
2
Department of Medicine, University of California San Francisco, San Francisco, California, USA
3
Mayo Clinic, Jacksonville, FL, USA
4
Department of Internal Medicine, University of Utah School of Medicine, Salt Lake City, Utah, USA
5
Johns Hopkins School of Medicine, Johns Hopkins University, Baltimore, MD, USA
6
Emory University School of Medicine, Emory University, Atlanta, GE, USA
61
7
Alvin J. Siteman Cancer Center, Washington University School of Medicine, Washington University, St.
Louis, MO, USA
9
Robert H. Lurie Cancer Center, Northwestern University, Chicago, IL, USA
10
New Jersey State Cancer Registry, New Jersey Department of Health, Trenton, New Jersey
11
Louisiana State University School of Public Health, Louisiana State University, New Orleans, LA, USA
12
Karmanos Cancer Institute and Department of Oncology, Wayne State University of Medicine, Detroit,
MI, USA
13
University of Chicago, Chicago, IL
14
Rutgers-Robert Wood Johnson Medical School, Rutgers State University of New Jersey, New
Brunswick, NJ, USA
15
Division of Cancer Epidemiology and Genetics, National Cancer Institute, US National Institutes of
Health, Bethesda, Maryland, USA
16
International Epidemiology Institute, Rockville, MD, USA
17
Division of Epidemiology, Department of Medicine, Vanderbilt Epidemiology Center, Vanderbilt
University School of Medicine, Nashville, TN, USA
18
The Translational Genomics Research Institute, Phoenix, AZ, USA
19
Cancer Prevention Institute of California, Fremont, CA, USA
20
Stanford University School of Medicine and Stanford Cancer Institute, Palo Alto, CA, USA
21
American Cancer Society, Atlanta, GA, USA
22
SWOG Statistical Center, Seattle, Washington, USA
23
Department of Preventive Medicine, Stony Brook University, Stony Brook, NY, USA
24
Chronic Disease Research Centre and Faculty of Medical Sciences, University of the West Indies,
Bridgetown, Barbados
25
Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA , USA
26
Glickman Urologic and Kidney Institute, Cleveland Clinic, Cleveland, OH
27
Henry Ford Hospital, Detroit, MI, USA
28
The University of Texas MD Anderson Cancer Center, The University of Texas, Houston, TX, USA
29
Sylvester Comprehensive Cancer Center and Department of Epidemiology and Public Health,
University of Miami Miller School of Medicine, Miami, FL 33136,USA
30
Harvard Medical School, Dana-Farber Cancer Institute, Harvard University, Boston, MA, USA
31
Harvard School of Public Health, Harvard University, Boston, Massachusetts, USA
32
Institute for Human Genetics, University of California, San Francisco, San Francisco, California, USA
33
Center for Cancer Genomics, Wake Forest School of Medicine, Wake Forest University, Winston-
Salem, North Carolina, USA
34
Department of Cancer Prevention and Control, Roswell Park Cancer Institute, Buffalo, New York, USA
35
Population Sciences/Cancer Etiology, City of Hope, Duarte, CA, USA
62
36
Vanderbilt Epidemiology Center, Vanderbilt University School of Medicine, Nashville, Tennessee,
USA
37
Department of Epidemiology, Gillings School of Global Public Health, and Lineberger Comprehensive
Cancer Center, University of North Carolina, Chapel Hill, NC, USA
38
The Cancer Institute of New Jersey, New Brunswick, NJ, USA
39
Genome Sciences Centre, BC Cancer Agency, Vancouver, BC V5Z 1L3, Canada; Department of
Biomedical Physiology and Kinesiology, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
40
Mayo Clinic, 200 First St SW, Rochester, MN, USA
41
Department of Medicine, University of Alabama at Birmingham, Birmingham, Alabama, USA
Department of Epidemiology, University of Alabama at Birmingham, Birmingham, Alabama, USA
42
University of Hawaii Cancer Center, University of Hawaii, Honolulu, HI, USA
43
Cancer Epidemiology Centre, Cancer Council of Victoria, Melbourne, Australia
44
School of Population and Global Health, Centre for Epidemiology and Biostatistics, University of
Melbourne, Melbourne, VIC, Australia
45
Department of Epidemiology and Preventive Medicine, Monash University, Melbourne, Australia
46
Cancer Control Research, BC Cancer Agency, Vancouver, BC Canada; School of Population and Public
Health, University of British Columbia, Vancouver, BC, Canada
63
3.1 Abstract
Genome-wide association studies (GWAS) of multiple myeloma (MM) in Northern
European populations have identified associations with common genetic variation in seven
regions. We investigated these regions in samples from two North American populations: 1,274
MM patients and 1,486 controls of European ancestry (EA) and 1,049 MM patients and 7,080
controls of African ancestry (AA). We observed directionally consistent effects for all seven loci
in both populations, with four significantly associated (p<0.05) with risk in EAs (3p22.1, 7p15.3,
17p11.2, 22q13.1), and two significantly associated with risk in AAs (7p15.3 and 22q13.1). In a
multiethnic fine-mapping analysis, variation in five regions (2p33.3, 3p22.1, 7p15.3, 17p11.2,
22q13.1) was statistically signficantly associated with risk and in only one region was the index
variant the most associated signal (rs4487645 at 7p15.3 (OR=1.30, p=8.7x10
-8
). A large fraction
of the most associated variants identified in the multiethnic analyses are potentially functional in
etiologically relevant CD20+ lymphoid B cells, including a missense variant (rs34562254,
Pro251Leu, 17p11.2) in TNFRSF13B, which encodes a lymphocyte-specific protein in the tumor
necrosis factor receptor family that interacts with the NF-kb pathway. Our study shows that these
MM susceptibility regions contain risk variants that are important across populations and further
supports the use of multiple ethnic groups in genetic studies to enhance the localization and
identification of functional alleles.
64
3.2 Introduction
Multiple myeloma (MM), a neoplasm of malignant plasma cells arising in bone marrow,
comprises 1.9% of all cancer deaths and 20% of all hematological cancer deaths.MM is
uncommon, with an age-adjusted incidence rate of 7.7/100,000 in males and 4.9/100,000 in
females in the U.S. between 2007 and 2011 (www.seer.cancer.gov). Clinical manifestations
range from asymptomatic (smoldering) myeloma to active symptomatic disease, usually with
metastasis to bone [1]. There is a 2- to 3- fold higher risk of disease in African Americans
compared to Whites and a similar increased risk in relatives of MM cases [2, 3], suggesting a
heritable component to this disease.
A genome-wide association study (GWAS) of 1675 cases and 5903 controls from a
Northern European population identified two novel loci associated with MM risk at 3p22.1
(rs1052501) and 7p15.3 (rs4487645), as well as a marginally significant association (p~10
-7
) at
2p23.3 (rs6746082) [4]. In a second GWAS of 4,692 cases and 10,990 controls from the United
Kingdom and Germany, four additional risk loci were identified at 3q26.2 (rs10936599), 6p21.33
(rs2285803), 17p11.2 (rs4273077), and 22q13.1 (rs877529) [5]. For these common risk variants
the per allele odds ratios (OR) and risk allele frequencies (RAF) ranged from 1.19 to1.38, and
0.11 to 0.76, respectively.
For common susceptibility alleles that are shared across populations, underlying genetic
differences in linkage disequilibrium (LD) across ethnicities can be exploited to more precisely
localize markers of disease risk. In the present study, we examined the known susceptibility
regions for MM in populations of African (AA) and European (EA) ancestry and conducted
multiethnic GWAS plus imputation-based fine-mapping in attempt to identify putative functional
variants that better capture risk at the known loci in these populations.
65
3.3 Methods
3.3.1 Ethics statement
All work has been performed under national and international guidelines. Written consent
was obtained for all participants at the time of blood/saliva collection. The Institutional Review
Board at all participating institutions approved the study protocol. The participants in this study
are from two separate GWAS which are described in detail below.
3.3.2 European Ancestry Study Participants
There were 1,264 MM cases and 1,479 controls of European origin from four
genotyping centers: University of Southern California (USC), University of California at San
Francisco (UCSF); Mayo Clinic (Mayo); and University of Utah (UU) (Supplementary
Methods). The USC GWAS consisted of four case-control studies (Los Angeles SEER [6],
Seattle/Detroit SEER [7], University of British Columbia, University of Alabama) and two
cohort studies (the Multiethnic Cohort Study (MEC) [8] and the Melbourne Collaborative
Cohort Study [9]). The Mayo Clinic GWAS included cases and controls from Mayo Clinic and
Washington University [10]. UCSF [11] and UU each included only one study sample of cases
and controls.
Genotyping and Imputation: Each genotyping center genotyped both cases and controls
using the same array (although arrays used differed by center) and imputation was performed at
each center using IMPUTE2 [12] or Beagle [13] with the 1000 Genomes Project (1KGP) March
2012 release as the reference panel. A detailed description for each genotyping center is provided
in the Supplementary Methods.
Statistical Analysis: Each genotyping center analyzed data separately using unconditional
logistic regression, adjusting for age, sex, and prinicipal components (PC; Supplementary
66
Methods) [14]. Summary statistics were meta-analyzed using a fixed effects model weighted by
the inverse standard error in METAL [15].
A description of each of the EA studies, genotyping platforms and methods, as well as
imputation and quality control procedures are provided in the Supplementary Methods.
3.3.3 African Ancestry Study Participants
This study consists of 1,049 AA patients enrolled to date (enrollment ongoing) from 10
clinical centers (Winship Cancer Institute and Grady Memorial Hospital at Emory University,
MD Anderson Cancer Center at University of Texas, Robert H. Lurie Comprehensive Cancer
Center at Northwestern University, Sidney Kimmel Cancer Institute at Johns Hopkins
University, Karmanos Cancer Institute at Wayne State University, University of Chicago
Comprehensive Cancer Center, Siteman Cancer Center at Washington University, St. John
Providence Health System, and Henry Ford Health System) and four NCI SEER cancer registries
(California, Detroit, New Jersey, and Louisiana). USC serves as the data coordinating center that
receives, processes and maintains all de-identified clinical and questionnaire data and
biospecimens. English speaking AA patients diagnosed with active or smoldering MM at age 20
years or older without allogeneic stem cell transplant were eligible for enrollment.
Genotyping data from 7,080 AA controls was utilized, which included 4,448 male
controls from the African Ancestry Prostate Cancer GWAS Consortium (AAPC, consisting of 13
independent studies) and 2,632 female controls from a breast cancer GWAS of AA women
(consisting of nine independent studies), both of which have been previously described [16, 17].
Genotyping and Imputation: DNA from buffy coat or saliva samples from AA patients
was genotyped using the Illumina HumanCore GWAS array and controls were previously
67
genotyped using the Illumina 1M-Duo (Illumina Inc., San Diego, California, USA). Quality
control steps for the controls are described elsewhere [16, 17]. Among cases, 37,046 SNPs and
11 samples with a call rate < 98% were removed. Cases were further excluded for the following
criteria: (i) unexpected replicates (n=14); (ii) first or second degree relatives (n=2); (iii) reported
sex conflicting with estimated by X chromosome heterozygosity or XXY sex chromosome
aneuploidy (n=6). A small subset of controls (n=100) were genotyped on both arrays for QC
purposes; any SNP that was discordant between the two platforms was removed (n=3,134). Only
SNPs genotyped in both cases and controls were included for imputation (n=188,835). Prior to
merging the case and control genotype data, variant alleles were flipped to forward strand and
base pair positions were mapped to GRCh37/hg19. SNPs with discrepant alleles for the same
variant between cases and controls were set to missing. Imputation to 1KGP (March 2012
release) was conducted for 500 Kb regions around the seven previously identified SNPs. SNPs
with INFO >0.80 and a MAF > 1% were included in the analysis.
Statistical Analysis: The goal of our statistical analysis was two-fold, first to enhance, by
use of multiethnic fine-mapping, the localization of previously defined signals of association (i.e.
the index SNPs found to be genome-wide significant in the previous studies of Europeans), and
second, to search for new associations in regions nearby to the previously defined signals.
Reflecting these two goals, within each of the seven regions of interest, SNPs (both typed and
imputed) were classified into two groups: Group A SNPs (r
2
≥0.20 with index estimated in 1KGP
EUR populations) and Group B (r
2
<0.20). We were less stringent in our choice of criteria for
statistical significance for the Group A SNPs (the type I error rate or alpha level) than for Group
B SNPs because of the better prior knowledge of association of risk with the more strongly
correlated Group A SNPs than with Group B SNPs. Specifically, we controlled for multiple
68
comparisons separately for each region (demanding region-wide significance) when testing the
Group A SNPs but required more stringent experiment-wide significance across all regions for
the Group B SNPs.
Alpha levels for each region were separately derived for the two groups of SNPs using
permutation testing. To achieve numerically stable results, 1000 replicates randomly shuffling
the case/control status of all samples while preserving the orginal case/control ratio were
generated for Groups A and B SNPs within each region. For each replicate, we recorded the
minimum p-value of all tested SNPs and regarded the 5
th
percentile of the 1000 minimum p-
values as the permuation-based significance level for the Group A SNPs in that particular region.
The minimum alpha level for all Group A SNPs across the seven regions was α =1.5x10
-3
. By
contrast, the significance levels for Group B SNPs were found at the 5/7= 0.71
th
percentile, a
Bonferroni correction accounting for a total of seven regions of unrelated SNPs under
investigation. The significance levels for both groups across the seven regions are presented in
Supplementary Table 3. PCs were calculated in EIGENSTRAT [14] using 19,070 common
SNPs (MAF>0.05) with low pairwise LD (r2<0.2) selected from the 188,835 overlapping
genotyped SNPs. Unconditional logistic regression was performed adjusting for age (at
diagnosis for cases and at blood draw for controls), sex, and PC1-5. The dosage effects of the
risk allele assuming an additive genetic model were analyzed in a one degree-of-freedom
likelihood ratio test implemented in SNPTEST [18]. For the purpose of fine-mapping, we
searched among SNPs correlated with index at r
2
>0.20 (measured in 1KGP EUR) for more
significant SNPs than the index itself in the same region. All r
2
values presented in the results are
calculated using European and African populations from 1KGP.
69
3.3.4 Multiethnic Fine-Mapping Analysis
Summary statistics were meta-analyzed using a fixed effects model weighted by the
inverse standard error in METAL [15]. Region-specific alpha-levels defined in the AA analysis
were applied to the multiethnic meta-analysis, as they are the most conservative in this group.
Genomic Annotation
To integrate chromatin biofeature annotations with our genotyping data in each region,
we used the R package FunciSNP, available at Bioconductor.org [19]. We identified all SNPs
that were correlated (r2> 0.8) with the most significant SNP in a 500 Kb region, or the top 10
most significant SNPs, whichever was larger. We selected publicly available datasets relevant
to the development of the B-cell lineage, most closely representing MM pathogenesis. The
following ENCODE datasets were employed to filter correlated SNPs that lie within putative
enhancer regions with Gene Expression Omnibus (GEO) accession IDs: B-cells CD20+
RO01778 DGF Peaks (GSM1014525), B-cells CD20+ RO01778 DNaseI HS Peaks
(GSM1024765, GSM1024766), B-cells CD20+ RO01794 HS Peaks (GSM1008588), CD20+
(RO 01778) H3K4me3 Histone Mod ChIP-seq Peaks (GSM945229), CD20+ RO01794
H3K27ac Histone Mods by ChIP-seq Peaks (GSM1003459), CD20+ (RO 01794) H3K4me3
Histone Mod ChIP-seq Peaks (GSM945198), CD20+ CTCF Histone Mods by ChIP-seq Peaks
(GSM1003474), CD20+ H2A.Z Histone Mods by ChIP-seq Peaks (GSM1003476), CD20+
H3K4me2 Histone Mods by ChIP-seq Peaks (GSM1003471). To define other physical map
features (transcription start sites, 5’ UTR, 3’UTR) we downloaded annotations from the
February 2009 release of the human genome (GRCh37/hg19) available from the UCSC
genome browser [20]. Finally, we used the highly conserved set of predicted targets of
microRNA targeting at mircode.org (miRcode 11, June 2012 release [21]), and conserved high-
70
quality microRNA target species from microRNA.org (June 2010 release) [22]. The specific
settings used in FunciSNP are described in the Supplementary Methods.
3.4 Results
Ethnic-specific Replication of Known Risk Variants: Among EAs, we observed
directionally consistent effects (OR>1) for all seven risk variants, with four variants replicating
at p<0.05 (3p22.1, p=4x10
-3
; 7p15.3, p=7x10
-4
; 17p11.2, p=2x10
-4
; 22q13.1, p=4x10
-4
, Table
3.1). We had ≥90% power to detect the originally reported effect for four SNPs (all four were
nominally significant) and 39-54% power for the other three SNPs (one was nominally
significant). We also observed directionally consistent results for the seven published risk
variants among AAs (OR>1), with two replicating at p<0.05 (7p15.3, p=5.5x10
-5
; 22q13.1,
p=3x10
-2
, Table 3.1). All previously reported risk variants were common among AAs (MAF
ranging from 0.07 to 0.47). In the AA sample, we had ≥90% power to detect the effect observed
in EAs for three SNPs (one was nominally significant), and 22-84% power for the other four
SNPs (one was nominally significant; Table 3.1). There were no statistically significant
associations for any SNPs and MM risk using Group B SNP alpha levels, although a marginally
significant association (p=1.86x10
-6
) was observed in the 6p21.33 region in AAs only
(Supplementary Table 2). Ethnic-specific results for all regions are provided in
Supplementary Tables 1 and 2.
Multiethnic Fine-Mapping: The directionally consistent effects for each index SNP in
both populations suggest that there is likely to be a single underlying common functional variant
in each region that is shared across populations. In an attempt to better localize the functional
variant, summary statistics from the EA and AA studies were meta-analyzed using a fixed effects
71
model for six of the seven known risk regions. The HLA region on chromosome 6p21.33 was
excluded from the meta-analysis because of extreme sensitivity of the region to population
stratification due to ethnic-specific extended haplotypes and underlying LD patterns requiring
more SNP density than available here for interpretable results [23].
For the multiethnic fine-mapping analysis, we defined the alpha level of significance for
SNP associations based on permutation testing using the AA data as it provides the most
conservative estimates. We found statistically significant associations in all regions except
3q26.4 (Table 3.1, Supplementary Figure 2) for Group A SNPs; however there were no
significant associations for Group B SNPs in any of the regions. Below we describe the top
associations from multiethnic fine-mapping in each region. 2p33.3. rs732075 (OR=1.16,
p=3.0x10
-4
) was the most significantly associated () SNP in this region. Variant rs732075 was
significant at p<0.05 in both EAs (OR=1.22, p=2.0x10
-3
) and AAs (OR=1.18, p=0.02) and was
common in both populations (RAF=0.59 and 0.62 in EAs and AAs, respectively, Table 3.1). It is
weakly correlated with the index SNP in EAs (r
2
=0.28), and AAs (r
2
=0.09, Table 3.1). Based on
genomic annotation, rs732075 and five correlated SNPs are located within likely enhancers and
have the potential to affect transcription factor binding (Supplementary Table 4). One of the
correlated SNPs, rs10180663, is situated within a DNase hypersensitive region of an H3K27-
acetylated enhancer intronic to the DTNB gene. This variant is also bioinformatically rich,
significantly altering or disrupting 11 potential high-quality transcription factor motifs.
3p22.1. In this region, rs73069394 was the most significantly associated SNP (OR=1.20,
p=1.32x10
-5
). Variant rs73069394 is common and is correlated with the index SNP in both EA
and AA populations, respectively (RAF=0.19, r
2
=0.96; RAF=0.62, r
2
=0.77, Table 3.1). There
was no evidence found of potential function for this SNP.
72
3q26.4. Weak signals were observed in this region in both the ethnic-specific and
multiethnic fine-mapping. After adjusting for multiple comparisons, no Group A or B SNP in
this region reached statistical significance. The strongest association was observed for
rs12637184 (OR=1.15, p=0.01). In both EAs and AAs, rs12637184 is highly correlated with the
reported index SNP rs10936599 (r
2
=0.94 and 1.0, respectively). This SNP maps to the 5’ UTR
region of ACTRT3, a gene encoding for an actin-related protein and although this SNP did not
reach statistical significance, it does overlap with five genomic features suggesting potential
regulatory activity. Of 21 correlated SNPs, four SNPs were located in coding regions and two of
these were synonymous; the remainder were located in enhancers; all had putative effects on
trancription factor binding based on in silico.
7p15.3. The index SNP, rs4487645 (OR=1.30, p=8.7x10
-8
) was the most significantly
associated SNP in this region in the multiethnic fine-mapping analysis. rs4487645 was strongly
correlated with the best markers in this region in EAs (rs112941857, r
2
=0.90, Supplementary
Table 2) and AAs (rs56333627, r
2
=0.89, Supplementary Table 1). This SNP is located in
intron 80 of DNAH11, a gene encoding for a ciliary outer dynein arm protein and is 47kb
upstream of CDCA7L, a cell-cycle gene that has recently been shown to have increased
expression in malignant plasma cells [24]. This variant overlaps seven putative regulatory
features and alters binding motif for GATA trancription factors (GATA1-6, Figure 3.1,
Supplementary Tables 4 and 5). Eight SNPs correlated with rs4487645 are in enhancers that
likely alter binding with other transcription factors.
17p11.2. rs34562254 (OR=1.31, p=2.5x10
-6
) was the most significantly associated SNP
in this region in the multiethnic analysis as well as in the ethnic-specific analyses (Table 3.1,
Supplementary Tables 1 and 2). This variant is highly correlated with the reported index SNP
73
in Europeans (r
2
=0.90) and weakly correlated in Africans (r
2
=0.33, Table 3.1). rs34562254 is a
missense variant (Pro251Leu) located in TNFRSF13B, a lymphocyte-specific tumor necrosis
factor binding site receptor superfamily that interacts with the NF-kb pathway. Five correlated
SNPs are located in enhancers and improve or disrupt binding with various transcription factors
including (rs57382045) that potentially improves binding with TBET and EOMES, two
transcription factors necessary for the development of CD8+ T-cells that fight viral infections
[25, 26].22q13.1. rs139400 (OR=1.19, p=1.2x10
-6
) was the most significantly associated SNP in
this region and is correlated with the reported index SNP in both EAs (r
2
=0.96) and AAs
(r2=0.63). The SNP is located in intron 11 of CBX7, a tumor supressor gene which is down-
regulated in multiple cancers [27, 28]. Eight correlated SNPs in enhancers also potentially effect
transcription factor binding, including notably rs79503, which disrupts a putative FOXP3 site, a
master regulator of lymphoid T cell development [29, 30].
Five of the seven index SNPs and all six of the better markers identified by multiethnic
meta-analysis were at least marginally more common among African Americans, with
rs1052501/rs73069394, showing the largest difference (MAF in EAs 22%, in AA’s 63%)
(Supplementary Figure 3). In total, through functional annotation, we identified 103 SNPs in
the six regions analyzed in the multiethnic analysis, 57 of which overlapped with biologically
relevant genomic features consistent with regulatory sites based on CD20+ (B lymphocyte) cells
(see Supplementary Methods, Supplementary Table). Forty-seven SNPs were located in
putative enhancers, four in untranslated regions, and three each in coding and promoter regions.
74
3.5 Discussion
This is the first study to comprehensively examine the known risk regions in MM in
individuals of African ancestry. We replicated the direction of the effect estimates for each of
the seven previously reported variants in both EA and AA populations which suggests that
many of the risk loci for MM found in individuals of European ancestry are also risk loci in
individuals of African ancestry. In multiethnic fine-mapping, we also identified SNPs in five
GWAS risk regions that are more significant than the index SNP, which suggests that they are
the functional alleles or better proxies of the functional alleles in these populations. The
genomic annotation of these variants highlights potential functional impact within enhancer,
promoter or protein coding sequence for some of the variants.
From the most significant associations in each region, we observed four variants from
the multiethnic fine-mapping analysis that ranked high for potential regulatory activity; three
SNPs overlapped with ≥3 putative regulatory genomic annotations and the fourth was a
missense variant located in TNFRSF13B. In 7p15.3, the index SNP (rs4487645) was
significantly associated with MM risk in both EAs and AAs, overlapped seven different
markers of potential functionality and was predicted to affect eight different transcription
factor binding motifs including GATA1-GATA5 transcription factors affecting T-cell and
hematopoietic stem cell differentiation (Supplementary Table 4; see Supplementary
Methods). The intronic variant rs4487645 is located in DNAH11, a gene involved in the
movement of respitory cilia [31], which has not been previously associated with MM risk. The
variant is also located in the 3’ region of CDCA7L, a cell division cycle gene, which is
expressed in both normal B-cells and MM cell lines [32, 33]. CDCA7L has recently been
identified as a target gene for MYC, a known oncogene that has been associated with MM
75
progression [34]. In a recent publication, Weinhold et. al. generated expression quantitative
trait loci (eQTL) data on malignant plasma cells in 848 MM patients. They found the strongest
association for rs4487645, which is associated with cis-regulation of CDCA7L [24].
Both the ethnic-specific and multiethnic analyses identified the missense variant
rs34562254 (Pro251Leu) located in TNFRSF13B, as the most significant SNP in 17p11.2. This
gene encodes a protein that is a lymphocyte-specific member of the tumor necrosis factor
(TNF) receptor superfamily that interacts with the NF-Kb pathway, which plays a key role in
B-cell activation. Activation of this pathway occurs during MM progression and the standard
treatment of MM involves immunomodulatory therapies that inhibit activation of this pathway
[35, 36].
This study includes the largest existing collection of AA MM cases and controls and is
the first to examine these reported risk regions in this high-risk group. One limitation is that
AA cases and controls were genotyped on different arrays with only a small number of
overlapping SNPs (n=188,835 SNPs genome-wide) which limited our ability to identify novel
variants (Group B SNPs) and to examine the overlap in the HLA region. However, we
performed rigorous QC on genotyped SNPs, which included dropping discordant SNPs
identified in a subset of 100 controls that were genotyped on both arrays (see Methods). These
QC procedures allowed us to impute cases and controls together, thereby providing more
accurate imputed genotypes for the same SNPs in cases and controls. However, there were not
a large number of genotyped SNPs across each of the regions which made imputation
challenging, i.e., in the 17p11.2 and 22q13.1 regions, where over half of the imputed SNPs
with a MAF > 1% were excluded due to poor quality scores (single variant info score < 0.8 in
IMPUTE2, Supplementary Figures 1 and 2, Supplementary Table 6).
76
Another limitation of this study is the small sample size of each ethnic group; however,
power was greatly enhanced by combininig the data across ethnicities in a meta-analysis. In the
EA analysis, we had 24% power to detect an OR of 1.25 for an allele frequency of 10% while in
the multiethnic analysis we had 83% power to detect this same effect size. Because MM is a rare
disease with a relatively poor 5-year survival rate (~40%) and poor tolerance of therapeutic
regimens, patients are often too ill to participate in studies. Therefore, unlike similar studies of
common solid tumor malignancies, it is often difficult to accrue enough patients to achieve
adequate statistical power. However, we were able to supplement AA cases with a large number
of controls from pre-existing GWAS. Power was further enhanced by combining the EA and AA
summary statistics in a multiethnic meta-analysis which leveraged the differential LD in these
two populations in attempt to more accurately approximate the true signal in the regions.
Although we did not conduct a multiethnic meta-analysis of the HLA region due to the
extreme sensitivity of the region to population stratification and long-range LD, we did
observe signals in this region for both AAs and EAs that differed by ethnicity, as expected. A
possible independent signal (rs116004192, p=1.9x10
-6
, [Group B]) was observed in AAs that
will need confirmation in a larger sample (genotyping underway).
In this study, we have replicated the direction of effect, observed for the index variant
in each of the seven known risk regions in EAs and AAs, and have identified better markers for
five of these regions in AAs. Replicating the directions of effect in both EAs and AAs suggests
common shared functional variants across populations. AAs had higher frequencies of all
multiethnic better markers and five of seven index SNPs possibly contributing to, but unlikely
to explain all of the excess risk. The AAMMS GWAS, and other large-scale discovery efforts
77
in AA populations will be required to better understand the degree to which there is a genetic
basis underlying the greater risk of disease.
78
Table 3.1 Replication of reported index SNPs and most significant associations for each region from the multiethnic meta-analysis.
a
Index SNP in each region.
b
Most significant SNP in each region from the multiethnic meta-analyses
c
r2 from 1KGP (AFR/EUR reference)
d
The HLA region was not analyzed in the multiethnic meta-analysis
e
Multiethnic analyses were not performed for the HLA region
f
Index SNP
Index SNPs
a
/ most significantly associated SNPs
b
Association in European ancestry Association in African ancestry Combined Meta
SNP CHR:BP Risk
Freq OR P-value Freq OR P-value Pow
Freq OR P-value Pow
OR P-value Phet r2 w/index
c
rs6746082
a
2:25659244 A/C 0.76 1.29 1.22E-07 0.79 1.15 0.0517 0.54 0.56 1.04 0.117 0.84 1.07 0.083 0.24
rs732075
b
2:25596431 G/T
0.59 1.22 0.002
0.62 1.12 0.0196
1.16 3.0E-04 0.28 0.09 / 0.28
rs1052501
a
3:41925398 G/A 0.20 1.32 7.47E-09 0.22 1.23 0.0044 0.90 0.63 1.12 0.148 0.99 1.16 4.5E-04 0.31
rs73069394
3:41787233 A/G
0.19 1.24 0.003
0.62 1.18 0.0153
1.20 1.3E-05 0.55 0.77 / 0.96
rs10936599
a
3:16949210
G/A 0.75 1.26 1.74E-13 0.79 1.12 0.0841 0.39 0.93 1.17 0.459 0.22 1.13 0.020 0.73
rs12637184
3:16948743
G/A
0.76 1.13 0.06
0.92 1.19 0.2654
1.15 0.010 0.64 0.94 / 1.00
rs2285803
a
6:31107258 A/G 0.28 1.19 1.18E-10 0.29 1.11 0.1273 0.43 0.26 1.06 0.183 0.51 -
d
-
e
rs4487645
a
7:21938240 C/A 0.65 1.38 3.33E-15 0.70 1.23 0.0007 0.95 0.89 1.48 5.5E-05 0.75
rs4487645
b
7:21938240 C/A
0.70 1.23 0.0007
0.89 1.48 5.5E-05
1.30 8.7E-08 0.07 -
f
rs4273077
a
17:1684913
G/A 0.11 1.26 1.41E-07 0.12 1.37 0.0002 0.98 0.14 1.12 0.13 0.99 1.21 4.3E-04 0.06
rs34562254
17:1684299
A/G
0.12 1.45 2.39E-05 0.13 1.21 0.0022
1.31 2.5E-06 0.12 0.33 / 0.90
rs877529
a
22:3954229
A/G 0.44 1.23 2.29E-16 0.45 1.21 0.0004 0.94 0.47 1.13 0.026 0.98
2.7E-05 0.30
rs139400
b
22:3954539
T/C
0.49 1.22 0.0004
0.53 1.17 0.0021
1.19 1.2E-06 0.63 0.63 / 0.96
79
Figure 3.1 Genomic annotation of the region around rs4487645 on chromosome 7.
80
3.6 Chapter 3 References
1. Lohr, J.G., et al., Widespread genetic heterogeneity in multiple myeloma: implications for targeted therapy.
Cancer Cell, 2014. 25(1): p. 91-101.
2. Gebregziabher, M., et al., Risk patterns of multiple myeloma in Los Angeles County, 1972-1999 (United
States). Cancer Causes Control, 2006. 17: p. 931-938.
3. Landgren, O. and B.M. Weiss, Patterns of monoclonal gammopathy of undetermined significance and
multiple myeloma in various ethnic/racial groups: support for genetic factors in pathogenesis. Leukemia,
2009. 23: p. 1691-1697.
4. Broderick, P., et al., Common variation at 3p22.1 and 7p15.3 influences multiple myeloma risk. Nat Genet,
2011. 44: p. 58-61.
5. Chubb, D., et al., Common variation at 3q26.2, 6p21.33, 17p11.2 and 22q13.1 influences multiple myeloma
risk. Nat Genet, 2013. 45: p. 1221-5.
6. Cozen, W., et al., Interleukin-6-related genotypes, body mass index, and risk of multiple myeloma and
plasmacytoma. Cancer Epidemiol Biomarkers Prev, 2006. 15: p. 2285-2291.
7. De Roos, A.J., et al., Metabolic gene variants and risk of non-Hodgkin's lymphoma. Cancer Epidemiol
Biomarkers Prev, 2006. 15(9): p. 1647-53.
8. Kolonel, L.N., et al., A multiethnic cohort in Hawaii and Los Angeles: baseline characteristics. Am J
Epidemiol, 2000. 151(4): p. 346-57.
9. Giles, G.G. and D.R. English, The Melbourne Collaborative Cohort Study. IARC Sci Publ, 2002. 156: p.
69-70.
10. Greenberg, A.J., et al., Single-nucleotide polymorphism rs1052501 associated with monoclonal
gammopathy of undetermined significance and multiple myeloma. Leukemia, 2012.
11. Holly, E.A., C.A. Eberle, and P.M. Bracci, Prior history of allergies and pancreatic cancer in the San
Francisco Bay area. Am J Epidemiol, 2003. 158: p. 432-441.
12. Howie, B.N., P. Donnelly, and J. Marchini, A flexible and accurate genotype imputation method for the
next generation of genome-wide association studies. PLoS Genet, 2009. 5(6): p. e1000529.
13. Browning, S.R. and B.L. Browning, Rapid and accurate haplotype phasing and missing-data inference for
whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet, 2007. 81(5):
p. 1084-97.
14. Price, A.L., et al., Principal components analysis corrects for stratification in genome-wide association
studies. Nat Genet, 2006. 38: p. 904-909.
15. Willer, C.J., Y. Li, and G.R. Abecasis, METAL: fast and efficient meta-analysis of genomewide association
scans. Bioinformatics, 2010. 26(17): p. 2190-1.
16. Han, Y., et al., Generalizability of established prostate cancer risk variants in men of African ancestry. Int
J Cancer, 2015. 136(5): p. 1210-7.
17. Feng, Y., et al., A comprehensive examination of breast cancer risk loci in African American women. Hum
Mol Genet, 2014. 23(20): p. 5518-26.
81
18. Marchini, J., et al., A new multipoint method for genome-wide association studies by imputation of
genotypes. Nat Genet, 2007. 39(7): p. 906-13.
19. Coetzee, S.G., et al., FunciSNP: an R/bioconductor tool integrating functional non-coding data sets with
genetic association studies to identify candidate regulatory SNPs. Nucleic Acids Res, 2012. 40(18): p.
e139.
20. Karolchik, D., et al., The UCSC Genome Browser database: 2014 update. Nucleic Acids Res, 2014.
42(Database issue): p. D764-70.
21. Jeggari, A., D.S. Marks, and E. Larsson, miRcode: a map of putative microRNA target sites in the long
non-coding transcriptome. Bioinformatics, 2012. 28(15): p. 2062-3.
22. Betel, D., et al., Comprehensive modeling of microRNA targets predicts functional non-conserved and non-
canonical sites. Genome Biol, 2010. 11(8): p. R90.
23. Gourraud, P.A., et al., HLA diversity in the 1000 genomes dataset. PLoS One, 2014. 9(7): p. e97282.
24. Weinhold, N., et al., The 7p15.3 (rs4487645) association for multiple myeloma shows strong allele-specific
regulation of the MYC-interacting gene CDCA7L in malignant plasma cells. Haematologica, 2014.
25. McLane, L.M., et al., Differential localization of T-bet and Eomes in CD8 T cell memory populations. J
Immunol, 2013. 190(7): p. 3207-15.
26. Popescu, I., et al., T-bet:Eomes balance, effector function, and proliferation of cytomegalovirus-specific
CD8+ T cells during primary infection differentiates the capacity for durable immune control. J Immunol,
2014. 193(11): p. 5709-22.
27. Forzati, F., et al., CBX7 is a tumor suppressor in mice and humans. J Clin Invest, 2012. 122(2): p. 612-23.
28. Guan, Z.P., et al., [Downregulation of chromobox protein homolog 7 expression in multiple human cancer
tissues]. Zhonghua Yu Fang Yi Xue Za Zhi, 2011. 45(7): p. 597-600.
29. Hori, S., T. Nomura, and S. Sakaguchi, Control of regulatory T cell development by the transcription factor
Foxp3. Science, 2003. 299(5609): p. 1057-61.
30. Fontenot, J.D., et al., Regulatory T cell lineage specification by the forkhead transcription factor foxp3.
Immunity, 2005. 22(3): p. 329-41.
31. Lucas, J.S., et al., Static respiratory cilia associated with mutations in Dnahc11/DNAH11: a mouse model
of PCD. Hum Mutat, 2012. 33(3): p. 495-503.
32. The Genotype-Tissue Expression (GTEx) project. Nat Genet, 2013. 45(6): p. 580-5.
33. Barretina, J., et al., The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug
sensitivity. Nature, 2012. 483(7391): p. 603-7.
34. Tian, Y., et al., CDCA7L promotes hepatocellular carcinoma progression by regulating the cell cycle. Int J
Oncol, 2013. 43(6): p. 2082-90.
35. Demchenko, Y.N. and W.M. Kuehl, A critical role for the NFkB pathway in multiple myeloma. Oncotarget,
2010. 1(1): p. 59-68.
36. Hideshima, T., et al., NF-kappa B as a therapeutic target in multiple myeloma. J Biol Chem, 2002.
277(19): p. 16639-47.
82
4 Expanding GWAS through the Reuse of Existing Genotype Data
4.1 Combining External Data Helps Enhance the Study Power
As seen in the example of fine-mapping risk loci for multiple myeloma (MM) among
African Americans from Chapter 3, existing genotype data of 7,080 MM free controls were
merged with those of 1,080 MM cases to effectively increase the total sample size. Such strategy
is attractive in that large data can be rapidly acquired with a relatively lower cost, which is
especially beneficial to studies of ethnic minorities, of which DNA samples are scarce and
response rates are lower [1]. In a case-control study, a case-control ratio of 1: m for n cases has
equivalent power to a 1:1 case-control ratio with 2
1
m
n
m +
cases. For example, accruing 7,080
MM-free controls from studies of breast and prostate cancers (AABC and AAPC) in African
Americans ( 7 m ≈ ) would achieve equivalent power as if there were 1,800 case-control pairs (a
75% increase). If m grows arbitrarily large the effective sample size would nearly double.
When existing genotypes or publically available data can be acquired at little extra cost,
this presents an opportunity to boost study power or reduce costs. Nowadays, large GWAS
databases have become available, e.g., the database of Genotypes and Phenotypes (dbGaP,
www.ncbi.nlm.nih.gov/gap) and the European Genome-Phenome Archive (EGA,
www.ebi.ac.uk/ega). Researchers can conveniently resort to those databases searching for data
compatible to theirs or collaborate through large GWAS consortia so that study power can be
substantially improved.
4.2 Caveats in Combining External Genotype Data
Despite granted an access to abundance of virtually free data, however, one should
proceed with extreme caution, as there are potential pitfalls in consolidating data from a variety
83
of sources. As emphasized in Chapter 1, cases and controls should be sampled from the same
underlying population and/or their population structure differences need to be properly
accounted for, so that the observed allele frequency differences can be attributed to the disease
susceptibility loci other than to any subtle population structure. This becomes a special challenge
to studies of admixed populations, such as African Americans, as cases and controls may exhibit
considerable differences in ancestry. Such differences are much more highlighted in large-scale
multicenter GWAS in which samples are collected with huge geographic and demographic
diversity. Failing to recognize this fact will give rise to spurious associations and negate power
improvement stemming from increased sample sizes. How to maximize the use of available data
remains an open question. In the following sections, I will use the multiple myeloma study in
African Americans (AAMM) as an example to investigate issues of combining case and control
data typed on different chips, and explore possible strategies to ameliorate issues arising from
case-control studies in admixed populations.
4.3 Complications in the Example of the AAMM Study
Multiple myeloma (MM) is a rare disease with an annual incidence rate of 6.1 cases per
100,000 persons and an annual death rate of 3.4 per 100,000 persons (2007-2011); five-year
survival rate after diagnosis was as low as 45% (2004-2010) (www.seer.cancer.gov). There are
notable differences in incidence and death rates across various racial/ethnic groups: African
Americans are at a 2 to 3 fold higher risk to develop MM compared to whites and Asian
Americans in the US [2]. GWAS conducted in European populations have identified a few risk
loci that are implicated in excessive risk for MM [3, 4]. Whether the risk disparities observed
84
between African Americans and whites can be attributable to genetic risk factors is therefore of
most importance.
Because of the very low incidence rates, coupled with a relatively poor survival rate,
accrual of a large number of MM cases to achieve adequate statistical power is difficult from the
beginning of study design. The paucity of African American MM cases presents a major
impediment to GWAS of MM in African Americans. While supplementing cases with a great
number of controls seems to improve power to some extent, increasingly large sample sizes may
require more complicated adjustment for admixture and cryptic relatedness at the expense of
statistical power. In AAMM, cases were obtained from 11 clinical centers and controls in AABC
and AAPC were contributed from 9 and 13 centers widely distributed across the country. African
Americans are known as an admixed population largely between African and European
ancestors, as well as varying degrees of admixture with Native Americans and Hispanics that
themselves are admixed populations with ancestors traced to Asia [5]. In quantification of
different admixture patterns, for each individual we ran the program STRUCTURE [6] within
each center of the AAMM, AABC, and AAPC studies. The YRI, CEU, and JPT samples from
HapMap Phase 3 were merged with our genotype data and we requested k=3 clusters be used
such that the estimates of proportions of each ancestry would have meaningful interpretations,
i.e., the three ancestral proportions align with the African, European, and Asian ancestry. Figure
4.1 shows distributions of the estimated ancestral proportions by study center. MM cases tended
to have slightly higher African ancestry (~3 to 4% more) and less European ancestry than AABC
or AAPC controls, although the within-center variation was much more profound than between-
center or between case-control variations (Table 4.1). Additional admixture with Asians was
minimal across all centers (< 2%).
85
Given the complications in admixture patterns, the feasibility of reusing controls from
AABC and AAPC in hopes of achieving better power has been a topic of interest. Chen et al.
using the AABC data, discussed the statistical power of association tests with adjustment for
ancestral differences in the context of reusing controls with different admixture fractions from
cases [7]. Their results suggested the loss of power due to adjustment for admixture differences
is little if the within-group heterogeneity of the ancestry fractions is much greater than the
between-group heterogeneity, a situation similar to what is seen in the above AAMM example;
conversely, power will be limited by an upper bound if the between-group differences in
admixture are so substantial that dwarf the within-group differences even if the total numbers of
subjects can grow arbitrarily large. This constitutes a less desirable situation where introducing
external data does not directly translate to greater power.
86
Table 4.1 Proportions of Estimated African ancestry for individuals in AAMM, AABC and
AAPC by study site.
AAMM
AABC
AAPC
Site n mean s.d.
Site n mean s.d.
Site n mean s.d.
California 37 0.77 0.12
MEC 907 0.74 0.16
CPS-II 110 0.72 0.15
Louisiana 44 0.77 0.15
CARE 212 0.76 0.14
MEC 1645 0.75 0.15
New Jersy 47 0.78 0.16
BCFR 50 0.77 0.14
SELECT 208 0.75 0.18
Johns Hopkins 110 0.78 0.14
PLCO 111 0.77 0.14
LAAPC 285 0.76 0.15
Wayne State 59 0.79 0.10
SFBC 219 0.78 0.14
KCPCS 75 0.76 0.13
Detroit 8 0.80 0.09
NBHS 179 0.78 0.13
PLCO 212 0.77 0.14
Wash U 66 0.80 0.09
WFBC 142 0.79 0.14
MDA 436 0.80 0.14
Northwestern 65 0.80 0.08
WCHS 232 0.80 0.16
GECAP 88 0.81 0.10
Providence 18 0.81 0.12
CBCS 580 0.81 0.12
DCPC 335 0.81 0.15
MD Anderson 225 0.81 0.10
NCPCS 236 0.81 0.11
U Chicago 52 0.82 0.08
CaPGene 85 0.81 0.12
Henry Ford 39 0.83 0.09
SCCS 509 0.85 0.09
Emory 178 0.83 0.08
PCBP 224 0.88 0.09
Grady 58 0.84 0.07
Average 0.81 0.11
Average 0.77 0.14
Average 0.78 0.15
Abbreviations: AABC: the Multiethnic Cohort Study (MEC); the Los Angeles component of the Women's
Contraceptive and Reproductive Experiences (CARE) Study; the Women's Circle of Health Study (WCHS); the San
Francisco Bay Area Breast Cancer Study (SFBCS); the Northern California Breast Cancer Family Registry (NC-
BCFR); the Carolina Breast Cancer Study (CBCS); the Prostate, Lung, Colorectal, and Ovarian Cancer Screening
Trial (PLCO) Cohort; the Nashville Breast Health Study (NBHS);Wake Forest University Breast Cancer Study
(WFBC).
AAPC: the Multiethnic Cohort (MEC); the Southern Community Cohort Study (SCCS); Cancer Prevention Study II
(CPSII); Prostate Cancer Case-Control Studies at MD Anderson (MDA); the Los Angeles Study of Aggressive
Prostate Cancer (LAAPC); Prostate Cancer Genetics Study (CaP Genes); Case-Control Study of Prostate Cancer
among African Americans in Washington, DC (DCPC); King County (Washington) Prostate Cancer Studies
(KCPCS); the Gene-Environment Interaction in Prostate Cancer Study (GECAP); North Carolina Prostate Cancer
Study (NCPCS); Selenium and Vitamin E Cancer Prevention Trial (SELECT); Prostate Cancer in a Black
Population (PCBP); Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial (PLCO).
87
88
Figure 4.1 Distributions of the estimated ancestral proportions by study center. From top to bottom: AAMM, AABC, and AAPC. Red,
African (YRI); green, European (CEU); yellow, Asian (JPT). Horizontal dotted lines mark the 80% and 20% of ancestry fraction,
respectively.
89
4.4 Statistical Power of Association Tests While Controlling for Population Structure
4.4.1 The Bourgain Test: a Retrospective Approach
In the theoretical derivation of the non-centrality parameter (ncp) that quantifies power in
an association test, Chen et al. adopted a “retrospective” model proposed by Bourgain et al. [8].
In a standard test for genotype-disease associations, the disease outcome is modeled as the
outcome variable and genotypes as well as other covariates are designated as the explanatory
variables, termed as the “prospective” approach. The “retrospective” Bourgain test reverses the
order and models the marker data as a function of case-control status.
Let
12
, ,.. , ( . )
T
j j Nj
SS S =
j
S be the vector of allele counts for a given SNP j for a total of N
subjects, each element representing the number of the test allele, assuming an additive model.
Consider a 2 N × design matrix C of the form of (4.1),
(4.1)
where the first column is all equal to 1 and the second column denotes the case-control status
(1/0). The retrospective model relating the mean of
j
S , denoted as
j
u , to the disease status is
written as equation 4.2 with parameter coefficients (4.3)
β
jj
u =C (4.2)
12
( , )
T
j j j
β ββ = (4.3)
Genetic relatedness between subjects is characterized by computing the covariance matrix,
2
cov( )
j
σ =
j
S K (4.4)
90
where
2
j
σ varies by each SNP but the K matrix is the same for all SNPs [9]. In GWAS of
nominally unrelated samples, a large number of randomly selected SNPs are used in computation
of the K matrix as
1
( 2)( 2)
1
ˆ
(1 ) 2
T
M
j
M
=
− −
=
−
∑
jj jj
jj
Sp Sp
K
p p
(4.5)
where
j
p is the frequency for SNP j in a total of M SNPs. The variance estimate for each SNP j is
2
1
ˆ
2
j
N
σ =
−
-1 -1 T -1 -1 T -1
jj
T
S [K - K C(C K C) C K ]S (4.6)
With
ˆ
K and
2
ˆ
j
σ , the Wald test can be readily constructed as follows,
ˆ
j
β =
T -1 -1 T -1
j
(CK C) CK S (4.7)
2
ˆ
r( ˆ va )
jj
βσ =
T -1 -1
(C K C) (4.8)
22
2 [2,2]
ˆˆ
( )/ ( )
jj
Va T r ββ = (4.9)
The Wald test follows a 1 df
2
χ distribution assuming the SNP under consideration is
not too rare. The ncp can be directly computed from the above Wald test given an assumed form
of the K matrix. The Balding-Nicols model [10] is widely used to characterize allele frequencies
differences between different populations. The allele frequencies in modern day populations,
1
p ,
follow a beta distribution
00
11
( p , (1 p ))
F F
B
F F
− −
− with
0
p being the allele frequency in the
ancestral population and
00 1
var( ) (1- ) p Fp p = , where F is the separation parameter
characterizing how far the modern population deviates from the ancestral population, i.e., the
greater F , the more dissimilar the allele frequencies in modern-day populations than in
91
ancestors. The statistical properties of ncp for the Bourgain test (equation 4.9) are discussed in
two types of admixed populations below, i.e., complete admixture and incomplete admixture.
4.4.2 The Power of the Bourgain Test for Studies of Completely Admixed Populations
Consider cases and controls sampled separately from two modern-day completely
admixed populations, both of which derived from the same two ancestral populations. The first
modern-day population (supplying cases) has fractions of
1
α and
1
1 α − from two ancestral
populations, while the second modern population (supplying controls) has proportions
2
α and
2
1 α − . Complete admixture implies there is no within-group admixture heterogeneity
among either cases or controls. Under the Balding-Nichols model, the K matrix has diagonal
element equal to
2
1 (1 2 2 )
i i
F αα + + − (i=1 for cases and 2 for controls) and off-diagonal element
equal to
2
2 (1 2 2 )
i i
F αα +− if both subjects are from modern-day population 1 (i=1) or both from
population 2 (i=2), otherwise equal to
12 1 2
2 (1 2 ) F αα α α + − − . Assuming this form of
the K matrix, the denominator of the Wald test,
[2,2]
ˆ
var( )
j
β , is shown by Chen et al. to be
bounded from below by
2
00 2
1
8 (1 ) ( ) p p Fα α − − as N → ∞ , unless 0 F = or
12
α α = , although
neither would be interesting since there would be no admixture differences between cases and
controls. Otherwise, the ncp is bounded from above. For instance, let and ,
mimicking 80% and 75% of African ancestry in the AAMM cases and controls,
similar to the distance between the HapMap YRI and CEU populations, equal number of cases
and control (N=5000 in each), ancestral allele frequency , type I error rate and
the true . The ncp for this complete admixture example is 72.78, a considerable loss of
power compared to ncp=208.33 if there is no admixture ( ) while all the other parameters
92
are the same. This exemplifies a situation where the power improvement through increasing
sample sizes is counterbalanced by the need to adjust for distinct admixture fractions between
cases and external controls. There exists a bound for the ncp that may limit the feasibility of
reusing exiting genotype data in admixed populations.
4.4.3 The Power of the Bourgain Test for Studies of Incompletely Admixed Populations
Requiring all individuals in a large admixed population to have homogeneous ancestral
fractions may not be a realistic assumption. Unlike the example of complete admixture,
incomplete admixture allows each individual in a modern day population to have a different
percentage of certain ancestry. The individual admixture fractions can be modeled by beta
distributions, ,
1
~ ()
i
hh
B
α α
α
−
, where α is the average fraction from the same ancestral
population and h characterizes the heterogeneity of ancestry fractions within the same modern-
day population [7]. The beta distributions are flexible enough to accommodate a variety of
ancestry fractions arising in incomplete admixture (Figure 4.2). Consider a range of values of h ,
where h close to 0 resembles complete admixture in both populations described in section 4.4.2
and h close to 1 resembles no admixture differences between the two. As h increases, within-
population heterogeneity becomes larger (“fatter” density curves) and there is a growing overlap
in terms of ancestral proportions between the two admixed populations (i.e., cases and controls).
A complete overlap would suggest no adjustment for ancestry admixture is necessary. Since each
individual now has a distinct
i
α , the K matrix has diagonal items equal to
2
1 (1 2 2 )
i i
F αα + + − for
subject i, whereas off-diagonal items are 2 (1 2 )
ik i k
F αα α α + −− for subjects i and k with
fractions of
i
α and
k
α from the first ancestral population. It is not immediately clear from
93
derivation whether or not there exists a bound for these ncp’s. Instead simulation results for
various h’s are shown in Figure 4.3. The behavior of the ncp is heavily governed by the
heterogeneity parameter h. The ncp’s are 94.84 (h=0.001), 169.70 (h=0.01), 203.87 (h=0.1), and
207.39 (h=0.2), falling between ncp=72.78 (complete admixture) and 208.33 (no admixture).
Studies in admixed populations with large within-population heterogeneity are better powered
than those with small within-population heterogeneity under the same settings of the Bourgain
test; power improvement is less adversely impacted by the theoretical bounds for the ncp in
incompletely admixed populations with more heterogeneous admixture fractions.
Recall the box plots of admixture percentages for three ancestral populations (Figure 4.1).
The average African or European ancestry percentages were very close between cases and
controls whereas the heterogeneity within study center was more pronounced (h>0.1). Therefore
the ncp curve should lie closely to the “no admixture” curve in Figure 4.3, suggesting very little
loss of power after correcting for population admixture. The reuse of a large number of African
American does not appear to compromise the power to detect true associations as suggested by
the Bourgain test.
94
Figure 4.2 (Modified from Chen et al.) Probability density plots of beta distributions with
different levels of heterogeneity. Solid line:
1
0.8 α = and dashed line:
2
0.75 α = for fractions of
African ancestry in two incompletely admixed populations.
95
Figure 4.3 (Modified from Chen et al.) Non-centrality parameters of the Bourgain test in a case-
control study where cases and controls are from two populations with varying degrees of
admixture.
96
4.4.4 Comparing the Bourgain test with PCA and Genomic Control
The Bourgain test has reasonable power compared to other commonly used adjustment
methods. In the situation of complete admixture, although the Bourgain test is likely to lose some
power due to adjustment for admixture and the ncp is anticipated to be bounded, the PCA
method after adjusting for the leading PC will leave little power because the first PC simply
captures the difference in admixture fractions between the two modern-day populations, which is
equivalent to the case-control status. In contrast, genomic control (GC) still has some power to
detect true associations after adjusting for the complete admixture, depending on the magnitude
of true effect sizes.
In less extreme examples, the PCA would regain power because PCs are sensitive to
admixture differences. The PCA uses the same
ˆ
K matrix to compute PCs as the Bourgain test
uses to model the covariance structure. PCs are treated as fixed effects when included as
covariates in regression models while the genetic relatedness is treated as random effects
modeled in the variance in the Bourgain test. In a small simulation of 2200 SNPs including 100
causal SNPs, with 200 cases and 200 controls that both population stratification (F=.1 between
two modern-day populations) and pedigree structure (50 families with 8 members in each) are
present, the results showed that both PCA and the retrospective model have a good protection of
the type I error rates under the null hypothesis (over-dispersion 1.06 λ = for PCA and 1 λ = for
PCA+GC vs. .999 λ = for the retrospective model (Figure 4.4a), and both appear to have similar
power under the alternative hypothesis (Figure 4.4 b) (Stram 2014). Notwithstanding, more
simulations are warranted to fully evaluate the performance of these two correcting methods
under different settings of parameters, e.g., larger numbers of SNPs and samples to mirror those
in GWAS, smaller F separation levels, more complex stratification and admixture structure, etc.
97
Figure 4.4 (Adapted from Stram
2014) a, QQ plot of p-values
for 2100 non-causal SNPs.
Over-dispersion λ ’s are equal
to 1.93 (Armitage trend test
without correction, black solid
line), 1.06 (adjusting for the 1st
PC, blue dashed line), 1 (first
PC+GC, green solid line), and
.999 (retrospective model, red
solid line)
b, QQ plot of p-values for 100
causal SNPs in simulations
(Armitage trend test without
correction, black solid line; 1st
PC+GC, green solid line;
retrospective model, red solid
line).
98
4.5 Additional Issues in the Reuse of Existing Data
In addition to adjusting for differentiated population structure between cases and controls
when existing data is reused, researchers should also be vigilant whether the external data were
processed in the same fashion as their own data, e.g., DNA sample preparation, genotyping
platforms, quality control (QC) criteria, imputation, etc. A glaring mistake was seen in a GWAS
studying exceptional longevity of centenarians, in which 90% of “cases”, the centenarians, were
genotyped using the Illumina 370 CNV chip and the remaining ~10% of cases were genotyped
on the Illumina 610 Quad chip, while controls were extracted from the Illumina Control
Database (iControlDB) presumably genotyped on assorted Illumina arrays [11]. The study, first
appearing in Science but later retracted, concluded that 70 genome-wide statistically significant
SNPs had been identified and a total of 150 SNPs could predict one’s likelihood of living to 100
years with 77% accuracy. The flaws stemmed from the known glitches of the Illumina 610 Quad
chip as it failed to distinguish major and minor alleles apart at some SNPs, resulting in artifactual
allele differences between cases and controls. The Manhattan plot and the QQ plot of single SNP
associations below implied that most of these ostensibly significant results could be nothing but
false positives (Figure 4.5 a, b). This example is not meant to curb the enthusiasm of utilizing
external data. It is yet another example to underscore the importance of extreme caution required
in handling large and complex GWAS data. As genotyping technologies advance, chips widely
used today will ultimately become outdated in the near future. Comparing and possibly
combining GWAS data genotyped at different times and/or on different chips will be inevitable.
In our AAMM study, cases were genotyped using the Illumina HumanCore Beadchip
(~300,000 SNPs) while controls were assayed on the Illumina Human1M-Duo chips (~1 million
SNPs). We implemented stringent quality control procedures to eliminate any differential
99
Figure 4.5 (From Sebastiani et al.) a (Top) the Manhattan plot shows a large number of
significantly associated SNPs across the genome; b (Bottom) the QQ plot reveals an inflation of
association tests. Neither is quite plausible in a calibrated GWAS that has eliminated systematic
allele frequency differences.
100
genotyping errors arising from cases and controls being typed on different chips. To assess the
reproducibility of genotype calls, DNA samples for 100 subjects were assayed on both chips and
SNPs with any discordant genotype for the same person were dropped from analyses
(concordance rate =1). For 188,376 SNPs that were successfully merged and after prudent QC,
there was no identifiable evidence of false-positive associations (Figure 4.6) as in the
“centenarians” study.
Figure 4.6 QQ plot of the Armitage trend
test p-values for 188,376 overlapping
SNPs genotyped on both chips ( 1.02 λ = )
While the possibility of differential genotyping errors has been ruled out, some pressing
issues remain to be addressed. The overlapping 188,376 SNPs only account for 2/3 of the total
SNPs genotyped for cases, a sizeable loss of data that will have certain deleterious effects on the
power of association tests. The small number of typed SNPs merely amount to a small fraction of
total genetic variation and the consequence is a very porous coverage of the entire genome. The
ability to tag unmeasured yet potentially causal SNPs using these SNPs is further exacerbated by
shorter LD blocks and greater number of variants in the genomes of African populations.
101
Even though we can amplify the total number of testing variants through imputation, the
reliability of imputed SNPs based on such few SNPs is still questionable. If we try to expand the
imputation basis from the 188,000 SNPs shown above to any larger number, then imputation
must be performed separately for the cases and the controls, i.e., the most powerful imputation
approach would seem to be to use the 260,000 SNPs on the Illumina HumanCore BeadChip to
impute to the 1000 Genomes for the cases, and to use the (already existing) imputation data
available for the AABC and AAPC controls.
If imputed SNPs are inherently less reliable, performing imputation separately and using
different SNPs for the MM study implies that imputation error may be differential between cases
and controls. This could present an enormous challenge to any statistical models in terms of
controlling type I error at desired level and attaining adequate power to identify true associations.
Some thoughts on approaching these problems are outlined in Chapter 5.
102
4.6 Chapter 4 references
1. Kolonel, L.N., et al., A multiethnic cohort in Hawaii and Los Angeles: baseline characteristics. Am J
Epidemiol, 2000. 151(4): p. 346-57.
2. Landgren, O. and B.M. Weiss, Patterns of monoclonal gammopathy of undetermined significance and
multiple myeloma in various ethnic/racial groups: support for genetic factors in pathogenesis. Leukemia,
2009. 23: p. 1691-1697.
3. Broderick, P., et al., Common variation at 3p22.1 and 7p15.3 influences multiple myeloma risk. Nat Genet,
2011. 44: p. 58-61.
4. Chubb, D., et al., Common variation at 3q26.2, 6p21.33, 17p11.2 and 22q13.1 influences multiple myeloma
risk. Nature genetics, 2013. 45: p. 1221-5.
5. Campbell, M.C. and S.A. Tishkoff, African genetic diversity: implications for human demographic history,
modern human origins, and complex disease mapping. Annu Rev Genomics Hum Genet, 2008. 9: p. 403-
33.
6. Pritchard, J.K., M. Stephens, and P. Donnelly, Inference of population structure using multilocus genotype
data. Genetics, 2000. 155: p. 945-959.
7. Chen, G.K., et al., The Potential for Enhancing the Power of Genetic Association Studies in African
Americans through the Reuse of Existing Genotype Data. PLoS Genetics, 2010. 6(9): p. 13-13.
8. Bourgain, C., et al., Novel case-control test in a founder population identifies P-selectin as an atopy-
susceptibility locus. American journal of human genetics, 2003. 73(3): p. 612-626.
9. Rakovski, C.S. and D.O. Stram, A kinship-based modification of the armitage trend test to address hidden
population structure and small differential genotyping errors. PLoS One, 2009. 4(6): p. e5825.
10. Balding, D.J. and R.A. Nichols, A method for quantifying differentiation between populations at multi-
allelic loci and its implications for investigating identity and paternity. Genetica, 1995. 96(1-2): p. 3-12.
11. Sebastiani, P., N. Solovieff, and A.T. DeWan, RETRACTED - Genetic signatures of exceptional longevity
in humans. Sciencexpress, 2010(July).
103
5. Evaluating Inflation in Test Statistics Arising from Different Imputation Strategies
5.1 Motivation
Besides replication and fine-mapping of the known loci for multiple myeloma (MM)
identified in Europeans for African Americans, our primary interest has become a
comprehensive genome-wide scan for novel MM susceptibility loci among African Americans.
In this AAMM GWAS, a new batch of 282 MM cases were genotyped on the same HumanCore
chip as the existing 1049 MM cases in the previous MM fine-mapping study (Chapter 3); the
same data for the 7080 AABC+AAPC controls were recycled, consisting of ~1 million SNPs
typed on the Illumina Human1M Duo chip plus imputed variants from the 1000 Genomes Project
(KGP). Approximately the same ~188,000 genotyped SNPs as described in Chapter 4 remained
after combining all cases and controls and implementing the same stringent QC procedures,
therefore the problem of losing a considerable share of data persisted. This chapter compares a
variety of imputation strategies aimed at effectively expanding data for association testing. The
results will serve as background materials to rationalize the analysis plans for the new AAMM
GWAS paper.
5.2 Genomic Coverage Assessments for the 188k Genotyped SNPs
A fundamental question that is worth asking first is how good (or bad) the coverage of
the whole genome is for these 188k SNPs after imputation, through leveraging LD between the
tag SNPs designed on the chip and all genetic variants in the genome [1]. The HumanCore
BeadChips used to genotype MM cases were reported to capture 26% of all KGP variants with
an minor allele frequency (MAF) > 0.05 in the YRI population, and 15% of variants with MAF >
0.01, at
2
0.8 r > (Illumina Product Data Sheet, 2013), only capturing a rather limited proportion
104
of total variants in either MAF category. We performed similar calculations by binning all KGP
variants into three MAF brackets (i.e., min-1%, 1-5%, and >5% measured in KGP AFR samples)
and by minimum imputation quality score (i.e., > 0.8, > 0.9, > 0.95, and > 0.99), and then
computed the proportion of KGP variants imputed with the basis of the 188k SNPs for each
combination of the MAF brackets and minimum info scores (Table 5.1). For those 5.4 million
very rare variants (MAF<1% not counting monomorphic and singleton variants), only 16.2%
were imputed with info scores > 0.8; the coverage rates plummeted steeply as more stringent
imputation quality was required, e.g., only 1.1% and 0.1% of genomic coverage for the minimum
info score equal to 0.95 and 0.99, respectively. For common (MAF≥5%) and less common
variants (MAF: 1%-5%), the genomic coverage at info>0.8 improved substantially, especially so
for the more common ones capturing 72.7% of KGP variants with MAF>5% at info>0.8. The
coverage rates decreased less steeply compared to those with MAF<1%, an indication of overall
greater imputation certainty for more common genetic variants. Combining all KGP variants
with MAF>1%, close to 60% (58.8%) were well imputed with info scores above 0.8 and about
one third (32.6%) with info>0.9. These estimates delineated the boundaries of attainable
genomic coverage we could leverage by using this HumanCore chip.
Table 5.1 Genomic coverage of the 188k SNPs by minor allele frequency and info score.
MAF N Variants+ % of SNPs with info* greater than
0.80 0.90 0.95 0.99
(min-0.01) 5,412,837 16.2% 4.4% 1.1% 0.1%
[0.01-0.05) 7,266,569 41.3% 15.2% 4.8% 0.4%
[0.05-0.50] 9,182,167 72.7% 46.4% 26.8% 6.9%
[0.01-0.50] 16,448,736 58.8% 32.6% 17.1% 4.0%
+ number of variants with MAFs falling into each bracket measured in KGP AFR populations
* info scores were estimated in IMPUTE2.
105
Nelson et al. compared the genomic coverage rates for eight common genotyping arrays,
including the HumanCore, OmniExpress, Omni2.5M, and Omni5M, etc., using KGP data [2].
They randomly selected 90% of the samples in KGP as reference and masked KGP variants not
present on a certain array for the remaining 10% samples. Knowing both the observed genotypes
and imputed dosages for those masked variants, they were able to compute the r2 metric (squared
correlation between actual genotype and imputed dosage) to assess genomic coverage across
different genotyping arrays (Table 5.2). It is evident that the other arrays all had better coverage
than the HumanCore across the MAF spectrum, mostly due to their larger backbones, e.g.,
~720,000 variants for the OmniExpress and ~2.5 million for the Omni2.5M. Comparing the
OmniExpress to HumanCore, the coverage nearly tripled (35% vs. 12.4%) for less common
variants (MAF: 1%-5%) and doubled for those with MAF>1% (56.1% vs. 31%). Investing in
more comprehensive genotyping arrays will help improve the overall genomic coverage.
Nonetheless, given the fact that our controls were typed using the Human1M, opting for arrays
larger than 1 million variants would not be cost-efficient either, since additional variants would
be dropped when merging genotypes between cases and controls.
Table 5.2 Genomic coverage estimates between HumanCore and other arrays.
Based on 188k
(info > 0.9)+
Based on KGP by Nelson et al. [2]
(r2 > 0.8)*
MAF HumanCore HumanCore OmniExp Omni2.5M Omni5M
(min-0.01) 4.4% 8.1% 18.4% 27.6% 33.3%
[0.01-0.05) 15.2% 12.4% 35.0% 57.6% 63.0%
[0.05-0.50] 46.4% 45.8% 72.8% 87.7% 89.4%
[0.01-0.50] 32.6% 31.0% 56.1% 74.4% 77.8%
+ info scores were estimated in IMPUTE2, same as Table 5.1 under column “0.9”.
*Pearson r2 between the imputation posterior probabilities and the observed genotypes
106
5.3 Quantifying Imputation Quality Using the Info Score and R2
Another interesting observation from Table 5.2 is the coverage estimates for the
HumanCore by Nelson et al. were very similar to ours evaluated at info score >0.9 in the
imputation based on the 188k SNPs. Such resemblance intrigued us to consider if the info score
provided in IMPUTE2 output and the correlation r2 were related; if so, whether the relationship
varied by MAF or imputation basis. Note when restricting the imputation basis to the
overlapping 188k SNPs, we dropped ~72k SNPs Typed in cAses (referred to as TA SNPs) and
~800k Typed in cOntrols (TO SNPs). For either TA and or TO SNPs, in additional to info
scores, we could compute the correlation between the imputation dosage and the actual
genotypes not used for imputation (not in the 188k). This provided an alternative means to assess
the reliability of imputation.
Figure 5.1A shows that the info scores for the TA/TO SNPs were comparable to those of
the KGP variants in the same MAF bracket in terms of the median, interquartile range, and
maximum info score. The minimum info scores for the TA and TO SNPs were greater than 0.4
and 0.2, whereas the minimum info scores for the KGP variants were as low as zero even for the
more common variants (MAF>5%), in a large part owing to the limited coverage of the 188k
SNPs discussed before. No TA or TO SNPs were less than 1% in frequency. Figure 5.1B further
compares the info scores (white boxes) with the r2 metrics (gray boxes) in pairs for the same
subsets of TA/TO SNPs. The percentiles for the info score were uniformly greater than for the r2
metric regardless of types of SNPs or MAF brackets; more common SNPs had both greater info
scores and r2 metrics than less common SNPs; for either info scores or r2, the case-control
differences were trivial.
107
Figure 5.1 Box plots comparing the distributions of the info score and the r2. A. Distributions of info scores (white boxes) for each
subset of imputed variants. TO, SNPs typed in controls only; TA, SNPs typed in cases only; AFR, all KGP variants with
corresponding MAF estimated in AFR samples; [0.05-0.50], [0.01-0.05), and (min-0.01) are ranges of MAFs. B. r2 (gray boxes) is the
squared Pearson correlation between posterior probability and actual genotypes for each variant; info scores were obtained from
IMPUTE2.
108
F
o
Figure 5.2 Linea
only; TA, SNPs
ar correlation be
typed in cases
etween the info
only; [0.05-0.50
scores and r2. A
0] and [0.01-0.0
A. Subplots by S
05) are ranges of
SNP type and M
f MAFs. B. Ove
B
MAF bracket: TO
erlay of the left f
O, SNPs typed in
four subplots.
n controls
109
To better characterize the relationship between these two imputation quality metrics, we
regressed r2 against info score, with stratification by MAF and TA/TO SNP type (Figure 5.2 A,
B). Once again, the results confirmed that info scores and r2 were well correlated for all four
subsets, except for a few poorly imputed variants (info<0.4) that were not of primary interest
after all. The slopes of the regression lines were almost identical, revealing a consistently linear
relation approximated as,
2
21 r Info × = − for info > 0.5
Info scores equal to 0.6, 0.8, 0.9 and 0.95 can be approximately translated to r2 equal to
0.2, 0.6, 0.8 and 0.9. The revelation of this linear relationship explains the similarity in the
calculations of genomic coverage between our using info > 0.9 and Nelson’s r2 > 0.8 in Table
5.2. It also assures us filtering imputed variants at info = 0.8 is a sensible choice [3], as it still
covers ~60% of all KGP variants with MAF>1% (Table 5.1) while maintaining a reasonably
good quality of imputation (r2>0.6). The KGP variants with MAF>1% and imputed with info
score >0.8 based on the 188k typed SNPs (~9.6 million variants in total) were therefore of the
primary interest henceforth.
5.4 Imputations and the Inflation Factors in Test Statistics for Case-Control Studies
While the imputation quality filtered at info=0.8 is acceptable in general, because in our
study a different platform was used in cases than in controls to genotype the overlapping SNPs,
differential imputation errors could still occur and give rise to false positives in association tests.
To evaluate the extent to which imputation could affect the downstream association analyses, we
adopted several alternative imputation strategies:
110
1) Imputation based solely on the 188k overlapping SNPs where cases and controls were
imputed together with the same basis;
2) Imputation utilizing all available genotypes in cases and in controls, i.e., imputing to the
KGP for cases based on the ~260k SNPs typed on the HumanCore whereas imputing to
the KGP for controls based on the ~1M SNPs typed on the Human1M Duo;
3) Imputation based on the same 188k SNPs but cases and controls were imputed separately,
somewhat a middle ground between 1) and 2);
4) Examining if differentiated imputation errors also existed between the two sources of
controls (AABC and AAPC) that had been imputed independently [4];
5) For the “TA” SNPs (typed in cases only), supplementing the actual genotypes for cases
with the imputed dosages for controls, and conducting a case-control analysis.
5.4.1 Imputation to KGP Based on the 188k Overlapping SNPs
As seen in Section 5.2, leveraging LDs between the 188k SNPs typed on both chips and
untyped variants in KGP could capture 58.8% of KGP variants with MAF>1% at info score
>0.8. For these 9.6 million captured variants in the AAMM GWAS, no serious inflation in test
statistics was observed ( 1.039 λ = ). This was extremely close to the inflation observed for the
typed SNPs ( 1.02 λ = calculated from the genotypes or 1.042 λ = calculated from the imputed
dosages) (Table 5.3, Figure 5.3). Such mild differences in lambda could be attributed to the
intrinsic randomness of imputation processes other than some genuine differentiation. Further
filtering out variants with lower MAFs and/or lower info scores did not yield meaningful
changes ( :1.036 1.054 λ − ), roughly within ±0.01 of the inflation factor for the typed SNPs.
111
Table 5.3 Inflation in test statistics by info score and MAF cutoffs when cases and controls were
imputed together based on the 188k.
minINFO minMAF NSNPs Lambda N(P<5e-08) N(P
gcAdj
<5e-08)
0.8 0.01 9,652,528 1.039 0 0
0.8 0.02 8,571,586 1.039 0 0
0.8 0.05 6,671,243 1.041 0 0
0.8 0.07 5,892,937 1.041 0 0
0.8 0.1 5,018,797 1.040 0 0
0.9 0.01 5,351,303 1.041 0 0
0.9 0.02 5,025,337 1.041 0 0
0.9 0.05 4,304,567 1.039 0 0
0.9 0.07 3,951,974 1.037 0 0
0.9 0.1 3,507,923 1.036 0 0
0.95 0.01 2,779,485 1.037 0 0
0.95 0.02 2,696,849 1.038 0 0
0.95 0.05 2,477,025 1.038 0 0
0.95 0.07 2,346,754 1.036 0 0
0.95 0.1 2,163,325 1.038 0 0
0.98 0.01 1,153,420 1.048 0 0
0.98 0.02 1,141,584 1.049 0 0
0.98 0.05 1,098,676 1.048 0 0
0.98 0.07 1,066,173 1.045 0 0
0.98 0.1 1,014,225 1.046 0 0
0.99 0.01 616,596 1.052 0 0
0.99 0.02 613,917 1.054 0 0
0.99 0.05 599,359 1.053 0 0
0.99 0.07 586,635 1.051 0 0
0.99 0.1 564,355 1.052 0 0
1 0.01 182,906 1.042 0 0
(188k dosage) 0.02 182,622 1.042 0 0
1 0.05 179,105 1.042 0 0
1 0.07 175,383 1.042 0 0
1 0.1 168,277 1.043 0 0
Abbreviations: minINFO, minimum info score; minMAF, minimum minor allele frequency; NSNPs, number of
imputed variants with info score > minINFO and MAF> minMAF; Lambda, the median of the 1 df chi-square
statistics for Wald test divided by 0.455; N(P<5e-08), number of false positives with p-value < 5x10-8;
N(PgcAdj<5e-08), number of false positives with p-value < 5x10-8 after genomic control by dividing the chi-square
statistic by lambda; 188k dosage, the calculations of inflation were based on the imputation dosages for these 188k
SNPs, compared to the inflation calculated using the actual genotypes, which was λ=1.02.
112
The small increase in inflation for minINFO =0.98 and 0.99 could be due to numerical
fluctuations given the drastic decrease in the numbers of imputed variants when filtered at such
stringent levels. More importantly, no false positives were observed starting from minINFO> 0.8
and minMAF> 0.01, indicating differential imputation errors were satisfactorily contained when
cases and controls were imputed together. This imputation strategy is conservative in that it only
used the overlapping genotypes with removal of about 1/3 of the SNPs typed exclusively in
cases, resulting in loss of genomic coverage and hence power.
Figure 5.3 Inflation in test statistics by info score and MAF cutoffs when cases and controls were
imputed together based on the 188k. The colors are for different minimum info score cutoffs:
blue (>0.8), magenta (>0.9), dark green (>0.95), red (>0.98), orange (>0.99) and light green
(typed SNPs); x axis denotes MAF cutoffs: 1%, 2%, 5%, 7%, and 10%.
113
5.4.2 Imputation Based on All Genotyped SNPs for Cases and Controls Separately
Contrary to the previous approach, imputation based on all genotypes did account for the
known genotype data available only for cases or controls and therefore could have had the power
to identify what the previous approach failed to do. However, in light of the remarkably different
numbers of SNPs between the imputation bases (260k for cases vs. 1 million for controls),
differential errors occurring in the separate imputation runs seemed to distort the distribution of
test statistics ( 1.484 λ = ), yielding thousands of unlikely genome-wide significant associations
(Table 5.4). Filtering out less well-imputed SNPs helped restore the distribution gradually (Table
5.4, Figure 5.4). The overall inflation in test statistics decreased from 1.4 λ > for filtering
imputed variants at info > 0.8 to 1.1 λ < for much better-imputed variants with info > 0.98; for
variants with MAF>1% and info score >.99, the inflation diminished close enough to the
counterpart filtered at the same criteria where the case-control were imputed together ( 1.068 λ =
vs. 1.052 λ = ). Filtering out less common SNPs also lessened the inflation and reduced the
number of false positives, although to a much lesser degree. While more stringent info score
and/or MAF criteria indeed corrected the inflation to varying degrees, neither could effectively
eliminate false positives altogether even at unrealistically stringent levels such as info > 0.99 and
MAF > 10%. For example, 54 variants still achieved genome-wide significance after genomic
control (P
gcAdj
< 5x10
-8
) despite as few as 857,644 variants (less than 10% of 9.6 million to start
with) remained under consideration. This strategy, the least conservative among all strategies
outlined at the beginning of Section 5.4, highlighted that the risks of performing imputations for
cases and controls separately overweighed the benefits from incorporating more genotypes.
114
Table 5.4 Inflation in test statistics by info score and MAF cutoffs when cases and controls were
imputed separately based on all genotypes.
minINFO minMAF NSNPs Lambda N(P<5e-08) N(P
gcAdj
<5e-08)
0.8 0.01 9,207,258 1.484 5207 1475
0.8 0.02 8,492,323 1.500 4973 1385
0.8 0.05 6,635,090 1.458 3737 1206
0.8 0.07 5,856,495 1.436 3270 1141
0.8 0.1 4,982,660 1.409 2749 1016
0.9 0.01 6,612,916 1.317 1305 600
0.9 0.02 6,248,254 1.328 1264 585
0.9 0.05 5,204,846 1.311 1026 548
0.9 0.07 4,711,992 1.300 942 531
0.9 0.1 4,117,546 1.288 828 475
0.95 0.01 3,941,016 1.180 385 267
0.95 0.02 3,814,975 1.185 384 263
0.95 0.05 3,412,729 1.180 365 259
0.95 0.07 3,190,272 1.176 351 255
0.95 0.1 2,896,554 1.176 320 230
0.98 0.01 1,791,369 1.092 117 99
0.98 0.02 1,765,015 1.094 117 99
0.98 0.05 1,669,844 1.091 115 97
0.98 0.07 1,606,333 1.089 114 96
0.98 0.1 1,511,718 1.094 108 93
0.99 0.01 964,670 1.068 70 59
0.99 0.02 956,488 1.069 70 59
0.99 0.05 923,504 1.067 69 58
0.99 0.07 897,502 1.066 68 57
0.99 0.1 857,644 1.072 64 54
Abbreviations: minINFO, minimum info score; minMAF, minimum minor allele frequency; NSNPs, number of
imputed variants with info score > minINFO and MAF> minMAF; Lambda, the median of the 1 df chi-square
statistics for Wald test divided by 0.455; N(P<5e-08), number of false positives with p-value < 5x10-8;
N(PgcAdj<5e-08), number of false positives with p-value < 5x10-8 after genomic control by dividing the chi-square
by lambda.
115
Figure 5.4 Inflation in test statistics by info score and MAF cutoffs when cases and controls were
imputed separately based on all available genotypes. The colors are for different minimum info
score cutoffs: blue (>0.8), magenta (>0.9), dark green (>0.95), red (>0.98), and orange (>0.99); x
axis denotes MAF cutoffs: 1%, 2%, 5%, 7%, and 10%.
116
5.4.3 Imputation for Cases and Controls Separately While Based on the Same 188k
The previous two approaches have exemplified the most and least conservative strategies
for imputation given that cases and controls were typed with different chips and with different
numbers of SNPs. The observed differences in the inflation of test statistics between these two
extremes could either be attributable to the different imputation bases or to differential errors
arising from stochastic processes in the separate batches of imputation. To dissect such
difference, a third imputation was performed based on the same 188k SNPs while phasing and
imputing for cases and controls separately. Using the chromosome 1 data as an example, we
compared the inflation factors between these three different yet related imputation strategies,
namely, based on the same 188k but separate imputations (“188k.sep”), based on the 188k and
imputed together (“188k.together”), and based on all available genotypes (“260k+1M”). As
expected, the inflation for the “188k.sep” imputation ( 1.076 λ = for info>0.8 and MAF>1%) fell
between the other two ( 1.04 λ = and 1.50 λ = ) and gravitated toward the situation where cases
and controls were imputed together, suggesting that both the different imputation bases and
imputation batches led to differential errors and the former may play a larger role (Table 5.5). As
info score increased, the lambda values for the two types of imputation both based on the 188k
rapidly converged (Figure 5.5), since differential errors are smaller for variants imputed with
more certainty. Based on the same basis of the 188k SNPs, neither the imputation performed
together nor separately generated any false positives. In contrast, imputation based on different
bases (260k+1M), while improvement in inflation was observed when including only better-
imputed variants ( 1.4 λ > for info>0.8 decreasing to 1.1 λ ≈ for info>0.99), did not eliminate false
positives even at the most stringent info score. The above discussion is limited to chromosome 1;
notwithstanding, similar comparisons are expected for the other chromosomes.
117
Table 5.5 Comparing the inflation in test statistics derived from three different imputation strategies (chromosome 1 data only).
188.sep
188k.together
260k+1M
Min
INFO
Min
MAF
N
SNPs
Lambda N(P<
5e-08)
N(PgcAdj
<5e-08)
N
SNPs
Lambda N(P<
5e-08)
N(PgcAdj
<5e-08)
N
SNPs
Lambda N(P<
5e-08)
N(PgcAdj
<5e-08)
0.8 0.01 735,177 1.076 0 0
753,962 1.040 0 0
720,229 1.500 403 91
0.8 0.02
655,089 1.073 0 0
669,519 1.042 0 0
664,172 1.514 393 88
0.8 0.05
509,274 1.078 0 0
518,217 1.051 0 0
515,768 1.484 301 70
0.8 0.07
449,279 1.082 0 0
456,529 1.057 0 0
453,964 1.456 256 67
0.8 0.1
383,440 1.076 0 0
388,747 1.057 0 0
386,386 1.430 218 60
0.9 0.01 411,331 1.050 0 0
423,981 1.042 0 0
517,328 1.331 97 35
0.9 0.02
386,602 1.055 0 0
398,489 1.045 0 0
489,339 1.343 95 34
0.9 0.05
330,011 1.059 0 0
339,520 1.048 0 0
406,030 1.340 73 34
0.9 0.07
302,250 1.067 0 0
310,571 1.056 0 0
366,798 1.329 63 32
0.9 0.1
269,946 1.061 0 0
276,772 1.051 0 0
321,531 1.320 59 30
0.95 0.01 216,876 1.039 0 0
224,126 1.051 0 0
312,335 1.213 30 23
0.95 0.02
210,198 1.040 0 0
217,527 1.052 0 0
302,110 1.219 30 23
0.95 0.05
192,133 1.040 0 0
198,710 1.047 0 0
269,693 1.226 30 23
0.95 0.07
182,080 1.048 0 0
187,891 1.052 0 0
251,919 1.218 28 22
0.95 0.1
168,904 1.046 0 0
173,933 1.051 0 0
229,964 1.222 27 21
0.98 0.01 94,159 1.042 0 0
96,963 1.060 0 0
145,438 1.130 10 10
0.98 0.02
92,906 1.048 0 0
95,860 1.063 0 0
143,124 1.137 10 9
0.98 0.05
89,171 1.040 0 0
92,152 1.052 0 0
135,484 1.142 10 9
0.98 0.07
86,474 1.040 0 0
89,331 1.054 0 0
130,265 1.140 10 9
0.98 0.1
82,422 1.030 0 0
85,196 1.045 0 0
123,287 1.145 9 8
0.99 0.01 51,073 1.016 0 0
52,157 1.028 0 0
80,444 1.097 7 7
0.99 0.02
50,662 1.023 0 0
51,897 1.032 0 0
79,611 1.105 7 7
0.99 0.05
49,254 1.020 0 0
50,578 1.023 0 0
76,852 1.111 7 7
0.99 0.07
48,125 1.017 0 0
49,411 1.020 0 0
74,640 1.104 7 7
0.99 0.1
46,392 1.017 0 0
47,547 1.014 0 0
71,629 1.106 6 6
Abbreviations: 188k.sep, cases and controls were imputed separately while based on the same 188k SNPs; 188.together, cases and controls were imputed in the
same imputation batch based on the 188k; 260k_1M, cases were imputed based on the 260k typed on the HumanCore and controls were imputed based on the 1
million SNPs typed on the Human1M.
118
Figure 5.5 Comparing the inflation factors in test statistics derived from the three different imputation strategies (showing
chromosome 1 data only). Imputation bases were 260k+1M SNPs (dark green plus signs), 188k SNPs imputed separately (blue
circles) and 188k SNPs imputed together (magenta triangles); x axis denotes MAF cutoffs: 1%, 2%, 5%, 7%, and 10%.
119
5.4.4 Potential differential imputation errors within controls
Another caveat in the case-control analysis using preexisting imputed data is that, not
only were the cases and controls imputed separately, but also the two sources of controls,
namely, AABC and AAPC controls. If these two groups of controls were internally
differentiated, then it would pose a potential threat to the integrity of the control data. A quick
way to verify this was to construct a hypothetical “case-control” sample designating the AABC
controls as the “cases” and the AAPC controls as controls [4]. We show the results for this
comparison using the chromosome 1 data (Table 5.6, Figure 5.6). In large, the inflation was mild
as expected ( 1.04 λ ≈ for info scores > 0.8), comparable to the inflation in the analysis between
real cases and controls. It also improved gradually with both higher info scores and MAFs (
1.01 λ ≈ for info scores > 0.99). However, the false positives remained a threat even at extremely
stringent levels, e.g., there were 13 variants with p < 5x10
-8
(info > 0.99 and MAF > 10%) for
chromosome 1. Genomic control was of little help in this situation given that the lambda was
extremely close to 1. Little is known how these false positives between the two sources of
controls would bias the real case-controls comparisons, i.e., toward the null or away from the
null. Therefore the controls should be re-imputed as a whole (possibly along with cases) to
eliminate such errors prior to being tested for association.
120
Table 5.6 Inflation in a hypothetical case-control analysis, AABC vs. AAPC (chromosome 1 data
only)
minINFO minMAF NSNPs Lambda N(P<5e-08) N(P
gcAdj
<5e-08)
0.8 0.01 720,572 1.042 87 84
0.8 0.02 664,216 1.043 86 83
0.8 0.05 515,402 1.043 76 74
0.8 0.07 453,995 1.044 70 68
0.8 0.1 386,582 1.040 60 58
0.9 0.01 669,539 1.038 69 66
0.9 0.02 624,492 1.040 69 66
0.9 0.05 493,633 1.038 66 64
0.9 0.07 436,530 1.040 62 60
0.9 0.1 372,746 1.034 55 53
0.95 0.01 549,446 1.031 42 41
0.95 0.02 521,541 1.031 42 41
0.95 0.05 429,150 1.032 41 40
0.95 0.07 384,496 1.035 41 40
0.95 0.1 332,256 1.027 38 38
0.98 0.01 371,028 1.016 19 19
0.98 0.02 358,816 1.017 19 19
0.98 0.05 311,389 1.024 18 18
0.98 0.07 284,842 1.025 18 18
0.98 0.1 251,390 1.015 17 17
0.99 0.01 258,557 1.009 15 15
0.99 0.02 252,311 1.010 15 15
0.99 0.05 225,695 1.014 14 14
0.99 0.07 209,035 1.014 14 14
0.99 0.1 187,313 1.005 13 13
Abbreviations: minINFO, minimum info score; minMAF, minimum minor allele frequency; NSNPs, number of
imputed variants with info score > minINFO and MAF> minMAF; Lambda, the median of the 1 df chi-square
statistics for Wald test divided by 0.455; N(P<5e-08), number of false positives with p-value < 5x10-8;
N(PgcAdj<5e-08), number of false positives with p-value < 5x10-8 after genomic control by dividing the chi-square
by lambda.
121
Figure 5.6 Inflation in a hypothetical case-control analysis, AABC vs. AAPC (chromosome 1
data only); x axis denotes MAF cutoffs: 1%, 2%, 5%, 7%, and 10%.
122
5.4.5 Imputation to the Full 260k SNPs Genotyped in Cases
Imputations based on the overlapping 188k SNPs (Strategy #1) discarded the actual
genotypes for those “TA” SNPs (typed in cases but not in controls); on the other hand, separate
imputations based on all genotypes (Strategy #2), while accounted for those “TA” SNPs,
introduced an overwhelmingly large number of false positives. The last strategy was to strive to
use all of the actual genotypes for cases (~260k SNPs typed on the HumanCore), supplemented
with the imputed dosages for controls (~188k typed on the Human1M plus the difference of
~72k “TA” SNPs imputed with the KGP). Since the 188k typed SNPs neither showed inflation in
test statistics nor introduced any false positive associations, we focus the following discussion on
the ~72k TA SNPs. Among these ~72k SNPs, 62,933 SNPs were successfully imputed with the
KGP reference and 47,473 SNPs with MAF>1% were imputed with info score > 0.8. The
inflation in test statistics was fairly high, 1.272 λ = for SNPs filtered at info>0.8 and MAF>1%
(Table 5.7). Even at the most stringent criteria, i.e. info>0.99 and MAF>10%, the inflation was
still noticeable ( 1.118 λ = ) despite less than half of the SNPs remained (20,947 out of 47,473)
and the false positives were not eliminated either. Filtering by MAF had little contribution in
controlling the inflation since the majority of these SNPs were fairly common (MAF>5%).
This approach exemplifies that differential errors remain a problem even when the
imputation was only performed for controls despite having 1 million SNPs in the basis. The
inflation became even bigger when we tried to convert the dosage data for the controls to “hard
calls” (best-guesses of the genotypes with a threshold = 0.9) and combine with the actual
genotypes of cases ( 1.9 λ ≈ , data not shown). All evidence reiterates the importance of
performing imputations for cases and controls based on the same basis in order to control the
differential errors.
123
Table 5.7 Inflation for the “TA” SNPs when the genotypes of cases were combined with the
imputed dosages for controls
minINFO minMAF NSNPs Lambda N(P<5e-08) N(P
gcAdj
<5e-08)
0.8 0.01 47,473 1.272 27 18
0.8 0.02 47,364 1.273 27 18
0.8 0.05 46,604 1.271 27 18
0.8 0.07 45,596 1.264 27 18
0.8 0.1 43,815 1.256 26 18
0.9 0.01 47,044 1.261 23 16
0.9 0.02 46,940 1.263 23 16
0.9 0.05 46,207 1.261 23 16
0.9 0.07 45,218 1.256 23 16
0.9 0.1 43,459 1.248 22 15
0.95 0.01 44,132 1.231 22 16
0.95 0.02 44,051 1.233 22 16
0.95 0.05 43,424 1.233 22 16
0.95 0.07 42,541 1.229 22 16
0.95 0.1 40,958 1.225 21 15
0.98 0.01 33,228 1.146 11 9
0.98 0.02 33,178 1.146 11 9
0.98 0.05 32,779 1.146 11 9
0.98 0.07 32,223 1.146 11 9
0.98 0.1 31,214 1.145 11 9
0.99 0.01 22,166 1.123 7 6
0.99 0.02 22,128 1.123 7 6
0.99 0.05 21,889 1.123 7 6
0.99 0.07 21,551 1.121 7 6
0.99 0.1 20,947 1.118 7 6
Abbreviations: minINFO, minimum info score; minMAF, minimum minor allele frequency; NSNPs, number of
imputed variants with info score > minINFO and MAF> minMAF; Lambda, the median of the 1 df chi-square
statistics for Wald test divided by 0.455; N(P<5e-08), number of false positives with p-value < 5x10-8;
N(PgcAdj<5e-08), number of false positives with p-value < 5x10-8 after genomic control by dividing the chi-square
by lambda.
124
Figure 5.7 Inflation for the “TA” SNPs when the genotypes of cases were combined with the
imputed dosages for controls; x axis denotes MAF cutoffs: 1%, 2%, 5%, 7%, and 10%.
125
5.5 Summary
This chapter has investigated a variety of imputation strategies aimed at optimizing the
use of existing genotype data while ensuring the validity of association testing. Restricting to the
small number of overlapping SNPs between different platforms, though the most conservative,
seems to control the inflation and false positives in association tests fairly well. The degree to
which the genomic coverage will be compromised depends on the similarity between the chips.
It has been advised that a small number of samples (a few hundred) be genotyped on both chips
[4]. The concordance of genotypes for the same SNPs serves as a good indicator of similarity.
When samples are genotyped using the same chip but imputed in different batches,
unexpected false positives may also arise even though not necessarily inflate the test statistics.
Such situation would entail sifting through many seemingly significant associations; the more
stringent criteria for filtering imputed variants would be advisable.
5.6 Chapter 5 References
1. Barrett, J.C. and L.R. Cardon, Evaluating coverage of genome-wide association studies. Nat Genet, 2006.
38(6): p. 659-62.
2. Nelson, S.C., et al., Imputation-based genomic coverage assessments of current human genotyping arrays.
G3 (Bethesda), 2013. 3(10): p. 1795-807.
3. Hoffmann, T.J., et al., Design and coverage of high throughput genotyping arrays optimized for individuals
of East Asian, African American, and Latino race/ethnicity using imputation and a novel hybrid SNP
selection algorithm. Genomics, 2011. 98(6): p. 422-30.
4. Sinnott, J.A. and P. Kraft, Artifact due to differential error when cases and controls are imputed from
different platforms. Hum Genet, 2012. 131(1): p. 111-9.
126
6. Final Remarks
Across different racial/ethnic groups, African Americans (AA) have higher incidence
rates and worse survival for a wide range of diseases, e.g., breast cancer, prostate cancer, and
multiple myeloma, etc. [1-3]. It is also becoming increasingly clear that such disparities could be
attributed to genetic risk factors after adjusting for a wide range of environmental factors [4].
However, genetic studies in ethnic minorities, including AAs, are often underpowered, due to
insufficient research resources (limited sample sizes, lack of advanced research infrastructure,
etc.) compared to studies in the European-derived populations. Therefore, the potentials of
expanding research data and advancing statistical methods in AAs are of critical importance.
One attractive approach is to incorporate the existing data in the study design to enlarge
the sample size cost-effectively and therefore increase power. A caveat to keep in mind is that
AAs are a largely admixed population with variation between individuals in ancestry. In addition
to adjusting for the ancestral differences in association tests to account for the confounding
effect, one should also evaluate whether the additional samples could directly translate to better
power. When the variation in individual admixture proportion is much larger than the difference
in average admixture between different sources of samples, the increase in power will be less
adversely affected.
African Americans can also be analyzed together with samples of the European ancestry
in a trans-ethnicity fine-mapping study setting to take advantage of the relatively shorter LDs in
AAs. Diverse LD information, coupled with larger sample sizes, may facilitate the localization of
the potential causal variant in a much finer resolution. Shorter LD, however, makes the
imputation in AAs more challenging. When resorting to imputation to include unmeasured
variants in the genome, we expand the latitude of association testing to a much larger scale by
127
leveraging the haplotype information available in the KGP reference panel. The difficulties in
imputing variants for AAs imply a less optimal coverage of the genome, and therefore loss of
power to identify risk variants.
Such loss of power could be further exacerbated when the cases and controls were typed
using various platforms with a considerable difference in the number of the tag SNPs. The
overlap between the two different chips could only be a small fraction of the total genotypes.
Imputations based on such small number of SNPs may be less reliable; restricting to the small
fraction of reliably imputed variants still suffers the loss of power. Attempts to salvage more
genotypes outside the overlap should be made with extreme caution, as differential errors could
occur and result in overwhelmingly many false positives.
Genome-wide association studies in AAs have many more challenges. This dissertation
set out to investigate the possibility of extending GWAS methods in African American data:
existing controls may be effectively reused to enhance power if the admixture differences are
properly accounted for; imputations are helpful in assembling genetic data typed on different
chips and expanding the test of association to a much larger number of variants; haplotypes may
supplement single variant association tests with additional hypothesis testing. Conceived from a
variety of perspectives, these approaches could work in concert to better power studies in AAs
and to assist in elucidating the risk profile that disproportionately burdens AAs.
128
Chapter 6 References
1. Haiman, C.A., et al., Characterizing genetic risk at known prostate cancer susceptibility loci in African
Americans. PLoS Genetics, 2011. 7(5).
2. Haiman, C.A., et al., Genome-Wide Testing of Putative Functional Exonic Variants in Relationship with
Breast and Prostate Cancer Risk in a Multiethnic Population. PLoS Genetics, 2013. 9(3).
3. Gebregziabher, M., et al., Risk patterns of multiple myeloma in Los Angeles County, 1972-1999 (United
States). Cancer Causes Control, 2006. 17: p. 931-938.
4. Henderson, B.E., et al., The influence of race and ethnicity on the biology of cancer. 2012. p. 648-653.
129
Comprehensive References
Aad, G., Abbott, B., Abdallah, J., et al. (2011). Search for diphoton events with large missing transverse energy in 7
TeV proton-proton collisions with the ATLAS detector. Phys Rev Lett 106, 121803.
Ambrosone, C. B., Ciupak, G. L., Bandera, E. V., et al. (2009). Conducting Molecular Epidemiological Research in
the Age of HIPAA: A Multi-Institutional Case-Control Study of Breast Cancer in African-American and European-
American Women. J Oncol 2009, 871250.
Amler, L. C., Bauer, A., Corvi, R., et al. (2000). Identification and characterization of novel genes located at the
t(1;15)(p36.2;q24) translocation breakpoint in the neuroblastoma cell line NGP. Genomics 64, 195-202.
Balding, D. J., and Nichols, R. A. (1995). A method for quantifying differentiation between populations at multi-
allelic loci and its implications for investigating identity and paternity. Genetica 96, 3-12.
Barretina, J., Caponigro, G., Stransky, N., et al. (2012). The Cancer Cell Line Encyclopedia enables predictive
modelling of anticancer drug sensitivity. Nature 483, 603-607.
Barrett, J. C., and Cardon, L. R. (2006). Evaluating coverage of genome-wide association studies. Nat Genet 38,
659-662.
Barrett, J. C., Fry, B., Maller, J., and Daly, M. J. (2005). Haploview: analysis and visualization of LD and haplotype
maps. Bioinformatics 21, 263-265.
Betel, D., Koppal, A., Agius, P., Sander, C., and Leslie, C. (2010). Comprehensive modeling of microRNA targets
predicts functional non-conserved and non-canonical sites. Genome Biol 11, R90.
Bonardi, F., Fusetti, F., Deelen, P., van Gosliga, D., Vellenga, E., and Schuringa, J. J. (2013). A proteomics and
transcriptomics approach to identify leukemic stem cell (LSC) markers. Mol Cell Proteomics 12, 626-637.
Bourgain, C., Hoffjan, S., Nicolae, R., et al. (2003). Novel case-control test in a founder population identifies P-
selectin as an atopy-susceptibility locus. American journal of human genetics 73, 612-626.
Broderick, P., Chubb, D., Johnson, D. C., et al. (2012). Common variation at 3p22.1 and 7p15.3 influences multiple
myeloma risk. Nat Genet 44, 58-61.
Browning, S. R., and Browning, B. L. (2007). Rapid and accurate haplotype phasing and missing-data inference for
whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet 81, 1084-1097.
Burneo, J. G., Sirven, J. I., Kiesel, L. W., et al. (2013). Managing common complex symptomatic epilepsies: tumors
and trauma: american epilepsy society - 2012 annual course summary. Epilepsy Curr 13, 232-235.
Cai, Q. Y., Long, J. R., Lu, W., et al. (2011). Genome-wide association study identifies breast cancer risk variant at
10q21.2: results from the Asia Breast Cancer Consortium. Human molecular genetics 20, 4991-4999.
Campbell, M. C., and Tishkoff, S. A. (2008). African genetic diversity: implications for human demographic
history, modern human origins, and complex disease mapping. Annu Rev Genomics Hum Genet 9, 403-433.
Cardon, L. R., and Abecasis, G. R. (2003). Using haplotype blocks to map human complex trait loci. Trends Genet
19, 135-140.
Chakraborty, R., and Weiss, K. M. (1988). Admixture as a Tool for Finding Linked Genes and Detecting That
Difference from Allelic Association between Loci. Proceedings of the National Academy of Sciences of the United
States of America 85, 9119-9123.
130
Chen, F., Chen, G. K., Millikan, R. C., et al. (2011a). Fine-mapping of breast cancer susceptibility loci characterizes
genetic risk in African Americans. Hum Mol Genet 20, 4491-4503.
Chen, F., Chen, G. K., Stram, D. O., et al. (2013). A genome-wide association study of breast cancer in women of
African ancestry. Hum Genet 132, 39-48.
Chen, F., Stram, D. O., Le Marchand, L., et al. (2011b). Caution in generalizing known genetic risk markers for
breast cancer across all ethnic/racial populations. Eur J Hum Genet 19, 243-245.
Chen, G. K., Millikan, R. C., John, E. M., et al. (2010). The Potential for Enhancing the Power of Genetic
Association Studies in African Americans through the Reuse of Existing Genotype Data. PLoS Genetics 6, 13-13.
Chen, G. K., Wang, K., Stram, A. H., Sobel, E. M., and Lange, K. (2012). Mendel-GPU: haplotyping and genotype
imputation on graphics processing units. Bioinformatics 28, 2979-2980.
Cheng, I., Chen, G. K., Nakagawa, H., et al. (2012). Evaluating Genetic Risk for Prostate Cancer among Japanese
and Latinos. Cancer Epidemiology Biomarkers & Prevention 21, 2048-2058.
Cheong, H. S., Park, B. L., Kim, E. M., et al. (2011). Association of RANBP1 haplotype with smooth pursuit eye
movement abnormality. Am J Med Genet B Neuropsychiatr Genet 156B, 67-71.
Cheverud, J. M. (2001). A simple correction for multiple comparisons in interval mapping genome scans. Heredity
(Edinb) 87, 52-58.
Chubb, D., Weinhold, N., Broderick, P., et al. (2013). Common variation at 3q26.2, 6p21.33, 17p11.2 and 22q13.1
influences multiple myeloma risk. Nat Genet 45, 1221-1225.
Coetzee, S. G., Rhie, S. K., Berman, B. P., Coetzee, G. A., and Noushmehr, H. (2012). FunciSNP: an
R/bioconductor tool integrating functional non-coding data sets with genetic association studies to identify candidate
regulatory SNPs. Nucleic Acids Res 40, e139.
Collaboration, A., Aad, G., Abbott, B., et al. (2014). Measurement of the muon reconstruction performance of the
ATLAS detector using 2011 and 2012 LHC proton-proton collision data. Eur Phys J C Part Fields 74, 3130.
Consortium, G. T. (2013). The Genotype-Tissue Expression (GTEx) project. Nat Genet 45, 580-585.
Cordell, H. J. (2002). Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans.
Human molecular genetics 11, 2463-2468.
Costas, J., Salas, A., Phillips, C., and Carracedo, A. (2005). Human genome-wide screen of haplotype-like blocks of
reduced diversity. Gene 349, 219-225.
Cozen, W., Gebregziabher, M., Conti, D. V., et al. (2006). Interleukin-6-related genotypes, body mass index, and
risk of multiple myeloma and plasmacytoma. Cancer Epidemiol Biomarkers Prev 15, 2285-2291.
Daly, M. J., Rioux, J. D., Schaffner, S. F., Hudson, T. J., and Lander, E. S. (2001). High-resolution haplotype
structure in the human genome. Nature genetics 29, 229-232.
De Roos, A. J., Gold, L. S., Wang, S., et al. (2006). Metabolic gene variants and risk of non-Hodgkin's lymphoma.
Cancer Epidemiol Biomarkers Prev 15, 1647-1653.
Demchenko, Y. N., and Kuehl, W. M. (2010). A critical role for the NFkB pathway in multiple myeloma.
Oncotarget 1, 59-68.
Devlin, B., and Roeder, K. (1999). Genomic control for association studies. Biometrics 55, 997-1004.
131
Dudbridge, F., and Gusnanto, A. (2008). Estimation of significance thresholds for genomewide association scans.
Genetic epidemiology 32, 227-234.
Dudbridge, F., and Koeleman, B. P. C. (2003). Rank truncated product ofP-values, with application to genomewide
association scans. Genetic epidemiology 25, 360-366.
Dudbridge, F., and Koeleman, B. P. C. (2004). Efficient computation of significance levels for multiple associations
in large studies of correlated data, including genomewide association studies. American journal of human genetics
75, 424-435.
Durbin, R. M., Altshuler, D. L., Abecasis, G. R., et al. (2010). A map of human genome variation from population-
scale sequencing. Nature 467, 1061-1073.
Durrant, C., Zondervan, K. T., Cardon, L. R., Hunt, S., Deloukas, P., and Morris, A. P. (2004). Linkage
disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes. The American Journal
of Human Genetics 75, 35-43.
Eilstein, D., Uhry, Z., Lim, T. A., and Bloch, J. (2008). Lung cancer mortality in France. Trend analysis and
projection between 1975 and 2012, using a Bayesian age-period-cohort model. Lung Cancer 59, 282-290.
Excoffier, L., and Slatkin, M. (1995). Maximum-likelihood estimation of molecular haplotype frequencies in a
diploid population. Molecular Biology and Evolution 12, 921-927.
Feng, Y., Stram, D. O., Rhie, S. K., et al. (2014). A comprehensive examination of breast cancer risk loci in African
American women. Human molecular genetics, 1-9.
Fontenot, J. D., Rasmussen, J. P., Williams, L. M., Dooley, J. L., Farr, A. G., and Rudensky, A. Y. (2005).
Regulatory T cell lineage specification by the forkhead transcription factor foxp3. Immunity 22, 329-341.
Forzati, F., Federico, A., Pallante, P., et al. (2012). CBX7 is a tumor suppressor in mice and humans. J Clin Invest
122, 612-623.
Franke, A., McGovern, D. P. B., Barrett, J. C., et al. (2010). Genome-wide meta-analysis increases to 71 the number
of confirmed Crohn's disease susceptibility loci. Nature genetics 42, 1118-1125.
Freedman, M. L., Haiman, C. A., Patterson, N., et al. (2006). Admixture mapping identifies 8q24 as a prostate
cancer risk locus in African-American men. Proceedings of the National Academy of Sciences of the United States
of America 103, 14068-14073.
Freedman, M. L., Monteiro, A. N. A., Gayther, S. A., et al. (2011). Principles for the post-GWAS functional
characterization of cancer risk loci. Nature genetics 43, 513-518.
Freedman, M. L., Reich, D., Penney, K. L., et al. (2004). Assessing the impact of population stratification on genetic
association studies. Nature genetics 36, 388-393.
Fu, J., Wolfs, M. G., Deelen, P., et al. (2012). Unraveling the regulatory mechanisms underlying tissue-dependent
genetic variation of gene expression. PLoS Genet 8, e1002431.
Gabriel, S. B., Schaffner, S. F., Nguyen, H., et al. (2002). The structure of haplotype blocks in the human genome.
Science (New York, N.Y.) 296, 2225-2229.
Gao, X., Becker, L. C., Becker, D. M., Starmer, J. D., and Province, M. A. (2010). Avoiding the high Bonferroni
penalty in genome-wide association studies. Genetic epidemiology 34, 100-105.
132
Gebregziabher, M., Bernstein, L., Wang, Y., and Cozen, W. (2006). Risk patterns of multiple myeloma in Los
Angeles County, 1972-1999 (United States). Cancer Causes Control 17, 931-938.
Ghoussaini, M., Song, H., Koessler, T., et al. (2008). Multiple loci with different cancer specificities within the 8q24
gene desert. J Natl Cancer Inst 100, 962-966.
Gibson, G. (2010). Hints of hidden heritability in GWAS. 558-560.
Giles, G. G., and English, D. R. (2002). The Melbourne Collaborative Cohort Study. IARC Sci Publ 156, 69-70.
Gourraud, P. A., Khankhanian, P., Cereb, N., et al. (2014). HLA diversity in the 1000 genomes dataset. PLoS One 9,
e97282.
Greenberg, A. J., Lee, A. M., Serie, D. J., et al. (2012). Single-nucleotide polymorphism rs1052501 associated with
monoclonal gammopathy of undetermined significance and multiple myeloma. Leukemia.
Guan, Z. P., Gu, L. K., Xing, B. C., Ji, J. F., Gu, J., and Deng, D. J. (2011). [Downregulation of chromobox protein
homolog 7 expression in multiple human cancer tissues]. Zhonghua Yu Fang Yi Xue Za Zhi 45, 597-600.
Guo, Y., Li, J., Bonham, A. J., Wang, Y., and Deng, H. (2009). Gains in power for exhaustive analyses of
haplotypes using variable-sized sliding window strategy: a comparison of association-mapping strategies. European
journal of human genetics EJHG 17, 785-792.
Haiman, C. A., Chen, G. K., Blot, W. J., et al. (2011). Characterizing genetic risk at known prostate cancer
susceptibility loci in African Americans. PLoS Genetics 7.
Haiman, C. A., Han, Y., Feng, Y., et al. (2013). Genome-Wide Testing of Putative Functional Exonic Variants in
Relationship with Breast and Prostate Cancer Risk in a Multiethnic Population. PLoS Genetics 9.
Haiman, C. A., Patterson, N., Freedman, M. L., et al. (2007). Multiple regions within 8q24 independently affect risk
for prostate cancer. Nature genetics 39, 638-644.
Haiman, C. A., and Stram, D. O. (2010). Exploring genetic susceptibility to cancer in diverse populations. Curr
Opin Genet Dev 20, 330-335.
Han, Y., Signorello, L. B., Strom, S. S., et al. (2015). Generalizability of established prostate cancer risk variants in
men of African ancestry. Int J Cancer 136, 1210-1217.
Henderson, B. E., Lee, N. H., Seewaldt, V., and Shen, H. (2012). The influence of race and ethnicity on the biology
of cancer. 648-653.
Hideshima, T., Chauhan, D., Richardson, P., et al. (2002). NF-kappa B as a therapeutic target in multiple myeloma.
J Biol Chem 277, 16639-16647.
Hindorff, L. A., Sethupathy, P., Junkins, H. A., et al. (2009). Potential etiologic and functional implications of
genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106, 9362-9367.
Hoffmann, T. J., Kvale, M. N., Hesselson, S. E., et al. (2011a). Next generation genome-wide association tool:
design and coverage of a high-throughput European-optimized SNP array. Genomics 98, 79-89.
Hoffmann, T. J., Zhan, Y., Kvale, M. N., et al. (2011b). Design and coverage of high throughput genotyping arrays
optimized for individuals of East Asian, African American, and Latino race/ethnicity using imputation and a novel
hybrid SNP selection algorithm. Genomics 98, 422-430.
133
Hoggart, C. J., Clark, T. G., De Iorio, M., Whittaker, J. C., and Balding, D. J. (2008). Genome-wide significance for
dense SNP and resequencing data. Genetic epidemiology 32, 179-185.
Holly, E. A., Eberle, C. A., and Bracci, P. M. (2003). Prior history of allergies and pancreatic cancer in the San
Francisco Bay area. Am J Epidemiol 158, 432-441.
Hori, S., Nomura, T., and Sakaguchi, S. (2003). Control of regulatory T cell development by the transcription factor
Foxp3. Science 299, 1057-1061.
Howie, B. N., Donnelly, P., and Marchini, J. (2009). A flexible and accurate genotype imputation method for the
next generation of genome-wide association studies. PLoS Genet 5, e1000529.
Hu, Y. J., and Lin, D. Y. (2010). Analysis of untyped SNPs: maximum likelihood and imputation methods. Genetic
epidemiology 34, 803-815.
Huang, L., Verstrepen, L., Heyninck, K., et al. (2008). ABINs inhibit EGF receptor-mediated NF-kappaB activation
and growth of EGF receptor-overexpressing tumour cells. Oncogene 27, 6131-6140.
Huffman, D. M., Deelen, J., Ye, K., et al. (2012). Distinguishing between longevity and buffered-deleterious
genotypes for exceptional human longevity: the case of the MTP gene. J Gerontol A Biol Sci Med Sci 67, 1153-
1160.
Jeggari, A., Marks, D. S., and Larsson, E. (2012). miRcode: a map of putative microRNA target sites in the long
non-coding transcriptome. Bioinformatics 28, 2062-2063.
Jia, L., Landan, G., Pomerantz, M., et al. (2009). Functional enhancers at the gene-poor 8q24 cancer-linked locus.
PLoS Genet 5, e1000597.
John, E. M., Hopper, J. L., Beck, J. C., et al. (2004). The Breast Cancer Family Registry: an infrastructure for
cooperative multinational, interdisciplinary and translational studies of the genetic epidemiology of breast cancer.
Breast Cancer Research 6, R375-R389.
John, E. M., Schwartz, G. G., Koo, J., Wang, W., and Ingles, S. A. (2007). Sun exposure, vitamin D receptor gene
polymorphisms, and breast cancer risk in a multiethnic population. Am J Epidemiol 166, 1409-1419.
Johnson, E. O., Hancock, D. B., Levy, J. L., et al. (2013). Imputation across genotyping arrays for genome-wide
association studies: assessment of bias and a correction strategy. Hum Genet 132, 509-522.
Kang, H. M., Sul, J. H., Service, S. K., et al. (2010). Variance component model to account for sample structure in
genome-wide association studies. Nature genetics 42, 348-354.
Karolchik, D., Barber, G. P., Casper, J., et al. (2014). The UCSC Genome Browser database: 2014 update. Nucleic
Acids Res 42, D764-770.
Kolonel, L. N., Henderson, B. E., Hankin, J. H., et al. (2000). A multiethnic cohort in Hawaii and Los Angeles:
baseline characteristics. Am J Epidemiol 151, 346-357.
Kraft, P., Cox, D. G., Paynter, R. a., Hunter, D., and De Vivo, I. (2005). Accounting for haplotype uncertainty in
matched association studies: a comparison of simple and flexible techniques. Genetic epidemiology 28, 261-272.
Kraft, P., and Stram, D. O. (2007a). Re : The Use of Inferred Haplotypes. Journal of Human Genetics 81, 863-868.
Kraft, P., and Stram, D. O. (2007b). Re: the use of inferred haplotypes in downstream analysis. American journal of
human genetics 81, 863-865; author reply 865-866.
134
Lambert, J. C., Grenier-Boley, B., Harold, D., et al. (2012). Genome-wide haplotype association study identifies the
FRMD4A gene as a risk locus for Alzheimer's disease. Molecular psychiatry, 1-10.
Landgren, O., and Weiss, B. M. (2009). Patterns of monoclonal gammopathy of undetermined significance and
multiple myeloma in various ethnic/racial groups: support for genetic factors in pathogenesis. Leukemia 23, 1691-
1697.
Lee, Y., Park, Y. J., Kim, M. N., Uh, Y., Kim, M. S., and Lee, K. (2015). Multicenter Study of Antimicrobial
Susceptibility of Anaerobic Bacteria in Korea in 2012. Ann Lab Med 35, 479-486.
Li, J., and Ji, L. (2005). Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation
matrix. Heredity 95, 221-227.
Li, Y., Willer, C. J., Ding, J., Scheet, P., and Abecasis, G. R. (2010). MaCH: Using sequence and genotype data to
estimate haplotypes and unobserved genotypes. Genetic epidemiology 34, 816-834.
Lin, D. Y., and Huang, B. E. (2007). The Use of Inferred Haplotypes. 80, 2006-2008.
Liu, N., Zhang, K., and Zhao, H. (2008). Haplotype-association analysis. Advances in Genetics 60, 335-405.
Lohr, J. G., Stojanov, P., Carter, S. L., et al. (2014). Widespread genetic heterogeneity in multiple myeloma:
implications for targeted therapy. Cancer Cell 25, 91-101.
Lorenz, P., Dietmann, S., Wilhelm, T., et al. (2010). The ancient mammalian KRAB zinc finger gene cluster on
human chromosome 8q24.3 illustrates principles of C2H2 zinc finger evolution associated with unique expression
profiles in human tissues. BMC Genomics 11, 206.
Lucas, J. S., Adam, E. C., Goggin, P. M., et al. (2012). Static respiratory cilia associated with mutations in
Dnahc11/DNAH11: a mouse model of PCD. Hum Mutat 33, 495-503.
Marchbanks, P. A., McDonald, J. A., Wilson, H. G., et al. (2002). The NICHD Women's Contraceptive and
Reproductive Experiences Study: methods and operational results. Annals of Epidemiology 12, 213-221.
Marchini, J., Howie, B., Myers, S., McVean, G., and Donnelly, P. (2007). A new multipoint method for genome-
wide association studies by imputation of genotypes. Nat Genet 39, 906-913.
Marjoram, P., Zubair, a., and Nuzhdin, S. V. (2014). Post-GWAS: where next? More samples, more SNPs or more
biology? Heredity 112, 79-88.
Mathias, R. A., Gao, P., Goldstein, J. L., et al. (2006). A graphical assessment of p-values from sliding window
haplotype tests of association to identify asthma susceptibility loci on chromosome 11q. BMC genetics 7, 38-38.
McCarthy, M. I., Abecasis, G. R., Cardon, L. R., et al. (2008). Genome-wide association studies for complex traits:
consensus, uncertainty and challenges. Nature reviews. Genetics 9, 356-369.
McClellan, J., and King, M.-C. (2010). Genetic heterogeneity in human disease. Cell 141, 210-217.
McLane, L. M., Banerjee, P. P., Cosma, G. L., et al. (2013). Differential localization of T-bet and Eomes in CD8 T
cell memory populations. J Immunol 190, 3207-3215.
Meng, Z., Zaykin, D. V., Xu, C.-F., Wagner, M., and Ehm, M. G. (2003). Selection of genetic markers for
association analyses, using linkage disequilibrium and haplotypes. American journal of human genetics 73, 115-130.
Methods, M.-s. H. A., Stram, D. O., and Seshan, V. E. (2012). Statistical Human Genetics. R. C. Elston, J. M.
Satagopan, and S. Sun (eds), 423-452. Totowa, NJ: Humana Press.
135
Morris, R. W., and Kaplan, N. L. (2002). On the advantage of haplotype analysis in the presence of multiple disease
susceptibility alleles. Genetic epidemiology 23, 221-233.
Moskvina, V., and Schmidt, K. M. (2008). On multiple-testing correction in genome-wide association studies.
Genetic epidemiology 32, 567-573.
Mouaffak, F., Kebir, O., Bellon, A., et al. (2011). Association of an UCP4 (SLC25A27) haplotype with ultra-
resistant schizophrenia. Pharmacogenomics 12, 185-193.
Nackley, A. G., Shabalina, S. A., Tchivileva, I. E., et al. (2006). Human catechol-O-methyltransferase haplotypes
modulate protein expression by altering mRNA secondary structure. Science (New York, N.Y.) 314, 1930-1933.
Nelson, S. C., Doheny, K. F., Pugh, E. W., et al. (2013). Imputation-based genomic coverage assessments of current
human genotyping arrays. G3 (Bethesda) 3, 1795-1807.
Newman, B., Moorman, P. G., Millikan, R., et al. (1995). The Carolina Breast Cancer Study: integrating population-
based epidemiology and molecular biology. Breast Cancer Research and Treatment 35, 51-60.
Parra, E. J., Marcini, A., Akey, J., et al. (1998). Estimating African American admixture proportions by use of
population-specific alleles. The American Journal of Human Genetics 63, 1839-1851.
Passtoors, W. M., Beekman, M., Deelen, J., et al. (2013). Gene expression analysis of mTOR pathway: association
with human longevity. Aging Cell 12, 24-31.
Passtoors, W. M., Boer, J. M., Goeman, J. J., et al. (2012). Transcriptional profiling of human familial longevity
indicates a role for ASF1A and IL7R. PLoS One 7, e27759.
Patil, N., Berno, A. J., Hinds, D. A., et al. (2001). Blocks of limited haplotype diversity revealed by high-resolution
scanning of human chromosome 21. Science 294, 1719-1723.
Pe'er, I., Yelensky, R., Altshuler, D., and Daly, M. J. (2008). Estimation of the multiple testing burden for
genomewide association studies of nearly all common variants. Genetic epidemiology 32, 381-385.
Penney, K. L., Pyne, S., Schumacher, F. R., et al. (2010). Genome-wide association study of prostate cancer
mortality. Cancer Epidemiol Biomarkers Prev 19, 2869-2876.
Penney, K. L., Sinnott, J. A., Fall, K., et al. (2011). mRNA expression signature of Gleason grade predicts lethal
prostate cancer. J Clin Oncol 29, 2391-2396.
Penney, K. L., Sinnott, J. A., Tyekucheva, S., et al. (2015). Association of prostate cancer risk variants with gene
expression in normal and tumor tissue. Cancer Epidemiol Biomarkers Prev 24, 255-260.
Poduslo, S. E., Huang, R., and Spiro, a. (2010). A genome screen of successful aging without cognitive decline
identifies LRP1B by haplotype analysis. American journal of medical genetics. Part B, Neuropsychiatric genetics :
the official publication of the International Society of Psychiatric Genetics 153B, 114-119.
Popescu, I., Pipeling, M. R., Shah, P. D., Orens, J. B., and McDyer, J. F. (2014). T-bet:Eomes balance, effector
function, and proliferation of cytomegalovirus-specific CD8+ T cells during primary infection differentiates the
capacity for durable immune control. J Immunol 193, 5709-5722.
Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. a., and Reich, D. (2006). Principal
components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904-909.
Price, A. L., Tandon, A., Patterson, N., et al. (2009). Sensitive detection of chromosomal segments of distinct
ancestry in admixed populations. PLoS Genetics 5, e1000519-e1000519.
136
Price, A. L., Zaitlen, N. a., Reich, D., and Patterson, N. (2010). New approaches to population stratification in
genome-wide association studies. Nature reviews. Genetics 11, 459-463.
Pritchard, J. K., Stephens, M., and Donnelly, P. (2000). Inference of population structure using multilocus genotype
data. Genetics 155, 945-959.
Prorok, P. C., Andriole, G. L., Bresalier, R. S., et al. (2000). Design of the Prostate, Lung, Colorectal and Ovarian
(PLCO) Cancer Screening Trial. 273S-309S.
Pulit, S. L., Leusink, M., Menelaou, A., and de Bakker, P. I. W. (2014). Association claims in the sequencing era.
Genes 5, 196-213.
Purcell, S., Neale, B., Todd-Brown, K., et al. (2007). PLINK: A Tool Set for Whole-Genome Association and
Population-Based Linkage Analyses. The American Journal of Human Genetics 81, 559-575.
Qin, Z. S., Niu, T., and Liu, J. S. (2002). Partition-ligation--expectation-maximization algorithm for haplotype
inference with single-nucleotide polymorphisms. The American Journal of Human Genetics 71, 1242-1242.
Rakovski, C. S., and Stram, D. O. (2009). A kinship-based modification of the armitage trend test to address hidden
population structure and small differential genotyping errors. PLoS One 4, e5825.
Reich, D. E., Cargill, M., Bolk, S., et al. (2001). Linkage disequilibrium in the human genome. Nature 411, 199-
204.
Ripke, S., Neale, B. M., Corvin, A., et al. (2014). Biological insights from 108 schizophrenia-associated genetic
loci. Nature 511, 421-427.
Schaid, D. J. (2004). Evaluating associations of haplotypes with traits. Genetic epidemiology 27, 348-364.
Schaid, D. J., Rowland, C. M., Tines, D. E., Jacobson, R. M., and Poland, G. A. (2002). Score tests for association
between traits and haplotypes when linkage phase is ambiguous. American journal of human genetics 70, 425-434.
Sebastiani, P., Solovieff, N., and DeWan, A. T. (2010). RETRACTED - Genetic signatures of exceptional longevity
in humans. Sciencexpress.
Shim, H., Chun, H., Engelman, C. D., and Payseur, B. A. (2009). Genome-wide association studies using single-
nucleotide polymorphisms versus haplotypes: an empirical comparison with data from the North American
Rheumatoid Arthritis Consortium. BMC Proceedings 3, S35-S35.
Šidák, Z. (1967). Rectangular confidence regions for the means of multivariate normal distributions. Journal of the
American Statistical Association 62, 626-633.
Siddiq, A., Couch, F. J., Chen, G. K., et al. (2012). A meta-analysis of genome-wide association studies of breast
cancer identifies two novel susceptibility loci at 6q14 and 20q11. Human molecular genetics 21, 5373-5384.
Sinnott, J. A., and Kraft, P. (2012). Artifact due to differential error when cases and controls are imputed from
different platforms. Hum Genet 131, 111-119.
Smith, T. R., Levine, E. A., Freimanis, R. I., et al. (2008). Polygenic model of DNA repair genetic polymorphisms
in human breast cancer risk. Carcinogenesis 29, 2132-2138.
Stram, D. O., Leigh Pearce, C., Bretsky, P., et al. (2003). Modeling and E-M estimation of haplotype-specific
relative risks from genotype data for a case-control study of unrelated individuals. Human heredity 55, 179-190.
137
Sung, Y. J., Gu, C. C., Tiwari, H. K., Arnett, D. K., Broeckel, U., and Rao, D. C. (2012). Genotype imputation for
African Americans using data from HapMap phase II versus 1000 genomes projects. Genetic epidemiology 36, 508-
516.
Tang, R., Feng, T., Sha, Q., and Zhang, S. (2009). A variable-sized sliding-window approach for genetic association
studies via principal component analysis. Annals of Human Genetics 73, 631-637.
Tian, Y., Huang, C., Zhang, H., et al. (2013). CDCA7L promotes hepatocellular carcinoma progression by
regulating the cell cycle. Int J Oncol 43, 2082-2090.
Trégouët, D.-A., König, I. R., Erdmann, J., et al. (2009). Genome-wide haplotype association study identifies the
SLC22A3-LPAL2-LPA gene cluster as a risk locus for coronary artery disease. Nature genetics 41, 2008-2010.
Udler, M. S., Meyer, K. B., Pooley, K. a., et al. (2009). FGFR2 variants and breast cancer risk: fine-scale mapping
using African American studies and analysis of chromatin conformation. Human molecular genetics 18, 1692-1703.
Uh, H. W., Deelen, J., Beekman, M., et al. (2012). How to deal with the early GWAS data when imputing and
combining different arrays is necessary. Eur J Hum Genet 20, 572-576.
Uhlmann, A. (2012). [Symposium of the Freiburg/bad Sackingen Rehabilitation Research Network on theme "Reha
2020--Lifestyle And Health Risks", February 10-11, 2012 in Freiburg]. Rehabilitation (Stuttg) 51, 269-270.
van Beek, M. H., Voshaar, R. C., van Deelen, F. M., van Balkom, A. J., Pop, G., and Speckens, A. E. (2012). The
cardiac anxiety questionnaire: cross-validation among cardiac inpatients. Int J Psychiatry Med 43, 349-364.
van den Ouweland, A. M., van Minkelen, R., Bolman, G. M., et al. (2012). Complete FXN deletion in a patient with
Friedreich's ataxia. Genet Test Mol Biomarkers 16, 1015-1018.
Visscher, Peter M., Brown, Matthew A., McCarthy, Mark I., and Yang, J. (2012). Five Years of GWAS Discovery.
The American Journal of Human Genetics 90, 7-24.
Wang, X., Zhu, X., Qin, H., et al. (2010). Adjustment for local ancestry in genetic association analysis of admixed
populations. 1-9.
Wang, Y., Fong, P. Y., Leung, F. C. C., Mak, W., and Sham, P. C. (2007). Increased gene coverage and Alu
frequency in large linkage disequilibrium blocks of the human genome. Genetics and molecular research : GMR 6,
1131-1141.
Weinhold, N., Meissner, T., Johnson, D. C., et al. (2014). The 7p15.3 (rs4487645) association for multiple myeloma
shows strong allele-specific regulation of the MYC-interacting gene CDCA7L in malignant plasma cells.
Haematologica.
Wellcome Trust Case Control, C. (2007). Genome-wide association study of 14,000 cases of seven common
diseases and 3,000 shared controls. Nature 447, 661-678.
Welter, D., MacArthur, J., Morales, J., et al. (2014). The NHGRI GWAS Catalog, a curated resource of SNP-trait
associations. Nucleic Acids Research 42.
Willer, C. J., Li, Y., and Abecasis, G. R. (2010). METAL: fast and efficient meta-analysis of genomewide
association scans. Bioinformatics 26, 2190-2191.
Witte, J. S., Visscher, P. M., and Wray, N. R. (2014). The contribution of genetic variants to disease depends on the
ruler. Nature Reviews Genetics 15, 765-776.
138
Xie, R., and Stram, D. O. (2005). Asymptotic equivalence between two score tests for haplotype-specific risk in
general linear models. Genetic epidemiology 29, 166-170.
Yang, J., Lee, S. H., Goddard, M. E., and Visscher, P. M. (2011). GCTA: a tool for genome-wide complex trait
analysis. American journal of human genetics 88, 76-82.
Yang, J., Zaitlen, N. a., Goddard, M. E., Visscher, P. M., and Price, A. L. (2014). Advantages and pitfalls in the
application of mixed-model association methods. Nature genetics 46, 100-106.
Zaboli, G., Jönsson, E. G., Gizatullin, R., De Franciscis, A., Asberg, M., and Leopardi, R. (2008). Haplotype
analysis confirms association of the serotonin transporter (5-HTT) gene with schizophrenia but not with major
depression. American journal of medical genetics Part B Neuropsychiatric genetics the official publication of the
International Society of Psychiatric Genetics 147, 301-307.
Zaykin, D. V., Westfall, P. H., Young, S. S., Karnoub, M. a., Wagner, M. J., and Ehm, M. G. (2002a). Testing
association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals.
Human heredity 53, 79-91.
Zaykin, D. V., Zhivotovsky, L. A., Westfall, P. H., and Weir, B. S. (2002b). Truncated product method for
combining P-values. Genetic epidemiology 22, 170-185.
Zhang, C., Bailey, D. K., Awad, T., et al. (2006). A whole genome long-range haplotype (WGLRH) test for
detecting imprints of positive selection in human populations. Bioinformatics 22, 2122-2128.
Zhao, H., Pfeiffer, R., and Gail, M. H. (2003). Haplotype analysis in population genetics and association
studies.(Brief article). In Pharmacogenomics, 171(178)-171(178).
Zheng, W., Cai, Q., Signorello, L. B., et al. (2009). Evaluation of 11 breast cancer susceptibility loci in African-
American women. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for
Cancer Research, cosponsored by the American Society of Preventive Oncology 18, 2761-2764.
Zuk, O., Hechter, E., Sunyaev, S. R., and Lander, E. S. (2012). The mystery of missing heritability: Genetic
interactions create phantom heritability. Proc Natl Acad Sci U S A 109, 1193-1198.
139
Abstract (if available)
Abstract
African Americans have been known to have higher risks for a wide range of complex diseases, such as breast and prostate cancer, multiple myeloma, and many others. Differences in genetic architecture across diverse racial/ethnic groups may explain the elevated risks. However, research resources for African Americans have been limited compared to those for the European-derived populations, a major impediment to elucidating the genetic risk factors disproportionately affecting African Americans. The possibility of improving the study power is therefore of primary scientific interest. This dissertation discusses the approaches to addressing the problem as follows: 1) integrating haploptypes with single SNPs in testing for breast cancer risk in African Americans
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Post-GWAS methods in large scale studies of breast cancer in African Americans
PDF
Polygenic analyses of complex traits in complex populations
PDF
Polygenes and estimated heritability of prostate cancer in an African American sample using genome-wide association study data
PDF
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping
PDF
The multiethnic nature of chronic disease: studies in the multiethnic cohort
PDF
Genetic studies of cancer in populations of African ancestry and Latinos
PDF
A genome wide association study of multiple sclerosis (MS) in Hispanics
PDF
Methodology and application of modern genetic association tests in admixed populations
PDF
Comparison of Cox regression and machine learning methods for survival analysis of prostate cancer
PDF
Screening and association testing of coding variation in steroid hormone coactivator and corepressor genes in relationship with breast cancer risk in multiple populations
PDF
Genetic risk factors in multiple myeloma
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Breast epithelial cell type specific enhancers and functional annotation of breast cancer risk loci
PDF
High-dimensional feature selection and its applications
PDF
Systematic analysis of single nucleotide polymorphisms in the human steroid 5-alpha reductase type I gene
PDF
Inference correction in measurement error models with a complex dosimetry system
PDF
Prediction and feature selection with regularized regression in integrative genomics
Asset Metadata
Creator
Song, Chi
(author)
Core Title
Extending genome-wide association study methods in African American data
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
09/16/2015
Defense Date
09/08/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
admixture,African American,fine-mapping,genome-wide association study,haplotype analysis,imputation,meta-analysis,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Stram, Daniel O. (
committee chair
), Coetzee, Gerhard A. (
committee member
), Cozen, Wendy (
committee member
), Haiman, Christopher A. (
committee member
), Lewinger, Juan Pablo (
committee member
)
Creator Email
chis@usc.edu,CSong.2013@marshall.usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-182906
Unique identifier
UC11275527
Identifier
etd-SongChi-3916.pdf (filename),usctheses-c40-182906 (legacy record id)
Legacy Identifier
etd-SongChi-3916.pdf
Dmrecord
182906
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Song, Chi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
admixture
fine-mapping
genome-wide association study
haplotype analysis
imputation
meta-analysis