Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Polygenes and estimated heritability of prostate cancer in an African American sample using genome-wide association study data
(USC Thesis Other)
Polygenes and estimated heritability of prostate cancer in an African American sample using genome-wide association study data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
POLYGENES AND ESTIMATED HERITABILITY OF PROSTATE CANCER IN AN
AFRICAN AMERICAN SAMPLE USING GENOME-WIDE ASSOCIATION STUDY
DATA
By
Jing He
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BIOSTATISTICS)
Aug 2013
Copyright 2013 Jing He
ii
Acknowledgements
I would like to gratefully acknowledge Dr. Daniel Stram and Dr. Christopher Haiman
who guided and supported me through completion of my PhD with their inspiration and
endless patience. I would also like to acknowledge the support and inspiration from my
wonderful committee Dr. James Gaudermand, Dr. Wendy Mack and Dr. Gehard Coetzee.
Finally I would like to thank my family for their love, support and encouragement.
iii
Table of Contents
Acknowledgements ............................................................................................................. ii
List of Tables .................................................................................................................... vii
List of Figures .................................................................................................................. viii
Abstract ............................................................................................................................... x
Chapter 1 Genetic Studies of Polygenic Trait..................................................................... 1
1.1 Polygenic trait ........................................................................................................... 1
1.2 Genome-wide association study (GWAS) ................................................................ 1
1.3 GWAS in multiethnic groups .................................................................................... 2
1.3.1 Motivation .......................................................................................................... 2
1.3.2 Differential linkage disequilibrium (LD) ........................................................... 3
1.3.3 Population structure ............................................................................................ 4
1.4 Accounting for population stratification ................................................................... 5
1.4.1 Genomic control ................................................................................................. 6
1.4.2 Structured association ......................................................................................... 6
1.4.3 Principal component analysis (PCA) .................................................................. 7
1.5 Shared risk between complex diseases ...................................................................... 7
Chapter 2 Completed Work ................................................................................................ 9
2.1 Introduction ............................................................................................................... 9
iv
2.2 The association of diabetes with colorectal cancer risk: the multiethnic cohort ..... 11
2.2.1 Abstract ............................................................................................................. 11
2.2.2 Introduction ...................................................................................................... 12
2.2.3 Study population ............................................................................................... 13
2.2.4 Statistical analysis............................................................................................. 14
2.2.5 Results .............................................................................................................. 16
2.2.6 Discussion ......................................................................................................... 32
2.2.7 Acknowledgements .......................................................................................... 35
2.2.8 References ........................................................................................................ 35
2.3 Generalizability and epidemiologic characterization of eleven colorectal cancer
GWAS hits in Multiple populations .............................................................................. 40
2.3.1 Abstract ............................................................................................................. 40
2.3.2 Introduction ...................................................................................................... 41
2.3.3 Study Populations ............................................................................................. 43
2.3.4 Genotyping ....................................................................................................... 46
2.3.5 Statistical Analysis ........................................................................................... 47
2.3.6 Results .............................................................................................................. 49
2.3.7 Discussion ......................................................................................................... 63
2.3.9 References ........................................................................................................ 67
v
Chapter 3 Heritability Estimate Using GWAS ................................................................. 73
3.1 Background ............................................................................................................. 73
3.1.1 Heritability ........................................................................................................ 73
3.1.2 Heritability of binary trait ................................................................................. 74
3.2 Heritability estimated from GWAS......................................................................... 79
3.2.1 Motivation ........................................................................................................ 79
3.2.2 Problem of missing heritability ........................................................................ 79
Chapter 4 Finding the Missing Heritability ...................................................................... 81
4.1 More comprehensive evaluation of genetic variation ............................................. 81
4.1.1 Introduction ...................................................................................................... 81
4.1.2 Score analysis ................................................................................................... 81
4.1.3 Linear mixed model (LMM)............................................................................. 88
4.2 Application to African American Prostate Cancer (AAPC) GWAS data ............... 92
4.2.1 AAPC GWAS ................................................................................................... 92
4.2.2 Score analysis of Prostate Cancer ..................................................................... 93
4.2.3 Mixed linear model of Prostate Cancer ............................................................ 96
Chapter 5 Interpretation of Heritability Estimates .......................................................... 101
5.1 Concerns about score analysis and LMM approach .............................................. 101
5.2 Simulation studies ................................................................................................. 102
vi
5.2.1 Simulated phenotypes based on genotypes in AAPC GWAS ........................ 102
5.2.2 Simulated genotypes and phenotypes with low pairwise correlation ............. 108
5.3 Discussion ............................................................................................................. 112
Bibliography ................................................................................................................... 116
Appendix A ..................................................................................................................... 121
Appendix B ..................................................................................................................... 127
vii
List of Tables
Table 2-1 Baseline characteristics of the participants, by ethnicity and by diabetes status
........................................................................................................................................... 18
Table 2-2 Age-adjusted
a
and Multivariate
b
Relative Risks of Colorectal Cancer
Associated with Diabetes Status According to Ethnic or Racial Group, Site and Stage of
disease in the Multiethnic Cohort, Los Angeles, CA and Hawaii .................................... 24
Table 2-3 Multivariate Relative Risks
a
of Colorectal Cancer Associated with Diabetes
Status According to Risk Factors in the Multiethnic Cohort, Los Angeles, CA and Hawaii
........................................................................................................................................... 28
Table 2-4 Descriptive characteristics of each study* ....................................................... 50
Table 2-5 The association of known CRC variants with CRC/adenoma risk by
race/ethnicity * .................................................................................................................. 56
Table 2-6 The risk of CRC/Adenoma associated with an aggregate risk score
*
by
race/ethnicity ..................................................................................................................... 60
Table 4-1 Results of the British Isles / non-British Isles score analyses [31]. .................. 85
Table 4-2 Results of score analyses on Prostate Cancer. Score alleles were selected based
on logistic model adjusted for first 10 principal components ........................................... 95
Table 5-1 Results of score analyses on simulated phenotype. Models were adjusted for
first 10 principal components .......................................................................................... 107
Table 5-2 Results of linear mixed model analyses of simulated phenotypes and genotypes
with low pairwise correlation (averaged over 20 independent simulations). ................. 111
viii
List of Figures
Figure 2-1 Risk allele frequencies in controls across racial/ethnic groups, in descending
ordered based on frequency in European Americans. ...................................................... 53
Figure 3-1 (From Yang et al) The Liability Threshold Model for a Disease Prevalence of
K. An underlying continuous random variable determines disease status. If liability
exceeds the threshold t, then individuals are affected. ...................................................... 75
Figure 4-1 (From Purcell et al) Multidimensional scaling (MDS) plot for the individuals
in the final post-QC dataset (both cases and controls). Known study samples are indicated
by color; the distinct clusters are labeled with the exception of the four British Isles
samples (from Scotland, Ireland and England) that show near complete overlap on the
first two dimensions. ......................................................................................................... 84
Figure 4-2 (From Purcell et al) Observed and simulated profiles of target sample variance
explained according to P-value threshold P
T
. a, The observed variance explained is
shown (R
2
, black line). b, A subset of models that produced results consistent with the
observed data is shown. All yielded similar estimates of the total variance explained by
the SNPs that tag the causal variants, V
M
, with a mean value of 34%. c, Four inconsistent
models with fewer variants of larger effect are shown. .................................................... 87
Figure 4-3 (From Yang et al) Estimates of variance explained by genome-wide SNPs
from adjusted estimates of genetic relationships are unbiased. Results are shown as
estimates of variance explained by different proportions of SNPs randomly selected from
all the SNPs in the combined set. For each group of SNPs, the variance explained by
genome-wide SNPs is estimated using both raw estimates of genetic relationships and
ix
adjusted estimates of genetic relationships correcting for prediction error (assuming c=0).
Error bars denote s.e. of the estimate of variance explained by genome-wide SNPs. The
log-likelihood ratio test (LRT) statistic is calculated as twice the difference in log-
likelihood between the full (h
2
≠0) and reduced (h
2
=0) models. ....................................... 91
Figure 4-4 (From Yang et al) Variance explained by chromosomes. Shown are the
estimates of the variance explained by each chromosome for height (combined) by joint
analysis using 11,586 unrelated individuals against chromosome length. The numbers in
the circles are the chromosome numbers. The regression slopes and R
2
were 1.6×10
-4
(P=1.4×10
-6
) and 0.695 for height. ................................................................................... 92
Figure 4-5 Heritability explained by common SNPs for Prostate Cancer by number of
SNPs used. Models were adjusted for first 10 eigenvectors. ............................................ 98
Figure 4-6 Variance explained by chromosomes for prostate cancer against chromosome
length. Models were adjusted for first 10 eigenvectors. The regression slope and R
2
were
9.17×10
-5
(P=0.0508) and 0.421. ...................................................................................... 99
Figure 4-7 Estimates of the variance explained by genic and intergenic regions on each
chromosome for prostate cancer by the joint analysis. The genic region is defined as (a)
±0 kb, (b) ±20 kb and (c) ±50 kb of the 3′ and 5′ UTRs. ................................................ 100
Figure 5-1 Heritability explained by common SNPs for simulated phenotype over
observed AAPC genotypes by number of SNPs used. Models were adjusted for first 10
eigenvectors. a, When the models include all of the causal SNPs. b, When the models do
not include any causal SNPs. .......................................................................................... 105
x
Abstract
Background: Validated SNPs in genome-wide association studies (GWAS) account for
only a small fraction of the variation in human complex traits. Methods: We applied
score analysis (Purcell et al) to 6957 not closely related individuals in African American
prostate cancer (AAPC) GWAS to see whether common variants have an important role
en masse; we estimated the narrow sense heritability of prostate cancer explained by
316,735 genotyped GWAS SNPs using a linear mixed model (Yang et al). We simulated
genotypes and phenotypes with low degree of correlation to evaluate how the estimates
from score analysis and linear mixed model are influenced by distant relatedness among
individuals. Results: Score analyses provided evidences for a substantial polygenic
component to the prostate cancer risk; the narrow sense heritability of prostate cancer in
African Americans was estimated to be 30% by utilizing 316,735 GWAS SNPs in linear
mixed model. In the simulation studies, the narrow sense heritability estimates were
biased up by inclusion of non-causal SNPs that are associated with distant relatedness.
Conclusion: The narrow sense heritability estimate from linear mixed model may be an
overestimate of the genetic contribution of the specific set of measured GWAS SNPs and
is closer to the overall narrow sense heritability. Impact: It is important to clarify what the
estimates are telling us because where the phenotypic variation lays will give hints on the
future direction of genetic association studies.
1
Chapter 1 Genetic Studies of Polygenic Trait
1.1 Polygenic trait
The term monogenic trait refers to those disorders induced by highly penetrant single
genes that are both necessary and sufficient to induce the phenotype. These diseases,
however, collectively affect only a small fraction of the world's population. For example NF1
(neurofibromatosis-1) occurs in about 1/3500 live births which about half of them arise
due to spontaneous mutation. Polygenic diseases, which account for most of the deaths in
the world, are due to genetic variants that occur much more commonly. For a polygenic
trait, genetic predisposition is due to multiple genes rather than a single gene. Each
variant at each gene influencing a complex trait will have a small additive or
multiplicative effect on the trait. The vast majority of sequence differences between
people are due to SNPs.
1.2 Genome-wide association study (GWAS)
Genome-wide association study is a powerful tool to identify common genetic factors
that influence polygenic diseases. In a GWAS, a large number of SNPs are genotyped,
the density of genetic markers (SNP) and the extent of linkage disequilibrium are
sufficient to capture a large proportion of the common variation in the human genome (in
the population under study). And then the genotypes of SNPs are compared between
groups of people with and without diseases. If one type of the variant (one allele) is more
frequent in people with the disease, the SNP is said to be “associated” with the disease.
The associated SNPs are then considered to mark a region of the human genome which
influences the risk of disease. The first GWA study was published in 2005 [1], as of 2011,
2
hundreds or thousands of individuals are tested, over 1,200 human GWA studies have
examined over 200 diseases and traits, and almost 4,000 SNP associations have been
found [2].
1.3 GWAS in multiethnic groups
1.3.1 Motivation
For many diseases, significant disparities exist with regard to incidence and mortality
among racial and ethnic groups. For example, African American men have the highest
risk of prostate cancer in the United States, as well as the highest risk of developing
aggressive prostate cancer and the highest prostate cancer mortality rates; estrogen
receptor (ER)-negative breast tumors, which are more aggressive than ER-positive
tumors, occur more frequently in African American women than in Caucasian women.
African-American men and women have a higher risk of developing colorectal cancer
and a lower survival rate. We want to understand the genetic structure underlying these
disparities. However, so far most studies of complex diseases have been performed in
people of European descent.
Population genetics shows that disease variants, especially for late onset diseases that do
not affect reproductive success, will differ by frequency from population to population.
The disease causing allele might not be found in the study if the allele is absent or less
common in the population where the case subjects are chosen, thus it is possible and
necessary to study variants in non-European populations that cannot be studied in whites.
Also, the linkage disequilibrium (LD) pattern may be more broken in certain population
3
such as those of African Ancestry, thus we can narrow down the functional region by
performing association studies in ethnic groups with shorter LD blocks. In addition,
successful replication of association in multiple ethnic groups can increase our
confidence of the allele being a functional allele. Therefore replication studies are an
important part of understanding the burden of disease in all groups and are needed to be
done to see if the results found in European population translate to risk in other multiple
ethnic groups. My paper on replication of Colorectal Cancer (CRC) Genome Wide
Association Study (GWAS) hits is an example of such study.
1.3.2 Differential linkage disequilibrium (LD)
One of the complications in genetic studies in multi-ethnic setting is that differential LD
patterns can distort associations. Linkage disequilibrium is the non-random association of
alleles at two or more loci. The significant variants identified in GWAS are not
necessarily causal variants, but likely the variants that are correlated with true causal
variants. However, the correlations differ among ethnic groups due to different LD
patterns, which might be the reason that significant variants found in the European
Americans did not replicate well in one ethnic group [3]. To account for the different LD
patterns in different ethnic groups, we can do further genotyping to localize associations
in regions that don’t replicate. My current work on fine mapping of the region that
contains the Hepatocyte-Nuclear Factor 1 Beta (HNF1B) gene on chromosome 17q12
where two independent associations have been detected in Prostate Cancer is an example
of such a study.
4
1.3.3 Population structure
Another complication in genetic association studies in multi-ethnic groups is that
population structure can be a confounder, and the association found could be due to the
underlying structure of the population and not a disease associated locus.
Population structure refers to the presence of a systematic difference in allele frequencies
between subpopulations in a population possibly due to different ancestry. The basic
cause of population structure is nonrandom mating between groups, often due to their
physical separation (e.g., for populations of African and European descent) or cultural
preferences, followed by genetic drift of allele frequencies in each group. Genetic
admixture can occur when individuals from two or more previously separated
populations begin interbreeding. Admixture results in the introduction of new genetic
lineages into a population. In some contemporary populations like African Americans,
there has been recent admixture between individuals from different populations, leading
to populations in which ancestry is variable. Distant relatedness is a more complex form
of population structure. It refers to the situation where apparently unrelated individuals
actually have some unexpected relatedness. Whereas population structure generally
describes remote common ancestry of large groups of individuals, cryptic relatedness
refers to recent common ancestry among smaller groups (often just pairs) of individuals.
Phenotypes of study subjects might also vary systematically with ancestry because of the
following reasons [4]: (1) Disease prevalence varies across subpopulations in accordance
with the frequencies of causal alleles, and there may be unequal contributions to the case
and control study sample from subpopulations reflecting the differing subpopulation
5
prevalence. (2) Subpopulation prevalence may vary because of differing environmental
risks. (3) Ascertainment bias can arise if there are differences in the sampling strategies
between cases and controls that are correlated with ancestry. This may happen, for
example, because cases, but not controls, are sampled from clinics that over represent
particular groups.
The assumption of population homogeneity in association studies, especially case-control
studies, can easily be violated. As long as there is a higher allele frequency in a
subpopulation you will find association with any trait more prevalent in the case
population [5]. This kind of spurious association increases as the sample population
grows so the problem should be of special concern in large scale association studies when
loci only cause relatively small effects on the trait. It is therefore important for the models
used in the GWAS to compensate for the population structure.
1.4 Accounting for population stratification
There are a number of possible ways to ameliorate the effects of population structure in
the association studies and thus compensate for any population bias[6], including
genomic control which is a relatively nonparametric method for controlling the inflation
of test statistics [7], structured association methods [8] which use genetic information to
estimate and control for population structure, and efficient mixed-model association (EMMA)
which use variance component approaches [9]. Currently, the most widely used structured
association method is the principal component method as implemented in the program
Eigenstrat [10].
6
1.4.1 Genomic control
A method that in some cases can compensate for the above described problems has been
developed by Devlin and Roeder [7]. It works by using markers that are not linked with
the trait in question to correct for any inflation of the statistic caused by population
stratification.
𝜒𝜒 2
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑓𝑓𝑓𝑓𝑓𝑓 𝑎𝑎𝑎𝑎𝑎𝑎𝑡𝑡𝑎𝑎𝑎𝑎𝑎𝑎 𝑓𝑓𝑓𝑓𝑡𝑡𝑓𝑓𝑓𝑓𝑡𝑡𝑓𝑓𝑎𝑎𝑎𝑎𝑡𝑡𝑡𝑡 : 𝑌𝑌 2
=
2 𝑁𝑁 (2 𝑁𝑁 ( 𝑓𝑓 1
+ 2 𝑓𝑓 2
) − 2 𝑅𝑅 ( 𝑓𝑓 1
+ 2 𝑓𝑓 2
))
2
4 𝑅𝑅 ( 𝑁𝑁 − 𝑅𝑅 )(2 𝑁𝑁 ( 𝑓𝑓 1
+ 2 𝑓𝑓 2
) − ( 𝑓𝑓 1
+ 2 𝑓𝑓 2
)
2
)
Alleles aa Aa AA Total
Case r0 r1 r2 R
Control s0 s1 s2 S
Total n0 n1 n2 N
Under the null hypothesis of no population stratification the trend test has an asymptotic
χ
2
distribution with one degree of freedom. The idea underlying genomic control is that
the statistic is inflated by a factor λ so that 𝑌𝑌 2
~ 𝜆𝜆 𝜒𝜒 1
2
, where λ depends on the effect of
stratification. λ can be estimated from the unlinked markers 𝜆𝜆 ̂
= 𝑚𝑚𝑡𝑡𝑚𝑚𝑎𝑎𝑎𝑎𝑓𝑓 (𝑌𝑌 1
2
, 𝑌𝑌 2
2
, … , 𝑌𝑌 𝐿𝐿 2
)/
0.456 ,where L is the number of unlinked markers. One problem with this approach is
that there can be an ultimate limit to the power of the test that cannot be surpassed no
matter how large N becomes [11]. Also this approach is not straightforward to extend to
multi-SNPs analysis.
1.4.2 Structured association
In this method, sampled individuals were modeled as having inherited their genes from a
pool of K “unstructured” subpopulations (where K may be unknown). The subpopulations
7
are “unstructured” in that within each subpopulation all loci are in Hardy-Weinberg
equilibrium. Each individual’s genetic background is represented by a vector q =
(q
1
,…,q
k
), where q is the proportion of the individual’s genome which originated in
subpopulation k. A Markov Chain Monte Carlo method is then used to estimate the
number of subpopulations, the allele frequencies in each subpopulation, and the value of
q for each sampled individual. This approach relies on K-island model and can be
computationally slow.
1.4.3 Principal component analysis (PCA)
This is a three steps method. First, principal components analysis is applied to genotype
data to infer continuous axes of genetic variation. Intuitively, the axes of variation reduce
the data to a small number of dimensions, describing as much variability as possible; they
are defined as the top eigenvectors of a covariance matrix between samples. Second,
continuously adjust genotypes and phenotypes by amounts attributable to ancestry along
each axis, via computing residuals of linear regressions. Third, compute association
statistics using ancestry-adjusted genotypes and phenotypes. Note that in linear models
this procedure is equivalent to perform a regression analysis of phenotype on genotype
while including the PCs as adjustment variables. This approach does not capture effects
of distant relatedness, or more complex form of population structure.
1.5 Shared risk between complex diseases
Lots of diseases have common shared risk factors; for example, Diabetics have been
found to have a greater risk of colorectal cancer than non-diabetics; positive relationships
8
have been reported between risks of ovarian and breast cancer; and high cholesterol
levels were significantly related to Alzheimer's disease.
The shared risk between complex diseases could be due to common environmental risk
factors, for instance excess weight is risk factor for both type 2 diabetes and colorectal
cancer. It could also be due to causal effects of one disease on the risk of another. For
example, “hyperinsulinemia-hypothesis” was purported to explain the positive
relationship between CRC risk and Type 2 Diabetes [12, 13], whereby insulin resistance
in Type 2 diabetics leads to higher insulin levels, as well as an increased level of
bioavailable IGF-I. The proliferative effects of insulin and IGF-I on the colon epithelium
are believed to increase the potential for spontaneous mutations and play an important
role in both the initiation and progression phases of carcinogenesis in colorectal cancer.
It is important to understand whether the above known risk factors can explain the co-
occurrence of disease, if not then the shared risk may be due to common genetic risk
variants. My paper on Colorectal Cancer risk and Type 2 Diabetes is an example of this
type of study.
9
Chapter 2 Completed Work
2.1 Introduction
In the replication study of CRC GWAS hits, we studied the generalizability of the risk
variants identified in genome-wide association studies (GWAS) in populations of
European ancestry in other ethnic groups. We examined the risk variants on 8q23
(rs16892766), 8q24 (rs6983267), 9p24 (rs719725), 10p14 (rs10795668), 11q23
(rs3802842), 14q22 (rs4444235), 15q13 (rs4779584), 16q22 (rs9929218), 18q21
(rs4939827), 19q13 (rs10411210) and 20p12 (rs961253) in a sample of 2,472 CRC cases,
839 adenoma cases and 4,466 controls comprised of European American, African
American, Native Hawaiian, Japanese American and Latino men and women. We
confirmed the associations with an increased risk of CRC/adenoma for the 8q24, 11q23
and 15q13 loci in European Americans, and observed significant associations between
the 8q24 and 20p12 loci with CRC/adenoma risk in African Americans. This study has
been published and the paper is included in section 2.3.
In the second study we aimed to better characterize the association between Prostate
Cancer risk and the region contains the Hepatocyte-Nuclear Factor 1 Beta (HNF1B) gene
on chromosome 17q12, where two independent associations have been detected, in a
multiethnic population. To accomplish this, we extensively sequenced the Linkage
Disequilibrium (LD) blocks of the two known index risk alleles in 48 prostate cancer
cases from 3 different racial/ethnic groups. We subsequently selected a subset of these
SNPs in order to cover this region in 4 racial/ethnic groups. We then used a nested
case/control study of the 5 racial/ethnic groups of the MEC in order to estimate the
10
association of these SNPs with prostate cancer. In this study, the previously reported
associations did not replicate well in Japanese Americans. However, we found that
rs12453443, which is independent of previously identified signals, was strongly
associated with elevated Prostate Cancer risk in Japanese Americans. This study is still in
progress.
Another study on CRC and Type 2 Diabetes is a prospective study in which Cox
regression was used to estimate hazard ratios (reported as relative risks) for the effect of
diabetes on colorectal cancer incidence. There are 3,549 incident colorectal cancer cases
identified over a 13-year period (1993-2006) among 199,143 European American,
African American, Native Hawaiian, Japanese American and Latino men and women in
the Multiethnic Cohort. We found that except for Native Hawaiians, diabetics overall had
a significantly greater risk of colorectal cancer than did non-diabetics in all racial groups
(relative risk (RR) =1.19, 95% confidence interval (CI): 1.09-1.29, P value (P) <0.001),
which is consistent with previous reports. Positive associations were observed for colon
cancer, cancers of both the right and left colon, and cancers diagnosed at a localized and
regional/distant stage. The positive relationship remained significant after we adjusted for
a number of potential environmental risk factors including body mass index (BMI),
smoking status, educational level, alcohol intake, non-steroidal anti-inflammatory drugs
(NSAIDs) use, saturated and non-saturated fat intake, dietary fiber intake, total calories
intake, vigorous activity, and menopausal status and hormone replacement therapy (HRT)
use (among women). The biological basis for the positive association between Type 2
11
diabetes and colorectal cancer remained unclear. This study has been published and the
paper is included in section 2.2.
2.2 The association of diabetes with colorectal cancer risk: the multiethnic cohort
2.2.1 Abstract
BACKGROUND: Diabetics have been found to have a greater risk of colorectal cancer
than non-diabetics. METHODS: We examined whether this relationship differed by
ethnic group, cancer site or tumor stage in a population-based prospective cohort,
including 3,549 incident colorectal cancer cases identified over a 13-year period (1993-
2006) among 199,143 European American, African American, Native Hawaiian,
Japanese American and Latino men and women in the Multiethnic Cohort. RESULTS:
Diabetics overall had a significantly greater risk of colorectal cancer than did non-
diabetics (relative risk (RR) =1.19, 95% confidence interval (CI): 1.09-1.29, P value (P)
<0.001). Positive associations were observed for colon cancer, cancers of both the right
and left colon, and cancers diagnosed at a localized and regional/distant stage. The
association with colorectal cancer risk was significantly modified by smoking status
(P
Interaction
=0.0044) with the relative risk being higher in never smokers (RR=1.32, 95%CI:
1.15-1.53, P<0.001) than past (RR=1.19, 95%CI: 1.05-1.34, P=0.007) and current
smokers (RR=0.90, 95%CI: 0.70-1.15, P=0.40). CONCLUSION: These findings provide
strong support for the hypothesis that diabetes is a risk factor for colorectal cancer.
12
2.2.2 Introduction
Colorectal cancer and Type 2 diabetes share several risk factors including western diet,
obesity and physical inactivity, and epidemiological studies have provided evidences to
support a positive relationship between these two common diseases (Levi et al, 2002;
Nilsen & Vatten, 2001; Will et al, 1998; Yang et al, 2005). One hypothesis purported to
explain the association is the “hyperinsulinemia-hypothesis” (Giovannucci, 1995;
McKeown-Eyssen, 1994), whereby insulin resistance in Type 2 diabetics leads to higher
insulin levels, as well as an increased level of bioavailable IGF-I. The proliferative
effects of insulin and IGF-I on the colon epithelium are believed to increase the potential
for spontaneous mutations and play an important role in both the initiation and
progression phases of carcinogenesis in colorectal cancer. This hypothesis is further
supported by the findings that elevated circulating levels of C-peptide and IGF-I are
associated with an increased risk of colorectal cancer (Jenab et al, 2007; Kaaks et al,
2000; Ma et al, 2004; Otani et al, 2007; Renehan et al, 2004; Sandhu et al, 2002; Schoen
et al, 2005). Studies of the relationship between Type 2 diabetes and colorectal cancer
risk in racial/ethnic populations other than Whites have been limited; however, in general,
they also support the positive association between these conditions (Jee et al, 2005; Seow
et al, 2006; Vinikoor et al, 2009). We conducted a prospective analysis of the
relationship between diabetes and colorectal cancer risk in the Multiethnic Cohort (MEC).
This large prospective cohort includes five racial/ethnic populations (European
Americans, African Americans, Native Hawaiians, Japanese Americans and Latinos) that
represent a wide variation in the incidence of these diseases, as well as in the prevalence
13
of known risk factors. Here we report on the association between these endpoints by race,
sex, cancer site (colon vs. rectum and left colon vs. right colon), stage (localized vs.
regional/distant), and in strata of established risk factors for colorectal cancer.
2.2.3 Study population
The Multiethnic Cohort includes 215,251 men and women in Hawaii and California
(largely from Los Angeles County). The participants are primarily individuals of Native
Hawaiian, Japanese, White, African American and Latino race/ethnicity, who entered the
cohort between 1993 and 1996 by completing a detailed, self-administered questionnaire
that obtained information on basic demographic variables, as well as several lifestyle
factors (e.g. diet) and medical conditions (e.g. diabetes). Potential cohort members were
identified primarily through Department of Motor Vehicles drivers’ license files and
additionally for African-Americans, Health Care Financing Administration (Medicare)
data files. Participants were between the ages of 45 and 75 years at the time of
recruitment. In the cohort, incident colorectal cancer cases are identified through cohort
linkage to population-based cancer Surveillance, Epidemiology and End Results (SEER)
registries which cover Hawaii and Los Angeles County, as well as the rest of the state of
California. For this analysis, linkage with these registries was complete through
December 31, 2004, in Hawaii and December 31, 2006 in California. Over this period,
1,921 male and 1,628 female cases of colorectal cancer were identified. Deaths within the
cohort are determined from linkages to the death certificate files in Hawaii and California,
supplemented with linkages to the National Death Index. Diabetes status is defined based
on self-report to a question on the baseline questionnaire asking whether a doctor had
14
ever told the respondent that he/she had diabetes (high blood sugar). The question did not
differentiate between Type 1 and Type 2 diabetes mellitus and thus we expect a small
fraction (<10%) of the self-reported diabetes cases to have Type 1 diabetes and be
potentially misclassified. Year of diabetes diagnosis (before 1994, 1994, 1995, 1996,
1997 or 1998) was defined using a second questionnaire in 2001, and 90.3% (79,178 men
and 100,663 women) of the subjects who returned the first questionnaire also returned the
second questionnaire.
In the analysis, we excluded one male subject missing information for diabetes status on
the baseline questionnaire. The prospective analysis of the association between diabetes
status and colorectal cancer incidence in this study includes 89,478 men and 109,664
women in the Multiethnic Cohort from the five main ethnic groups.
The study protocol was approved by the institutional review boards at the University of
Southern California and the University of Hawaii.
2.2.4 Statistical analysis
Cox regression was used to estimate hazard ratios (reported as relative risks) for the
effect of diabetes on colorectal cancer incidence. Models were stratified by age at entry
of the cohort, and were minimally adjusted for age, ethnicity and sex, or further adjusted
for the following risk factors as potential confounders: body mass index (BMI) (<23,
≥23-25, ≥25-30, ≥30-35, ≥35 kg/m
2
and missing), smoking status (never, past, current
and missing), educational level (≤12 years, some college or vocational, college graduate
and missing), alcohol intake (never, <12, 12-24 and ≥24 g/day and missing), non-
steroidal anti-inflammatory drugs (NSAIDs) use (yes, no and missing), saturated fat
15
intake (categorized by quartiles of the distribution of percent of calories from saturated
fat; ≤7.0%, >7.0-8.8%, >8.8-10.6%, >10.6% and missing), non-saturated fat intake
(categorized by quartiles of the distribution of percent of calories from non-saturated fat;
≤17.8%, >17.8-21.2%, >21.2-24.5%, >24.5% and missing), dietary fiber intake
(categorized by quartiles of the distribution of dietary fiber intake; ≤8.7, >8.7-
11.3, >11.3-14.5, >14.5 g/Kcal, and missing), total calories (categorized by quartiles of
the distribution of calories intake per day; ≤1417.8, >1417.8-1918.9, >1918.9-
2608.9, >2608.9 Kcal/day and missing), vigorous activity (never, ≤0.21, >0.21-
0.71, >0.71 hrs/day and missing), family history of colorectal cancer (yes and no),
menopausal status and hormone replacement therapy (HRT) use (pre-menopausal, post-
menopausal never HRT users, post-menopausal past HRT users, post-menopausal
current HRT users, and those missing information on menopause status or HRT use)
among women only. For each covariate, an indicator variable was used for those missing
data.
Relative risks were estimated separately for colon and rectal cancer, by location in the
colon (left vs. right; 31 men and 42 women were missing information for this variable
and were excluded from this analysis) and by stage (localized vs. regional/distant; 190
men and 213 women were missing information for this variable and were excluded from
this analysis). We also examined whether the association may be modified by known risk
factors (at baseline): age (<60, 60-69 and ≥70), BMI (<25 kg/m
2
and ≥25 kg/m
2
; 881 men
and 2704 women were missing information for BMI and were further excluded from this
analysis), smoking status (never, past and current; 1,034 men and 2,145 women were
16
missing information for smoking status and were further excluded from this analysis),
and NSAIDs use (yes and no; 2,984 men and 5,701 women were missing information for
NSAIDs use and were further excluded from this analysis).
The risk of colorectal cancer in diabetics has been reported to be highest 10-15 years after
diabetes diagnosis, with risk declining in later years perhaps as a result of
hypoinsulinemia (Hu et al, 1999; La Vecchia et al, 1997; Le Marchand et al, 1997). In an
attempt to assess the association between time since diabetes diagnosis and colorectal
cancer risk, we performed analysis comparing colorectal cancer risk in long-term and
short-term diabetics. Subjects who reported a diabetes diagnosis before 1994 were
considered to have a long duration of diabetes, and those who reported a diabetes
diagnosis in or after 1994 were considered to have a short duration of diabetes. In this
analysis, we began the follow-up at the time the second questionnaire was returned
(2001). 5,344 diabetic men and 5,842 diabetic women were missing information for
diagnosis year of diabetes or did not return the second questionnaire, or had colorectal
cancer before the return date of the second questionnaire and thus were excluded from
this analysis. After exclusion, 9,716 diabetic men and 10,429 diabetic women were
included in this analysis.
2.2.5 Results
The mean age of the men (n=89,478) at baseline of the cohort was 60.2 years (standard
deviation, 8.9), which ranged from 56.8 years for Native Hawaiians to 62.0 years for
African Americans. The prevalence of diabetes varied widely across populations, from
9.5% in European Americans to 21.6% in Latinos. Diabetic men were slightly older than
17
non-diabetic men for each racial/ethnic group, ranging from 62.7 years (vs. 59.2 in non-
diabetics) in European Americans to 61.9 years (vs. 60.3 in non-diabetics) in Latinos.
Among men, the age-standardized incidence rate of colorectal cancer was higher in
diabetics than non-diabetics among European Americans, African Americans, Japanese
Americans and Latinos, but lower in Native Hawaiians. In each population, diabetic men
were more likely to be overweight and less physically active than men without diabetes
(Table 2-1).
The mean age of the women (n=109,664) was 59.7 years (standard deviation, 8.9), with
Native Hawaiians being the youngest (mean, 56.2 years) and Japanese and African
Americans being the oldest (mean, 60.9 years; Table 1). Among women, the prevalence
rate of diabetes was lowest in European Americans (7.8%) and highest in African
Americans (20.0%). Likewise, the mean age of diabetic women at baseline was greater
than non-diabetic women. Age-standardized rates of colorectal cancer in women were
greater in diabetics than non-diabetics for all populations except Native Hawaiians.
Among women, obesity and physical inactivity were consistently associated with
diabetes across populations (Table 2-1).
18
Table 2-1 Baseline characteristics of the participants, by ethnicity and by diabetes status
European
Americans
African
Americans
Native
Hawaiians
Japanese
Americans Latinos
Characteristics Diab
Non-
diab Diab
Non-
diab Diab
Non-
diab Diab
Non-
diab Diab
Non-
diab
Men
No. of subjects 2157 20471 2573 9944 1341 4897 4297 22059 4692 17047
(%) (9.5) (90.5) (20.6) (79.6) (21.5) (78.5) (16.3) (83.7) (21.6) (78.4)
Mean age 62.7 59.2 64 62.1 58.8 56.8 63.4 61.2 61.9 60.3
(SD) (8.4) (9.1) (8.1) (9) (8.2) (8.7) (8.5) (9.3) (7.2) (7.9)
No. of CRC cases 41 318 67 256 24 93 134 578 105 305
CRC incidence rates
a
126.1 106.9 160.2 139.6 137.7 155.7 164.4 164.3 125.1 99.1
Family history of CRC, %
b
7.6 8 7.4 6.5 6.1 6.5 9.9 9.8 3.8 3.7
BMI (kg/m2),%
c
≥23-25 11.1 21.8 10.6 17.4 7.6 14.2 20.3 26.8 12.3 17
<23 7.4 18.2 7.9 14.7 4.9 11.4 15.9 25.9 7.4 11.1
≥25-30 43.3 45.6 43.3 46.4 38.3 45.9 46.2 40.8 47.8 53.9
≥30-35 24.5 11.4 24.6 14.5 27.6 19.5 13.5 5.5 24 14
≥35 13.1 2.7 9.9 3.9 19.9 8 3.9 0.8 7.7 3.2
missing 0.5 0.4 3.3 3.1 1.7 1 0.2 0.2 0.8 0.9
Educational level, %
c
≤12 years 32.6 23 41.1 40.4 60.5 53.4 36.7 35 64.9 63.4
Some college or vocational 30.6 28.5 36.7 35.8 25.8 28 33.2 30.1 23 22.8
19
College graduate 36.3 47.7 20.7 22.2 12.8 17.4 29.5 34.3 10.4 12
missing 0.5 0.8 1.1 1.6 0.9 1.2 0.7 0.6 1.7 1.8
Physical activity (hours/day),%
c
never 38.9 27.9 41.9 34.5 29.2 22.3 39.2 33 36.3 29
≤0.21 16.3 14.4 18.1 15.1 14.3 13.2 18.9 18.1 14.4 14.2
>0.21-0.71 20.9 23.4 17.3 21.3 24 25.6 21.8 23.4 18.6 20.8
>0.71 20.3 31.5 17.4 24.2 28.3 35.6 17.4 23 25.2 30.8
missing 3.6 2.8 5.2 4.9 3.7 3.4 2.7 2.5 5.5 5.1
NSAIDs use, %
c
no 42.3 49.8 46.8 51.2 55.2 61.1 58.4 67.9 53.3 55.7
yes 55.1 48 48.6 44.2 41.6 35.4 39.4 30.2 41.3 38.8
missing 2.7 2.1 4.6 4.6 3.2 3.5 2.2 1.8 5.4 5.5
Women
No. of subjects 2065 24355 4372 17481 1548 6587 3629 26024 4657 18946
(%) (7.8) (92.2) (20) (80) (19) (81) (12.2) (87.8) (19.7) (80.3)
Mean age 62.1 59.2 63 61 58.1 56.4 63.2 61.2 61.4 59.3
(SD) (8.4) (9) (8.5) (9.2) (8.5) (8.8) (8.6) (9) (7.3) (7.8)
No. of CRC cases 38 267 111 357 16 72 90 395 68 214
CRC incidence rates
a
107.5 70.1 138.6 110.9 69.3 80 179.2 95.8 81.6 59.7
Family history of CRC, %
b
8.8 9.2 8.7 8.4 7.3 7.2 11.1 11.1 4.5 5.1
BMI (kg/m2),%
c
≥23-25 9.6 19.5 6.1 14.3 8.1 16.6 16.8 20.7 10.2 18.2
<23 11.1 36.4 5.3 13.8 7.8 20.4 25.8 50.5 6.9 17.2
≥25-30 29.5 28 28.4 36.1 30.1 32.7 36.4 21.8 35 39.5
20
≥30-35 24.4 10.2 26.7 20.1 24.9 17.3 14.9 4.3 28.1 16.5
≥35 22.9 4.9 27 11.7 24.1 10.3 4.4 1 17 6.8
missing 2.5 1 5.8 4.1 5 2.8 1.7 1.8 2.7 1.8
Educational level, %
c
≤12 years 43.6 30.2 46.4 38.5 65.2 58.7 41.8 39.1 75.4 70.7
Some college or vocational 30.1 33.2 33.8 37.2 22.6 26.1 31.2 29.6 16.6 18.9
College graduate 25.4 35.7 17.6 22.8 11.3 14.1 26.3 30.5 6.1 8.2
missing 0.9 0.9 1.5 1.5 0.9 1.1 0.7 0.8 1.9 2.1
Physical activity (hours/day),%
c
never 57.9 48.5 60.3 56.1 47.7 43.7 61.7 59.5 58.1 53.2
≤0.21 15.6 15.5 15.3 17 17.9 16.8 16.2 15.8 13.5 15.3
>0.21-0.71 12 16.9 9.7 11.5 15.8 19.4 12.2 13.5 10.4 12.3
>0.71 9.3 14.6 5.3 7.3 11.5 15.1 6.3 7.5 7.1 9.6
missing 5.1 4.5 9.3 8.1 6.4 5 3.6 3.7 10.8 9.6
NSAIDs use, %
c
no 49.6 54.9 45.1 50.2 58.6 63.2 68.9 75 49.4 53.7
yes 46.7 41.9 46.9 43 34.5 31.5 27.7 22.1 41.5 38.5
missing 3.7 3.2 8 6.8 6.9 5.4 3.4 2.9 9.1 7.8
HRT use
Never 48.8 42.3 59.4 57.7 63.6 55.9 51.4 49.2 57.9 54.8
Past 20.9 17.6 20.1 18.7 15.1 16.1 14.7 12.5 17.7 16.6
Current 26.2 37.1 13.2 17.7 15.4 23.5 30.3 34.8 15.8 19.9
Missing 4.1 3 7.3 6 5.9 4.5 3.6 3.4 8.6 8.8
a
Age standardized (5-year age groups) to 2000 US standard population with age 40 years old or above, per 100000
21
person-years
b
Age standardized (5-year age groups) to the total population included in the study, subjects who were missing
information for family history of colorectal cancer were included as with no family history.
c
Age standardized (5-year age groups) to the total population included in the study.
d
Age standardized (5-year age groups) to the total women included in the study
22
The associations noted by race and sex were not materially modified after adjustment for
potential risk factors for colorectal cancer; none of the risk factors served as strong
independent confounders of the association between colorectal cancer risk and diabetes
(Table 2-2). In multivariate analyses, the relative risk of colorectal cancer in diabetics
was 19% higher compared to non-diabetics (RR=1.19, 95% CI: 1.09-1.29, P<0.001;
Table 2). The relative risk was greater and more significant in women (RR=1.28, 95% CI:
1.12-1.46, P<0.001) than in men (RR=1.12, 95% CI: 0.99-1.26, P=0.063;
P
Interaction
=0.071). A positive association between diabetes and colorectal cancer risk was
noted in all ethnic groups except among Native Hawaiians, with the relative risk
estimates ranging from 1.16 in European Americans (95% CI: 0.91-1.48, P=0.24) and
African Americans (95% CI: 0.97-1.38, P=0.097) to 1.27 in Japanese Americans (95% CI:
1.09-1.47, P=0.002). Risk was non-significantly lower in Native Hawaiians (RR=0.89, 95%
CI: 0.62-1.27, P=0.52), however, a test of heterogeneity suggested no significant
difference in the risk estimates across ethnic groups (P
Interaction
=0.43) (Table 2-2).
In analyses stratified by cancer site (colon vs. rectum) (Table 2-2), risk of colon cancer
was elevated in diabetics (RR=1.20, 95% CI: 1.09-1.32, P<0.0001) as was rectal cancer
risk (RR=1.15, 95% CI: 0.96-1.36. P=0.12). Test of heterogeneity by ethnicity was not
significant for colon cancer (P
Interaction
= 0.30) or for rectal cancer (P
Interaction
= 0.18).
However, for rectal cancer, a particularly strong association between cancer risk and
diabetes was observed only in Latinos (RR=1.55, 95% CI: 1.13-2.14, P=0.007) (Table 2-
2) with the association noted in Latino men (RR=1.51, 95% CI: 1.00-2.27, P=0.050) and
women (RR=1.61, 95% CI: 0.97-2.69, P=0.067). The risk of cancer associated with
23
diabetes was elevated and similar in magnitude for cancer in the right colon (RR=1.23, 95%
CI: 1.08-1.41, P=0.002) and in the left colon (RR=1.18, 95% CI: 1.01-1.38, P=0.040).
For both locations, tests of heterogeneity suggested no difference across ethnic groups
(Table 2-2).
In analyses by stage (Table 2-2), there were significant increases in risk for both
regional/distant cancer and localized cancer (RR=1.21, 95% CI: 1.06-1.39, P=0.005 for
localized cancer; RR=1.14, 95% CI: 1.01-1.30, P=0.036 for regional/distant cancer). No
significant heterogeneity across ethnic groups was observed by stage. For regional/distant
colorectal cancer, the relative risk was similar in men (RR=1.15, 95% CI: 0.97-1.36,
P=0.10) and women (RR=1.14, 95% CI: 0.94-1.37, P=0.19); while for localized cancer,
there was no significant increase in risk observed in diabetic men (RR=1.06, 95% CI:
0.88-1.27, P=0.55), but the risk was significantly increased in diabetic women (RR=1.44,
95% CI: 1.18-1.76, P<0.001; P
Interaction
= 0.0098) (Table 2-2).
24
Table 2-2 Age-adjusted
a
and Multivariate
b
Relative Risks of Colorectal Cancer Associated with Diabetes Status According to Ethnic
or Racial Group, Site and Stage of disease in the Multiethnic Cohort, Los Angeles, CA and Hawaii
European African Native Japanese
Americans Americans Hawaiians Americans Latinos All
1
P
Int
All
No. Cases 664 791 205 1197 692 3549
Age-adj 1.26(0.99-1.59) 1.16(0.98-1.37) 0.91(0.64-1.29) 1.29(1.12-1.50) 1.24(1.05-1.48) 1.41(1.27-1.57) 0.43
0.058 0.084 0.59 0.001 0.013 <0.001
Multi 1.16(0.91-1.48) 1.16(0.97-1.38) 0.89(0.62-1.27) 1.27(1.09-1.47) 1.21(1.02-1.46) 1.19(1.09-1.29) 0.43
0.24 0.097 0.52 0.002 0.03 <0.001
Men
No. Cases 359 323 117 712 410 1921
Age-adj 1.07(0.77-1.48) 1.01(0.77-1.33) 0.95(0.60-1.49) 1.15(0.95-1.38) 1.25(1.00-1.56) 1.12(1.00-1.26) 0.66
0.69 0.93 0.82 0.15 0.048 0.048
Multi 0.96(0.69-1.35) 1.02(0.77-1.34) 0.97(0.60-1.55) 1.15(0.95-1.39) 1.31(1.04-1.64) 1.12(0.99-1.26) 0.65
0.83 0.91 0.9 0.16 0.023 0.063
Women
No. Cases 305 468 88 485 282 1628
Age-adj 1.52(1.08-2.13) 1.27(1.02-1.57) 0.89(0.52-1.54) 1.56(1.24-1.97) 1.25(0.95-1.64) 1.34(1.19-1.52) 0.32
0.017 0.03 0.69 <0.001 0.11 <0.001
Multi 1.48(1.03-2.12) 1.27(1.02-1.58) 0.83(0.47-1.46) 1.5(1.18-1.91) 1.09(0.82-1.45) 1.28(1.12-1.46) 0.32
0.033 0.036 0.51 0.001 0.54 <0.001
Colon
25
No. Cases 501 642 144 847 494 2628
Age-adj 1.33(1.02-1.73) 1.25(1.04-1.50) 0.87(0.57-1.32) 1.35(1.14-1.60) 1.13(0.91-1.39) 1.24(1.12-1.36) 0.3
0.035 0.018 0.5 0.001 0.26 <0.001
Multi 1.2(0.92-1.59) 1.24(1.02-1.49) 0.84(0.54-1.30) 1.34(1.13-1.60) 1.1(0.88-1.36) 1.2(1.09-1.32) 0.3
0.18 0.028 0.43 0.001 0.4 <0.001
Right colon
No. Cases 307 397 68 423 269 1464
Age-adj 1.4(1.00-1.95) 1.18(0.93-1.49) 1.06(0.60-1.88) 1.47(1.16-1.86) 1.08(0.81-1.43) 1.26(1.11-1.43) 0.45
0.05 0.17 0.84 0.001 0.61 <0.001
Multi 1.32(0.93-1.87) 1.17(0.92-1.49) 1.17(0.64-2.14) 1.46(1.14-1.85) 1.07(0.79-1.43) 1.23(1.08-1.41) 0.47
0.12 0.21 0.6 0.002 0.67 0.002
Left colon
No. Cases 182 223 71 406 209 1091
Age-adj 1.34(0.86-2.07) 1.39(1.03-1.88) 0.68(0.36-1.31) 1.26(0.98-1.63) 1.15(0.84-1.59) 1.23(1.06-1.43) 0.36
0.2 0.032 0.25 0.069 0.39 0.008
Multi 1.14(0.72-1.80) 1.38(1.01-1.89) 0.61(0.31-1.18) 1.27(0.98-1.64) 1.1(0.79-1.53) 1.18(1.01-1.38) 0.34
0.58 0.044 0.14 0.077 0.59 0.04
Rectum
No. Cases 163 149 61 350 198 921
Age-adj 1.03(0.61-1.73) 0.81(0.53-1.25) 1.02(0.55-1.88) 1.16(0.88-1.54) 1.57(1.15-2.13) 1.17(0.99-1.38) 0.15
0.91 0.34 0.96 0.29 0.004 0.071
Multi 1(0.59-1.71) 0.85(0.54-1.32) 1.01(0.53-1.91) 1.1(0.83-1.47) 1.55(1.13-2.14) 1.15(0.96-1.36) 0.18
1 0.47 0.99 0.51 0.007 0.12
Localized
No. Cases 277 278 102 551 252 1460
26
Age-adj 1.39(0.97-1.97) 1.31(1.00-1.72) 0.97(0.60-1.57) 1.33(1.07-1.65) 1.18(0.88-1.57) 1.27(1.11-1.44) 0.73
0.07 0.053 0.89 0.009 0.27 <0.001
Multi 1.22(0.84-1.75) 1.32(1.00-1.76) 0.95(0.57-1.57) 1.26(1.01-1.56) 1.19(0.88-1.61) 1.21(1.06-1.39) 0.7
0.3 0.052 0.83 0.04 0.25 0.005
Regional
No. Cases 317 377 97 579 316 1686
Age-adj 1.15(0.81-1.63) 0.95(0.74-1.23) 0.92(0.55-1.51) 1.25(1.01-1.54) 1.39(1.09-1.78) 1.17(1.03-1.32) 0.19
0.45 0.71 0.73 0.042 0.009 0.013
Multi 1.07(0.74-1.54) 0.95(0.73-1.24) 0.87(0.52-1.47) 1.24(1.00-1.54) 1.33(1.03-1.72) 1.14(1.01-1.30) 0.21
0.72 0.73 0.61 0.052 0.029 0.036
a
Adjusted for sex (except in models for men/women), age, and race (except in models for each ethnic group), stratified on age at
baseline questionnaire.
b
Further adjusted for BMI, smoking, NSAIDs use, education, alcohol intake, saturated fat intake, unsaturated fat intake, dietary fiber
intake, physical activity , family history of colorectal cancer, stratified on age at baseline questionnaire.
1
P value for interaction between ethnic group and diabetes.
27
We observed no significant difference in the association between diabetes status and
colorectal cancer risk by body mass index (≥25 kg/m
2
: RR=1.21, 95% CI: 1.09-1.34;
<25kg/m
2
: RR=1.23, 95% CI: 1.05-1.45, P
Interaction
= 0.14; Table 2-3). Nor were
significant differences noted in the association when stratified by age, NSAIDs use, or
HRT use (among women). However, we found that the association between diabetes
status and colorectal cancer risk differed significantly by smoking status
(P
Interaction
=0.0044). The association was strongest in never smokers, with a relative risk of
1.32 (95% CI: 1.15-1.53, P<0.001); in past smokers, the relative risk was 1.19 (95% CI:
1.05-1.34, P=0.007); no positive association was observed in current smokers (RR=0.90,
95% CI: 0.70-1.15, P=0.40). This pattern was generally consistent in both sexes and in all
colorectal cancer subgroups, and a significant interaction between tobacco use and
diabetes status was observed in men, and for regional/distant colorectal cancer and rectal
cancer. We observed no heterogeneity by ethnicity within a define strata for any risk
factor (e.g. current smokers; Table 2-3).
We also investigated the relationship between time since diagnosis with diabetes and
colorectal cancer risk. Compared to diabetics diagnosed before 1994, those with a later
diagnosis were at non-significantly lower risk of colorectal cancer risk (RR
long-duration vs.
short-duration
= 0.89, 95% CI: 0.69-1.14, p=0.35).
28
Table 2-3 Multivariate Relative Risks
a
of Colorectal Cancer Associated with Diabetes Status According to Risk Factors in the
Multiethnic Cohort, Los Angeles, CA and Hawaii
European African Native Japanese
Americans Americans Hawaiians Americans Latinos All
1
P
Int
BMI (kg/m
2
)
<25
No. cases 266 205 46 695 184 1396
OR(95% CI) 0.86(0.46-1.58) 1.63(1.11-2.38) 1.67(0.68-4.13) 1.16(0.92-1.45) 1.36(0.92-2.01) 1.23(1.05-1.45) 0.48
P value 0.62 0.012 0.27 0.21 0.12 0.012
≥ 25
No. cases 389 557 156 493 493 2088
OR(95% CI) 1.31(1.00-1.71) 1.08(0.88-1.31) 0.86(0.58-1.27) 1.43(1.17-1.76) 1.19(0.97-1.46) 1.21(1.09-1.34) 0.13
P value 0.047 0.46 0.45 <0.001 0.091 <0.001
2
P
Int
0.14
Age
<60
No. cases 167 227 89 290 199 972
OR(95% CI) 1.22(0.70-2.14) 0.93(0.65-1.33) 1.08(0.65-1.82) 1.23(0.89-1.72) 1.00(0.69-1.45) 1.08(0.91-1.29) 0.63
P value 0.48 0.69 0.76 0.21 1 0.39
60-69
No. cases 281 299 89 507 363 1539
OR(95% CI) 1.37(0.98-1.92) 1.31(1.00-1.71) 0.82(0.47-1.42) 1.43(1.14-1.78) 1.37(1.08-1.74) 1.33(1.17-1.50) 0.19
P value 0.066 0.053 0.48 0.002 0.009 <0.001
29
≥ 70
No. cases 216 265 27 400 130 1038
OR(95% CI) 0.82(0.51-1.32) 1.17(0.87-1.57) 0.61(1.17-2.19) 1.06(0.81-1.38) 1.15(0.76-1.74) 1.05(0.89-1.23) 0.38
P value 0.42 0.3 0.44 0.67 0.5 0.58
2
P
Int
0.059
Smoking
Never
No. cases 211 287 60 484 259 1301
OR(95% CI) 1.03(0.65-1.63) 1.35(1.02-1.79) 1.07(0.56-2.05) 1.54(1.22-1.95) 1.25(0.93-1.68) 1.32(1.15-1.53) 0.55
P value 0.9 0.038 0.83 <0.001 0.14 <0.001
Past
No. cases 325 347 99 528 327 1626
OR(95% CI) 1.38(1.00-1.90) 1.04(0.81-1.35) 0.75(0.45-1.26) 1.30(1.05-1.61) 1.21(0.94-1.56) 1.19(1.05-1.34) 0.18
P value 0.047 0.74 0.28 0.016 0.14 0.007
Current
No. cases 121 151 43 175 93 583
OR(95% CI) 0.68(0.31-1.49) 1.14(0.74-1.76) 1.18(0.49-2.84) 0.54(0.31-0.94) 1.18(0.71-1.98) 0.90(0.70-1.15) 0.25
P value 0.33 0.54 0.71 0.029 0.52 0.4
2
P
Int
0.0044
NSAIDs
No
No. cases 349 387 123 837 368 2064
OR(95% CI) 1.05(0.73-1.50) 1.06(0.81-1.37) 0.83(0.51-1.35) 1.35(1.13-1.63) 1.19(0.92-1.52) 1.17(1.04-1.32) 0.24
P value 0.81 0.69 0.45 0.001 0.18 0.008
Yes
30
No. cases 291 352 75 324 283 1325
OR(95% CI) 1.25(0.88-1.78) 1.27(0.99-1.63) 0.96(0.55-1.68) 1.09(0.83-1.43) 1.27(0.97-1.68) 1.20(1.04-1.37) 0.94
P value 0.21 0.058 0.88 0.54 0.085 0.1
2
P
Int
0.76
Menopausal status and HRT use
b
Pre
No. cases 12 21 13 29 13 88
OR(95% CI) 8.74(1.30-58.70) 0.47(0.088-2.55) 0.79(0.15-4.26) 4.83(1.95-11.95) 2.83(0.40-19.91) 2.04(1.17-3.57) 0.04
P value 0.026 0.38 0.79 0.001 0.3 0.012
Post never
No. cases 124 209 40 212 136 721
OR(95% CI) 1.34(0.79-2.27) 1.60(1.17-2.19) 0.57(0.24-1.37) 1.42(0.99-2.03) 1.01(0.67-1.52) 1.30(1.08-1.57) 0.26
P value 0.28 0.003 0.21 0.059 0.96 0.005
Post past
No. cases 74 116 16 65 52 323
OR(95% CI) 1.42(0.67-2.97) 1.00(0.62-1.62) 1.65(0.28-9.76) 0.91(0.45-1.85) 1.12(0.57-2.17) 1.08(0.80-1.45) 0.99
P value 0.36 0.99 0.58 0.8 0.75 0.6
Post current
No. cases 76 64 13 147 46 346
OR(95% CI) 0.85(0.30-2.43) 1.44(0.75-2.78) 1.09(0.15-7.91) 1.45(0.91-2.32) 1.46(0.70-3.04) 1.35(0.99-1.83) 0.76
P value 0.77 0.28 0.93 0.12 0.32 0.057
2
P
Int
0.08
a
Adjusted for sex (except in models for menopausal status and HRT use), age, race (except in models for each ethnic group), BMI
(except in models for BMI), smoking (except in models for smoking), NSAIDs use (except in models for NSAIDs use), education,
31
alcohol intake, saturated fat intake, unsaturated fat intake, dietary fiber intake, physical activity and family history of colorectal
cancer, stratified on age at baseline questionnaire.
b
Pre: premenopausal; Post never: postmenopausal HRT never users; Post past: postmenopausal HRT past users; Post current:
postmenopausal HRT current users. Only includes women.
1
P value for interaction between ethnic group and diabetes.
2
P value for interaction between BMI, age, smoking, NSAIDs use, menopausal status & HRT use and diabetes.
32
2.2.6 Discussion
In the prospective analysis of five racial/ethnic populations in the Multiethnic Cohort, we
found a highly statistically significant association between diabetes status and colorectal
cancer incidence, with diabetics having 19% greater risk of developing colorectal cancer
than non-diabetics after adjusting for known risk factors. The risk associated with
diabetes was slightly greater in women (RR=1.28) than in men (RR=1.12)
(P
Interaction
=0.071), and the positive association was observed in all populations except
Native Hawaiians, which was the smallest group.
Only a small number of studies have investigated the relationship between diabetes and
colorectal cancer risk in non-European populations and even fewer have closely
examined the association by colorectal cancer site, stage and other known risk factors for
colorectal cancer across multiple ethnic groups. Our findings in Japanese Americans
suggested that risk of colorectal cancer was elevated in diabetics in both men and women
(RR=1.15, 95% CI=0.95-1.39 in men and RR=1.49, 95% CI=1.18-1.91 in women),
which is consistent with previous reports in Koreans and Singapore Chinese (Jee et al,
2005; Seow et al, 2006). In contrast to previous report of no association with colon or
rectal cancer among African Americans (Vinikoor et al, 2009), we found that diabetes
was significantly associated with colon cancer risk in African Americans (RR=1.24, 95%
CI=1.02-1.49), but not with rectal cancer risk (RR=0.85, 95% CI=0.54-1.32). We also
found no evidence of an association of diabetes and colorectal cancer risk in Native
Hawaiians, which has not been reported before.
33
In analyses by cancer site, risk was significantly increased for colon cancer, cancer of the
right colon and left colon. Non-significant overall positive associations were observed for
rectal cancer. For cancer of the rectum, although the test for interaction suggested no
heterogeneity in the association across ethnic groups, we did notice that the risk or
colorectal cancer association with diabetes was significantly elevated in Latino (RR=1.51
in men and RR=1.61 in women), but not in any of the other ethnic groups. In analyses by
stage, significant increases in risks were noted for both regional/distant and localized
disease. Risks within tumor subgroups were generally consistent across ethnic groups.
The relationship between diabetes status and colorectal cancer was significantly modified
by smoking status. In contrast to a previous report of greater colorectal cancer risk in
current and former smokers than never smokers among diabetics (Gibbs et al, 2007;
Limburg et al, 2006), the association in our study was stronger in never smokers
(RR=1.32) than in past smokers (RR=1.19), while no positive association was observed
in current smokers (RR=0.90). With smoking being a risk factor for colorectal cancer, it
is possible that the association between diabetes and colorectal cancer risk may be more
apparent in those at lower risk. Additional studies will be needed to better understand this
observation.
In this study, diabetes status was based on self-report, which may have led to some
misclassification of diabetics as non-diabetics. This underreporting may have resulted in
an underestimate of the magnitude of the association between diabetes and colorectal
cancer risk. However, previous studies have shown that self-reported responses for many
common chronic diseases such as diabetes are reliable when compared with medical
34
records (Midthjell et al, 1992; Okura et al, 2004; Walitt et al, 2008). We could not
differentiate between cases with Type 1 and Type 2 diabetes, however, we expect this
misclassification to be minimal as Type 1 diabetes is uncommon. Another limitation of
our study is that the analysis did not account for incident cases of diabetes over the
follow-up period. Incident cases of diabetes in the non-diabetic group during the follow-
up would make the two groups more similar and therefore attenuate the true association
between diabetes and colorectal cancer risk.
With ≤ 13 years of follow-up, we were unable to address the effects of long-standing
diabetes on colorectal cancer risk. However, we observed that diabetics diagnosed before
1994 were at lower risk of having colorectal cancer compared to those who diagnosed
with diabetes in1994 or later (RR=0.89). These findings support results from previous
studies (Hu et al, 1999; La Vecchia et al, 1997; Le Marchand et al, 1997) and the
hyperinsulinemia-hypothesis, according to which risk of colorectal cancer should be
lower in later years of diabetes as a result of hypoinsulinemia.
In this large, multiethnic prospective study, we observed consistent positive associations
between diabetes and colorectal cancer risk in African Americans, Latinos, Japanese and
European Americans. The lack of positive association between diabetes and colorectal
cancer risk in Native Hawaiians is consistent with the relatively low rate of colorectal
cancer in this group despite a high rate of diabetes (Giovannucci, 1995; Grandinetti et al,
2007). Additional follow-up and larger studies will be needed to confirm the apparent
inverse association in Native Hawaiians. These findings provide strong support for the
hypothesis that diabetes is a risk factor for colorectal cancer.
35
2.2.7 Acknowledgements
This work was supported by the National Cancer Institute [grant number CA54281]. We
are indebted to the participants of the Multiethnic Cohort for their ongoing commitment
to the study.
2.2.8 References
Gibbs P, Steel S, McLaughlin S, Jones I, Faragher I, Skinner I, Croxford M, Johns J,
Chapman M, Lipton L (2007) Type 2 diabetes mellitus, smoking, and colorectal cancer.
Am J Gastroenterol 102: 909-10
Giovannucci E (1995) Insulin and colon cancer. Cancer Causes Control 6: 164-79
Grandinetti A, Kaholokula JK, Theriault AG, Mor JM, Chang HK, Waslien C (2007)
Prevalence of diabetes and glucose intolerance in an ethnically diverse rural community
of Hawaii. Ethn Dis 17: 250-5
Hu FB, Manson JE, Liu S, Hunter D, Colditz GA, Michels KB, Speizer FE, Giovannucci
E (1999) Prospective study of adult onset diabetes mellitus (type 2) and risk of colorectal
cancer in women. J Natl Cancer Inst 91: 542-7
Jee SH, Ohrr H, Sull JW, Yun JE, Ji M, Samet JM (2005) Fasting serum glucose level
and cancer risk in Korean men and women. JAMA 293: 194-202
36
Jenab M, Riboli E, Cleveland RJ, Norat T, Rinaldi S, Nieters A, Biessy C, Tjonneland A,
Olsen A, Overvad K, Gronbaek H, Clavel-Chapelon F, Boutron-Ruault MC, Linseisen J,
Boeing H, Pischon T, Trichopoulos D, Oikonomou E, Trichopoulou A, Panico S, Vineis
P, Berrino F, Tumino R, Masala G, Peters PH, van Gils CH, Bueno-de-Mesquita HB,
Ocke MC, Lund E, Mendez MA, Tormo MJ, Barricarte A, Martinez-Garcia C,
Dorronsoro M, Quiros JR, Hallmans G, Palmqvist R, Berglund G, Manjer J, Key T, Allen
NE, Bingham S, Khaw KT, Cust A, Kaaks R (2007) Serum C-peptide, IGFBP-1 and
IGFBP-2 and risk of colon and rectal cancers in the European Prospective Investigation
into Cancer and Nutrition. Int J Cancer 121: 368-76
Kaaks R, Toniolo P, Akhmedkhanov A, Lukanova A, Biessy C, Dechaud H, Rinaldi S,
Zeleniuch-Jacquotte A, Shore RE, Riboli E (2000) Serum C-peptide, insulin-like growth
factor (IGF)-I, IGF-binding proteins, and colorectal cancer risk in women. J Natl Cancer
Inst 92: 1592-600
La Vecchia C, Negri E, Decarli A, Franceschi S (1997) Diabetes mellitus and colorectal
cancer risk. Cancer Epidemiol Biomarkers Prev 6: 1007-10
Le Marchand L, Wilkens LR, Kolonel LN, Hankin JH, Lyu LC (1997) Associations of
sedentary lifestyle, obesity, smoking, alcohol use, and diabetes with the risk of colorectal
cancer. Cancer Res 57: 4787-94
37
Levi F, Pasche C, Lucchini F, La Vecchia C (2002) Diabetes mellitus, family history, and
colorectal cancer. J Epidemiol Community Health 56: 479-80; author reply 480
Limburg PJ, Vierkant RA, Fredericksen ZS, Leibson CL, Rizza RA, Gupta AK, Ahlquist
DA, Melton LJ, 3rd, Sellers TA, Cerhan JR (2006) Clinically confirmed type 2 diabetes
mellitus and colorectal cancer risk: a population-based, retrospective cohort study. Am J
Gastroenterol 101: 1872-9
Ma J, Giovannucci E, Pollak M, Leavitt A, Tao Y, Gaziano JM, Stampfer MJ (2004) A
prospective study of plasma C-peptide and colorectal cancer risk in men. J Natl Cancer
Inst 96: 546-53
McKeown-Eyssen G (1994) Epidemiology of colorectal cancer revisited: are serum
triglycerides and/or plasma glucose associated with risk? Cancer Epidemiol Biomarkers
Prev 3: 687-95
Midthjell K, Holmen J, Bjorndal A, Lund-Larsen G (1992) Is questionnaire information
valid in the study of a chronic disease such as diabetes? The Nord-Trondelag diabetes
study. J Epidemiol Community Health 46: 537-42
38
Nilsen TI, Vatten LJ (2001) Prospective study of colorectal cancer risk and physical
activity, diabetes, blood glucose and BMI: exploring the hyperinsulinaemia hypothesis.
Br J Cancer 84: 417-22
Okura Y, Urban LH, Mahoney DW, Jacobsen SJ, Rodeheffer RJ (2004) Agreement
between self-report questionnaires and medical record data was substantial for diabetes,
hypertension, myocardial infarction and stroke but not for heart failure. J Clin Epidemiol
57: 1096-103
Otani T, Iwasaki M, Sasazuki S, Inoue M, Tsugane S (2007) Plasma C-peptide, insulin-
like growth factor-I, insulin-like growth factor binding proteins and risk of colorectal
cancer in a nested case-control study: the Japan public health center-based prospective
study. Int J Cancer 120: 2007-12
Renehan AG, Zwahlen M, Minder C, O'Dwyer ST, Shalet SM, Egger M (2004) Insulin-
like growth factor (IGF)-I, IGF binding protein-3, and cancer risk: systematic review and
meta-regression analysis. Lancet 363: 1346-53
Sandhu MS, Dunger DB, Giovannucci EL (2002) Insulin, insulin-like growth factor-I
(IGF-I), IGF binding proteins, their biologic interactions, and colorectal cancer. J Natl
Cancer Inst 94: 972-80
39
Schoen RE, Weissfeld JL, Kuller LH, Thaete FL, Evans RW, Hayes RB, Rosen CJ (2005)
Insulin-like growth factor-I and insulin are associated with the presence and advancement
of adenomatous polyps. Gastroenterology 129: 464-75
Seow A, Yuan JM, Koh WP, Lee HP, Yu MC (2006) Diabetes mellitus and risk of
colorectal cancer in the Singapore Chinese Health Study. J Natl Cancer Inst 98: 135-8
Vinikoor LC, Long MD, Keku TO, Martin CF, Galanko JA, Sandler RS (2009) The
association between diabetes, insulin use, and colorectal cancer among Whites and
African Americans. Cancer Epidemiol Biomarkers Prev 18: 1239-42
Walitt BT, Constantinescu F, Katz JD, Weinstein A, Wang H, Hernandez RK, Hsia J,
Howard BV (2008) Validation of self-report of rheumatoid arthritis and systemic lupus
erythematosus: The Women's Health Initiative. J Rheumatol 35: 811-8
Will JC, Galuska DA, Vinicor F, Calle EE (1998) Colorectal cancer: another
complication of diabetes mellitus? Am J Epidemiol 147: 816-25
Yang YX, Hennessy S, Lewis JD (2005) Type 2 diabetes mellitus and the risk of
colorectal cancer. Clin Gastroenterol Hepatol 3: 587-94
40
2.3 Generalizability and epidemiologic characterization of eleven colorectal cancer
GWAS hits in Multiple populations
2.3.1 Abstract
Background: Genome-wide association studies (GWAS) in populations of European
ancestry have identified several loci that confer an increased risk of colorectal cancer
(CRC). Methods: We studied the generalizability of the associations with 11 risk variants
for CRC on 8q23 (rs16892766), 8q24 (rs6983267), 9p24 (rs719725), 10p14 (rs10795668),
11q23 (rs3802842), 14q22 (rs4444235), 15q13 (rs4779584), 16q22 (rs9929218), 18q21
(rs4939827), 19q13 (rs10411210) and 20p12 (rs961253) in a multiethnic sample of 2,472
CRC cases, 839 adenoma cases and 4,466 controls comprised of European American,
African American, Native Hawaiian, Japanese American and Latino men and women.
Because findings for CRC and adenoma were similar, we combined both groups in the
analyses. Results: We confirmed the associations with an increased risk of CRC/adenoma
for the 8q24, 11q23 and 15q13 loci in European Americans, and observed significant
associations between the 8q24 and 20p12 loci with CRC/adenoma risk in African
Americans. Moreover, we found statistically significant cumulative effects of risk alleles
on CRC/adenoma risk in all populations (odds ratio (OR) per allele = 1.07-1.09, p≤0.039)
except in Japanese Americans (OR=1.01, p=0.52). We found heterogeneity in the
associations by tumor subsite, age of CRC/adenoma onset, sex, body mass index (BMI)
and smoking status for some of the variants. Conclusions: These results provide evidence
that the known variants are in aggregate significantly associated with CRC/adenoma risk
in multiple populations except Japanese Americans, and the influences may differ across
41
groups defined by clinicopathological characteristics for some variants. Impact: These
results underline the importance of studying the epidemiologic architecture of these
genetic effects in large and diverse populations.
2.3.2 Introduction
Genome-wide association studies (GWAS) have identified several common single-
nucleotide polymorphisms (SNPs) that confer risk for colorectal cancer (CRC) in
populations of European ancestry. The variant rs6983267 within the chromosome 8q24
region was the first to be identified in multiple studies (1-3). Additional studies
uncovered and validated other common risk variants, on 8q23 (rs16892766) (4), 9p24
(rs719725) (2, 5), 10p14 (rs10795668) (4), 11q23 (rs3802842) (6), 15q13 (rs4779584) (7)
and 18q21 (rs4939827) (6, 8). A recent meta-analysis of two GWAS conducted in the UK
revealed four new loci associated with CRC, on 14q22 (rs4444235), 16q22 (rs9929218),
19q13 (rs10411210) and 20p12 (rs961253) (9). Of the 11 variants, 5 are located in
proximity of genes involved in TGF-β signaling (14q22, 15q13, 18q21, 19q13, 20p12)
which highlights a previously implicated pathway important in the pathogenesis of CRC
(6-9). Each of the identified CRC risk variants is common in European populations, with
allele frequencies ranging from 7% to 49% (CEU Hapmap) and has modest effect sizes of
1.10 to 1.26 per allele. Together, these 11 loci explain <10% of the heritability of CRC in
populations of European ancestry (9), which suggests that the vast majority of the
inherited variations underlying risk of CRC have yet to be found.
42
A number of studies have characterized associations with subsets of these loci in
populations of European ancestry (5, 10-19); however, replication efforts in other
racial/ethnic populations have been limited (3, 20-22). The risk variant at 8q24
(rs6983267) is suggested to be a biologically functional allele (23, 24) and we, and others,
have reported rs6983267 to be a pan-ethnic marker of risk in multiple populations (3, 20,
22). Testing of all risk loci for CRC in non-European populations will be required to
determine whether they may serve as markers of risk more broadly in the U.S. population
and globally. While much emphasis has been placed on the low predictive value of
common variants revealed through GWAS, modeling of their aggregate effects in
multiple populations remains a crucial next step to characterize the risk conferred by
these markers across racial/ethnic populations and identify subgroups of the population at
greater risk.
In the present study, we evaluated the independent and aggregate effects of the 11
validated risk variants for CRC in four multiethnic studies which in total include 3,311
cases of CRC or adenoma and 4,466 controls comprised of European American, African
American, Native Hawaiian, Japanese American and Latino men and women. Here we
report on the associations of these risk variants with CRC risk by racial/ethnic group,
disease subgroup (stage and anatomical subsite), and main CRC risk factors (age of onset,
sex, BMI, smoking status, and first degree family history of colorectal cancer).
43
2.3.3 Study Populations
The Multiethnic Cohort (MEC): The MEC includes 215,251 men and women in
Hawaii and Los Angeles (with additional African-Americans from elsewhere in
California) (25). The cohort is comprised predominantly of African Americans, Japanese
Americans, Native Hawaiians, Latinos and European Americans who entered the study
between 1993 and 1996 by completing a 26-page self-administered questionnaire that
requested detailed information about dietary habits, demographic factors, personal
behaviors, history of medical conditions, family history of cancer, and for women,
reproductive history and exogenous hormone use. The participants were between the ages
45 and 75 at recruitment. Incident cancers in the MEC are identified by linkage to
population-based Surveillance, Epidemiology and End Results (SEER) cancer registries
covering Hawaii and Los Angeles County, and to the California State Cancer Registry
covering all of California. From the registries, information about stage of disease and site
of tumor (colon versus rectum) is available. Beginning in 1994, blood samples were
collected from incident colorectal cancer cases and a random sample of MEC participants
to serve as a control pool for genetic analyses in the cohort. Starting in 2001, blood
samples were collected prospectively from ~70,000 cohort participants. Eligible cases in
this CRC case-control study consisted of men and women with incident CRC (n=1,549)
diagnosed after enrollment in the MEC through December 2005. Controls (n=2,173) were
participants without colorectal, breast or prostate cancer prior to entry into the cohort and
without a diagnosis of CRC up to December 2005. This study was approved by the
44
Institutional Review Boards at the University of Hawaii and the University of Southern
California.
The Los Angeles County CRC Case-Control Study (LA CRC): The LA CRC is a
population-based case-control study of CRC. Eligible cases included English-speaking
women with a histologically confirmed incident colorectal cancer, diagnosed at ages 55
to 74 years, from January 1998 through December 2002 and who were residents of Los
Angeles County at the time of diagnosis. Cases were identified from the Los Angeles
County Cancer Surveillance Program. Controls were selected from the neighborhoods
where cancer cases resided at the time of diagnosis. A personal interview and a blood
sample were obtained from each subject. A reference date was de fined as 1 year before
the date of diagnosis of the case. This same reference date was used for each case’s
matched control subject. This study includes 497 cases and 507 controls that were
available for genotyping (26). This study was approved by the Institutional Review Board
at the University of Southern California.
The Hawaii Multiethnic CRC Case-Control Study (HI CRC): The HI CRC is a
population-based case-control study in Hawaii and has been described in detail
previously (27). Cases were identified through the Hawaii SEER registry and consisted of
Japanese American, European American and Native Hawaiian residents of Oahu, Hawaii,
who were newly diagnosed with colon or rectal cancer between January 1994 and August
1998. Controls were selected from participants in an ongoing population-based health
survey conducted by the Hawaii State Department of Health and from Health Care
45
Financing Administration participants. In-person interviews were conducted at the
subjects’ homes by trained interviewers. The median interval between diagnosis and
interview for cases was 4.5 months (25
th
–75
th
percentiles, 3.3–8.4 months). The
questionnaire included detailed information on demographics, including the race of each
grandparent; a quantitative food frequency questionnaire; a lifetime history of tobacco,
alcohol, and aspirin use; a history of recreational sports activities since age18; a personal
history of various relevant medical conditions; a family history of CRC in parents and
siblings; information on height and weight at different ages; and for women, a history of
reproductive events and hormone use. This study includes 426 cases and 616 controls
with DNA for genetic analysis. This study was approved by the Institutional Review
Board at the University of Hawaii.
The Adenoma Study: For the adenoma study (28), two flexible-sigmoidoscopy
screening clinics were first used to recruit participants on Oahu, Hawaii. Adenoma cases
were identified either from the baseline examination at the Hawaii site of the Prostate
Lung Colorectal and Ovarian (PLCO) cancer screening trial (1996-2000) or at the Kaiser
Permanente Hawaii (KPH)’s Gastroenterology Screening Clinic (1995-2007). In addition,
from 2002 to 2007, we recruited all eligible colonoscopy patients in the KPH
Gastroenterology Department. Cases were patients with histologically confirmed first-
time colorectal adenoma(s), of Japanese, Caucasian or Hawaiian race/ethnicity. Controls
were selected among patients with a normal colorectum, and were matched to the cases
on age at exam, sex, race/ethnicity, screening date (±3 months), clinic and type of
examination (colonoscopy or flexible sigmoidoscopy). Exposure information was
46
collected via interview-administered questionnaires designed to obtain demographic and
lifestyle information, including lifetime histories of physical activity, tobacco smoking,
and alcohol drinking; medical history; family cancer history; and, for females,
reproductive and hormone use history. The interview also included a validated food
frequency questionnaire with 268 food items, and a detailed history of vitamin and
mineral supplement use was also taken. The median interval between endoscopy and
interview was 11.3 months (25
th
–75
th
percentiles, 5.2–21.1 months). This study includes
839 cases and 1170 controls available for genetic analysis. This study was approved by
the Institutional Review Board at the University of Hawaii.
Altogether, the present analysis included 3,311 CRC/adenoma cases and 4,466 controls
(European American (1,171/1,534), African American (382/510), Native Hawaiian
(323/472), Japanese American (1,042/1,426) and Latino (393/524)).
2.3.4 Genotyping
Genotyping was conducted by the TaqMan allelic discrimination assay (Applied
Biosystems, Foster City, CA) (29). For all SNPs, genotype call rates were >95% among
case and control groups in each population. Individuals who were missing information on
4 or more SNPs were excluded from the analysis (197 individuals were excluded).
Blinded replicates (5-10%) were included within and across all 96-well DNA plates. For
all the SNPs, the concordance rates for the duplicates QC samples were ≥99.4% in MEC,
≥99.7% in HI CRC and Adenoma study, and were 100% in the LA CRC study.
47
2.3.5 Statistical Analysis
Genotype frequencies in the controls were tested for departure from Hardy-Weinberg
Equilibrium (HWE) in each racial/ethnic group in each study. Odds ratios (OR) and 95%
confidence intervals (95% CI) were estimated for each variant using unconditional
logistic regression. The “low-risk” allele based on previous GWAS reports was used as
the reference allele. The effect of each SNP on disease risk was evaluated under a log-
additive mode of inheritance, as well as using separate indicator variables to allow
unrestricted effects of heterozygotes and homozygotes of the risk allele. For each SNP,
we performed a 1-df likelihood ratio test (LRT) to compare the additive to the
unrestricted model. For all SNPs, the unrestricted model which allows dominant and
recessive effects was not significantly better than the additive model. The statistical
power of capturing the ORs reported in previous GWAS were calculated using Quanto
(version 1.2.4).
We also performed analyses to model the cumulative effect of all 11 variants on disease
risk. In these analyses, each individual was assigned a risk score, which was the total
number of risk alleles carried for the 11 variants. For individuals missing data for <4
SNPs (n=970), the average number of risk alleles for a SNP in each population was used
in the calculation of the risk score. ORs were estimated to evaluate the per allele effect of
the risk score on CRC risk. We also categorized the risk score in quintiles based on the
risk score distribution in controls in each racial/ethnic group. Results for the risk score
analyses were not materially changed when we restricted it to individuals with complete
genotype data for all 11 SNP (2909 of 3311 cases and 3898 of 4466 controls).
48
We tested whether dominant or recessive effects for each SNP contributed additional
information on top of the risk score by including indicator variables for heterozygotes
and homozygotes in the risk score model for each SNP one by one. This model was
compared to the model with just the risk score using a 2-df LRT. As with the additive
model, allowing dominant or recessive genetic effects for any SNP did not improve the
model fit.
Also, to incorporate known genetic effect size into the assessment of cumulative effects,
we created a weighted risk score, in which SNPs were weighted by their log (OR), based
on ORs reported in the GWAS in European Americans. The two scores were highly
correlated (Pearson correlation coefficient was 0.95, P<0.0001) and the unweighted risk
score was used to estimate cumulative genetic effects in all analyses.
Case-only analyses were performed to test whether individual and aggregate effects of
SNPs were different on CRC versus adenoma risk. For invasive CRC, we conducted
case-only analyses for each SNP and the risk score by tumor stage at diagnosis (localized
vs. regional/distant), for colon and rectal cancer, and by location in the colon (left vs.
right). In addition to race/ethnicity, we examined whether the associations were modified
by sex, age of onset (based on the median in the controls: <66 and ≥66 years), first -
degree family history of colorectal cancer, BMI (<23, ≥23 –25, ≥25–30, ≥30 kg/m
2
) and
smoking status (never and ever smokers) using a LRT of heterogeneity by adding
interaction term to the model.
49
Genetic effects were adjusted for age (categorized by quartiles of the age distribution in
controls in the pooled sample: <59, ≥59 -66, ≥66 -72, ≥72 and unknown), study, and
race/ethnicity in pooled analyses. For each covariate, an indicator variable was used for
those individuals with missing data. To assess the potential confounding effects of
population stratification, we conducted a principal component (PCA) analysis using
genotype data for 1389 SNPs from candidate genes studies among 1098 CRC cases and
1489 controls from the MEC (30-32). In testing each CRC variant, we included the first
10 principal components (PCs) along with self-reported ethnicity in the logistic
regression models to assess the impact of population stratification.
2.3.6 Results
Descriptive characteristics of participants are presented in Table 2-4 by study. The ages
of individuals ranged from 21 to 86 years, and in general, mean ages were similar
between cases and controls within each study. The Native Hawaiians were the youngest
group (mean age, 61 years) and African Americans the oldest (mean age, 67 years).
50
Table 2-4 Descriptive characteristics of each study*
Adenoma LA CRC HI CRC MEC All
Cases Controls Cases Controls Cases Controls Cases Controls Cases Controls
Total subjects 839 1170 497 507 426 616 1549 2173 3311 4466
Sex(%)
Men
515
(0.61)
733
(0.63) 0 0
255
(0.60)
355
(0.58)
853
(0.55)
1132
(0.52)
1623
(0.49)
2220
(0.50)
Women
324
(0.39)
437
(0.37)
497
(1.00)
507
(1.00)
171
(0.40)
261
(0.42)
696
(0.45)
1041
(0.48)
1688
(0.51)
2246
(0.50)
Race(%)
African Americans 0 0
67
(0.13)
52
(0.10) 0 0
315
(0.20)
458
(0.21)
382
(0.12)
510
(0.11)
Native Hawaiians
181
(0.21)
233
(0.20) 0 0
65
(0.15)
83
(0.13)
77
(0.05)
156
(0.07)
323
(0.10)
472
(0.11)
Japanese Americans
267
(0.32)
380
(0.32) 0 0
233
(0.55)
373
(0.61)
542
(0.35)
673
(0.31)
1042
(0.31)
1426
(0.32)
Latinos 0 0
59
(0.12)
43
(0.09) 0 0
334
(0.22)
481
(0.22)
393
(0.12)
524
(0.12)
European Americans
391
(0.47)
557
(0.48)
371
(0.75)
412
(0.81)
128
(0.30)
160
(0.26)
281
(0.18)
405
(0.19)
1171
(0.35)
1534
(0.34)
Age
Mean(sd)
59.9
(8.6)
61.9
(8.4)
65.6
(5.7)
64.6
(6.8)
64.3
(11.8)
65.0
(11.6)
68.4
(8.2)
66.8
(8.1)
65.3
(9.2)
65.1
(8.9)
No. unknown 0 0 0 0 0 0 28 0 28 0
Site
Unknown -- -- 0
2
28
30
51
Rectum -- -- 59 -- 120 -- 413 -- 592 --
Colon -- -- 438 -- 304 -- 1108 -- 1850 --
Left colon -- -- 187 -- 169 -- 509 -- 865 --
Right colon -- -- 248 -- 134 -- 587 -- 969 --
Other colon -- -- 3 -- 1 -- 12 -- 16 --
Stage
Unknown -- -- 1
2
28
31
In situ -- -- 30 -- 42 -- 130 -- 202 --
Invasive -- -- 466 -- 382 -- 1391 -- 2239 --
Localized -- -- 201 -- 194 -- 694 -- 1089 --
Distant/regional -- -- 264 -- 182 -- 594 -- 1040 --
Unknown -- -- 1 -- 6 -- 103 -- 110 --
First-degree family history of colorectal cancer(%)
Yes
151
(0.18)
160
(0.14)
65
(0.13)
57
(0.11)
74
(0.17)
62
(0.10)
182
(0.12)
190
(0.09)
472
(0.14)
469
(0.11)
No
688
(0.82)
1010
(0.86)
432
(0.87)
450
(0.89)
352
(0.83)
554
(0.90)
1367
(0.88)
1983
(0.91)
2839
(0.86)
3997
(0.89)
Smoking status(%)
Never
364
(0.43)
620
(0.53)
221
(0.45)
255
(0.5)
174
(0.41)
289
(0.47)
532
(0.34)
933
(0.43)
1291
(0.39)
2097
(0.47)
Past
370
(0.44)
473
(0.4)
215
(0.43)
197
(0.39)
182
(0.43)
260
(0.42)
683
(0.44)
912
(0.42)
1450
(0.44)
1842
(0.41)
Current
105
(0.13)
77
(0.07)
61
(0.12)
55
(0.11)
67
(0.16)
62
(0.1)
234
(0.15)
299
(0.14)
467
(0.14)
493
(0.11)
Unknown 0 0 0 0
3
(0.01)
5
(0.01)
100
(0.06)
29
(0.01)
103
(0.03)
34
(0.01)
BMI(kg/m
2
)(%)
52
<23
125
(0.15)
256
(0.22)
108
(0.22)
122
(0.24)
149
(0.35)
195
(0.32)
327
(0.21)
536
(0.25)
709
(0.21)
1109
(0.25)
≥23-25
150
(0.18)
213
(0.18)
86
(0.17)
95
(0.19)
86
(0.20)
120
(0.19)
268
(0.17)
445
(0.20)
590
(0.18)
873
(0.20)
≥25-30
324
(0.39)
447
(0.38)
160
(0.32)
173
(0.34)
128
(0.30)
219
(0.36)
577
(0.37)
816
(0.38)
1189
(0.36)
1655
(0.37)
≥30
240
(0.29)
254
(0.22)
143
(0.29)
117
(0.23)
63
(0.15)
82
(0.13)
277
(0.18)
356
(0.16)
723
(0.22)
809
(0.18)
Unknown 0 0 0 0 0 0
100
(0.06)
20
(0.01)
100
(0.03)
20
(0.004)
*
Adenoma: The Adenoma Study; LA CRC: The Los Angeles County CRC Case-Control Study; HI CRC: The Hawaii Multiethnic
CRC Case-Control Study; MEC: The Multiethnic Cohort.
53
Ten of the 11 SNPs were polymorphic among controls in each racial/ethnic group
(Figure 2-1). SNP rs16892766 was monomorphic in Japanese Americans and had a
minor allele frequency (MAF) <10% in European Americans, Native Hawaiians and
Latinos, while SNP rs10795668 had a MAF of 7% in African Americans. All other SNPs
had a MAF >10% in each racial/ethnic group. Allele frequencies differed substantially
across ethnic groups, with variations ranged from 12% for rs16892766 to 63% for
rs4779584 (Table 2-5 and Figure 2-1). None of the SNPs displayed departure from
HWE within study/ethnic group (P>0.01).
Figure 2-1 Risk allele frequencies in controls across racial/ethnic groups, in descending
ordered based on frequency in European Americans.
54
In ethnic-pooled case-only analysis comparing CRC (n=2,472) and adenoma cases
(n=839) we observed no significant association between any SNP and case group. For
each SNP the OR estimates were not materially different when combining in situ CRC
(n=202 cases) with invasive CRC cases compared to invasive cases alone (results not
presented). Based on these observations, we estimated the individual and aggregate
genetic effects of the CRC variants among all cases combined. Within an ethnic group,
the allele frequencies of the 11 SNPs were consistent across studies. We also observed no
strong evidence of significant within-ethnic-group heterogeneity in the association of
each SNP with CRC/adenoma risk between studies (only one SNP rs3802842 has
P
het
<0.01 among European Americans).
In European Americans, the directions of association with the 11 variants were generally
consistent with previous GWAS reports, positive associations were observed with the risk
alleles of 9 of the 11 SNPs (8 with OR≥1.05) and the other 2 SNPs had ORs very close to
one (Table 2-5). Nominally significant associations were observed with rs6983267
(OR=1.12, 95% CI: 1.01-1.25; p=0.038), rs3802842 (OR=1.28, 95% CI: 1.14-1.44;
p=4×10
-5
) and rs4779584 (OR=1.20, 95% CI: 1.04-1.37; p=0.010), while rs16892766
also contributed to elevated disease risk, but the association was only borderline
significant (OR=1.18, 95% CI: 0.97-1.43, p=0.095). The effect sizes for the other 5 SNPs
were relatively small (OR<1.10) and did not reach statistical significance.
Positive associations (OR>1) with CRC/adenoma were also noted with 9 SNPs in Latinos
(8 with OR ≥1.05) and with 8 SNPs in each African Americans (5 with OR≥1.05) and
55
Native Hawaiians (7 with OR ≥1.05; Table 2-5). However, the associations were weaker
in the four non-European populations, with only SNPs rs6983267 at 8q24 (OR=1.52, 95%
CI: 1.11-2.07, p=0.0090 (3)) and rs961253 at 20p12 (OR=1.24, 95% CI: 1.01-1.53,
p=0.042), found to be significantly associated with risk in African Americans (Table 2-5).
Interestingly, in Japanese Americans (1,042 cases, 1,426 controls), the second largest
group after European Americans, only 2 SNPs had effect estimates >1.05 (rs6983267 at
8q24 and rs4444235 at 14q22) with only the OR for rs6983267 approaching statistical
significance (1.12, 95% CI: 0.99-1.26). Tests for heterogeneity suggested no difference in
genetic effects across racial/ethnic groups for these 11 SNPs (Table 2-5). The association
by racial/ethnic group and genotype are presented in. In pooling all populations, the
pattern of associations was similar to what was observed in European Americans, with
significant positive associations observed for rs16892766, rs6983267, rs3802842, and the
effect of rs4779584 became weaker than that observed in European Americans (Table 2-
5). After removing European Americans from the pooled analysis, the associations were
positive for 10 of the 11 SNPs, but only rs6983267 remained significantly associated with
increased CRC/adenoma risk (data not shown; OR=1.16, 95% CI: 1.06-1.27, p=0.0011).
56
Table 2-5 The association of known CRC variants with CRC/adenoma risk by race/ethnicity *
EA AA NH JA LA Pooled
SNP/ Chr./ Location / 1171 cases 382 cases 323 cases 1042 cases 393 cases 3311 cases
Allele Tested
†
Position
†
Nearest Gene 1534
controls
510
controls
472
controls
1426
controls
524
controls
4466
controls
rs16892766 8q23.3 Intergenic
1.18 1.23 1.14
--
1.29 1.2
(0.97-1.43) (0.92-1.63) (0.59-2.21) (0.82-2.05) (1.04-1.39)
RAF for C 1.2E+08 EIF3H 0.077 0.12 0.02 -- 0.04 0.047
P
het
‡
0.99
rs6983267 8q24.21 Intergenic
1.12 1.52 1.18 1.12 1.13 1.15
(1.01-1.25) (1.11-2.07) (0.95-1.46) (0.99-1.26) (0.92-1.39) (1.07-1.23)
RAF for G 1.3E+08 MYC 0.51 0.84 0.33 0.32 0.62 0.47
P
het
0.54
rs719725 9p24 Intergenic
0.99 1.06 1.21 0.98 0.96 1
(0.88-1.11) (0.84-1.33) (0.93-1.57) (0.87-1.11) (0.78-1.19) (0.93-1.08)
RAF for A 6355683 TPD52L3 0.63 0.72 0.8 0.68 0.7 0.68
P
het
0.81
rs10795668 10p14 Intergenic
1.05 0.93 1.2 1.01 1.02 1.05
(0.93-1.18) (0.64-1.33) (0.97-1.47) (0.90-1.13) (0.83-1.26) (0.97-1.12)
RAF for G 8741225 BC031880 0.69 0.93 0.62 0.56 0.7 0.67
P
het
0.55
rs3802842 11q23 Intergenic
1.28 1.11 1.27 1.02 1.07 1.15
(1.14-1.44) (0.90-1.36) (0.98-1.65) (0.91-1.15) (0.85-1.34) (1.07-1.23)
RAF for C 1.1E+08 FLJ45803 0.27 0.34 0.18 0.32 0.23 0.28
57
P
het
0.11
rs4444235 14q22.2 Intergenic
1.05 1.02 0.97 1.08 0.98 1.04
(0.94-1.17) (0.83-1.25) (0.79-1.18) (0.96-1.21) (0.82-1.19) (0.97-1.11)
RAF for C 5.3E+07 BMP4 0.47 0.34 0.57 0.62 0.5 0.52
P
het
0.84
rs4779584 15q13.3 Intergenic
1.2 1.03 1.02 0.94 1.13 1.07
(1.04-1.37) (0.85-1.25) (0.82-1.27) (0.81-1.08) (0.92-1.39) (0.99-1.15)
RAF for T 3.1E+07
GREM1,
SCG5 0.18 0.54 0.64 0.81 0.25 0.48
P
het
0.18
rs9929218 16q22.1 Intron
0.99 1.01 0.92 0.98 1.09 0.99
(0.88-1.11) (0.82-1.26) (0.71-1.19) (0.84-1.13) (0.87-1.36) (0.92-1.07)
RAF for G 6.7E+07 CDH1 0.71 0.7 0.83 0.82 0.77 0.76
P
het
0.91
rs4939827 18q21.1 Intron
1.05 0.98 1.14 0.99 1.13 1.05
(0.94-1.17) (0.79-1.21) (0.92-1.40) (0.86-1.14) (0.92-1.40) (0.98-1.12)
RAF for T 4.5E+07 SMAD7 0.52 0.32 0.42 0.22 0.33 0.37
P
het
0.67
rs10411210 19q13.1 Intron
1.06 0.99 0.92 1 1.15 1.02
(0.89-1.26) (0.81-1.20) (0.72-1.18) (0.86-1.17) (0.87-1.52) (0.93-1.11)
RAF for C 3.8E+07 RHPN2 0.89 0.6 0.81 0.86 0.85 0.83
P
het
0.82
rs961253 20p12.3 Intergenic
1.03 1.24 1.12 0.93 1.15 1.05
(0.92-1.15) (1.01-1.53) (0.83-1.50) (0.78-1.12) (0.93-1.42) (0.97-1.14)
RAF for A 6352281 BMP2 0.36 0.33 0.14 0.12 0.27 0.25
58
P
het
0.35
*
Each cell gives odds ratios (and 95% confidence intervals) for allele dosage effects along with the risk allele frequency in controls.
Odds ratios for CRC/adenoma are adjusted for age (quartiles), ethnicity (in pooled) and study. EA: European Americans; AA:
African Americans; NH: Native Hawaiians; JA: Japanese Americans; LA: Latinos.
†
NBCI build 36.1, forward strand
‡
P value for heterogeneity of allele dosage effects across ethnic groups (4-df test)
59
The mean number of risk alleles and the distributions of the risk score were nearly
identical across racial/ethnic groups (Table 2-6). In all racial/ethnic groups the risk score
was associated with an OR of 1.06 to 1.09 per allele (P’s≤0.039) except in Japanese
Americans (OR=1.01, 95% CI: 0.97-1.06, p=0.52; Table 2-6). A significant difference in
the effect of the risk score on CRC/adenoma risk was detected between Japanese
Americans and European Americans (1-df test, P
het
=0.036), although there was no
evidence of overall heterogeneity of the risk score effect across all racial/ethnic groups
(4-df test, P
het
=0.20; Table 2-6). In all populations except Japanese Americans (OR=1.10,
p=0.43), CRC/adenoma risk for individuals in the highest quintile of the risk score was
significantly elevated by a similar magnitude compared to those in the lowest quintile,
with ORs ranging from 1.52 (p=0.036) in Latinos to 1.58 (p=9.38×10
-5
) in European
Americans (Table 2-6).
60
Table 2-6 The risk of CRC/Adenoma associated with an aggregate risk score
*
by race/ethnicity
EA AA NH JA LA Pooled
Mean risk cases 10.9(5-18) 11.8(7-18) 11.1(6-17) 10.7(4-17) 10.8(5-15) 11.0(4-18)
score
(range) controls 10.6(4-18) 11.6 (6-18) 10.7(5.87-16) 10.7(4-17) 10.9(5-17.26) 10.7(4-18)
Quintiles
†
Q1 n (cases/ctrls) 271/423 92/139 64/122 260/361 97/159 784/1204
OR(95% CI)
‡
1.00 (Ref) 1.00 (Ref) 1.00 (Ref) 1.00 (Ref) 1.00 (Ref) 1.00 (Ref)
Q2 n (cases/ctrls) 205/272 74/98 54/105 209/295 73/98 615/868
OR(95% CI)
1.18
(0.93-1.50)
1.29
(0.85-1.95)
1.01
(0.64-1.58)
0.98
(0.77-1.24)
1.31
(0.88-1.96)
1.10
(0.96-1.26)
P value 0.16 0.23 0.98 0.86 0.19 0.17
Q3 n (cases/ctrls) 222/305 69/100 70/80 195/288 69/93 625/866
OR(95% CI)
1.15
(0.91-1.45)
1.19
(0.79-1.82)
1.76
(1.12-2.75)
0.94
(0.74-1.20)
1.26
(0.84-1.91)
1.12
(0.98-1.28)
P value 0.24 0.41 0.014 0.6 0.27 0.11
Q4 n (cases/ctrls) 202/265 70/91 58/71 178/233 66/77 574/737
OR(95% CI)
1.21
(0.95-1.53)
1.32
(0.86-2.01)
1.56
(0.98-2.49)
1.06
(0.82-1.37)
1.50
(0.98-2.29)
1.22
(1.06-1.41)
P value 0.12 0.2 0.059 0.65 0.061 0.0059
Q5 n (cases/ctrls) 271/269 77/82 77/94 200/249 88/97 713/791
OR(95% CI)
1.58
(1.25-1.98)
1.53
(1.00-2.32)
1.57
(1.02-2.42)
1.10
(0.86-1.41)
1.52
(1.03-2.26)
1.39
(1.22-1.60)
P value 9.38×10
-5
0.048 0.042 0.43 0.036 1.83×10
-6
Per allele n (cases/ctrls) 1171/1534 382/510 323/472 1042/1426 393/524 3311/4466
61
OR(95% CI)
1.08
(1.04-1.12)
1.07
(1.00-1.15)
1.09
(1.01-1.18)
1.01
(0.97-1.06)
1.07
(1.01-1.15)
1.06
(1.03-1.08)
P value 7.60×10
-5
0.039 0.023 0.52 0.034 7.81×10
-7
P
het
vs. EA
§
Ref. 0.83 0.72 0.036 0.89 0.20
||
*
risk score is the total number of risk alleles
†
Cut off values for risk score quintiles were based on the distribution of risk score in controls in each ethnic group, which are 9, 10,
11, 12 in EA, JA and LA; 10, 11, 12 and 13 in AA; and 9, 10, 11, 12.16 in NH.
‡
Odds ratios (and 95% confidence intervals) adjusted for age (quartiles), ethnicity (pooled) and study. EA: European Americans; AA:
African Americans; NH: Native Hawaiians; JA: Japanese Americans; LA: Latinos.
§
P value for heterogeneity of risk score effects between European Americans and each of the other racial/ethnic groups (1-df test).
||
P value for heterogeneity of risk score effects across ethnic groups (4-df test).
62
In the MEC, results were similar following adjustment for principal components versus
adjusting only for self-reported ethnicity.
In ethnic-pooled case-only analyses, SNPs rs3802842 and rs4444235 were more
associated with rectal cancer than with colon cancer (OR=1.15 for rectal versus colon
cancer, 95% CI: 1.00-1.33, P=0.054 for rs3802842; OR=1.18, 95% CI: 1.03-1.35,
P=0.020 for rs4444235). No significantly different patterns of association were observed
for left versus right colon cancer, or by CRC stage (localized vs. distant/regional) for any
of the variant. Test for interaction revealed significant effect modification by age, with
the effects of rs3802842, rs4444235 and rs4779584 being stronger on CRC/adenoma risk
at younger ages (before age 66) (P
het
=0.041 for rs3802842; P
het
=0.046 for rs4444235;
P
het
=0.0079 for rs4779584); while the effect of rs6983267 (P
het
=0.013) and rs961253
(P
het
=0.016) was stronger at older ages (after age 65). The associations for rs10411210
(P
het
=0.021) and rs9929218 (P
het
=0.035) were stronger in men than women. The
association for rs4444235 was stronger in nonsmokers than smokers (P
het
=0.033).
Significant interaction with BMI was observed for rs4779584 (P
het
=0.015), with stronger
effects for group with BMI<23. No significantly different patterns of association were
observed by family history of colorectal cancer for any of the variant. Associations with
risk score were similar by tumor site and stage, and no evidence of heterogeneity in
associations of risk score was observed with age, sex, family history of colorectal cancer,
BMI or smoking status overall or in any population (results not shown).
63
2.3.7 Discussion
In our study, we were able to detect statistically significant associations between
rs6983267 (8q24), rs3802842 (11q23), rs4779584 (15q13) and risk of colorectal
neoplasia in European Americans, which were consistent with the reports from GWAS of
CRC (1, 6, 7) and various subsequent replication studies (3, 5, 10, 12, 15-19). The ORs
estimated in European Americans in our study for rs6983267 and rs4779584 were
reduced, and the OR for rs3802842 was increased compared to the ORs described in the
original GWAS; however, the 95% CI overlapped substantially, suggesting that the
differences were probably not significant. None of the three SNPs are located within
known genes. SNP rs6983267 has been implicated in regulating MYC while rs4779584 is
near GREM1 which is a component of the TGF-β super-family signaling pathway.
Further work will be required to better understand the clinical and biological implications
of these associations.
We failed to replicate the association between rs10795668, rs719725, rs4939827,
rs16892766, rs10411210, rs4444235, rs961253 and rs9929218 and CRC/adenoma risk in
European Americans. Significant associations with variants rs9929218, rs10411210,
rs961253 and rs4444235 were also not replicated in a study of 1786 cases and 1749
controls of Swedish origin (10), however, associations with rs10795668, rs4939827,
rs6983267 and rs719715 have been confirmed in other studies in Europeans (5, 10, 13,
18). Failure to replicate in our study may be explained by a lack of power. For the 8 SNPs
that we failed to replicate in European Americans, the power of detecting the ORs
64
reported in GWAS was 84% for rs4939827 and between 34% and 62% for the other 7
SNPs.
As we had found previously, rs6983267 was associated with elevated CRC/adenoma risk
in multiple populations (3). However these observations are not independent as there is
substantial overlap (~55%) between the cases and controls in our previous report and in
the current study. The association with rs6983267 was statistically significant in African
Americans, with the risk estimates in the other non-European ethnic groups of borderline
significance. We also found a previously unreported significant association between
rs961253 and CRC/adenoma risk in African Americans. No other statistically significant
associations were found in non-European populations.
The failure to detect significant associations in non-European populations may be due to
limited power. Except for rs4779584 and rs6983267 in Japanese Americans, the power to
detect a nominally significant association in each non-European population was <80%,
with the average power in each population for the variants tested ranging from 23% in
Native Hawaiians to 47% in Japanese Americans. Lack of statistically significant
associations in non-European populations may also be due to differences in linkage
disequilibrium (LD) patterns across ethnic groups, with the risk variants being linked
with functional variants in Europeans but not in all populations. Of the 11 SNPs, only
rs6983267 has been implicated as being a functional variant (23, 24). However the
specific biological mechanisms of all SNPs that have been associated with colorectal
cancer risk remain unknown.
65
Except for Japanese Americans, cases tend to carry more risk alleles than controls, and
moreover, there was a gradual increase in the CRC/adenoma OR with an increased
number of risk alleles in all ethnic groups. Similar cumulative effects of low-risk variants
have been described in other studies (10, 14, 18, 22).
We noted that both the individual and aggregate genetic effects of the CRC risk variants
were weaker or absent in Japanese Americans compared with other racial/ethnic groups.
Despite the relatively large sample size, only two SNPs, rs6983267 and rs444235, were
found to confer more than a 5% increase in CRC/adenoma risk in Japanese Americans.
Moreover, no enrichment in the number of risk alleles was found in Japanese American
CRC/adenoma cases. Although this could be again due to lack of power and different LD
patterns, the weak genetic effects were consistent with the fact that Japanese were
traditionally a low risk population for colorectal cancer. However, their risk increased
markedly upon migration to the U.S. and during recent decades in Japan, possibly
reflecting a particularly strong susceptibility to the environmental exposures associated
with a western lifestyle (27). Since GWAS hits were identified based on the
reproducibility of their associations with CRC, they are unlikely to be modified by
environmental risk factors, the frequencies of which do vary across populations.
Heterogeneity in the associations of the SNPs with CRC by clinicopathological
characteristics has been described in a number of previous studies, including stronger
relationship between rectal cancer and rs10795668(4, 22), rs3802842(6, 14) and
rs4939827(19); between age of disease onset and rs16892766(4), rs6983267, rs10795668,
66
rs10411210 (10); between sex and rs4939827(13) and rs9929218(9); between family
history of CRC and rs6983267(1), rs10795668 (10); and between CRC stage and
rs3802842 (22). However, past results have been inconsistent. In our pooled population,
we noted heterogeneity of association by tumor site (rectum and colon), age of disease
onset, sex, BMI and smoking status for some of the SNPs. We were unable to detect any
significant difference in association patterns by tumor stage or by family history of
colorectal cancer. Middeldorp et al (18) reported enrichment in the number of risk alleles
in patients with a family history of colorectal cancer compared with solitary CRC patients,
and in patients with early-onset disease (age≤50) compared with patients with late -onset
of CRC (age >50). We did not find any relationship between number of CRC risk alleles
and these risk factors or any other clinicopathological characteristics. Based on the
number of comparisons these observations will need to be confirmed in other studies.
In this large multiethnic study, we confirmed the associations between SNPs on 8q24,
11q23 and 15q13 and CRC/adenoma risk in European Americans, and observed
associations for SNPs on 8q24 and 20p12 in African Americans. Despite many SNPs
failing to reach statistical significance in individual analyses, we found significant
cumulative effects of risk alleles on CRC/adenoma across populations except for
Japanese Americans. The genetic effects differed by tumor site, age of onset, sex, BMI
and smoking status for some of the variants. Since many comparisons were tested caution
is needed in interpreting these results. Our results indicate that the risk alleles identified
in many European studies are in aggregate associated with CRC/adenoma risk in most
populations with the exception of Japanese Americans. This underscores both the role
67
that common variants play in CRC/adenoma and the need to extend study to diverse
populations.
2.3.8 Acknowledgements
The authors thank Christian Caberto for data management and Annette Lum-Jones and
Ann Seifried for performing the laboratory assays. This study was supported by National
Cancer Institute (NCI) grants CA63464, CA54281, CA132839, CA60987 and CA72520.
We also thank the Hawaii Tumor Registry (National Cancer Institute contract N01-PC-
35137) for assistance in colorectal cancer case identification.
2.3.9 References
1. Tomlinson I, Webb E, Carvajal-Carmona L, et al. A genome-wide association scan of
tag SNPs identifies a susceptibility variant for colorectal cancer at 8q24.21. Nat Genet
2007; 39: 984-8.
2. Zanke BW, Greenwood CM, Rangrej J, et al. Genome-wide association scan identifies
a colorectal cancer susceptibility locus on chromosome 8q24. Nat Genet 2007; 39: 989-
94.
3. Haiman CA, Le Marchand L, Yamamato J, et al. A common genetic risk factor for
colorectal and prostate cancer. Nat Genet 2007; 39: 954-6.
68
4. Tomlinson IP, Webb E, Carvajal-Carmona L, et al. A genome-wide association study
identifies colorectal cancer susceptibility loci on chromosomes 10p14 and 8q23.3. Nat
Genet 2008; 40: 623-30.
5. Poynter JN, Figueiredo JC, Conti DV, et al. Variants on 9p24 and 8q24 are associated
with risk of colorectal cancer: results from the Colon Cancer Family Registry. Cancer
Res, 2007; 67: 11128-32.
6. Tenesa A, Farrington SM, Prendergast, JG, et al. Genome-wide association scan
identifies a colorectal cancer susceptibility locus on 11q23 and replicates risk loci at 8q24
and 18q21. Nat Genet 2008; 40: 631-7.
7. Jaeger E, Webb E, Howarth K, et al. Common genetic variants at the CRAC1 (HMPS)
locus on chromosome 15q13.3 influence colorectal cancer risk. Nat Genet 2008; 40: 26-8.
8. Broderick P, Carvajal-Carmona L, Pittman AM, et al. A genome-wide association
study shows that common alleles of SMAD7 influence colorectal cancer risk. Nat Genet
2007; 39: 1315-7.
9. Houlston RS, Webb E, Broderick P, et al. Meta-analysis of genome-wide association
data identifies four new susceptibility loci for colorectal cancer. Nat Genet 2008; 40:
1426-35.
69
10. von Holst S, Picelli S, Edler D, et al. Association studies on 11 published colorectal
cancer risk loci. Br J Cancer 2010; 103: 575-80.
11. Li L, Plummer SJ, Thompson CL, et al. A common 8q24 variant and the risk of colon
cancer: a population-based case-control study. Cancer Epidemiol Biomarkers Prev 2008;
17: 339-42.
12. Pittman AM, Broderick P, Sullivan K, et al. CASP8 variants D302H and -652 6N
ins/del do not influence the risk of colorectal cancer in the United Kingdom population.
Br J Cancer 2008; 98: 1434-6.
13. Thompson CL, Plummer SJ, Acheson LS, Tucker TC, Casey G, and Li L.
Association of common genetic variants in SMAD7 and risk of colon cancer.
Carcinogenesis 2009; 30: 982-6.
14. Pittman AM, Webb E, Carvajal-Carmona L, et al. Refinement of the basis and impact
of common 11q23.1 variation to the risk of developing colorectal cancer. Hum Mol
Genet 2008; 17: 3720-7.
15. Berndt SI, Potter JD, Hazra A, et al. Pooled analysis of genetic variation at
chromosome 8q24 and colorectal neoplasia risk. Hum Mol Genet 2008; 17: 2665-72.
70
16. Tuupanen S, Niittymaki I, Nousiainen K, et al. Allelic imbalance at rs6983267
suggests selection of the risk allele in somatic colorectal tumor evolution. Cancer Res
2008; 68: 14-7.
17. Schafmayer C, Buch S, Volzke H, et al. Investigation of the colorectal cancer
susceptibility region on chromosome 8q24.21 in a large German case-control sample. Int
J Cancer 2009; 124: 75-80..
18. Middeldorp A, Jagmohan-Changur S, van Eijk R, et al. Enrichment of low penetrance
susceptibility loci in a Dutch familial colorectal cancer cohort. Cancer Epidemiol
Biomarkers Prev 2009; 18: 3062-7.
19. Curtin K, Lin WY, George R, et al. Cancer Epidemiol Biomarkers Prev 2009; 18:
616-21.
20. Matsuo K, Suzuki T, Ito H, et al. Association between an 8q24 locus and the risk of
colorectal cancer in Japanese. BMC Cancer 2009; 9: 379.
21. Kupfer SS, Torres JB, Hooker S, et al. Novel single nucleotide polymorphism
associations with colorectal cancer on chromosome 8q24 in African and European
Americans. Carcinogenesis 2009; 30: 1353-7.
71
22. Xiong F, Wu C, Bi X, et al. Risk of genome-wide association study-identified genetic
variants for colorectal cancer in a chinese population. Cancer Epidemiol Biomarkers Prev
2010; 19: 1855-61.
23. Pomerantz MM, Ahmadiyeh N, Jia L, et al. The 8q24 cancer risk variant rs6983267
shows long-range interaction with MYC in colorectal cancer. Nat Genet 2009; 41: 882-4.
24. Tuupanen S, Turunen M, Lehtonen R, et al. The common colorectal cancer
predisposition SNP rs6983267 at chromosome 8q24 confers potential to enhanced Wnt
signaling. Nat Genet 2009; 41: 885-90.
25. Kolonel LN, Henderson BE, Hankin JH, et al. A multiethnic cohort in Hawaii and
Los Angeles: baseline characteristics. Am J Epidemiol 2000; 151: 346-57.
26. Wu AH, Siegmund KD, Long TI, et al. Hormone therapy, DNA methylation and
colon cancer. Carcinogenesis 2010; 31: 1060-7.
27. Le Marchand L, Hankin JH, Wilkens LR, et al. Combined effects of well-done red
meat, smoking, and rapid N-acetyltransferase 2 and CYP1A2 phenotypes in increasing
colorectal cancer risk. Cancer Epidemiol Biomarkers Prev 2001; 10: 1259-66.
72
28. Hankin, JH, Wilkens LR, Kolonel LN, and Yoshizawa CN. Validation of a
quantitative diet history method in Hawaii. Am J Epidemiol 1991; 133: 616-28.
29. Lee LG, Connell CR, and Bloch W. Allelic discrimination by nick-translation PCR
with fluorogenic probes. Nucleic Acids Res 1993; 21: 3761-6.
30. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, and Reich D.
Principal components analysis corrects for stratification in genome-wide association
studies. Nat Genet 2006; 38: 904-9.
31. Reich D, Price AL, and Patterson N. Principal component analysis of genetic data.
Nat Genet 2008; 40: 491-2.
32. Serre D, Montpetit A, Pare G, et al. Correction of population stratification in large
multi-ethnic association studies. PLoS One 2008; 3: e1382.
73
Chapter 3 Heritability Estimate Using GWAS
3.1 Background
3.1.1 Heritability
The variance of the observed phenotypes (σ
P
2
) can be expressed as a sum of unobserved
underlying variances (σ
G
2
and σ
E
2
) representing the contribution of unobserved genotype
(G) and environmental factors (E):
𝜎𝜎 𝑃𝑃 2
= 𝜎𝜎 𝐺𝐺 2
+ 𝜎𝜎 𝐸𝐸 2
The genetic variance can be partitioned into the variance of additive genetic effects ( 𝜎𝜎 𝐴𝐴 2
),
of dominance (interactions between alleles at the same locus) genetic effects ( 𝜎𝜎 𝐷𝐷 2
), and of
epistatic (interactions between alleles at different loci) genetic effects ( 𝜎𝜎 𝐼𝐼 2
):
𝜎𝜎 𝐺𝐺 2
= 𝜎𝜎 𝐴𝐴 2
+ 𝜎𝜎 𝐷𝐷 2
+ 𝜎𝜎 𝐼𝐼 2
In the quantitative genetics literature, the fraction of the total phenotypic variance due to
genetic differences came to be known as heritability [14]:
𝐻𝐻𝑡𝑡𝑓𝑓𝑎𝑎𝑡𝑡𝑎𝑎𝐻𝐻𝑎𝑎𝑎𝑎𝑎𝑎𝑡𝑡𝐻𝐻 ( 𝐻𝐻𝑓𝑓𝑓𝑓𝑎𝑎𝑚𝑚 𝑡𝑡𝑡𝑡𝑓𝑓𝑡𝑡𝑡𝑡 ) = 𝐻𝐻 2
=
𝜎𝜎 𝐺𝐺 2
𝜎𝜎 𝑃𝑃 2
The narrow sense heritability is the additive genetic variance alone as a proportion of the
phenotypic variance, but does not include non-additive components such as dominance
and gene-gene interaction:
𝐻𝐻𝑡𝑡𝑓𝑓𝑎𝑎𝑡𝑡𝑎𝑎𝐻𝐻𝑎𝑎𝑎𝑎𝑎𝑎𝑡𝑡𝐻𝐻 ( 𝑓𝑓𝑎𝑎𝑓𝑓𝑓𝑓𝑓𝑓𝑛𝑛 𝑡𝑡𝑡𝑡𝑓𝑓𝑡𝑡𝑡𝑡 ) = ℎ
2
=
𝜎𝜎 𝐴𝐴 2
𝜎𝜎 𝑃𝑃 2
Heritability is a general and key population parameter that can help understand the
genetic architecture of complex traits.
74
3.1.2 Heritability of binary trait
The heritability of all-or-none trait can be defined in the usual way, by the proportion of
variation on an observed scale (0 or 1) that is due to additive genetic factors, and can be
estimated as for continuous traits. However, variance and heritability calculated on the
observed scale are a function of the incidence of the trait in the population. Binary traits
are often analyzed by means of the threshold liability model [15] (Figure 3-1), first used
by Wright [16]. In the threshold model, it is postulated that there exists an unobserved
continuous liability scale. Liability of disease is assumed to be the sum of environmental
and additive genetic components from independent normal distributions. Whether an
individual is affected or not depends on whether his liability exceeds or falls short of a
fixed threshold, with the proportion of the normal distribution that exceeds the threshold
being equal to the trait incidence (K).
75
Figure 3-1 (From Yang et al) The Liability Threshold Model for a Disease Prevalence of
K. An underlying continuous random variable determines disease status. If liability
exceeds the threshold t, then individuals are affected.
76
The advantages of working on the scale of liability are that population parameters such as
variance components and heritability are independent of prevalence and can therefore be
compared across traits or populations. The relationship between heritability estimates on
the observed scale (h
o
2
) and the narrow-sense heritability on the underlying continuous
liability scale (h
l
2
) is established by Dempster et al [17]:
ℎ
𝑎𝑎 2
= h
o
2
K(1 − K)
z
2
In case-control studies, cases and controls are not a random sample from the population
and the proportion of cases (P) is usually larger than the prevalence in the population. To
account for this ascertainment bias, Lee et al transformed the heritability estimates on
observed scale to a liability scale by adjusting both for scale and for ascertainment of the
case samples [18]:
ℎ
𝑎𝑎 2
= h
o
2
K(1 − K)
z
2
K(1 − K)
P(1 − P)
3.1.3 The additive Model
Let C be the set of Nc causal SNPs, which along with environmental factors determine
the phenotype of each individual. In an additive model, the phenotype of each individual
is defined by a sum of linear effects
𝐻𝐻 𝑗𝑗 = 𝑚𝑚 + � 𝑧𝑧 𝑎𝑎𝑗𝑗 𝛼𝛼 𝑎𝑎 𝑎𝑎 ∈ 𝐶𝐶 + 𝜀𝜀 𝑗𝑗
77
where 𝑧𝑧 𝑎𝑎𝑗𝑗 =
𝑔𝑔 𝑎𝑎𝑗𝑗 −2 𝑝𝑝 𝑎𝑎 � 2 𝑝𝑝 𝑎𝑎 (1 − 𝑝𝑝 𝑎𝑎 )
are the normalized genotypes, α
i
is the effect size of SNP i, ε
i
is
the environmental contribution, and Y is normalized to have variance 1. The
environmental contribution is assumed to be normally distributed as 𝜀𝜀 𝑎𝑎 ~ 𝑁𝑁 (0, 𝜎𝜎 𝜀𝜀 2
)
Consider the correlation between the phenotype of two individuals in the additive model
above:
𝑎𝑎𝑓𝑓𝑓𝑓 � 𝐻𝐻 𝑗𝑗 , 𝐻𝐻 𝑘𝑘 � = 𝑎𝑎𝑓𝑓𝑐𝑐 � 𝐻𝐻 𝑗𝑗 , 𝐻𝐻 𝑘𝑘 � = 𝑎𝑎𝑓𝑓𝑐𝑐 � � 𝑧𝑧 𝑎𝑎𝑗𝑗 𝛼𝛼 𝑎𝑎 𝑎𝑎 ∈ 𝐶𝐶 , � 𝑧𝑧 𝑎𝑎𝑘𝑘 𝛼𝛼 𝑎𝑎 𝑎𝑎 ∈ 𝐶𝐶 �
=
𝜎𝜎 𝑔𝑔 2
𝑁𝑁 𝐶𝐶 ∑ 𝑎𝑎𝑓𝑓𝑐𝑐 � 𝑧𝑧 𝑎𝑎𝑗𝑗 , 𝑧𝑧 𝑎𝑎𝑘𝑘 � = 𝜎𝜎 𝑔𝑔 2
𝑲𝑲 𝐶𝐶𝑎𝑎𝑓𝑓𝑡𝑡𝑎𝑎𝑎𝑎 , 𝑗𝑗𝑘𝑘 𝑎𝑎 ∈ 𝐶𝐶
Where 𝜎𝜎 𝑔𝑔 2
= ∑ 𝛼𝛼 𝑎𝑎 2
𝑎𝑎 ∈ 𝐶𝐶 , 𝑲𝑲 𝐶𝐶𝑎𝑎𝑓𝑓𝑡𝑡𝑎𝑎𝑎𝑎 , 𝑗𝑗𝑘𝑘 =
1
𝑁𝑁 𝐶𝐶 ∑ 𝑧𝑧 𝑎𝑎𝑗𝑗 𝑧𝑧 𝑎𝑎𝑘𝑘 𝑎𝑎 ∈ 𝐶𝐶
3.1.4 Classical heritability estimation
The classical methods of heritability estimation are based on an intuitive concept.
Phenotypes that are highly correlated among relatives in patterns consistent with
Mendelian inheritance are more heritable than those that are weakly correlated among
relatives.
The classical and still widely used approach is to collect sets of related individuals from
known pedigrees. The ij
th
element of 𝑲𝑲 𝐶𝐶𝑎𝑎𝑓𝑓𝑡𝑡𝑎𝑎𝑎𝑎 is twice the kinship coefficient or 2Φ
jk
.
Here Φ
jk
is the probability that an allele drawn at random from j is identical by descent to
a randomly drawn allele from k, and can be calculated from the known pedigree structure
[19]. Many of the familiar values for Φ
jk
such as Φ
jk
= 1/4 for full siblings assume that
founders share no alleles identical by descent, which may not be true in the presence of
inbreeding or population substructure [19, 20]. We call the matrix estimated from these
78
pedigree-based estimates K
Ped
, and it serves as an estimate of K
Causal
. Given this matrix,
the problem of heritability estimation is reduced to estimating 𝜎𝜎 𝑔𝑔 2
from the observed
covariance of the phenotypes of the related individuals.
Most traditional estimates of heritability using the correlations among related individuals
are presumed to estimate h
2
, although these estimates can be biased. For example, the
classical estimate involving the regression of offspring trait values on the mean parental
values does not include the dominance component of variance, but the epistatic
component does contribute to the estimate. The epistatic component is typically (and
perhaps incorrectly) assumed to be 0 for identifiability purposes [21, 22].
Heritability of Prostate Cancer was estimated from a twin study using combined data on
44,788 pairs of twins in the Swedish, Danish, and Finnish [23]. Phenotypic variance was
divided into a component due to inherited genetic factors (heritability), a component due
to environmental factors common to both members of the pair of twins (the shared
environmental component), and a component due to environmental factors unique to each
twin (the nonshared environmental component). Structural-equation modeling provided
estimates of the unobserved variables, it is estimated that 42% of the Prostate Cancer risk
may be explained by heritable factors (95% confidence interval (CI): 29% to 50%).
79
3.2 Heritability estimated from GWAS
3.2.1 Motivation
Genome-wide association studies (GWAS), in which several hundred thousand to more
than a million single nucleotide polymorphisms (SNPs) are assayed in thousands of
individuals, represent a powerful new tool for investigating the genetic architecture of
complex diseases. In the past few years, these studies have identified hundreds of genetic
variants associated with such conditions and have provided valuable insights into the
complexities of their genetic architecture. At this time, prostate cancer has 71 hits,
colorectal cancer has 19 hits.
Given a GWAS, one can compute an estimate of the genetic variance using the effect size
estimates from the markers with a pre-specified genome-wide significance level 𝜎𝜎 �
𝑔𝑔 2
=
∑ 𝛼𝛼 𝑎𝑎 2
𝑎𝑎 . This can be used to compute an estimate of the heritability ℎ
𝐺𝐺𝐺𝐺𝐴𝐴𝐺𝐺 2
=
𝜎𝜎 �
𝑔𝑔 2
𝜎𝜎 𝑌𝑌 2
.
Advantages of using GWAS to estimate heritability are: (1) They do not require
obtaining data on relatives of cases (2) Little confounding of shared environmental
factors (3) GWAS estimates of effect sizes can generally be measured marginally,
ignoring dominance and interaction effects, thus the heritability estimates from GWAS
are considered to be narrow-sense estimates; and the interaction effects weaken as the
relationship distance increases.
3.2.2 Problem of missing heritability
Despite the success of GWAS in identifying SNPs associated with hundreds of
phenotypes, the total fraction of the phenotypic variation explained ( ℎ
𝐺𝐺𝐺𝐺𝐴𝐴𝐺𝐺 2
) for most
80
phenotypes remains small relative to the published heritability estimates (h
2
), which are
estimated using the trait covariance among relatives [24-26]. The gap between ℎ
𝐺𝐺𝐺𝐺𝐴𝐴𝐺𝐺 2
and
h
2
has been termed the “missing heritability problem” [25]. For example, it is estimated
that 42% of the Prostate Cancer risk may be explained by heritable factors, but GWAS
hits account for approximately 30% of the familial risk [27]; the heritability of Colorectal
Cancer is estimated to be 35% [23] but only 1.6% is explained by GWAS hits [28].
Possible explanations for this missing heritability have been suggested. First of all, to
account for the large number of significance tests carried out, typically very stringent
statistical thresholds are set to control false positive rates in GWAS. This approach is at
the expense of the false negative rate, that is, failure to detect loci that are associated with
the trait but whose effect sizes are too small to reach genome-wide statistical significance.
Secondly, GWAS typically use common SNP markers. If causal variants have a lower
allele frequency than the SNPs in the GWAS, they will not be genotyped, and will be in
low linkage disequilibrium with common SNPs, thus the effect estimated at the
genotyped SNPs will be proportionally attenuated. For these reasons, the cumulative
genetic variation accounted for by SNPs that reach genome-wide statistical significance
is certain to be smaller than the total genetic variance. Recently, Zuk et al [29] showed
that certain types of epistatic interactions can inflate estimates of narrow-sense
heritability from population data.
81
Chapter 4 Finding the Missing Heritability
4.1 More comprehensive evaluation of genetic variation
4.1.1 Introduction
When phenotypes are collected on a sample of individuals whose relatedness is partially
or wholly unknown, genetic markers can be used to infer relatedness between pairs of
individuals, because related individuals tend to share more marker alleles than unrelated
individuals. The inferred relatedness can then be correlated with phenotypic similarity,
and quantitative genetic parameters, including heritability, can be estimated. This method
has been applied in evolutionary studies to estimate heritability for quantitative traits
when phenotypes and DNA samples are available but pedigree information is not, for
example in fish, plants and mammals. A disadvantage of this method is that many
polymorphic markers, typically hundreds, are needed to estimate relatedness accurately,
for distant relatives in particular [30].
Instead of focusing only on a small set of significant SNPs in GWAS, more
comprehensive evaluation of genetic variation are required to try to find the ‘‘missing’’
heritability in GWAS.
4.1.2 Score analysis
In a recent study on schizophrenia by Purcell et al [31], the author evaluated whether
common variants have an important role en masse, directly testing the classic theory of
polygenic inheritance, previously hypothesized to apply to schizophrenia. The idea
behind this approach is that although GWAS did not identify a large number of strongly
82
associated loci, there could still be potentially thousands of very small individual effects
that collectively account for a substantial proportion of variation in risk. They
summarized variation across nominally associated loci into quantitative scores that
manifests a combined effect of many SNPs, and related the scores to disease state in
independent samples. Increasing proportions will be detected at increasingly liberal
significance thresholds (P
T
). Using such thresholds, they defined large sets of ‘score
alleles’ in a discovery sample, to generate aggregate risk scores for individuals in
independent target samples.
After filtering on minor allele frequency, genotyping rate and linkage disequilibrium,
Purcell et al obtained a subset of 74,062 autosomal SNPs in approximate linkage
equilibrium. In each discovery sample, they selected sets of score alleles at different
association test P
T
thresholds. For each individual in the target sample, they calculated
the number of score alleles they possessed, each weighted by the log odds ratio from the
discovery sample. To assess whether the aggregate scores reflect schizophrenia risk, they
tested for a higher mean score in target cases compared to controls. They selected males
(2,176 cases, 1,642 controls) and females (1,146 cases, 1,945 controls) to form arbitrary
discovery and target samples. Score alleles designated in the discovery sample were
significantly enriched among target cases, and the effect was larger for increasingly
liberal P
T
thresholds. The score on the basis of all SNPs with male discovery P
T
< 0.5 (n
= 37,655 SNPs) was highly correlated with schizophrenia in target females (P = 9×10
-19
),
explaining ~ 3% of the variance (Nagelkerke’s pseudo R
2
from logistic regression [32]),
with higher scores in cases.
83
Purcell et al tried to eliminate several possible confounders, with emphasis on subtle
population structure. Defining score alleles in British Isles samples and testing in target
samples from Sweden, Portugal and Bulgaria (Figure 4-1), and vice versa, they observed
a similar pattern of results (Table 4-1). They argued that it is unlikely that the same
substructure is overrepresented in the corresponding phenotype class when discovery and
target samples are from distinct populations.
84
Figure 4-1 (From Purcell et al) Multidimensional scaling (MDS) plot for the individuals
in the final post-QC dataset (both cases and controls). Known study samples are indicated
by color; the distinct clusters are labeled with the exception of the four British Isles
samples (from Scotland, Ireland and England) that show near complete overlap on the
first two dimensions.
85
Table 4-1 (From Purcell et al) Results of the British Isles / non-British Isles score
analyses.
They next investigated the genetic models consistent with their data. The total additive
genetic variance (V
A
) reflects the number of causal alleles, as well as their frequency and
effect size distributions. However, the variance explained by the markers that tag these
causal alleles (V
M
) will be attenuated, reflecting the average extent of linkage
disequilibrium between marker and causal allele. In the target samples, the variance
explained by the observed score alleles (V
S
) will be further attenuated by sampling
variation and P
T
threshold, such that V
S
≤V
M
≤V
A
. They used simulation to estimate
possible values for V
M
and V
A
, by identifying models that produced profiles of V
S
across
P
T
threshold that were similar to those observed in the ISC data, as indexed by the target
sample R
2
. Under a variety of genetic models, they simulated discovery and target data
sets of comparable sample size to the ISC. On the basis of the empirical allele frequency
86
distribution, they simulated marker SNPs, varying the proportion that were in linkage
disequilibrium with causal variants, for which we varied allele frequency (uniform, U-
shaped) and effect size distributions (fixed GRR values, exponential GRR values, or
fixed variance explained) as well as the extent of linkage disequilibrium.
From a broad range of models, a subset produced results consistent with the ISC data
(Fig 4-2). Among these, all led to similar estimates of V
M
(mean 34%, range 32% to
36%). In models in which the causal alleles were imperfectly tagged (r
2
<1), estimates of
V
A
can be considerably larger. Therefore, their estimate that common polygenic variation
accounts for one-third of the total variation in schizophrenia risk is a lower bound for the
true value, which could be much higher. Figure 4-2b shows seven examples from the
range of consistent models.
87
Figure 4-2 (From Purcell et al) Observed and simulated profiles of target sample
variance explained according to P-value threshold P
T
. a, The observed variance explained
is shown (R
2
, black line). b, A subset of models that produced results consistent with the
observed data is shown. All yielded similar estimates of the total variance explained by
the SNPs that tag the causal variants, V
M
, with a mean value of 34%. c, Four inconsistent
models with fewer variants of larger effect are shown.
88
4.1.3 Linear mixed model (LMM)
Yang et al [33] estimated the proportion of variance for human height explained by
294,831 SNPs genotyped on 3,925 unrelated individuals using a linear model analysis. In
contrast to single-SNP association analysis, this approach does not attempt to test the
significance of individual SNPs, but to fit the effects of all the SNPs as random effects by
a mixed linear model (MLM) and provides an unbiased estimate of the variance
explained by the SNPs in total.
y = Xβ + Wu + ε
where y is an n × 1 vector of phenotypes with n being the sample size; X is a matrix of
fixed effects such as sex, age, and/or one or more principal components, with effect sizes
β; ε is a vector of residual effects with ε ~ N(0, Iσ
ε
2
); W is a standardized genotype
matrix with the ij
th
element 𝑛𝑛 𝑎𝑎𝑗𝑗 =
𝑥𝑥 𝑎𝑎𝑗𝑗 −2 𝑝𝑝 𝑎𝑎 � 2 𝑝𝑝 𝑎𝑎 (1 − 𝑝𝑝 𝑎𝑎 )
, x
ij
is the number of copies of the reference
allele for the i
th
SNP of the j
th
individual and p
i
is the frequency of the reference allele.
If we regard the effects of the SNPs to be random so that the effect sizes u are sampled
from a normal distribution with mean zero and variance σ
u
2
, i.e., u ~ N(0, Iσ
u
2
), the
unconditional variance of Y is
var(Y) = WW’σ
u
2
+ Iσ
ε
2
If we define A=WW’/N and σ
g
2
as the variance explained by all the SNPs, i.e., σ
g
2
=
Nσ
u
2
, where N is number of columns (SNPs) of W, then we can write the model as
E(y)= Xβ
var(y) = Aσ
g
2
+ Iσ
ε
2
89
A is interpreted as the genetic relationship matrix (GRM) between individuals. The
genetic relationship between individual j and k can be estimated by the following
equation:
𝐴𝐴 𝑗𝑗𝑘𝑘 =
1
𝑁𝑁 �
� 𝑥𝑥 𝑎𝑎𝑗𝑗 − 2 𝑝𝑝 𝑎𝑎 � (𝑥𝑥 𝑎𝑎𝑘𝑘 − 2 𝑝𝑝 𝑎𝑎 )
2 𝑝𝑝 𝑎𝑎 (1 − 𝑝𝑝 𝑎𝑎 )
𝑁𝑁 𝑎𝑎 =1
Note that 𝐴𝐴 𝑗𝑗𝑘𝑘 is identical to the estimate of the genetic relatedness by Astle et al [4] for
related individuals. Note the similarities and differences here xij are fixed the effect sizes
are random while in the earlier description the effect sizes are fixed and zij are random.
GCTA [18] implements the REML method via the average information (AI) algorithm
[34] to estimate σ
g
2
, and the heritability is estimated to be ℎ
2 �
=
𝜎𝜎 𝑔𝑔 2
𝜎𝜎 𝑃𝑃 2
, with σ
P
2
being the
phenotypic variance. In a separate paper Yang et al extended this method to partition the
genetic variance onto each of the chromosomes [18]. They estimated the GRM from the
SNPs on each chromosome (A
C
) and estimated the variance attributable to each
chromosome by fitting the GRMs of all the chromosomes simultaneously in the model
𝒚𝒚 = 𝑿𝑿𝑿𝑿 + � 𝒈𝒈 𝐶𝐶 + 𝝐𝝐 22
𝐶𝐶 =1
𝑎𝑎𝑓𝑓𝑚𝑚 𝑐𝑐𝑎𝑎𝑓𝑓 ( 𝒈𝒈 𝐶𝐶 ) = 𝑨𝑨 𝐶𝐶 𝜎𝜎 𝐶𝐶 2
For binary traits like Prostate Cancer, the heritability estimates on the observed scale are
transformed to a liability scale by adjusting both for scale and for ascertainment of the
case samples [18]. Note that Yang et al interpret the heritability estimate as the fraction of
variance explained by the measured SNPs themselves and is not influenced by residual
relatedness, i.e. they assume that the Ys are strictly independent given Wu.
90
They found that 45% (Figure 4-3) of variance of height can be explained by considering
all SNPs simultaneously, and the variance explained by each chromosome is proportional
to its length (Figure 4-4). Thus their conclusion was most of the heritability is not
missing but has not previously been detected because the individual effects are too small
to pass stringent significance tests. The implication is that use of the same SNPs in larger
and larger sample size will ultimately pinpoint through linkage disequilibrium the causal
SNPs that explain the narrow sense heritability.
91
Figure 4-3 (From Yang et al) Estimates of variance explained by genome-wide SNPs
from adjusted estimates of genetic relationships are unbiased. Results are shown as
estimates of variance explained by different proportions of SNPs randomly selected from
all the SNPs in the combined set. For each group of SNPs, the variance explained by
genome-wide SNPs is estimated using both raw estimates of genetic relationships and
adjusted estimates of genetic relationships correcting for prediction error (assuming c=0).
Error bars denote s.e. of the estimate of variance explained by genome-wide SNPs. The
log-likelihood ratio test (LRT) statistic is calculated as twice the difference in log-
likelihood between the full (h
2
≠0) and reduced (h
2
=0) models.
92
Figure 4-4 (From Yang et al) Variance explained by chromosomes. Shown are the
estimates of the variance explained by each chromosome for height (combined) by joint
analysis using 11,586 unrelated individuals against chromosome length. The numbers in
the circles are the chromosome numbers. The regression slopes and R
2
were 1.6×10
-4
(P=1.4×10
-6
) and 0.695 for height.
4.2 Application to African American Prostate Cancer (AAPC) GWAS data
4.2.1 AAPC GWAS
Nine studies were genotyped as part of the GWAS of prostate cancer in African
American men [35]. A brief description of each study can be found in the appendix A.
93
Genotyping was conducted using the Illumina Infinium 1 M-Duo bead array at the
University of Southern California and the NCI Genotyping Core Facility (PLCO study).
Following genotyping samples were removed based on the following exclusion criteria: 1)
unknown replicates across studies (n=24, none within studies); 2) call rates <95%
(n=126); 3) samples with >10% mean heterozygosity on the X chromosome and/or <10%
mean intensity on the Y chromosome - we inferred 3 samples to be XX and 6 to be XXY;
4) ancestry outliers (n=108, discussed below), and; 5) samples that were related (n=141,
discussed below). To assess genotyping reproducibility we included 158 replicate
samples; the average concordance rate was 99.99% (≥99.3% for all pairs). Starting with
1,153,397 SNPs, we removed SNPs with <95% call rate, MAFs<1%, or >1 QC mismatch
based on sample replicates (n=105,411).
Our study included 1,035,043 autosomal SNPs among 4,905 prostate cancer cases and
4,732 controls. A reduced set of 316,735 SNPs in approximate linkage equilibrium
(r
2
≤0.25 within 200 SNPs window, implemented through Plink) was obtained. The
genetic relationship matrix (GRM) of all the individuals was estimated using GCTA, and
we excluded one of each pair of individuals with an estimated genetic relationship >0.025
(that is, more related than third or fourth cousins) and retained a subset of 6,957 (3,573
cases and 3,381 controls) not closely related individuals.
4.2.2 Score analysis of Prostate Cancer
Following Purcell et al, we randomly divided study population into a discovery sample
(1,730 cases and 1,749 controls) and a target sample (1,774 cases and 1,704 controls).
94
We selected sets of score SNPs based on P value from logistic regression of prostate
cancer status on each SNP in discovery sample using different threshold P values (0.0001,
0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5), models were adjusted for the first 10
eigenvectors. The score for each individual in the target sample is computed as the
number of score alleles weighted by the log of the odds ratio from the discovery sample
( 𝑡𝑡𝑎𝑎𝑓𝑓𝑓𝑓𝑡𝑡 = ∑ 𝑥𝑥 𝑎𝑎 × log 𝑂𝑂𝑅𝑅 𝑎𝑎 ). If an individual in the target sample is missing that genotype,
the mean is imputed for that genotype, based on the target sample allele frequency. The
association between score and Prostate Cancer in the target sample is tested using logistic
regression adjusted for first 10 eigenvectors.
We observed similar results as Purcell et al with the polygenic score highly predictive in
the target sample. Score alleles designated in the discovery sample were significantly
enriched among target cases, the association was stronger for increasingly liberal P
T
thresholds up to P
T
=0.1 (P value=8.75×10
-7
), explaining ~0.93% of the variance
(Nagelkerke’s pseudo R
2
from logistic regression [32]), then the association became
weaker possibly due to inclusion of too many null alleles in the score. The score on the
basis of all SNPs with discovery P
T
< 0.5 (n = 158,671 SNPs) was highly correlated with
Prostate Cancer in target sample (P = 4.39×10
-6
, Nagelkerke’s pseudo R
2
=0.81%), with
higher scores in cases (Table 4-2).
95
Table 4-2 Results of score analyses on Prostate Cancer. Score alleles were selected based
on logistic model adjusted for first 10 principal components
P
cut
N P
R
2
0.0001 29 9.09E-06 0.76%
0.001 325 1.98E-02 0.21%
0.01 3308 1.81E-05 0.70%
0.05 16185 1.19E-06 0.91%
0.1 32214 8.75E-07 0.93%
0.2 64033 2.40E-06 0.85%
0.3 95548 2.82E-06 0.84%
0.4 127061 5.46E-06 0.79%
0.5 158671 4.39E-06 0.81%
N: number of SNPs in the score; P: P values from model with first 10 eigenvectors; R
2
:
R
2
from model with score and first 10 eigenvectors minus R
2
from model with only
eigenvectors
96
4.2.3 Mixed linear model of Prostate Cancer
Using all 1 million SNPs before LD pruning the LMM gave an estimate of narrow sense
heritability ~ 32%. The phenotypic variance (of the liability) of prostate cancer in African
Americans explained by the relatively independent 316,735 SNPs obtained through
pruning in Plink was estimated to be 30.40% (Figure 4-4). Since using 1 million SNPs
did not increase the estimates of heritability to any important degree, all the analyses
were performed on the reduced set of SNPs and individuals except otherwise stated. We
looked at the heritability estimates using 64K, 127K, 190K, 254K and all of the ~317K
SNPs. The proportion of phenotypic variance explained increased from 18.71% by using
~64K SNPs to 30.08% when ~190K SNPs were used, and then the estimate barely
increased when the remaining SNPs were added in the model (Figure 4-5). Note that
similar result seems evident in Yang’s paper (Figure 4-3) where it only took 30K SNPs
to obtain ~50% of the phenotypic variance captured by all 300K.
Next, we estimated the GRMs from the SNPs on each autosome and estimated the
genetic variances explained by individual chromosomes. We observed a marginally
significant linear relationship between the estimate of variance explained by each
chromosome and chromosome length (R
2
=0.421 and P = 0.058, Figure 4-5). We also
partitioned the variance explained by all the SNPs onto genic and intergenic regions of
the whole genome. We mapped 316,735 SNPs to 19,870 genes according to positions on
the UCSC Genome Browser hg18 assembly, 19,460 of which had at least one SNP within
±50 kb of the 5’ and 3’ untranslated regions (UTRs). We defined the gene boundaries as
97
±0 kb, ±20 kb and ±50 kb of the 3’ and 5’ UTRs. A total of 147,958, 194,579 and
226,317 SNPs were located within the boundaries of 14,521, 18,863 and 19,460 protein-
coding genes, for the three definitions respectively. The estimates of variance explained
by the genic and intergenic SNPs across genome for the three definitions of genic regions
(±0 kb, ±20 kb and ±50 kb) respectively are (a) 17.94% (s.e. 0.057) and 11.38% (s.e.
0.063); (b) 29.78% (s.e. 0.064) and <0.01% (s.e. 0.053); and (c) 31.33% (s.e. 0.067) and
<0.01% (s.e. 0.046) (Figure 4-6). When intergenic SNPs were included in the models
alone, the estimates of variance explained are (a) 18.87% (s.e. 0.057); (b) 8.65% (s.e.
0.048) and (c) 4.82% (s.e. 0.041) respectively.
98
Figure 4-5 Heritability explained by common SNPs for Prostate Cancer by number of
SNPs used. Models were adjusted for first 10 eigenvectors.
99
Figure 4-6 Variance explained by chromosomes for prostate cancer against chromosome
length. Models were adjusted for first 10 eigenvectors. The regression slope and R
2
were
9.17×10
-5
(P=0.0508) and 0.421.
100
Figure 4-7 Estimates of the variance explained by genic and intergenic regions on each
chromosome for prostate cancer by the joint analysis. The genic region is defined as (a)
±0 kb, (b) ±20 kb and (c) ±50 kb of the 3′ and 5′ UTRs.
101
Chapter 5 Interpretation of Heritability Estimates
5.1 Concerns about score analysis and LMM approach
The reason for our concern is explained as below. In LMM, let Z
1
and Z
2
be the
normalized genotype matrix of causal SNPs, each contributes genetic variance σ
1
2
and σ
2
2
,
𝑐𝑐𝑎𝑎𝑓𝑓 ( 𝒀𝒀 ) =
𝒁𝒁 𝟏𝟏 𝒁𝒁 𝟏𝟏 ′
𝑁𝑁 1
𝜎𝜎 1
2
+
𝒁𝒁 𝟐𝟐 𝒁𝒁 𝟐𝟐 ′
𝑁𝑁 2
𝜎𝜎 2
2
+ 𝑰𝑰 𝜎𝜎 𝜀𝜀 2
When the measured SNPs W include only part of the causal SNPs (Z
1
) and some non-
causal SNPs (Z
3
), the genetic variance contributed by W should be σ
1
2
, since Z
3
don’t
impact the phenotype
𝑐𝑐𝑎𝑎𝑓𝑓 ( 𝒀𝒀 ) =
𝒁𝒁 𝟏𝟏 𝒁𝒁 𝟏𝟏 ′
𝑁𝑁 1
𝜎𝜎 1
2
+
𝒁𝒁 𝟑𝟑 𝒁𝒁 𝟑𝟑 ′
𝑁𝑁 3
𝜎𝜎 3
2
( 𝑧𝑧𝑡𝑡𝑓𝑓𝑓𝑓 ) + 𝑰𝑰 𝜎𝜎 𝜀𝜀 2
However, under models for population stratification or hidden relatedness all variants
(not subject to natural selection) tend to have the same expected correlation matrix, so
that the genetic covariance defined at Z
3
and Z
2
will be similar to each other and to K,
𝒁𝒁 𝟏𝟏 𝒁𝒁 𝟏𝟏 ′
𝑁𝑁 1
,
𝒁𝒁 𝟐𝟐 𝒁𝒁 𝟐𝟐 ′
𝑁𝑁 2
,
𝒁𝒁 𝟑𝟑 𝒁𝒁 𝟑𝟑 ′
𝑁𝑁 3
~ 𝑲𝑲 [4]. Thus the estimate of σ
3
2
will contain additional information
about K that will overestimate the total genetic contribution of Z
1
and Z
3
.
The estimate from LMM is interpreted by Yang et al [33] as the narrow-sense heritability
due exclusively to the SNPs in high LD with those on the genotyping platform. Several
authors [36, 37] have noted that this can be a biased estimate if there is hidden population
stratification in population under study, but the influence of low levels of relatedness in
the LMM approach has not been closely examined. We investigated through simulation
102
studies the sensitivity of the LMM in estimating the heritability due to specific sets of
SNPs, to low levels of hidden relatedness.
The question of the total amount of heritability explained by a given set of SNPs is
important because where the majority of phenotypic variation lies has significant
implications for understanding the genetic architecture of complex phenotypes and the
future success of association studies (e.g. will larger studies that use similar sets of SNPs
continue to find additional hits, or must the SNP set to be interrogated be expanded). It
may also help to put bounds on how clinically useful genetic risk prediction will be once
we learn much more about which are the causal SNPs.
5.2 Simulation studies
5.2.1 Simulated phenotypes based on genotypes in AAPC GWAS
Binary phenotypes were simulated based on observed genotypes from the pruned AAPC
GWAS data using a liability model [38] as follows: 1) randomly select 1000 “genic”
SNPs from regions between 5’ and 3’ untranslated regions (UTRs) on even numbered
chromosomes (chr2, chr4 … chr22) as causal SNPs; 2) sample genetic effect of the j
th
causal SNP b
j
from standard normal distribution; 3) calculate the genetic value for each
individual by g = ∑x
j
×b
j
, where x
j
is the number of minor alleles of the j
th
causal SNP; 4)
set h
2
=0.5 and sample residual effects (e) from normal distribution with mean 0 and
variance equal to 𝑐𝑐𝑎𝑎𝑓𝑓 (𝑔𝑔 ) ×
1 − ℎ
2
ℎ
2
; 5) the liability is computed by l=g+e; 6) rank the
individuals by l, and assign the top half individuals as cases and the remaining as controls
[18].
103
Then we applied both approaches to the simulated phenotypes. For Purcell’s method,
scores for target individuals were computed using the same formula stated above
( 𝑡𝑡𝑎𝑎𝑓𝑓𝑓𝑓𝑡𝑡 = ∑ 𝑥𝑥 𝑎𝑎 × log 𝑂𝑂𝑅𝑅 𝑎𝑎 ). The association between score and simulated phenotype
increased when more causal SNPs were included in the score by increasing the cutoff P
values up to 0.05 (P = 2.01×10
-6
, explaining ~ 0.87% of the variance, Table 5-1); the
strength of association was attenuated by further increasing number of null SNPs in the
score. Next we excluded causal SNPs in each score, we noticed that even without any
SNP that is causal or on the same chromosome with any casual SNPs in the score, highly
significant associations with the simulated phenotype were still observed (P = 8.20×10
-4
,
explaining ~ 0.43% of the variance, Table 5-1). In a more extreme scenario where we
excluded all SNPs on the even number chromosomes in each score, that is, none of the
score alleles is a causal allele or on the same chromosome with any causal allele, we did
not observe significant associations between scores and the phenotype (Table 5-1).
According to LMM, 317K SNPs together explained 33.99% of variance of simulated
phenotype on the liability scale compared to the true 50% (Figure 5-1a). As observed
with prostate cancer, the proportion of variance of simulated phenotype explained
increased as more SNPs were included in the model, but the trend was not linear with the
number of SNPs used, 10.06% of phenotypic variance can be explained by ony ~64K
SNPs, 31.81% of phenotypic variance can already be explained using only 60% (~190K)
of the GWAS SNPs (Figure 5-1a). By using SNPs located closer to the regions where
causal SNPs are distributed, fewer SNPs are required to obtain a valid estimate of narrow
sense heritability, 34.9% of the variation in simulated phenotypes can be explained by
104
only ~72K SNPs from regions between 3’ and 5’ UTRs on even number chromosome
(Figure 5-1a). A substantial amount of variance was explained by using all non-causal
SNPs (21.60%); the amount of variance explained by non-causal SNPs located within
(genic) and outside (intergnic) of the regions between 3’ and 5’ UTRs was 18.83% and
12.19% respectively (Figure 5-1b); SNPs on odd number chromosomes barely explained
any variation in simulated phenotype (2.04%, Figure 5-1b).
105
Figure 5-1 Heritability explained by common SNPs for simulated phenotype over
observed AAPC genotypes by number of SNPs used. Models were adjusted for first 10
eigenvectors. a, When the models include all of the causal SNPs. b, When the models do
not include any causal SNPs.
a
106
b
107
Table 5-1 Results of score analyses on simulated phenotype. Models were adjusted for first 10 principal components
All SNPs Non-causal SNPs SNPs on odd number chromosomes
P
cut
N NC P R
2
N NC P R
2
N NC P R
2
0.0001 53 8 1.11E-04 0.57% 45 0 9.07E-02 0.11% 18 0 9.23E-01 0.00%
0.001 365 25 1.30E-03 0.40% 340 0 5.69E-01 0.01% 157 0 6.78E-01 0.01%
0.01 3423 89 1.40E-04 0.56% 3334 0 1.29E-01 0.09% 1491 0 9.51E-01 0.00%
0.05 16749 195 2.01E-06 0.87% 16554 0 8.20E-04 0.43% 7847 0 6.65E-01 0.01%
0.1 33136 265 6.81E-05 0.61% 32871 0 5.67E-03 0.29% 15877 0 8.08E-01 0.00%
0.2 64930 362 6.21E-04 0.45% 64568 0 1.74E-02 0.22% 31703 0 8.37E-01 0.00%
0.3 96305 457 3.05E-04 0.50% 95848 0 8.13E-03 0.27% 47474 0 9.32E-01 0.00%
0.4 128077 554 2.66E-04 0.51% 127523 0 6.62E-03 0.28% 63377 0 9.39E-01 0.00%
0.5 159676 647 1.98E-04 0.53% 159029 0 4.82E-03 0.30% 79276 0 8.90E-01 0.00%
N: number of SNPs in the score; NC: number of causal SNPs in the score; P: P values from model with first 10 eigenvectors; R
2
:
R
2
from model with score and first 10 eigenvectors minus R
2
from model with only eigenvectors
108
5.2.2 Simulated genotypes and phenotypes with low pairwise correlation
To show more directly how the narrow sense heritability estimates from the LMM are
influenced by using genotypes that are non-causal but have the same covariance structure
as causal SNPs do, we simulated both genotypes and phenotypes that have low pairwise
correlation among individuals.
Correlated genotypes were generated using the R package “mvtBinaryEP” (http://cran.r-
project.org/web/packages/mvtBinaryEP/index.html) based on the algorithm of Emrich
and Piedmonte [39]. The method relies on simulating multivariate normal vectors and
then dichotomizing each coordinate. The cut points are determined by the vector of
means of binary variables. The correlation matrix S (which are the tetra-choric
correlations) of the multivariate normal vectors is computed in such a way that the
resulting binary vectors have correlation matrix R. The package was slightly modified so
that when S was not positive definite, the nearest positive definite matrix was used using
R package “nearPD” (http://stat.ethz.ch/R-manual/R-
devel/library/Matrix/html/nearPD.html).
We simulated a genotype matrix G for n=2000 individuals, each SNP (column) in G had
correlation matrix K mimicking a low level of relatedness among individuals. We
sampled the off-diagonal elements of K from uniform distribution U(0, 0.025), thus the
maximum pairwise correlation between any individuals is 0.025. We set the minor allele
frequencies (MAF) of SNPs in G equal to 4.5%, 12.6%, 21.5%, 32.2% and 44% (equal
number of SNPs in each MAF category), which is representative of the MAF distribution
in AAPC GWAS. For each SNP in G, its two alleles were independently sampled from
109
Bernoulli distribution with marginal probability equal to its MAF and correlation matrix
equal to K. Defining Z as the standardized form of genotype matrix G, we randomly
sampled matrices Z
1
(2000×1000), Z
2
(2000×1000) and Z
3
(2000×20000) from Z, thus
each column vector of Z
1
– Z
3
has variance-covariance matrix K. We also simulated Z
4
(2000×20000) as the standardized form of a genotype matrix whose column vectors arose
from independent Binomial distribution with marginal probabilities equal to the five
different MAFs stated previously.
Using SNPs in Z
1
and Z
2
as causal SNPs, the relationship between phenotypes (on the
liability scale) and genotypes follows the model
𝐋𝐋 = 𝐙𝐙 𝟏𝟏 𝐮𝐮 𝟏𝟏 + 𝐙𝐙 𝟏𝟏 𝐮𝐮 𝟏𝟏 + 𝛆𝛆
Where u
1
and u
2
are effect sizes of causal SNPs. Given u
1
and u
2
, the expected value and
variance of L are E(L) = Z
1
u
1
+ Z
2
u
2
and Var(L) = σ
2
I. If we treat u
1
and u
2
as random
effects as in LMM and let
γ
2
and τ
2
be the genetic variances contributed by Z
1
and Z
2
respectively, and let σ
2
be the variance contributed by environmental factors, the true
narrow sense heritability h
2
is
γ
2
+ τ
2
γ
2
+ τ
2
+ σ
2
.
In our simulation, we set γ
2
= τ
2
= 0.0625 and σ
2
= 0.125, which makes the true narrow
sense heritability h
2
equal to 50%. We sampled u
1
and u
2
from normal distribution N(0,
γ
2
I) and N(0, τ
2
I) respectively to mimic the genetic contribution of causal SNPs Z
1
and
Z
2
; we sampled L from multivariate normal distribution MN(Z
1
u
1
+ Z
2
u
2
, σ
2
I), and then
cut L into two categories by its median, with the half with higher liabilities as cases and
the rest as controls.
110
Table 5-2 shows the results averaged over 20 independent simulations. The 2K causal
SNPs were estimated to explain 52.36% of the phenotypic variance on the liability scale.
Using the 20K non-causal SNPs in Z
3
that have similar correlation matrix as the causal
SNPs alone can explain 25.49% of the liability variance. The estimate of the proportion
of additive genetic variance contributed by half the causal SNPs (Z
1
or Z
2
) was estimated
to be ~27%, the estimate was considerably increased to ~40% when Z3 were also
included in the model. However, Z
3
do not have obvious influences on the estimates
when all the causal SNPs were included in the model (52.36% compared to 53.25%).
Non-causal SNPs in Z4 whose distribution do not depend on correlation among
individuals can explain little of the phenotypic variance and do not obviously affect the
estimates.
111
Table 5-2 Results of linear mixed model analyses of simulated phenotypes and genotypes
with low pairwise correlation (averaged over 20 independent simulations).
N
σ
g
2
σ
e
2
h
2
Estimate (SE) Estimate (SE) Estimate (SE)
Z
3
20K 0.041 (0.019) 0.210 (0.020) 25.49% (0.120)
Z
4
20K 0.004 (0.025) 0.246 (0.026) 2.61% (0.159)
Z
1
,Z
2
2K 0.084 (0.010) 0.167 (0.009) 52.36% (0.054)
Z
1
,Z
2
,Z
3
22K 0.085 (0.020) 0.166 (0.019) 53.25% (0.122)
Z
1
,Z
2
,Z
4
22K 0.081 (0.027) 0.169 (0.026) 50.77% (0.166)
Z
1
1K 0.042 (0.007) 0.208 (0.008) 26.61% (0.043)
Z
1
,Z
3
21K 0.063 (0.020) 0.187 (0.020) 39.66% (0.121)
Z
1
,Z
4
21K 0.036 (0.026) 0.214 (0.027) 22.70% (0.163)
Z
2
1K 0.044 (0.007) 0.206 (0.008) 27.67% (0.043)
Z
2
,Z
3
21K 0.064 (0.020) 0.187 (0.020) 39.98% (0.121)
Z
2
,Z
4
21K 0.038 (0.026) 0.212 (0.027) 23.72% (0.163)
Z
1
and Z
2
are causal variants with covariance matrix; Z
3
are non-causal variants with the
same covariance matrix as Z
1
and Z
2
; Z
4
are independent non-causal variants
112
5.3 Discussion
The heritability estimates for prostate cancer from both score analysis and LMM
increased as an increasing number of SNPs were used, this provides further evidence for
the highly polygenic nature of prostate cancer. The narrow sense heritability of prostate
cancer was estimated to be approximately 42% according to family studies, and we
estimated from the linear mixed model that more than half of this variation (30/42 = 0.71)
appeared to be tagged by common SNPs. We observed evidence for prostate cancer that
genic regions explain more variation than intergenic regions, implying that causal
variants are more likely to occur in the vicinity of the genes than in intergenic regions.
However we noticed that depending on the definition of gene boundaries a larger number
of SNPs are classified as genic. We also observed that SNPs located outside of UTRs
together still explained a certain amount of phenotypic variance, this is consistent with
the observation that a substantial proportion of genome-wide significant SNPs for
complex traits is found in intergenic regions, of the 82 prostate cancer GWAS hits with
P<10
-5
listed in the Catalog of Published Genome Wide Association Studies as of Jan
2013, 55 of them are located in intergenic regions.
We noticed that in the analyses of prostate cancer, the narrow sense heritability estimate
from LMM by using the full set of 1 million SNPs (32%) is not obviously increased
compared to using the pruned set of 317K SNPs (30.40%), and furthermore, the estimate
was almost the same when using only 190K of the relatively independent SNPs (30.08%).
An obvious question about this result is whether this implies that the ~190K of the SNPs
on the array (and those in high LD with them) has a genetic contribution to prostate
113
cancer risk that is equal to that of the full 1 million SNPs. This question motivated our
simulation studies.
In the GWAS simulation where the causal variants were all genic (UTR+0kb) SNPs on
even chromosomes, Figure 5-1a shows good behavior of heritability estimates using any
set of SNPs that included all the causal variants defined in the simulation; and by using
SNPs located closer to the regions where causal SNPs are distributed, fewer SNPs are
required to obtain a valid estimate of narrow sense heritability.
In the GWAS simulation we restricted SNPs to have limited LD between each other
(R
2
<0.25) by careful pruning with a 200 SNPs window in Plink, therefore we expect that
the non-causal SNPs will explain at most 25% of the overall narrow sense heritability
(see Appendix B). However, as shown in Figure 5-1b, when all non-causal SNPs were
considered, the miss-attribution of heritability was fairly large with the narrow sense
heritability estimate at ~21% (or 70% of the total estimate of 30%). This is in support of
our concern that the estimate from LMM is sensitive to the influence of distant
relatedness and might over estimate the genetic contribution of the specific set of
measured SNPs. We found that intergenic versus genic analyses such as performed by
Yang et al [40] may be sensitive to subtle population stratification too, here non-genic
SNPs were estimated to convey heritability of 12.19% compared to the true value 0.
When we considered non-causal SNPs that were on the wrong chromosomes only a very
small fraction of heritability was explained (i.e. very little misattribution of effect). This
114
may be a reflection of different “histories” of the different chromosomes so that people
who are similar on one chromosome are not necessarily similar on other chromosomes.
In the second simulation (in which both genotypes as well as phenotypes were simulated),
we observed that when all the causal variants were included in the model, the LMM gave
an estimate that is very close to the true narrow sense heritability, and this estimate is not
obviously affected by adding non-causal SNPs to the model. However, when the LMM
did not consider all the causal SNPs, the results from this simulation tend to show more
directly that even under mild degrees of population structure (ρ≤0.025), it is difficult to
separate the effects of a given set of causal SNPs (Z
1
) especially when the set of SNPs
being evaluated (Z
1
, Z
3
) includes a large number of SNPs (Z
3
) that are not causal but
have the same weak correlation matrix as do the causal SNPs.
Our finding is that some of the analyses that are being widely used, such as looking at
intergenic versus genic regions, or looking at contributions of different chromosomes, are
likely less biased by weak levels of population structure than are others (such as
calculating the total fraction of variance that a specific set of SNPs could explain by
themselves when used as predictor variables).
In summary, these observations were consistent with our suspicious that heritability
estimates from the LMM approach tend to reflect total additive heritability rather than the
heritability directly inferable to the action of measured SNPs or variants in high LD
(R2>>0.25) with measured SNPs. Similar concerns may apply to the score analyses as in
115
Purcell et al. Thus caution is advised in the interpretation of heritability results using
either method.
116
Bibliography
1. Klein, R.J., et al., Complement factor H polymorphism in age-related macular
degeneration. Science, 2005. 308(5720): p. 385-9.
2. Johnson, A.D. and C.J. O'Donnell, An open access database of genome-wide
association results. BMC Med Genet, 2009. 10: p. 6.
3. He, J., et al., Generalizability and epidemiologic characterization of eleven
colorectal cancer GWAS hits in multiple populations. Cancer Epidemiol
Biomarkers Prev, 2011. 20(1): p. 70-81.
4. Astle, W. and D.J. Balding, Population Structure and Cryptic Relatedness in
Genetic Association Studies. Statistical Science, 2009. 24(4): p. 451-471.
5. Lander, E.S. and N.J. Schork, Genetic dissection of complex traits. Science, 1994.
265(5181): p. 2037-48.
6. Pritchard, J.K. and N.A. Rosenberg, Use of unlinked genetic markers to detect
population stratification in association studies. Am J Hum Genet, 1999. 65(1): p.
220-8.
7. Devlin, B. and K. Roeder, Genomic control for association studies. Biometrics,
1999. 55(4): p. 997-1004.
8. Pritchard, J.K., et al., Association mapping in structured populations. Am J Hum
Genet, 2000. 67(1): p. 170-81.
9. Kang, H.M., et al., Variance component model to account for sample structure in
genome-wide association studies. Nat Genet, 2010. 42(4): p. 348-54.
117
10. Price, A.L., et al., Principal components analysis corrects for stratification in
genome-wide association studies. Nat Genet, 2006. 38(8): p. 904-9.
11. Chen, G.K., et al., The potential for enhancing the power of genetic association
studies in African Americans through the reuse of existing genotype data. PLoS
Genet, 2010. 6(9).
12. McKeown-Eyssen, G., Epidemiology of colorectal cancer revisited: are serum
triglycerides and/or plasma glucose associated with risk? Cancer Epidemiol
Biomarkers Prev, 1994. 3(8): p. 687-95.
13. Giovannucci, E., Insulin and colon cancer. Cancer Causes Control, 1995. 6(2): p.
164-79.
14. Lush, J.L., Animal breeding plans. 1945, Ames,: Iaowa. The Collegiate press, inc.
15. Falconer, D.S., The inheritance of liability to diseases with variable age of onset,
with particular reference to diabetes mellitus. Ann Hum Genet, 1967. 31(1): p. 1-
20.
16. Wright, S., An Analysis of Variability in Number of Digits in an Inbred Strain of
Guinea Pigs. Genetics, 1934. 19(6): p. 506-36.
17. Dempster, E.R. and I.M. Lerner, Heritability of Threshold Characters. Genetics,
1950. 35(2): p. 212-36.
18. Yang, J., et al., GCTA: a tool for genome-wide complex trait analysis. American
Journal of Human Genetics, 2011. 88(1): p. 76-82.
19. Lange, K., Mathematical and statistical methods for genetic analysis. 2nd ed.
Statistics for biology and health. 2002, New York: Springer. xvii, 361 p.
118
20. Powell, J.E., P.M. Visscher, and M.E. Goddard, Reconciling the analysis of IBD
and IBS in complex trait studies. Nat Rev Genet, 2010. 11(11): p. 800-5.
21. Falconer, D.S., Introduction to quantitative genetics. 3rd ed. 1989, Burnt Mill,
Harlow, Essex, England New York: Longman Wiley. xii, 438 p.
22. Zuk, O., et al., The mystery of missing heritability: Genetic interactions create
phantom heritability. Proc Natl Acad Sci U S A, 2012.
23. Lichtenstein, P., et al., Environmental and heritable factors in the causation of
cancer--analyses of cohorts of twins from Sweden, Denmark, and Finland. N Engl
J Med, 2000. 343(2): p. 78-85.
24. Eichler, E.E., et al., Missing heritability and strategies for finding the underlying
causes of complex disease. Nat Rev Genet, 2010. 11(6): p. 446-50.
25. Maher, B., Personal genomes: The case of the missing heritability. Nature, 2008.
456(7218): p. 18-21.
26. Manolio, T.A., et al., Finding the missing heritability of complex diseases. Nature,
2009. 461(7265): p. 747-53.
27. Eeles, R.A., et al., Identification of 23 new prostate cancer susceptibility loci
using the iCOGS custom genotyping array. Nat Genet, 2013. 45(4): p. 385-91,
391e1-2.
28. Peters, U., et al., Identification of Genetic Susceptibility Loci for Colorectal
Tumors in a Genome-Wide Meta-analysis. Gastroenterology, 2013. 144(4): p.
799-807 e24.
119
29. Zuk, O., et al., The mystery of missing heritability: Genetic interactions create
phantom heritability. Proc Natl Acad Sci U S A, 2012. 109(4): p. 1193-8.
30. Visscher, P.M., W.G. Hill, and N.R. Wray, Heritability in the genomics era--
concepts and misconceptions. Nat Rev Genet, 2008. 9(4): p. 255-66.
31. Purcell, S.M., et al., Common polygenic variation contributes to risk of
schizophrenia and bipolar disorder. Nature, 2009. 460(7256): p. 748-52.
32. Nagelkerke, N.J.D., A Note on a General Definition of the Coefficient of
Determination. Biometrika, 1991. 78(3): p. 691-692.
33. Yang, J., et al., Common SNPs explain a large proportion of the heritability for
human height. Nat Genet, 2010. 42(7): p. 565-9.
34. Gilmour, A.R., R. Thompson, and B.R. Cullis, Average information REML: An
efficient algorithm for variance parameter estimation in linear mixed models.
Biometrics, 1995. 51(4): p. 1440-1450.
35. Haiman, C.A., et al., Characterizing genetic risk at known prostate cancer
susceptibility loci in African Americans. PLoS Genet, 2011. 7(5): p. e1001387.
36. Browning, S.R. and B.L. Browning, Population Structure Can Inflate SNP-Based
Heritability Estimates. American Journal of Human Genetics, 2011. 89(1): p. 191-
193.
37. Zaitlen, N. and P. Kraft, Heritability in the genome-wide association era. Hum
Genet, 2012. 131(10): p. 1655-64.
38. Falconer, D.S., Inheritance of Liability to Certain Diseases Estimated from
Incidence among Relatives. Annals of Human Genetics, 1965. 29: p. 51-&.
120
39. Emrich, L.J. and M.R. Piedmonte, A Method for Generating High-Dimensional
Multivariate Binary Variates. American Statistician, 1991. 45(4): p. 302-304.
40. Yang, J., et al., Genome partitioning of genetic variation for complex traits using
common SNPs. Nat Genet, 2011. 43(6): p. 519-25.
121
Appendix A
The Multiethnic Cohort (MEC). The MEC includes 215,251 men and women aged 45–
75 years at recruitment from Hawaii and California (ref). The cohort was assembled in
1993–1996 by mailing a self-administered, 26-page questionnaire to persons identified
primarily through the driver’s license files. Identification of incident cancer cases is by
regular linkage with the Hawaii Tumor Registry and the Los Angeles County Cancer
Surveillance Program; both NCI-funded Surveillance, Epidemiology, and End Results
registries. From the cancer registries, information is obtained about stage and grade.
Collection of biospecimens from incident prostate cases began in California in 1995 and
in Hawaii in 1997 and a biorepository was established between 2001 and 2006 from
67,000 MEC participants. The participation rates for providing a blood sample have been
greater than 60%. Through January 1, 2008 the African American case-control study in
the MEC included 1,094 cases and 1,096 controls.
The Southern Community Cohort Study (SCCS). The SCCS is a prospective cohort of
African and non-African Americans which during 2002–2009 enrolled approximately
86,000 residents aged 40–79 years across 12 southern states (ref). Recruitment occurred
mainly at community health centers, institutions providing basic health services primarily
to the medically uninsured, so that the cohort includes many adults of lower income and
educational status. Each study participant completed a detailed baseline questionnaire,
and nearly 90% provided a biologic specimen (approximately 45% a blood sample and
45% buccal cells). Follow-up of the cohort is conducted by linkage to national mortality
registers and to state cancer registries. Included in this study are 212 incident African
122
American prostate cancer cases and a matched stratified random sample of 419 African
American male cohort members without prostate cancer at the index date selected by
incidence density sampling.
The Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO). The
Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial [34], is a randomized,
two-arm trial among men and women aged 55–74 years to determine if screening reduced
the mortality from these cancers. Male participants randomized to the intervention arm
underwent prostate specific antigen (PSA) screening at baseline and annually for 5 years
and digital rectal examination at baseline and annually for 3 years. Sequential blood
samples were collected from participants assigned to the screening arm; participation was
93% at the baseline blood draw (1993–2001). Buccal cell samples were collected from
participants in the control arm of the trial; participation was about 85% for this
component. Included in this study are 286 African American prostate cancer cases and
269 controls without a history of prostate cancer, matched on age at randomization and
study year of the trial.
The Cancer Prevention Study II Nutrition Cohort (CPS-II). The CPS-II Nutrition
Cohort includes over 86,000 men and 97,000 women from 21 US states who completed a
mailed questionnaire in 1992 (aged 40–92 years at baseline) [35]. Starting in 1997,
follow-up questionnaires were sent to surviving cohort members every other year to
update exposure information and to ascertain occurrence of new cases of cancer; a .90%
response rate has been achieved for each follow-up questionnaire. From 1998–2001,
123
blood samples were collected in a subgroup of 39,376 cohort members. To further
supplement the DNA resources, during 2000–2001, buccal cell samples were collected by
mail from an additional 70,000 cohort members. Incident cancers are verified through
medical records, or through state cancer registries or death certificates when the medical
record can not be obtained. Genomic DNA from 76 African American prostate cancer
cases and 152 age-matched controls were included in stage 1 of the scan.
Prostate Cancer Case-Control Studies at MD Anderson (MDA). Participants in this
study were identified from epidemiological prostate cancer studies conducted at the
University of Texas M.D. Anderson Cancer Center in the Houston Metropolitan area
since 1996. Cases were accrued from six institutions in the Houston Medical Center and
were not restricted with respect to Gleason score, stage or PSA. Controls were identified
via random-digit-dialing or among hospital visitors and they were frequency matched to
cases on age and race. Lifestyle, demographic, and family history data were collected
using a standardized questionnaire. These studies contributed 543 African American
cases and 474 controls to this study (ref).
Identifying Prostate Cancer Genes (IPCG). Cases in this study were patients 1)
undergoing treatment for prostate cancer in the Department of Urology at Johns Hopkins
Hospital from 1999 to 2007; 2) undergoing treatment at the Sidney Kimmel
Comprehensive Cancer Center from 2003 to 2007; and 3) outside referrals as part of the
Hereditary Prostate Cancer Study from 1990 to present. Blood was obtained from groups
2) and 3) while DNA from normal tissue was obtained from group 1). Data are available
124
on age at diagnosis, race, pretreatment prostate-specific antigen (PSA) values, clinical
pathology values, and family history. The control subjects were men undergoing disease
screening and were not thought to have prostate cancer on the basis of a physical exam
and a serum PSA value below 4 ng/ml. Screenings were performed at the Johns Hopkins
Applied Physics Lab, at Bethlehem Steel in Baltimore, and at local African American
churches in East Baltimore (ref). A total of 368 African American cases and 172 controls
contributed to stage 1.
The Los Angeles Study of Aggressive Prostate Cancer (LAAPC). The LAAPC is a
population-based case-control study of aggressive prostate among African Americans in
Los Angeles County (ref). Cases were identified through the Los Angeles County Cancer
Surveillance Program rapid case ascertainment system and eligible cases included
African American men diagnosed with a first primary prostate cancer between January 1,
1999 and December 31, 2003. Eligible cases also had either tumor extension outside the
prostate, metastatic prostate cancer in sites other than prostate, or needle biopsy of the
prostate with Gleason grade 8 or higher, or Gleason grade 7 and tumor in more than 2/3
of the biopsy cores. Controls were identified by a neighborhood walk algorithm and were
men never diagnosed with prostate cancer, and were frequency matched to cases on age
(65 years). For this study, genomic DNA was included for 296 cases and 140 controls.
We also included an additional 163 African American controls from the MEC that were
frequency matched to cases on age.
125
Prostate Cancer Genetics Study (CaP Genes). The African American component of
this study population comprised 160 men: 75 cases diagnosed with more aggressive
prostate cancer and 85 age-matched controls (ref). All subjects were recruited and
frequency-matched on the major medical institutions in Cleveland, Ohio (i.e., the
Cleveland Clinic, University Hospitals of Cleveland, and their affiliates) between 2001
and 2004. The cases were newly diagnosed with histologically confirmed disease:
Gleason score 7; tumor stage T2c; or a prostate-specific antigen level .10 ng/ml at
diagnosis. Controls were men without a prostate cancer diagnosis who underwent
standard annual medical examinations at the collaborating medical institutions. Case-
Control Study of Prostate Cancer among African Americans in Washington, DC (DCPC).
Unrelated men self-described as African American were recruited for several case-control
studies on genetic risk factors for prostate cancer between the years 2001 and 2005 from
the Division of Urology at Howard University Hospital (HUH) in Washington, DC.
Control subjects unrelated to the cases and matched for age (65 years) were also
ascertained from the prostate cancer screening population of the Division of Urology at
HUH (ref). These studies included 292 cases and 359 controls.
King County (Washington) Prostate Cancer Studies (KCPCS). The study population
consists of participants from one of two population-based case-control studies among
residents of King County, Washington (refs). Incident Caucasian and African American
cases with histologically confirmed prostate cancer were ascertained from the Seattle-
Puget Sound SEER cancer registry during two time periods, 1993–1996 and 2002–2005.
Age-matched (5-year age groups) controls were men without a self-reported history of
126
being diagnosed with prostate cancer and were identified using one-step random digit
telephone dialing. Controls were ascertained during the same time periods as the cases. A
total of 145 incident African American cases and 81 African American controls were
included from these studies.
The Gene-Environment Interaction in Prostate Cancer Study (GECAP). The Henry
Ford Health System (HFHS) recruited cases diagnosed with adenocarcinoma of the
prostate of Caucasian or African American race, less than 75 years of age, and living in
the metropolitan Detroit tri-county area (ref). Controls were randomly selected from the
same HFHS population base from which cases were drawn. The control sample was
frequency matched at a ratio of 3 enrolled cases to 1 control based on race and five-year
age stratum. In total, 637 cases and 244 controls were enrolled between January 2002 and
December 2004. Of study enrollees, DNA for 234 African Americans cases and 92
controls were included in stage 1 of the scan.
127
Appendix B
Suppose phenotype Y is determined by m causal SNPs in a linear model
𝑌𝑌 𝑎𝑎 = 𝑎𝑎 + � 𝐻𝐻 𝑗𝑗 𝑛𝑛 𝑎𝑎𝑗𝑗 𝑚𝑚 𝑗𝑗 =1
+ 𝜀𝜀 𝑎𝑎 𝑎𝑎𝑓𝑓𝑚𝑚 𝑐𝑐𝑎𝑎𝑓𝑓 ( 𝑌𝑌 ) = � 𝑐𝑐𝑎𝑎𝑓𝑓 (𝑛𝑛 . 𝑗𝑗 )𝐻𝐻 𝑗𝑗 2
+ 𝜎𝜎 𝜀𝜀 2
The narrow sense heritability is
ℎ
2
=
∑ 𝑐𝑐𝑎𝑎𝑓𝑓 (𝑛𝑛 . 𝑗𝑗 )𝐻𝐻 𝑗𝑗 2
𝑐𝑐𝑎𝑎𝑓𝑓 (𝑌𝑌 )
When none of the causal SNPs was measured, we substituted causal genotypes with
predicted values using measured non-causal genotypes in G. The variance components
model can be written as
𝑌𝑌 𝑎𝑎 = 𝑎𝑎 + � 𝐻𝐻 𝑗𝑗 𝐸𝐸 (𝑛𝑛 𝑎𝑎𝑗𝑗 | 𝑮𝑮 𝑎𝑎 )
𝑚𝑚 𝑗𝑗 =1
+ 𝜀𝜀 𝑎𝑎 𝑎𝑎𝑓𝑓𝑚𝑚 𝑐𝑐𝑎𝑎𝑓𝑓 ( 𝑌𝑌 ) = � 𝑐𝑐𝑎𝑎𝑓𝑓 (𝐸𝐸 � 𝑛𝑛 . 𝑗𝑗 � 𝑮𝑮 � )𝐻𝐻 𝑗𝑗 2
𝑚𝑚 𝑗𝑗 =1
+ 𝜎𝜎 𝜀𝜀 2
By the law of total variance
𝑐𝑐𝑎𝑎𝑓𝑓 � 𝑛𝑛 . 𝑗𝑗 � = 𝑐𝑐𝑎𝑎𝑓𝑓 �𝐸𝐸 � 𝑛𝑛 . 𝑗𝑗 � 𝑮𝑮 � � + 𝐸𝐸 �𝑐𝑐𝑎𝑎𝑓𝑓 � 𝑛𝑛 . 𝑗𝑗 � 𝑮𝑮 � �
𝑐𝑐𝑎𝑎𝑓𝑓 �𝐸𝐸 � 𝑛𝑛 . 𝑗𝑗 � 𝐺𝐺 � � = 𝑐𝑐𝑎𝑎𝑓𝑓 � 𝑛𝑛 . 𝑗𝑗 � � 1 −
𝐸𝐸 �𝑐𝑐𝑎𝑎𝑓𝑓 � 𝑛𝑛 . 𝑗𝑗 � 𝐺𝐺 � �
𝑐𝑐𝑎𝑎𝑓𝑓 � 𝑛𝑛 . 𝑗𝑗 �
� = 𝑐𝑐𝑎𝑎𝑓𝑓 � 𝑛𝑛 . 𝑗𝑗 � 𝑅𝑅 2
,
where R
2
is the coefficient of determination
Thus the narrow sense heritability estimate using non-causal SNPs is
128
ℎ
𝑁𝑁𝑓𝑓𝑓𝑓 − 𝑎𝑎𝑎𝑎𝑓𝑓𝑡𝑡𝑎𝑎𝑎𝑎 2
=
∑ 𝑐𝑐𝑎𝑎𝑓𝑓 (𝑛𝑛 . 𝑗𝑗 )𝐻𝐻 𝑗𝑗 2
𝑅𝑅 2
𝑐𝑐𝑎𝑎𝑓𝑓 (𝑌𝑌 )
In our simulations we restricted SNPs to have limited LD between each other (R
2
<0.25).
This should mean that at most 25% of the total heritability (i.e. ~8%) should be explained
by non-causal SNPs,
Abstract (if available)
Abstract
Background: Validated SNPs in genome-wide association studies (GWAS) account for only a small fraction of the variation in human complex traits. Methods: We applied score analysis to 6957 not closely related individuals in African American prostate cancer (AAPC) GWAS to see whether common variants have an important role en masse
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Polygenic analyses of complex traits in complex populations
PDF
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping
PDF
Extending genome-wide association study methods in African American data
PDF
Post-GWAS methods in large scale studies of breast cancer in African Americans
PDF
Breast epithelial cell type specific enhancers and functional annotation of breast cancer risk loci
PDF
Understanding prostate cancer genetic susceptibility and chromatin regulation
PDF
The role of genetic ancestry in estimation of the risk of age-related degeneration (AMD) in the Los Angeles Latino population
PDF
Prostate cancer: genetic susceptibility and lifestyle risk factors
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
Methodology and application of modern genetic association tests in admixed populations
PDF
Examining the relationship between common genetic variation, type 2 diabetes and prostate cancer risk in the multiethnic cohort
PDF
Functional characterization of colorectal cancer GWAS loci
PDF
Association of comorbidity with prostate cancer tumor characteristics in African American men
PDF
Body size and the risk of prostate cancer in the multiethnic cohort
PDF
The association of prediagnostic metformin use with prostate cancer in the multiethnic cohort study
PDF
Two-stage genotyping design and population stratification in case-control association studies
PDF
Capture and analysis of circulating tumor cells in patients with hepatocellular carcinoma: analysis of a pilot study
PDF
The identification of novel kinase genes associated with androgen independent prostate cancer
PDF
Functional characterization of a prostate cancer risk region
PDF
Fish consumption and risk of colorectal cancer
Asset Metadata
Creator
He, Jing
(author)
Core Title
Polygenes and estimated heritability of prostate cancer in an African American sample using genome-wide association study data
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
08/07/2013
Defense Date
05/13/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
GWAS,heritability,OAI-PMH Harvest,population stratification,prostate cancer
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Stram, Daniel O. (
committee chair
), Coetzee, Gerhard (Gerry) A. (
committee member
), Gauderman, William James (
committee member
), Haiman, Christopher A. (
committee member
)
Creator Email
hej@usc.edu,jingheusc@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-319645
Unique identifier
UC11294618
Identifier
etd-HeJing-1988.pdf (filename),usctheses-c3-319645 (legacy record id)
Legacy Identifier
etd-HeJing-1988.pdf
Dmrecord
319645
Document Type
Dissertation
Format
application/pdf (imt)
Rights
He, Jing
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
GWAS
heritability
population stratification
prostate cancer