Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multivariate methods for extracting genetic associations from correlated data
(USC Thesis Other)
Multivariate methods for extracting genetic associations from correlated data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MULTIVARIATE METHODS FOR EXTRACTING GENETIC
ASSOCIATIONS FROM CORRELATED DATA
by
Mary Helen Black
__________________________________________________________
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(STATISTICAL GENETICS AND GENETIC EPIDEMIOLOGY)
August 2009
Copyright 2009 Mary Helen Black
ii
Acknowledgements
I would like to thank Dr. Richard Watanabe, my dissertation committee chair and
mentor, without whose guidance, encouragement and support, this work would not have
been possible. This work was supported by funds awarded to Dr. Richard Watanabe by
the National Institutes of Health (RO1 DK061628 and DK069922), and by the NIH Cell,
Biochemical and Molecular Sciences Training Grant. I would also like to thank my
committee for their invaluable feedback on my thesis, and for graciously extending extra
time and effort to discuss the ideas presented in this paper. Lastly, I thank my husband,
Robert, for his boundless and enduring support, understanding, and encouragement. This
paper and the degree to follow could never have been realized without his faith,
dedication and love, for which I am eternally grateful.
iii
Table of Contents
Acknowledgements ii
List of Tables v
List of Figures vii
Abstract ix
Chapter 1: Background 1
Chapter 2: A Principal Components-Based Clustering Method to
Identify Variants Associated with Complex Disease 16
Introduction 16
Methods 20
Analytical Approaches 20
Simulations 34
Results 41
Simulation Studies 41
Data Analysis 53
Discussion 56
Chapter 3: The Effects of Gene Size, Patterns of Linkage Disequilibrium
and Quantity of SNP Information on Oblique Principal
Components-Based Clustering 60
Introduction 60
Methods 61
Results 67
Discussion 82
Chapter 4: Principal Components-Based Clustering in the Presence
of Epistasis 86
Introduction 86
Methods 90
Analytical Approaches 90
Simulations 93
Results 98
Simulation Studies 98
Data Analysis 106
Discussion 111
iv
Chapter 5: Future Work 114
Bibliography 117
v
LIST OF TABLES
Table I: Example eigenvector loadings for SNP 1 and SNP 2 on PCs and 30
Rotated PCs
Table II: Descriptive Statistics for nine SNPs in TCF7L2 used as Causal
Loci in the Simulations 39
Table III: Estimated Type 1 Error for Single SNP, Joint SNP, PC and
OPCC methods under two scenarios for nine CVs within
TCF7L2 43
Table IV: Estimated Power
for Single SNP, Joint SNP, PC and OPCC
methods under two scenarios for nine CVs within TCF7L2 45
Table V: Proportion of CVs or tag SNPs correctly identified by OPCC
method 52
Table VI: HNF4A Association with OGTT-measured 2-hour Insulin 55
Table VII: HNF4A SNP-Cluster Associations with OGTT-measured
2-hour Insulin 56
Table VIII: Descriptive Statistics for three SNPs in GCK used as Causal
Variants in the Simulations 67
Table IX: Estimated Type 1 Error for PCA and OPCC methods under two
scenarios for three CVs in GCK 69
Table X: Estimated Type 1 Error for PCA and OPCC methods under two
scenarios for CVs in TCF7L2 70
Table XI: Estimated Power for PCA and OPCC methods under two
scenarios for three CVs in GCK 72
Table XII: Estimated Power for PCA and OPCC methods under two
scenarios for nine CVs in TCF7L2 73
Table XIII: Average number of PCs and Clusters required for analysis
under various scenarios of genotype availability and variance
explained 75
vi
Table XIV: Estimated Type 1 Error
for SNP, PC, and Cluster interactions
under various interaction scenarios 99
Table XV: Estimated Type 1 Error for SNP, PC, and Cluster interactions
under scenario of no marginal effects 100
Table XVI: Estimated Power for Pairwise SNP, PC, and Cluster interactions
under three scenarios 101
Table XVII: Mean (Std) Characteristics of Cleveland Clinic Cohort 107
Table XVIII: CETP and APOA5 SNP-Cluster Associations with HDL 110
vii
LIST OF FIGURES
Figure 1: PC and Rotated PC Components 27
Figure 2: Plot of SNP 2 vs SNP 1 29
Figure 3: Plot of PC
2
vs PC
1
, orthogonal rotation 29
Figure 4: Plot of PC
2
vs PC
1
, quartimax rotation 30
Figure 5: Pairwise LD structure among 144 SNPs in TCF7L2 from
HapMap CEU data. 37
Figure 6: Pairwise LD structure among 144 SNPs in single example
simulated data set. 37
Figure 7: Estimated haplotypes from HapMap CEU panel. 38
Figure 8: Estimated haplotypes from 144 SNPs in a single example
simulated data set. 38
Figure 9: Under scenario 1 of genotype availability (54 tags + CV),
estimated power of PC and OPCC tests of global association for
60% and 80% explained variance thresholds. 48
Figure 10: Under scenario 2 of genotype availability (54 tags only),
estimated power of PC and OPCC tests of global association for
60% and 80% explained variance thresholds. 48
Figure 11: Under scenario 1 of genotype availability (54 tags + CV),
estimated power of univariate PC and OPCC tests of association
for 60% and 80% explained variance thresholds. 50
Figure 12: Under scenario 2 of genotype availability (54 tags only),
estimated power of univariate PC and OPCC tests of association
for 60% and 80% explained variance thresholds. 50
Figure 13: Pairwise LD structure among 16 SNPs spanning HNF4A P1 and
P2 promoter regions from FUSION data. 54
Figure 14: Pairwise LD structure among 53 SNPs in GCK from HapMap
CEU data (GCK). 63
viii
Figure 15: Pairwise LD structure among 53 SNPs in single example
simulated data set (GCK). 63
Figure 16: Pairwise LD structure among 53 SNPs in single example
simulated data set (GCK). 64
Figure 17: Pairwise LD structure among 144 SNPs in single example
simulated data set (TCF7L2). 65
Figure 18: Pairwise LD structure among 144 SNPs in single example
simulated data set (TCF7L2). 66
Figure 19: Under scenario 1 of genotype availability (54 tags in TCF7L2, 18
tags in GCK), estimated power of PC and OPCC tests of global
association for 60% and 80% explained variance thresholds. 77
Figure 20: Under scenario 2 of genotype availability (133 SNPs in TCF7L2,
41 SNPs in GCK), estimated power of PC and OPCC tests of
global association for 60% and 80% explained variance thresholds. 78
Figure 21: Under scenario 1 of genotype availability (54 tags in TCF7L2, 18
tags in GCK), estimated power of PC and OPCC tests of univariate
association for 60% and 80% explained variance thresholds. 79
Figure 22: Under scenario 2 of genotype availability (133 SNPs in TCF7L2,
41 SNPs in GCK), estimated power of PC and OPCC tests of
univariate association for 60% and 80% explained variance
thresholds. 79
Figure 23: Estimated power of PC univariate tests of association under 3
scenarios of genotype availability (Tag SNPs only, Tag SNPs +
CV, All SNPs) using 60% and 80% explained variance thresholds. 81
Figure 24: Estimated power of OPCC univariate tests of association under 3
scenarios of genotype availability (Tag SNPs only, Tag SNPs +
CV, All SNPs) using 60% and 80% explained variance thresholds. 82
Figure 25: Estimated power of ITF SNP, PC and OPCC interaction tests
under two scenarios of genotype availability (Tag SNPs + CV,
All SNPs) using 60% and 80% explained variance thresholds. 105
ix
ABSTRACT
Multivariate methods ranging from traditional joint SNP methods to principal
components analysis (PCA) have been developed for testing multiple markers in a region
for association with disease and disease-related traits. However, these methods suffer
from low power or the inability to identify the subset of markers contributing to the
evidence for association under various scenarios. In this paper, a new approach is
introduced, based on PCA with an orthoblique rotation (OPCC), in order to identify
specific subsets of markers genotyped in a candidate region showing associating with an
outcome of interest. When compared to traditional methods, the OPCC approach has
similar or improved power, but also identifies the unique subset of markers contributing
to the evidence for association. The utility of OPCC is demonstrated with an example
from type 2 diabetes literature.
In this paper, the OPCC methodology is also extended to investigation of
epistasis, or gene-gene interactions, which is believed to underlie most complex diseases.
OPCC is directly compared to multiple linear regression and PC-modeling of
multiplicative interactions, under various scenarios. In most cases, the power of OPCC to
detect interactions in the presence of epistasis is greater than that of pairwise SNP or PC
approaches. Power is further enhanced when analyzed in an interaction testing
framework (ITF). OPCC for detecting gene-gene interactions is applied to cohort data,
for two candidate genes known to have a role in cardiovascular outcomes. Overall,
OPCC proves to be a powerful and efficient method to determine whether multiple
variants and their interactions are associated with complex disease traits.
1
Chapter 1. Background
The notion that complex diseases, such as hypertension, type 2 diabetes and
obesity, are likely caused by multiple disease gene variants is becoming widely accepted.
Genes alone have not fully accounted for the increasing prevalence of these diseases;
complex disorders have a complex etiology. However, it is increasingly recognized that
these diseases have a polygenic component which contributes to their susceptibility,
pathogenesis, and effectiveness of treatment. Which genes and polymorphisms are
responsible for these disease characteristics are still largely debated, as traditional
candidate gene approaches have yielded numerous single variant associations lacking
reproducibility, and have virtually ignored multiple variant hypotheses.
More than a decade ago, it was proposed that complex diseases arise from high
frequency alleles, in what is known as the “common disease, common variant” (CD-CV)
hypothesis [Risch and Merikengas, 1996; Collins et al, 1997]. Over the past several
years, this conjecture and its implications have been widely deliberated, with an ultimate
consensus that if the CD-CV hypothesis is true, despite the inconsistent replication of
candidate gene association findings, then there must exist many common disease
susceptibility variants, which are not necessarily located in coding regions of the gene
[Cargill et al, 2002; Halushka et al, 2002; Cheung and Spielman, 2002], and each of
small or modest effect and relatively low or moderate penetrance [Stewart, 2002;
Pritchard and Cox, 2002; Lohmueller et al, 2003]. Several authors have also posited that
if such variants are responsible for complex disease susceptibility, it may be necessary to
2
carry risk alleles at a combination of loci, both within and across genes, in order to confer
disease risk [Stewart, 2002; Weedon et al, 2006].
Preliminary investigation of a candidate gene typically involves identification of a
region of interest followed by assessment of the joint contribution of multiple loci within
that region on disease risk or a disease-related trait. Standard statistical approaches to
assessing single or joint effects of K diallelic loci on a single quantitative disease-related
trait involve the following framework [Cordell, 2002]:
y = β
0
+ β
1
G
1
+ …. + β
k-1
G
k-1
+ β
k
G
k
(1)
where y is a quantitative phenotype, assumed to be approximately normally distributed,
and G
i
are coded with values of 0, 1 or 2 for the respective number of minor alleles at
locus i, where i = 1,…, K. Wald χ
2
statistics for each parameter indicate the significance
of each locus effect adjusted for the presence of the other loci in the model. To test
significance of multiple variant effects, a likelihood ratio (LRT) test, approximated by a
χ
2
statistic on K degrees of freedom (df), can be used to test the additive or joint effect of
K loci. An LRT comparing equation (1) to the null model y = β
0
yields an omnibus test
of H
0
: β
1
=β
2
= … = β
K
=0. Rejection of the null hypothesis indicates that at least one β
i
≠ 0, leaving the K-df LRT χ
2
statistic to be interpreted as the significance level for the
joint, or additive effect of all K loci.
However, testing whether multiple loci in a candidate gene are jointly associated
with disease or a disease-related trait within a logistic or linear regression framework
assumes the genetic model for each locus is known and linearity in a quantitative
outcome or in log-odds of disease. From a statistical point of view, the number of
3
degrees of freedom linearly increases with the number of loci. As degrees of freedom are
an index of the amount of random variability, larger χ
2
values are required to achieve
significant p-values when more degrees of freedom are utilized, which detracts from the
power of joint models to detect significant effects when large numbers of loci are tested.
Moreover, as variants within a candidate gene region are likely to be highly
correlated, and highly correlated SNPs statistically have an equal chance of being
selected for analysis, traditional association analysis can not discern the true causal
variant(s) from others that are simply in strong linkage disequilibrium with the causal
variant(s). A standard approach developed in order to statistically cope with problems
that arise from testing a large number of highly correlated SNPs involves combining the
SNP data into a single variable, known as a haplotype. Because sets of loci on a
chromosome may be linked, or transmitted as a single unit, and the alleles at each of
those loci may then share some amount of linkage disequilibrium (LD), haplotypes can
be constructed which represent combinations of alleles at multiple linked loci [Schaid,
2004]. In this way, the multiple SNPs in a candidate gene are essentially categorized into
regions of LD, which are then tested for association with disease or disease-related trait.
Haplotype analysis may have statistical advantages over single and joint SNP
analyses, and from a biological perspective, may actually be more useful in complex
disease analysis, as haplotypes may be carriers of transcriptional and regulatory function
transmitted as a single unit [Clark, 2004]. However, there exists numerous ways to
define haplotype blocks in a given region; the Gabriel method [Gabriel et al, 2002]
suggests a block definition based on the confidence bounds of D′, while the Four Gamete
4
Rule determines block boundaries by examining whether all four gametes are represented
in the data at a given locus. Analysis of a single data set often yields different haplotype
block patterns depending on the definition specified; the algorithm used to define the
blocks affects both the number of estimated haplotypes and the efficiency with which a
single haplotype captures the genetic diversity in a given region [Gauderman et al, 2007].
Additionally, in a population-based study, individual haplotypes are not directly
observed, but must be inferred probabilistically from unphased genotype data. The most
commonly employed method of haplotype assignment for this type of study, the
Expectation-Maximization (EM) algorithm, utilizes maximum likelihood estimation to
find the values of haplotype probabilities which optimize the probability of the observed
data [Stephens et al, 2001]. While EM is a relatively stable algorithm, has well
established statistical properties and yields easily interpreted results, it has a few major
drawbacks. EM can be potentially trapped in local extrema, which may prevent the
algorithm from converging or identifying other more important modes, and cannot be
used with a very large number of loci due to computational constraints [Qin et al, 2002].
EM also assumes Hardy Weinberg Equilibrium (HWE) in the pooled sample, which may
be problematic, particularly if there are an increased proportion of heterozygotes
[Stephens et al, 2001]. Other haplotype reconstruction methods that try to account for
haplotype ambiguity by modeling the distribution of haplotypes in some way, still share
this issue of deviation from HWE, making it difficult to infer the true distribution of
haplotypes from population-based multi-locus genotype data [Stephens et al, 2001].
5
Several methods for testing haplotype association with disease or disease-related
phenotypes have been proposed [Zhao et al, 2000; Fallin et al, 2001; Schaid et al, 2002;
Zaykin et al, 2002; Zhao et al, 2003; Epstein and Satten, 2003]. But whether performing
a global test for overall association between haplotypes and disease, or testing specific
haplotype effects, all tests require that the distribution of haplotypes be specified.
Because this distribution relies on major assumptions (such as HWE), inference on
haplotypes can lead to substantial bias in parameter estimates, even when complete
genotype data is available [Allen and Satten, 2008].
Due to the issues surrounding regression analysis and haplotype-based testing of
joint effects across multiple loci in a candidate gene, other tests for joint SNP effects have
been proposed. Recently, principal components-based methods for examining
association between multiple SNPs in a candidate region and disease [Gauderman et al,
2007], as well as quantitative traits [Wang et al, 2007], have been introduced. The aim of
the principal components analysis (PCA) approach is the reduction of a large set of
correlated genetic markers to relatively few uncorrelated factors that account for a large
portion of the genetic variation in a candidate gene region. These orthogonal factors, or a
subset that exceed a predetermined threshold for explaining some proportion of the total
genetic variation, can then be tested for association with disease or disease-related trait in
a regression framework.
Principal components (PCs) are composed of eigenvectors derived from the
covariance among SNPs, such that each PC is linear combination of K SNPs that
accounts for some proportion of variation among all SNPs. The PCs can be ordered such
6
that the first PC (PC
1
) explains the highest proportion, PC
2
the second-highest, etc. Thus,
if G
i
are genotypes with values of 0, 1 or 2 for the respective number of minor alleles at
locus i, where i = 1,…, K, the K
th
PC takes the form:
PC
K
: e
K
G = e
1K
G
1
+ e
2K
G
2
+ ….. + e
KK
G
K
(2)
and explains the least amount of variance compared to all other PCs. Letting S be the
subset of K PCs which meet some combined threshold of explained variance (ex: 80%), it
has been shown that only S PCs may be used for analysis, with very little loss of
information [Gauderman et al, 2007]. Given a set of eigenvector coefficients
corresponding to S PCs, and genotype scores G
i
, PC scores can be computed for each
individual in the data set. Thus, the PC regression method produces an equation similar
to (1), but where PC scores, rather than genotypes themselves, serve as the independent
variables in the model:
y = β
0
+ β
1
PC
1
+ β
2
PC
2
+ ….. + β
s
PC
s
(3)
An omnibus test of association between PCs and outcome variable involves performing
an LRT by comparing equation (3) to the null model y = β
0.
The resulting LRT χ
2
statistic with corresponding S df would yield the significance level for the joint effect of
the S PCs. The result of this analysis is conventionally interpreted as the association
between variation in the gene region (represented by K SNPs) and disease outcome.
If exploring whether SNPs in a candidate gene region are associated with disease
status or disease-related trait, the use of PCA in a regression framework has several
advantages over traditional methods. Under most circumstances, this approach was
found to be more powerful than the joint SNP and haplotype association analyses
7
primarily because only a subset of PCs are actually analyzed, which results in a reduced
number of model dfs [Gauderman et al, 2007]. Gauderman et al. note that the PC
approach is particularly more powerful than haplotype-based association tests when the
true disease-causing variant is included in the analysis, along with htSNPs in the gene
region. Aside from discrepancies in methods for estimating haplotype distributions, a
causal variant effect may be attenuated across multiple haplotypes, leading to a loss of
power for the haplotype-based test [Gauderman, et al 2007]. Additionally, since PCs are
utilized in a single test of global association, this approach does not require a multiple
testing correction be applied to the LRT p-value. Other advantages of this procedure
include that is that it is relatively straightforward and is currently implemented in many
common software packages.
However, a major limitation of global PC association analysis is the inability to
interpret results beyond overall candidate region association with a disease outcome. In
general, several authors have proposed that the magnitude and/or sign of the eigenvector
coefficients e
i
can be used to interpret the weighting of each original variable on a given
PC [Gauderman et al, 2007; Atkinson et al, 2006; Chase et al, 2007]. Other investigators
contend this is not the case, and that after omnibus association between candidate region
and disease outcome has been established, one must still conduct a series of single SNP
or other type of test in order to determine which variants driving the association [Wang et
al, 2007]. Depending on the number of independent variables, and their underlying
correlation structure, both arguments are correct. It is possible, that with a very small
number of SNPs, the eigenvector loadings, which are analogous to correlations between
8
the SNPs and PCs, a distinct pattern might be discernable for each PC. However, when
analyzing a very large number of SNPs over an expansive gene region, the eigenvector
loadings may not appear to be in such contrast within a given PC, rendering the PC un-
interpretable. While PCA has been shown to be a powerful technique for assessing
global association between SNPs and disease, it is unlikely to be very useful in
elucidating the contribution of specific SNPs, when large numbers of SNPs are analyzed.
In addition to joint effects within individual candidate genes, complex disease
analysis necessitates investigation of multiple genes, each of which may have a direct or
indirect role in conferring the disease phenotype. If testing interactions among multiple
variants across multiple genes, it must also be considered that these variants may not
exert their effects on disease in an additive manner. A simple test for significance of a
multiplicative interaction between any two loci would involve adding another term,
formed from the product of the two loci, to the model in (1):
y = β
0
+ β
1
G
1
+ …. + β
j
G
j
+ β
k
G
k
+ β
jk
G
j
G
k
(4)
with a LRT that compares equation (4) to equation (1), and is evaluated for significance
with a χ
2
statistic on 1 df. Again, assuming the linearity assumptions of the model are
satisfied, testing all possible genetic models for each locus, across different modes of
interaction for even two loci can result in a large number of tests. Given K loci,
K!/(2!(K – 2)!) pairwise comparisons may be made, such that the number of tests that
must be performed scales exponentially with the number of loci. If p-values are not
corrected for this large number of tests, the result is an extremely high type 1 error rate
and numerous false positives. However, it is thought that correction of p-values for
9
multiple testing using a Bonferroni or Sidak correction can result in reduced power,
especially if the loci are correlated [Roeder et al, 2005]. In fact, Millstein et al. found the
multiplicative model approach with an LRT described above, to be even less efficient and
less powerful than any other LRT-based method, including marginal tests of association
for single loci. While each multiplicative model test utilizes only 1 df, and is popularly
interpreted as the synergistic effects of two loci, it generally does not have the power
necessary to detect interactions between loci that each contribute a small effect on disease
and have relatively low penetrance, after correction for multiple testing. Thus,
multiplicative models used in an LRT framework are usually underpowered to detect
effect sizes relevant for complex disease association.
In view of the issues surrounding the multiplicative interaction regression model,
several parametric, non-parametric and semi-parametric alternatives have been suggested
to deal with the problem of dimensionality in attempting to detect epistasis. However,
most of these alternative methods are specific to either a binary or continuous outcome,
tend to show a wide range of type 1 error and power under various scenarios, and can
even yield contradicting results when the same data are analyzed with multiple
procedures [Wilcox et al, 2007]. Combined with low acceptance and reliability, many of
these methods are not implemented in common, well-known software.
For example, alternative methods such as Set Association Method [Hoh et al,
2001; Ott and Hoh, 2003], Classification and Regression Tree Analysis (CART)
[Brieman et al, 1984], Logic Regression [Koopberg et al, 2001], Multifactor-
Dimensionality Reduction (MDR) [Ritchie et al, 2001] and the Focused Interaction
10
Testing Framework (FITF) [Millstein et al, 2005] were all initially developed with the
aim of identifying interacting loci across multiple candidate genes associated with binary
disease-status, and have not yet been further developed for quantitative trait analysis,
except for the suggestion of dissecting the continuous phenotype at some arbitrary
threshold in order to transform it into a dichotmous outcome. Conversely, Combinatorial
Partitioning Method (CPM) [Nelson et al, 2001] and Restricted Partitioning Method
(RPM) [Culverhouse et al, 2004], are specific to testing the effect of multiple loci on
quantitative outcomes, and have not been adapted for categorical disease status.
Set Association Method attempts to find subgroups of loci across multiple
candidate genes that have the greatest combined evidence for association with disease.
Set Association involves first computing a Hardy-Weinberg Equilibrium (HWE) χ
1
2
statistic among controls, in order to identify and remove from the analysis, any SNPs
with large χ
1
2
values due to genotyping error. The method then entails creating a
summary statistic from the product of a HWE χ
1
2
calculated among cases, and the χ
1
2
from a case-control test of association (2x3 table or logistic regression), for each marker.
These summary statistics are ranked in decreasing order, irrespective of genomic
location, and summed across various combinations of SNPs. Multiple sums are formed
across increasing numbers of loci, and p-values for each sum determined with
permutation tests, with the ultimate goal of finding the number of SNPs n that best
reflects association of those SNPs with disease. The minimum p-value for the sum
statistic over n SNPs is then evaluated for global significance with permutation tests, to
correct for testing multiple sums. Advantages of this technique are that it is
11
straightforward in implementation, uses a sum statistic which has been shown to be more
powerful than single marker tests of association, and uses permutation tests, which
although computationally intensive, cut down on the false positive rate without being
overly conservative. However, markers in LD will likely show the same level of
association, and the method may end up grouping markers in LD within a gene, but not
necessarily markers that represent a functional combination of loci. Also, this method
failed to perform well in analyses of high order interactions between variants with high
frequency alleles or of small effect, and was shown to have low power in the presence of
heterogeneity, admixture, and small/modest sample sizes [Vermuelen et al, 2007].
CART, a recursive partitioning procedure, begins with one node (all data) and
uses inference-based recursive modeling to determine the first optimal and each
subsequent “split” of the data, with the aim of producing a decision tree to identify
subgroups of high-risk subjects. By subdividing subjects by risk alleles for various loci,
CART is able to identify high-order SNP interactions across multiple genes [Huang et al,
2007]. CART’s advantages are that it is non-parametric, which implies that it does not
require any assumptions about the underlying distribution of predictor variables, it is able
to handle missing data, and its results are easily interpreted. However, missing values
and random noise in the data can make it difficult to determine when to stop splitting the
nodes, resulting in over-fitting of the model which can lead to reduced power [Zhang and
Bonney, 2000]. Over-fitting is particularly problematic for most tree-based methods, and
as a result, a great deal of investigation has been devoted to methods that improve
CART’s predictive power, such as setting a priori thresholds of explained variance for
12
new node generation, and post hoc procedures that involve tree pruning and/or cross-
validation. Additional disadvantages of CART are that it is computationally intensive,
and is not included in most major statistical software packages.
Logic Regression is an adaptive regression method, in which binary variables are
assembled into logic expressions, which can also be written in tree form, that serve as the
predictor in a regression equation vs disease status. Misclassification, or the difference
between the predicted outcome and real outcome, is used to score each logic expression.
The aim of this method is to then adaptively select each logic expression using a
stochastic simulated annealing algorithm: start with a simple expression and at each
stage, a new tree made by simple operations on the current tree, is selected at random. A
new tree replaces a current tree if it has a better score than the old tree, and is accepted
with a probability that depends on the difference between the scores of the old and new
tree and stage of the algorithm. Using logic regression, SNPs are converted into Boolean
predictors (for example, on the basis of the presence or absence of a risk allele) and the
adaptive annealing algorithm used to determine the best SNP logic expression (ie., SNP1
and SNP2 but not SNP3, etc.). The final, “best” tree is then fitted in a regression model,
with each tree potentially representing complex, high-order interactions among SNPs
from various genes [Kooperberg et al, 2006]. One major advantage of this approach is
that one can simultaneously examine different genetic models (ie. dominant, additive,
recessive, and combinations thereof). However, this approach is also subject to the same
drawbacks of tree-based methods described above. Also, if the underlying association
between variants and disease is based on a multiplicative interaction between loci, logic
13
regression failed to perform well if there were more than two causal loci, and if the loci
were of high frequency and/or of small effect [Vermeulen et al, 2007]. While logic
regression is implemented in a module available as an R package, computation time
increases dramatically with increasing numbers of SNPs.
CPM and RPM, designed for quantitative outcome analysis, are flexible, non-
parametric techniques like many of the methods presented above, and are effective in
reducing the dimensionality of data. The goal of CPM is to find a way to divide all
multi-locus genotypes into subgroups that explain a large proportion of the overall trait
variation, with statistical significance evaluated by permutation testing. Yet, because the
CPM algorithm exhaustively examines all possible partitions of the data, it does not
perform well for very large sets of loci. RPM is similar in concept to CPM, but uses the
marginal penetrances of each locus instead of marginal mean trait values. Also, instead
of calculating all possible partitions, RPM attempts to find the most reasonable partition
for evaluation, balancing the maximization of between-group variation with minimization
of within-group variation and the number of groups. However, it has been shown that
RPM is highly sensitive to sparse data, and neither CPM nor RPM are currently
implemented in any well-established or open source software [Heidema et al, 2006].
Inspired by CPM, MDR begins by dividing the data set into a training set (~90%
of the data) and independent testing set (~10% of the data). A set of SNPs are selected,
case-control ratios for all possible combinations of genotypes in the SNPs set are
calculated, and based on these ratios, SNPs are then classified as high-risk or low-risk.
High risk and low risk classifications are then used as the basis for pooling multi-locus
14
genotype groups, and the best combination of SNPs for a given model size, defined by
having the lowest proportion of misclassification and prediction error, is chosen as the
final multi-locus model. To ensure model accuracy, cross validation on subsets of the
training set are compared to the independent testing set. Significance of the final model
is assessed with permutation testing. A major advantage of MDR is that it has been
shown to be able to find interacting variants that independently have little to no single
locus effects. However, if the interaction is multiplicative in nature, MDR did not have
sufficient power to detect the interaction for high frequency alleles, or when the
interaction consisted of more than two loci. MDR has also been shown to have relatively
low power in the presence of genetic heterogeneity, admixture, and small or modest
sample sizes [Vermeulen et al, 2007]. Additionally, MDR requires an extensive
computational burden with a very large number of loci [Vermeulen et al, 2007].
More recently, the FITF approach by Millstein et al was developed in contrast to
MDR. The FITF approach employs a series of likelihood ratio tests on logistic models of
markers vs disease, in stages conditional on positive results from the previous stage. In
principal, the interaction testing framework (ITF) begins with analysis of single main
effects, where likelihood ratio tests are performed on β
k
for each of K variants. These
first stage tests are corrected for multiple testing with FDR. The second stage involves
testing a pairwise interaction model for all possible two-gene sets (full model) against a
reduced model which includes either 1) both main effects, if both variants were
significant in stage 1, 2) one main effect, if one variant was significant in stage 1, or 3) no
main effects, if neither were significant in stage 1. Thus, these scenarios for the reduced
15
model require a 1 df, 2 df, or 3 df LRT, respectively. All LRTs in stage 2 are then
adjusted for multiple testing by controlling FDR. This framework is quite useful,
because it takes into account the possibility that interacting variants may or may not have
significant main effects and accordingly adjusts the LRT χ
2
degrees of freedom. Because
this framework still involves a large number of tests, Millstein et al further reduce the
number of tests performed by first “pre-screening” variant combinations with a goodness
of fit χ
2
statistic derived from comparing the observed and expected distribution of each
marker in the pooled case-control sample, and then analyzing only the sets with a
calculated χ
2
statistic above a certain cut-off. Advantages of the FITF include that it was
shown to have greater power than MDR under a variety of scenarios, it is relatively
simple and straightforward to implement, and FITF software is readily available.
However, FITF is specific to case-control design, and has not yet been adapted to studies
involving quantitative traits, although the ITF can certainly be utilized for any outcome.
In summary, studying complex diseases involves 1) assessing whether good
candidates are in fact associated with disease-related traits, 2) examining potential
synergistic interactions of variants within candidate genes that may account for variation
in complex disease outcomes. In chapter 2, an oblique clustering method that reduces
dimensionality of the data and maintains the power of the PC approach, but allows for
unique identification of SNPs in a candidate gene region that are associated with
outcomes of interest, is presented. This approach is compared to the joint SNP and
traditional PC regression methods for examining association with disease outcomes, and
is applied to quantitative phenotype data among elderly and spousal controls from the
16
FUSION study. The aim of chapter 3 is to further explore the composition of the oblique
clusters, in the context of using tag SNP vs all gene variants, for a typical analysis. The
characteristics of clusters resulting from these analyses are compared for genes of various
sizes and underlying LD patterns, in order to determine the range of effects these genetic
attributes may have on cluster definition and the power of the clustering approach.
Because chapter 2 and 3 focus on identifying variants associated with disease
outcomes in the context of a single causal variant hypothesis, chapter 4 necessarily
extends the methods presented in chapter 2 to multiple causal variant hypotheses. To
date, the principal components approach has not been applied to the analysis of multiple
SNPs across various genes or in instances where the association with disease or disease-
related trait depended on multiplicative interaction among two or more causal variants.
In chapter 4, PCA and oblique cluster analysis are explored in the context of PC and
cluster interaction models, as methods for examining epistasis underlying complex
disease traits. Both a standard “all pairwise” approach and an interaction testing
framework (ITF) as described by Millstein et al, are explored.
Chapter 5 presents a discussion of future work, focusing primarily on extension
and refinement of the methods presented in chapter 4. The preliminary “focus” step of
the FITF described by Millstein et al is a powerful and efficient means of reducing the
number of variant interactions to be tested, and is utilized in the context of a case-control
study design. Potential extension of this approach in the context of cross-sectional study
design involving quantitative traits is considered.
17
Chapter 2. A Principal Components-Based Clustering Method to Identify
Variants Associated with Complex Disease Traits
INTRODUCTION
The notion that complex diseases are the result of the combined contribution of
various genetic markers with relatively small or modest effect sizes is becoming
increasingly accepted. For over a decade, investigators have approached candidate gene
studies by genotyping groups of individuals, such as population-based case-control
subjects or family members of an affected proband, and either sequencing or screening a
gene of interest among those subjects. As the development of the HapMap made specific
SNP information available to investigators, many chose to simply identify SNPs in a
particular HapMap population and follow up with genotyping those variants in a given
sample population. Once genotype data was obtained, call rates established, and quality
control measures used to filter out unwanted SNPs, individual testing of these SNPs
commenced in the traditional single SNP association testing approach, according to the
type of data being used in the analysis. However, it was soon discovered that testing
individual SNPs in a candidate gene or region necessarily introduces the issue of multiple
test corrections, which reduce power in order to control type 1 error. The issues
surrounding the low power of single SNP tests in identifying casual variants for complex
disease were further exacerbated by the lack of reproducibility in the findings of many
modest-sized candidate gene association studies. Many individual common variants of
small or modest effect have been identified from these studies. Yet the disparity of
results in populations of different evolutionary ancestry, and heterogeneity in disease
18
onset and pathology even within a single population, have led investigators to believe that
many such variants are responsible for disease, and single-SNP testing in each individual
population is not the most efficient way to find them.
Thus the development of multivariate methods to identify causal loci arose as an
important area of investigation for complex diseases. Gene-based investigations began
with joint SNP tests, in which an omnibus multi-df test is used to assess multiple SNP
(within gene) association with disease or disease-related traits. Understanding correlation
and inheritance patterns among SNPs led to development of haplotype-based methods of
association. However, haplotype analyses, in which genetic information from a series of
loci residing in a haplotype block is used in tests of association, are not without issue.
Three separate methods currently exist for defining haplotype block structure and
boundaries: Confidence Interval [Gabriel et al, 2002], Four Gamete [Wang et al, 2002],
and Solid Spline [Barrett et al, 2005]. While the method of Gabriel et al is by far the
most widely used, all three methods produce differing block definitions. Additionally, if
the sample is population-based, and phase information is not available, haplotype
frequencies are usually then estimated and the most likely haplotype assigned to each
individual. Methods which allow for haplotype uncertainty also require a great deal of
computing resources.
All the above issues aside, assessment of haplotype contribution to disease is still
plagued with many of the statistical issues surrounding traditional single and joint SNP
analyses. If testing haplotypes with an omnibus multi-df test, in a manner analogous to
the joint SNP test, many degrees of freedom can detract from power to detect true
19
associations. Strong SNP signals, when they do exist, may be attenuated across a
haplotype in a haplotype-based test. Testing individual haplotypes for association
requires correction for multiple testing, which may further reduce power to detect effects.
In consideration of joint SNP and haplotype-based methodological disadvantages,
new multivariate techniques have recently been developed. A promising alternative to
identifying multiple variants in a candidate gene that are associated with disease or
disease-related quantitative traits was recently offered by Gauderman et al in the form of
a principal components regression framework. In their approach, a principal components
(PC) analysis is used to derive linear transformations of the original SNP data, in which
eigenvectors are chosen to maximize the variance of each PC relative to the overall
variation in the gene region. Given the eigenvalues, each of which represents the variance
of a particular PC, a typical analysis involves selecting only the PCs which account for
some proportion of the total variation, thereby reducing the number of parameters to be
tested. This subset of PCs can then be used as covariates in an omnibus association test
with a disease or disease-related trait [Gauderman et al, 2007; Wang et al, 2008].
The PC approach has been shown to have greater power than haplotype-based
tests to detect association between multiple SNPs and disease, especially when the
number of haplotypes is very large [Gauderman et al, 2007]. However, the coefficients
that make up each eigenvector are derived from the pair-wise correlations among the
SNPs, and can not be used to infer any particular biological meaning. Additionally, the
eigenvector loadings of the original variables on a PC may not reflect the true importance
20
of the original SNPs to that PC, making the association between multiple PCs and disease
or disease-related traits difficult to interpret [Wang et al, 2008].
In this chapter, a PC-based clustering method is introduced as an alternative
approach that reduces dimensionality of the data and maintains the power of a PC
approach, but allows for unique identification of multiple SNPs associated with disease
and disease-related traits. The algorithm uses an orthoblique rotation of PCs on a set of
SNP data to form distinct clusters of SNPs. The goal is to define a subset of clusters that
explains a large portion of the total variance, such that those clusters can then be tested
for association with a disease or disease-related trait. Most importantly, each cluster is
defined by a specific set of SNPs being tested for association, which allows for specific
identification of SNPs of interest in the region being tested. The proposed oblique PC-
based clustering method was compared with traditional single and joint SNP tests, as well
as the Gauderman et al PC approach, in a simulation study based on Transcription Factor
7 Like-2 (TCF7L2), and in an application to Hepatocyte Nuclear Factor 4-alpha (HNF4A)
among spousal and elderly controls from the FUSION study.
METHODS
ANALYTICAL APPROACHES
Single SNP and Joint SNP Association Tests
Suppose genotype data is collected on K SNPs, where G
i
is the genotype score
coded according to a particular genetic model (0, 1, 2 for additive; 0, 1 for a dominant or
recessive) the main effects model:
y = β
0
+ β
i
G
i
where i = 1, …, K (1)
21
can be tested for significance, either with a 1-df LRT for equation (1) vs the null model
y = β
0
, or an asymptotically equivalent Wald χ
1
2
statistic on the parameter estimate β
i
,
where a multiple testing correction must be applied to the p-values resulting from K tests.
If the SNPs are independent, the Bonferroni method of p-value adjustment, or
multiplying each p-value by K, is appropriate. However, even if tag SNPs were chosen
from those in the data set using any means of selection under almost any criteria, and
analysis completed on only these “independent” markers, some amount of residual
correlation will inevitably exist among the SNPs. Therefore, the Bonferroni correction in
this context is overly conservative in its attempt to eliminate false positives, while
approaches that take correlation among the markers into account, such as permutation
testing, or pACT [Conneely and Boehnke, 2007], have been shown to be more powerful.
Thus, a multi-df test for joint association may be a more powerful alternative to
the individual SNP test, depending on the model specified. A basic joint test involves:
0
=1
=
K
k k
k
y g β + β
∑ (2)
and using a K-df LRT to test the likelihood of equation (2) vs a null model: y = β
0
. This
is formally equivalent to testing the null hypothesis β
0
=β
1
=…=β
K
= 0, which if rejected,
can be interpreted as one β
i
≠ 0, but does not elucidate which specific β
i
≠ 0. Because
investigators are typically interested in finding the “causal SNP”, or at least the one that
is most significantly associated with the outcome, joint association is typically utilized in
testing genes for which multiple variants have been genotyped. A joint test may be used
to determine regional importance for disease and is followed by series of single SNP
tests, for which in practice, a multiple testing correction is typically applied.
22
Principal Components Association Analysis (PCA)
The aim of the principal components analysis (PCA) approach is the reduction of
many correlated genetic markers to relatively few uncorrelated factors that represent a
large proportion of the genetic variation in the candidate gene region. The efficiency of
the PCA approach resides in the fact that, if a set of K variables (where K > 2) have
substantial correlation among them, then the first few PCs should account for most of the
variation in the original variables [Joliffe, 2002]. Thus, only a subset of these orthogonal
factors, which explains most of the total genetic variation, need be tested for association
with disease or a continuous disease-related trait in the regression framework.
In PCA, the principal components (PCs) are essentially linear combinations of
SNPs, such that each PC represents a weighted average of variant contribution to the
overall genetic variance in a given region. Let G be a data matrix with n observations
and k SNPs, each coded as 0, 1 or 2 for the number of observed minor alleles:
[ ]
k n
nk n n
k
k
k
g g g
g g g
g g g
G G G G
×
= =
L
M M M M
L
L
L
2 1
2 22 21
1 12 11
2 1
Here, the SNPs are treated as linear, or continuous variables, such that each can be
centered on its mean prior to analysis. Let ΔG represent the centralized matrix:
k n
nk n
k
k n
k nk n
k p
g g
g g
g x g g
g x g g
G
×
Δ Δ
Δ Δ
×
=
− −
− −
= Δ
L
M M M
L
L
M M M
L
1
1 11
. 1 . 1
. 1 1 . 11
23
The covariance matrix V would then be:
k k
kk k
k
k k
n
i
ik ik
n
i
i ik
n
i
ik i
n
i
i i
T
v v
v v
g g g g
g g g g
n
G G
n
×
×
= =
= =
=
Δ Δ Δ Δ
Δ Δ Δ Δ
−
= Δ Δ
−
=
∑ ∑
∑ ∑
L
M O M
L
L
M O M
L
1
1 11
1 1
1
1
1
1
1 1
1
1
1
1
V
from which the eigenvalues (λ), or variance of each PC, can be extracted. By subtracting
a scalar matrix (formed from the product of λ and identity matrix) from V:
k k
kk k
k
k k
v v
v v
× ×
−
−
=
− = −
λ
λ
λ λ
L
M O M
L
L
M O M
L
1
1 11
1 0
0 1
V I V
and setting the determinant of the resulting matrix equal to 0, one can solve for λ:
( )
k k
k
k
k
c c c
λ λ λ λ λ λ
λ λ λ
λ
> > > ∋ =
= + + + +
= −
−
−
K K
K
2 1 2 1
1 2
1
1
, , ,
0
0 det
λ λ λ λ
I V
Once the eigenvalues (λs) have been computed, their corresponding eigenvectors can be
computed:
( )
( )
( )
= − + +
= + + −
= =
−
−
= −
× ×
0
0
, , 2 , 1 where , 0
1 1
1 1 11
1
1
1
1 11
ki i kk i k
ki k i i
k
ki
i
k k
i kk k
k i
i
e v e v
e v e v
k i
e
e
v v
v v
λ
λ
λ
λ
λ
K
M
K
K M
L
M O M
L
E I V
24
such that each eigenvalue (λ) corresponds to a given eigenvector (e
i
). Eigenvectors are
determined subject to the constraint that e
i
T
e
i
= 1, which ensures the eigenvectors remain
unit vectors, and the covariance of PC
i
with PC
j
is equal to 0, for i≠j. Thus, the PCs are
optimal linear transformations of k eigenvectors, resulting in k orthogonal linear models:
k k k kk k k k k
k k
k k
PC Var g e g e g e g e
PC Var g e g e g e g e
PC Var g e g e g e g e
λ
λ
λ
= + + + = =
= + + + = =
= + + + = =
) ( , PC
) ( , PC
) ( , PC
2 2 1 1
2 2 2 2 22 1 12 2 2
1 1 1 2 21 1 11 1 1
K
M
K
K
(3)
where the k eigenvector elements for each eigenvalue represent the coefficients, or
weights, of k SNPs for each linear model, and e
i
T
e
j
= 0, for i≠j. Thus, PC
1
is uncorrelated
with any other PC, and represents the linear combination of SNPs that explains the
largest amount of genetic variation in the region. PC
2
shares no correlation with any PC,
and explains the next largest amount of SNP variation, and so forth (λ
1
> λ
2
> … > λ
k
).
Because the sum of the eigenvalues is equal to the sum of the variances at each
locus:
kk k
v v v + + + = + + + K K
22 11 2 1
λ λ λ
the proportion of the overall variation each PC can explain can also be written as:
∑ ∑
= =
=
k
i
i
i
k
i
ii
i
v
1 1
λ
λ λ
Once all PCs have been delineated for a given set of SNPs, a subset of PCs (PC
1
, PC
2
, …,
PC
S
, S < k) may be selected that explain a predetermined threshold for the proportion of
the total variance. The S PCs meeting this threshold criteria for explained variance can
25
be tested for association with disease status or continuous phenotype using logistic or
linear regression approaches [Gauderman et al, 2007; Wang et al, 2007]. For example,
given a data set of k SNPs, it may be determined that the first S PCs jointly account for
80% of the total genetic variation represented by the k SNPs. Assuming 80% is deemed
an acceptable threshold for amount of variation explained by the PCs, a test of
association between these S PCs and a quantitative phenotype could be carried out in a
simple regression framework:
0
=1
=
S
s s
s
y PC β + β
∑ (4)
An S-df LRT of model (4) vs the null model y = β
0
provides an omnibus test of whether
this gene region significantly explains a large proportion of the variation in trait y. Thus
PCA can be used to determine if the variation contained in a set of SNPs is likely to be
associated with variation in a continuous disease-related phenotype.
Oblique Principal Components-Based Clustering (OPCC)
PCA is an efficient variable reduction method, allowing PC regression to have
increased power to detect candidate region association with disease outcomes over joint
SNP and haplotype-based approaches. However, it is very difficult to determine the
specific contributions of SNPs comprising the candidate region to disease. Because
eigenvector elements of a given PC can be thought of as the correlation of each variable
with that PC, it has been argued that if the eigenvector coefficients are distributed such
that a few are very small and a few are large, the PC can be interpreted as a
representation of those variables which show large eigenvector values [Joliffe, 2002].
26
However, unless there are a reasonably small number of variables to begin with, and
hence a small number of PCs, this is usually not the case. More often, several
intermediate loadings convolute the interpretation of a given PC. The lack of PC
interpretability is problematic because the investigator must still rely on traditional SNP
association approaches after PC regression confirms significant candidate region
association with disease.
Currently, several methods exist which serve to make PCs more interpretable.
Such methods were developed on the basis that it is easier to simply interpret the k-
dimensional space represented by the set of SNPs than it is to interpret each individual
principal component [Joliffe, 2002]. One way to accomplish this is to rotate the axes
within this k-dimensional space in a manner that simplifies the interpretation of the axes
as much as possible. Thus various methods of rotation depend upon what is considered
the “simplicity criterion” for the given method. According to Thurston, a matrix of
loadings (where rows correspond to original variables, or SNPs, and columns to the PCs)
is simple in structure if: 1) each row contains at least one zero, 2) for each column, there
are at least as many zeros as there are columns, 3) for every pair of PCs, there are some
variables with zero loadings on one PC and large loadings on the other PC, 4) for every
pair of PCs, there is a high proportion of zero loadings, 5) for every pair of PCs there is
only a small number of large loadings [Thurston, 1947]. Thus, the simplicity criterion
underlying almost all rotational methods is that which drives the loadings toward zero, or
towards their maximum possible absolute value (as close to 1 as possible), so that
intermediate eigenvector values are avoided [Joliffe, 2002]. The ultimate goal is for the
27
investigator to discern which variables are most important (those with loadings that have
large absolute values) and those with are less important (those with loadings close to 0).
Rotation methods are typically categorized as orthogonal, oblique or orthoblique.
Specific types of orthogonal rotations include varimax, quartimax, equimax and
parsimax. In general, let an orthogonal rotation matrix be denoted by R
mxn
, where
original PCs are rows and newly rotated PCs are columns. Each element of R,
represented by the intersection of row m and column n is the cosine of the angle between
the original axis and the new one: r
m,n
= cos(θ
m,n
). For example, a rotation of two PCs
where θ
1,1
= 45º would be characterized by the matrix:
−
=
=
1 , 1 1 , 1
1 , 1 1 , 1
2 , 2 1 , 2
2 , 1 1 , 1
cos sin
sin cos
cos cos
cos cos
θ θ
θ θ
θ θ
θ θ
R R R R
and could be illustrated as:
Figure 1. PC and Rotated PC Components
As demonstrated in Figure 1, rotated PC
2
and rotated PC
1
maintain an orthogonal
relationship with one another.
28
Varimax rotation, notably the most popular type of orthogonal transformation,
involves searching for a rotation R in which the variance of the loadings is maximized.
This essentially amounts to maximizing the sum of the squared difference between the
squared loading of each variable on a given PC and the mean of the squared loadings. In
other words, this approach aims to maximize:
( )
∑∑
= =
− =
k
i
k
j
j i j i
e e V
1 1
2
2
,
2
, (5)
where V is variance of the loadings for a given PC,
2
, j i
e is the squared loading of the i
th
variable on the j
th
PC, and
2
, j i
e is the mean of the squared loadings. Thus, in each PC,
the large eigenvector coefficients will be increased and the small ones decreased so that
each PC only has a few variables with large loadings. While the varimax rotation
maximizes the squared loadings in each PC, which simplifies the columns of the rotated
PC matrix, the quartimax rotation maximizes the squared loadings in each variable,
which simplifies the rows of the rotated PC matrix. This is equivalent to maximizing
equation (5) where the mean of the squared loadings is set equal to 0 (
2
, j i
e = 0). Large
eigenvector coefficients are increased, and small ones are decreased for each variable, so
that each variable will only identify with a few PCs. Thus, quartimax minimizes the
number of PCs that relate to each variable. Equimax rotation is a compromise between
varimax and quartimax; the equimax approach maximizes a weighted sum of the varimax
and quartimax criteria in an attempt to achieve simple structure within variables as well
as within PCs [Joliffe, 2002].
29
As a simple illustration, let two SNPs (SNP 1 and SNP 2) have both been coded
as 0, 1 or 2 according to the number of minor alleles, for each subject in a given data set.
Then for some number of subjects, one could plot SNP 2 vs SNP 1 (Figure 2):
Figure 2. Plot of SNP 2 vs SNP 1
Performing PCA on these two SNPs as described above, yields two PCs with eigenvector
loadings on SNP 1 and SNP 2 for each PC. A plot of PC
2
vs PC
1
(Figure 3) yields:
Figure 3. Plot of PC
2
vs PC
1
, orthogonal rotation
SNP 1
SNP 2
x
x x
x x
x
x
x
x
PC
1
PC
2
x
x
x
x
x
x
x
x
x
2 - 2
2
- 2
2
2
- 2
- 2
30
As shown in Figure 3, even with only two PCs, there appears to be a large amount
of variation accounted for by each PC, but perhaps a little more by PC
1
than PC
2
. A
quartimax rotation performed on these two PCs reveals:
Figure 4. Plot of PC
2
vs PC
1
, quartimax rotation
A line of Rotated PC
2
= Rotated PC
1
fits perfectly though the graph in Figure 4, and thus
Rotated PC
2
and Rotated PC
1
would be expected to share the same amount of variation
across the original variables. Table I contains the eigenvector elements for PC
1
and PC
2
,
as well as rotated PC
1
and PC
2
. SNP 1 and SNP 2 have the same eigenvector values on
PC
1
and PC
2
, but clearly have different sized loadings on the rotated PCs. From this
table, one can infer that rotated PC
1
captures SNP 2, and rotated PC
2
SNP 1.
Table I. Example eigenvector loadings for SNP 1 and SNP 2 on PCs and Rotated PCs
PC
1
PC
2
Rotated PC
1
Rotated PC
2
SNP 1 -0.70711 0.70711 -0.0991 0.9951
SNP 2 0.70711 0.70711 0.9951 -0.0991
Proportion of total variation 0.599 0.401 0.500 0.500
Rotated PC
1
Rotated PC
2
x
x
x
x
x
x
x
x
x
2 - 2
- 2
2
31
In view of the orthogonal rotation approaches, which constrain PCs axes to a 90º
relationship, it can still be argued that such rotation is not reflective of the underlying
relationships between correlated variables. Oblique rotation procedures allow PCs or
factors to be correlated, and as such, can sometimes achieve a simple structure that better
fits the data than orthogonal transformations [Meyers et al, 2006]. Because the obliquely
rotated PCs are not orthogonal, the PC loadings may not be equivalent to the correlation
coefficients between the variables and the PCs. Hence, most techniques for obliquely
rotating factors or PCs also yield separate correlation values between variables and PCs.
The most common oblique transformation procedures are oblimin, quartamin and
covarimin. All such approaches are generalized from their orthogonal counterparts;
oblimin is generalized from varimax, quartimin from quartimax, etc. The main goal of
all oblique approaches is to minimize the sums of the cross-products (across PCs) of
squares of the eigenvector elements. If T is an ideal matrix with simple structure, also
known as a “target” matrix, then oblique rotation methods attempt to minimize the sum
of the squared differences between the rotated PC matrix and T [Carroll, 1953].
In order to gain the efficiency of the orthogonal transformation, but the simplicity
of interpretability of the oblique rotation, orthoblique rotations have been suggested.
Orthoblique rotations are typically two step processes in which the first step is orthogonal
transformation (varimax, quartimax, etc.), involving a search for an R matrix that yields a
rotated PC that approximates simple structure (as described above). The second step
involves computing a least square fit from the orthogonally rotated PC to the target
matrix T. Harris and Kaiser developed a class of orthoblique rotations in which they
32
claimed to obtain an oblique solution by orthogonal transformations of any given matrix
[Harris and Kaiser, 1964].
As implemented in the SAS Varclus procedure, the orthoblique clustering
algorithm begins with K SNPs initially grouped into a single cluster. PC analysis is
performed on the initial cluster, with an orthoblique rotation (quartimax-based) of the
first two principal components (PC
1
and PC
2
), which allows the initial cluster to be
divided into two clusters. Because the quartimax-based rotation is used, each SNP will
have a non-zero loading on only one of the two PCs, and loading of zero on the other,
allowing for two disjoint clusters to be formed. PC analysis then continues within each
of the two clusters, and hence, the first two PCs for each cluster are continually
calculated on a different set of variables at each step of the procedure. As such, the
clusters are disjoint, but the first PC of one cluster may be correlated with the first PC of
a different cluster, and have an oblique relationship.
As this process iterates, the algorithm assigns each SNP to the rotated component
with which it has the higher squared correlation. PC analysis within newly formed
clusters and SNP assignment continue iteratively, assigning SNPs to clusters, calculating
the overall proportion of variance explained by the clusters, and re-testing each SNP to
determine if assigning it to a different cluster increases the amount of variance explained.
At each step, the algorithm attempts to maximize the total variance accounted for by the
cluster components.
Thus for K SNPs, N clusters of the form:
33
K KN N N N N
K K
K K
g c g c g c g c
g c g c g c g c
g c g c g c g c
+ + + = =
+ + + = =
+ + + = =
K
M
K
K
2 2 1 1
2 2 22 1 12 2 2
1 2 21 1 11 1 1
C
C
C
(6)
are computed such that where C
N
is the N
th
cluster score, c
N
is a vector of standardized
cluster coefficients, and g is the vector of SNPs [g
1
, g
2
, …, g
k
]. While all SNPs are
subdivided into a total of N clusters, the number of SNPs in each cluster varies, yielding
cluster coefficients equal to zero for SNPs not included in the N
th
cluster. Thus, given the
cluster coefficients and individual genotype scores, cluster scores are computed for each
individual. As such, the correlation described above between SNPs and their clusters is
determined by alternating least squares for a given SNP regressed on its cluster score:
∑
∑
−
−
= =
i
n n
i
k k g
Total
g
i i
i i
C C
g g
SS
SS
R
2
2 2
Re 2
) (
) ( β
(7)
where β
g
is based on the analysis of variance for regression of the n
th
cluster vs. the k
th
SNP,
k
g is the mean value for the k
th
SNP, which is coded as 0,1, or 2 according to the
number of minor alleles,
n
C is the mean cluster score for n
th
cluster, and i is the number
of observations.
As with PCA, where K PCs on K SNPs jointly account for 100% of the locus
variation, K clusters, each composed of a single SNP, would also account for 100% of the
total variation. Also as with PCA, not all PCs or cluster divisions are required to capture
a substantial portion of locus variability. Therefore, only the S cluster formations
satisfying a predetermined threshold for explained variance need be tested for association
34
with disease status or continuous phenotype using logistic or linear regression
approaches. For example, given a data set of K SNPs, it may be determined that the first
S clusters jointly account for 80% of the total genetic variation represented by the K
SNPs. Assuming 80% is deemed an acceptable threshold for amount of variation
explained by the PCs, a test of association between these S clusters and a quantitative
phenotype could be carried out in a simple regression framework:
0
=1
=
N
n n
n
y C β + β
∑ (8)
Likelihood ratio tests with S degrees of freedom can be used to assess the significance of
model (8) vs. the null model (as previously described). Given significant association
under the omnibus test of S clusters, and the interpretability of each cluster, it is
reasonable to test individual clusters. This can be achieved with either a 1 df LRT for y =
β
0
+ β
i
C
i
vs the null model y = β
0
, where i = 1,…S, or an asymptotically equivalent Wald
χ
1
2
statistic on the parameter estimate β
i
, where a multiple testing correction should be
applied to the p-values resulting from S tests. All statistical methods and association tests
implemented in SAS (v 9.1).
SIMULATIONS
Generation of Genotypes
In order to simulate genotypes for a candidate gene that closely models one which
is known to be associated with complex disease, the Transcription Factor 7-Like 2
(TCF7L2) gene was chosen as the source of genetic data from which to simulate
genotypes. TCF7L2, a ubiquitously expressed nuclear transcription factor originally
35
identified as a susceptibility gene for type 2 diabetes through linkage analysis [Grant et
al, 2006], has been shown to be associated with disease in numerous studies across
various populations [Florez et al, 2006; Groves et al, 2006; Zhang et al, 2006; Scott et al,
2006; Damcott et al, 2006]. Recently, variants in TCF7L2 were among the most
significant findings in genome-wide association studies of type 2 diabetes [Saxena et al,
2007; Scott et al, 2007; Zeggini et al, 2007], and have been shown to be associated with
diabetes-related quantitative traits, such as impaired beta cell function or decreased
insulin response to glucose [Munoz et al, 2006; Saxena et al, 2006; Florez et al, 2006;
Dahlgren et al, 2007; Watanabe et al, 2007; Palmer et al, 2008], high triglyceride levels
[Humphries et al, 2006; Huertas-Vazquez et al, 2008], and low HDL [Melzer et al,
2006]. As TCF7L2 has shown replicable evidence for association with complex disease
as well as disease-related traits, it was determined to be a reasonable candidate from
which to generate genotype data for this simulation study.
Based on analysis of the original HapMap CEU data in Haploview (V. 4.1)
[Barrett et al, 2005], of the 233 markers comprising the ~215 kb region denoted by
TCF7L2, 144 are polymorphic in the CEU sample. Thus, the basis for simulating
genotypic data was the observed distribution of these 144 SNPs in TCF7L2 from the 60
founders in the HapMap CEU panel. Let g
1
,…,g
144
be the 144 SNP loci, and G
1
,...,G
144
be the corresponding SNP genotypes to be randomly assigned to each subject. As
outlined by Gauderman et al, the genotype simulation approach begins by generating G
1
from a multinomial distribution with proportions determined by the observed sample
distribution f(g
1
). G
2
is then sampled from the distribution of SNP 2 conditional on SNP
36
1, or f(g
2
|G
1
). This process continues until G
15
, which is simulated according to
f(g
15
|G
1
,…,G
14
). Due to the large number of SNPs that comprise the sample space, a
fixed window size of 15 SNPs was used to model all SNPs after G
15
, i.e. G
144
was
generated from f(g
144
|G
129
,…,G
143
). The result is the simulation of random SNP data with
similar frequencies and LD patterns to the observed SNPs in the HapMap CEU sample.
Using the approach described above, genotype data for the 144 SNPs in the
TCF7L2 locus were simulated for 1,000 data sets, each with 1,000 subjects. Analysis of
the original HapMap CEU data in Haploview (V. 4.1) [Barrett et al, 2005] indicates that
the 144 SNPs are distributed across 12 haplotype blocks as defined by Gabriel et al, with
one particularly large block of LD (Block 5, SNP 17 – SNP 62) spanning ~65 kb, shown
in Figure 5. 11 markers with minor allele frequencies < 0.01 were excluded from further
analysis. Of the remaining 133 SNPs, 54 tag SNPs were identified using the pairwise
tagging algorithm in Tagger [de Bakker et al, 2005], as implemented in Haploview, with
an r
2
threshold equal to 0.6. The pairwise r
2
distribution in the simulated data was similar
to that observed in HapMap CEU data. The LD structure from a single simulated data set
is shown in Figure 6. Using Gabriel et al definition for haplotype blocks, the simulated
data shows 15 blocks, including one large block encompassing almost the same SNPs as
those shown in the HapMap CEU data (SNP 17 – SNP 61). The remaining block
definitions are also, for the most part, very similar. The additional 3 blocks appear to be
defined in regions of low LD where blocks did not appear in the CEU panel.
37
Figure 5. Pairwise LD structure among 144 SNPs in TCF7L2 from HapMap CEU data.
White squares indicate r
2
=0, grey squares 0 < r
2
< 1, black squares r
2
=1.
Figure 6. Pairwise LD structure among 144 SNPs in single example simulated data set.
White squares indicate r
2
=0, grey squares 0 < r
2
< 1, black squares r
2
=1.
38
Additionally, most of the common haplotypes in the simulated data (Figure 7) had
frequencies similar to those in the real data (Figure 8). For haplotype frequencies
differing by more than 2% from those in original data, such haplotypes typically
belonged to blocks which also differed in size by 1 or more SNPs compared to the
original data. The similarities between the real and simulated data demonstrate that the
Gauderman et al approach to simulating genotype data, with a conditioning window of 15
SNPs, is adequate to capture the LD structure observed in TCF7L2 from the HapMap
CEU panel. However, the differences in block definitions and haplotype frequencies
highlight the limitations of the sliding window approach in simulating long range LD
patterns. All genotype simulations implemented in R (v. 2.6.1).
Figure 7. Estimated haplotypes from 144 SNPs in a single example simulated data set.
Figure 8. Estimated haplotypes from HapMap CEU panel.
Generation of Phenotype
Subsequent to the generation of genotype data for 144 SNPs, across 1,000 data
sets, each with 1,000 individuals, one of the SNPs (not a tag SNP) was chosen to be a
causal variant (CV). Based on the pairwise r
2
between tag SNP and CV, as well as
39
position within the gene region, nine distinct choices of CV were considered (Table II).
Three CVs (SNPs 54, 26, 60) and were located in the large haplotype block, exemplified
rare, modest and high allele frequencies, and were in strong LD with their designated tag
SNPs (r
2
> 0.915). Three CVs (SNPs 135, 143, 141) were located in a smaller block near
the 3’ end of the gene, were of low, medium and high frequency, and were in modest LD
with their corresponding tag SNPs (0.781 < r
2
< 0.798). The remaining three CVs (SNPs
72, 105, 94) were not located in any block, represented rare, medium frequency and
common SNPs, and showed relatively modest LD with tags (0.666 < r
2
< 0.747).
Table II. Descriptive Statistics for nine SNPs in TCF7L2 used as Causal Loci in the
Simulations
HapMap CEU Data Simulated Data
CV Position
†
Location MAF
Tag
SNP
Pairwise
r
2
between
Tag
SNP and
CV
Avg.
MAF
‡
Average
Pairwise
r
2
between
Tag SNP
and CV
‡
SNP 54 114793297 Block 5 0.051 SNP 42 0.997 0.050 0.997
SNP 26 114748339 Block 5 0.250 SNP 23 0.916 0.250 0.915
SNP 60 114777716 Block 5 0.417 SNP 43 0.932 0.417 0.931
SNP 135 114843626 Block 12 0.167 SNP 136 0.799 0.167 0.798
SNP 143 114847519 Block 12 0.242 SNP 136 0.784 0.242 0.784
SNP 141 114844373 Block 12 0.416 SNP 138 0.781 0.417 0.781
SNP 72 114788976
Outside
Block
0.042 SNP 70 0.699 0.042 0.699
SNP 105 114837133
Outside
Block
0.150 SNP 102 0.748 0.151 0.747
SNP 94 114811797
Outside
Block
0.333 SNP 100 0.667 0.333 0.666
†
Based on HapMap Data Release 23a (Phase II), NCBI Build 36, dbSNP Build 126
‡
Average across 1,000 simulated genotype data sets
40
In the interest of simulating a trait with a sizeable amount of variability, as is
observed with most complex disease traits, the quantitative phenotype was simulated with
coefficient of variation of 40%. For each CV, with minor allele frequency q, the trait was
simulated to approximate a univariately normal distribution, based on the linear model:
y = α + β(CV) + ε (10)
with β determined by:
) ( ) ( ) ( ) (
2
ε β α Var CV Var Var y Var + + =
such that:
) (
) (
) (
]
) (
) (
1 [ ) (
) (
) ( ) (
2
CV Var
R y Var
CV Var
y Var
Var
y Var
CV Var
Var y Var
CV
⋅
=
−
=
−
=
ε
ε
β
where
2
CV
R is the proportion of trait variation accounted for by the CV, and the Var(CV) is
calculated according to the variance of a multinomial distribution, and is thus a function
of CV minor allele frequency (q):
2
3 3 3
2 2 2
2
1 1 1
2 2 4 2
2
2 2 1
2
1 2 1
3 2 3 2 2 1 2 1
2
3 3 3
2
2 2 2
2
1 1 1
1
1
1 1
2 2
) 1 ( ) Pr( , 0 :
) 1 ( 2 ) Pr( , 1 :
) Pr( , 2 :
)) 1 ( 2 ( )) 1 ( 2 ( 4 4 ) 1 ( 2 4
)) (Pr( ) Pr( ) Pr( 4 )) (Pr( 4 ) Pr( ) Pr( 4
)] ( 2 ) 1 ( ) 1 ( ) 1 ( [ 1
3 .., , 1 ; 3 ] 2 ) 1 ( [ : ) (
q x p CV x
q q x p CV x
q x p CV x
where
q q q q q q q q q
x x x x x x
x x p p x x p p x p p x p p x p p
i m x x p p x p p n CV Var
m
i
m
i
m
i j
j i j i i i i
− = = =
− = = =
= = =
− − − − − − + =
− − − + =
+ − − + − + − =
= = − − =
∑ ∑ ∑
=
−
= + =
M
σ
Residual error, ε, is sampled from a normal distribution where:
) ( ) ( ) (
2
CV Var y Var Var β ε − =
41
In equation (10), the value of β was set equal to 0 to assess type 1 error. To assess
power, β was determined as the minimum effect size required for the CV to explain 1%
of the trait variation ) 01 . 0 (
2
=
CV
R . As verified in QUANTO [Gauderman J and Morrison
J, 2006], for a trait with a 40% coefficient of variation, assuming an additive genetic
model where 01 . 0
2
=
CV
R , a direct test of the CV with a two-sided 0.05 significance level
has 88.7% power, for 1,000 subjects. 1,000 null data sets (β =0) and 1,000 CV-
associated data sets (β required to achieve 1% explained trait variation) were simulated
for each CV. In order to account for the fact that investigators may or may not have
genotyped the CV in a given sample of subjects, two scenarios were considered for each
analysis: 1) 54 tag SNPs + CV, and 2) 54 tag SNPs alone. Estimated power and type 1
error were assessed as the proportion data sets for each CV that yielded a significant
association at the 0.05 level assuming a two-sided alternative hypothesis. All phenotype
simulations implemented in R (v. 2.6.1).
RESULTS
SIMULATION STUDIES
Type 1 Error
For each CV, single SNP, joint SNP, PC and OPCC analyses were conducted on
1,000 null data sets. Multi-df tests were evaluated with an LRT for full model (multiple
SNPs, PCs or Clusters) vs a null model (intercept only). Single-df tests were evaluated
with a Wald χ
1
2
statistic and p-values corrected for multiple testing with the Bonferroni
method. Type 1 error estimates for all tests were within range of 5% for all CVs, under
42
both scenarios (Table III). Each estimate is based on the proportion of 1,000 data sets
showing a significant association for any predictors vs disease. Because Gauderman et al
previously demonstrated that PC regression has a robust power under most “full-gene”
scenarios for a range of minor allele frequencies, a 60% explained variance threshold was
used for PC and OPCC approaches (instead of 80%).
43
Table III. Estimated Type 1 Error
†
for Single SNP, Joint SNP, PC and OPCC
methods under two scenarios for nine CVs within TCF7L2
Available Data
Single
SNP
a
Joint
SNP
b
PCA
b
OPCC
b
PCA
a
OPCC
a
Scenario 1: Tag SNPs + CV
Average df
‡
1 55 10.7 15.3 1 1
SNP 54 (rare, Block 5) 0.039 0.052 0.043 0.054 0.043 0.057
SNP 26 (moderate, Block 5) 0.040 0.061 0.047 0.052 0.056 0.050
SNP 60 (common, Block 5) 0.039 0.065 0.047 0.052 0.049 0.049
SNP 135 (low, Block 12) 0.053 0.070 0.063 0.061 0.060 0.054
SNP 143 (moderate, Block 12) 0.044 0.065 0.054 0.051 0.051 0.057
SNP 141 (common, Block 12) 0.055 0.067 0.051 0.064 0.048 0.062
SNP 72 (rare, outside) 0.036 0.064 0.050 0.049 0.054 0.043
SNP 105 (moderate, outside) 0.039 0.055 0.037 0.043 0.045 0.050
SNP 94 (common, outside) 0.047 0.060 0.059 0.051 0.052 0.047
Scenario 2: Tag SNPs only
Average df
‡
1 54 10.9 15.5 1 1
SNP 54 (rare, Block 5) 0.042 0.052 0.044 0.057 0.046 0.057
SNP 26 (moderate, Block 5) 0.040 0.063 0.046 0.049 0.049 0.048
SNP 60 (common, Block 5) 0.040 0.067 0.049 0.056 0.052 0.048
SNP 135 (low, Block 12) 0.052 0.061 0.056 0.059 0.047 0.052
SNP 143 (moderate, Block 12) 0.045 0.069 0.054 0.047 0.056 0.057
SNP 141 (common, Block 12) 0.054 0.065 0.056 0.057 0.050 0.061
SNP 72 (rare, outside) 0.035 0.055 0.043 0.049 0.053 0.043
SNP 105 (moderate, outside) 0.039 0.049 0.041 0.040 0.043 0.047
SNP 94 (common, outside) 0.048 0.057 0.056 0.050 0.045 0.048
†
Values are the proportion of 1,000 null data sets in which the test of association was rejected at the 0.05
level. 1 df tests were evaluated with a Wald χ
1
2
statistic based on a linear regression model for simulated
trait vs predictors (SNPs, PCs or clusters). Multi-df tests were implemented in a likelihood ratio test
framework for multiple predictors vs a null model (intercept only). The PC and OPCC tests were based
on the minimum set of PCs or clusters that explained 60% of the variation in TCF7L2.
‡
The average degrees of freedom across 1,000 data sets.
a
1 df test of main effect, Bonferroni corrected for multiple testing
b
Omnibus, multi-df test
44
Power
For each CV, all previously described analyses were conducted on 1,000 CV-
associated data sets. Multi-df tests were evaluated with an LRT for full model (multiple
SNPs, PCs or Clusters) vs a null model (intercept only). Single-df tests were evaluated
with a Wald χ
1
2
statistic and p-values corrected for multiple testing with the Bonferroni
method. Estimates of power for all tests, under both scenarios are presented in Table IV.
Each estimate is based on the proportion of 1,000 data sets showing a significant
association for any predictors tested vs disease. A 60% explained variance threshold was
used for PC and OPCC approaches.
45
Table IV. Estimated Power
†
for Single SNP, Joint SNP, PC and OPCC methods
under two scenarios for nine CVs within TCF7L2
Available Data
Single
SNP
a
Joint
SNP
b
PCA
b
OPCC
b
PCA
a
OPCC
a
Scenario 1: Tag SNPs + CV
Average df
‡
1 55 10.7 15.3 1 1
SNP 54 (rare, Block 5) 0.469 0.282 0.399 0.370 0.426 0.509
SNP 26 (moderate, Block 5) 0.579 0.314 0.499 0.465 0.524 0.581
SNP 60 (common, Block 5) 0.560 0.304 0.505 0.458 0.551 0.583
SNP 135 (low, Block 12) 0.614 0.348 0.512 0.456 0.411 0.551
SNP 143 (moderate, Block 12) 0.607 0.330 0.554 0.484 0.438 0.608
SNP 141 (common, Block 12) 0.526 0.300 0.449 0.387 0.404 0.482
SNP 72 (rare, outside) 0.494 0.275 0.327 0.413 0.274 0.526
SNP 105 (moderate, outside) 0.475 0.287 0.439 0.402 0.374 0.517
SNP 94 (common, outside) 0.550 0.293 0.479 0.475 0.360 0.638
Scenario 2: Tag SNPs only
Average df
‡
1 54 10.9 15.5 1 1
SNP 54 (rare, Block 5) 0.470 0.282 0.361 0.321 0.323 0.424
SNP 26 (moderate, Block 5) 0.516 0.301 0.412 0.383 0.382 0.422
SNP 60 (common, Block 5) 0.514 0.304 0.469 0.410 0.433 0.513
SNP 135 (low, Block 12) 0.512 0.320 0.401 0.380 0.282 0.383
SNP 143 (moderate, Block 12) 0.500 0.303 0.454 0.374 0.317 0.329
SNP 141 (common, Block 12) 0.410 0.292 0.398 0.346 0.326 0.409
SNP 72 (rare, outside) 0.322 0.227 0.211 0.314 0.179 0.405
SNP 105 (moderate, outside) 0.353 0.278 0.369 0.261 0.306 0.237
SNP 94 (common, outside) 0.344 0.271 0.353 0.353 0.255 0.212
†
Values are the proportion of 1,000 CV-associated data sets in which the test of association was rejected
at the 0.05 level. 1 df tests were evaluated with a Wald χ
1
2
statistic based on a linear regression model
for simulated trait vs predictors (SNPs, PCs or clusters). Multi-df tests were implemented in a likelihood
ratio test framework for multiple predictors vs a null model (intercept only). The PC and OPCC tests
were based on the minimum set of PCs or clusters that explained 60% of the variation in TCF7L2.
‡
The average degrees of freedom across 1,000 analyses.
a
1 df test of main effect, Bonferroni corrected for multiple testing
b
Omnibus, multi-df test
46
Under the scenario that genotype data for all 54 tag SNPs and the CV are
available in a given sample (scenario 1), both the PC and OPCC approaches outperform
the joint SNP method in an omnibus test of association. The OPCC global association
test shows similar power to the global PC test. In most cases, power is slightly lower (2-
6% lower) and is likely due to the additional degrees of freedom required for the OPCC
analysis. However, if individual PCs or clusters are tested for association with the trait,
and a Bonferroni correction applied for multiple testing, OPCC outperforms PC
regression for all tests. For CVs located within a block of LD (SNPs 54, 26, 60, 135, 143
and 141), OPCC showed approx. 3-17% greater power than traditional PC. For CVs
located outside of any block (SNPs 72, 105 and 94), power for the OPCC tests exceeded
that for PC analysis by approx. 25-27%.
When only the 54 tag SNPs are considered in the analysis (scenario 2), both the
global PC and global OPCC approaches indicate higher power to detect association than
the joint SNP test. Again, in most cases, the power for the omnibus OPCC test is approx.
2-10% lower than that for the PC test, which is likely due to the increased degrees of
freedom utilized by the global OPCC test. However, for individual testing of PCs or
clusters on 1 df, OPCC again outperformed the PC test for most CVs. Under this
scenario, the 1 df cluster association tests showed 4-20% greater power than individual
tests of PC association, for most CVs.
Robustness of Power and Method Accuracy
In order to assess whether the trade-off, between degrees of freedom or increased
correction for multiple testing and proportion of variation explained, differed for the 60%
47
and 80% thresholds, all PC and OPCC analyses were repeated using an 80% explained
variance criteria. As shown in Figures 9 - 12, the trends described above for the PC and
OPCC tests were generally similar under the standard 80% proportion of variance
criteria. An average of 10.7 PCs and 15.3 clusters were necessary to explain 60% of the
overall locus variation, while an average of 19.0 PCs and 27.9 clusters were required for
the 80% explained-variance threshold, across the 9,000 data sets used for scenario 1 (54
tags + CV genotyped). Similarly, an average of 10.9 PCs and 15.5 clusters were needed
to explain 60% of the variation, compared with 19.1 PCs and 27.9 clusters to explain
80%, across the 9,000 data sets used for scenario 2 (54 tags only).
For CVs residing in a block of LD, power for the PC global association test either
slightly decreased or remained the same, under the 80% threshold, whether the CV was
included in the model or not (Figure 9 and Figure 10). For the rare CV not residing in
any block, global PC proved to be slightly more powerful, but for the more common
variants outside any block, power was similar for the tags only scenario, and slightly
lower for the tags plus CV scenario. This is likely due to the fact that additional PCs,
beyond those that account for the majority of the locus variance may represent the effects
of rare variants and/or SNPs not in strong LD with many other SNPs. For all CVs under
scenario 1 (54 tags + CV), and most SNPs under scenario 2 (54 tags only), power for the
global cluster association tests decreased, as would be expected given the additional ~13
dfs required. The decrease in power observed for global association models that
contained enough PCs or clusters to explain the higher proportion of variance suggests
that price paid in degrees of freedom is not worth the additional variance explained.
48
0.00
0.10
0.20
0.30
0.40
0.50
0.60
SNP
54
SNP
26
SNP
60
SNP
135
SNP
143
SNP
141
SNP
72
SNP
105
SNP
94
Causal Variant
Power
PC 60% Threshold
PC 80% Threshold
OPCC 60% Threshold
OPCC 80% Threshold
Figure 9. Under scenario 1 of genotype availability (54 tags + CV), estimated power of
PC and OPCC tests of global association for 60% and 80% explained variance thresholds.
0.00
0.10
0.20
0.30
0.40
0.50
0.60
SNP
54
SNP
26
SNP
60
SNP
135
SNP
143
SNP
141
SNP
72
SNP
105
SNP
94
Causal Variant
Power
PC 60% T hreshold
PC 80% T hreshold
OPCC 60% T hreshold
OPCC 80% T hreshold
Figure 10. Under scenario 2 of genotype availability (54 tags only), estimated power of
PC and OPCC tests of global association for 60% and 80% explained variance thresholds.
49
As shown in Figure 11 and Figure 12, tests of individual PCs showed reduced
power when the 80% threshold was applied, for all CVs, under both scenarios of
genotype availability. Under the scenario in which genotype data on tag SNPs and CV
were available, OPCC univariate tests showed very similar power for CVs located in an
LD block, but slightly reduced power for those residing outside any block. When only
tag SNP information was available, OPCC had similar or slightly increased power to
detect association for CVs in a haplotype block, but variable amounts of power for CVs
not located in any block, depending on how well each was tagged by its tag SNP. The
fact that the univariate PC tests also show diminished power suggests that the additional
PCs, which increase the multiple testing correction factor, do not individually explain
enough variation to justify a test of their main effects. In contrast, it seems that
individual testing of clusters from the OPCC method varies by scenario; if the CV is not
present in the genotype data, but the CV is well tagged, OPCC is slightly more powerful
if a 60% explained variance threshold is assumed.
50
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
SNP
54
SNP
26
SNP
60
SNP
135
SNP
143
SNP
141
SNP
72
SNP
105
SNP
94
Causal Variant
Power
PC 60% Threshold
PC 80% Threshold
OPCC 60% Threshold
OPCC 80% Threshold
Figure 11. Under scenario 1 of genotype availability (54 tags + CV), estimated power of
univariate PC and OPCC tests of association for 60% and 80% variance thresholds.
0
0.1
0.2
0.3
0.4
0.5
0.6
SNP
54
SNP
26
SNP
60
SNP
135
SNP
143
SNP
141
SNP
72
SNP
105
SNP
94
Causal Variant
Power
PC 60% Threshold
PC 80% Threshold
OPCC 60% Threshold
OPCC 80% Threshold
Figure 12. Under scenario 2 of genotype availability (54 tags only), estimated power of
univariate PC and OPCC association tests for 60% and 80% variance thresholds.
51
While the estimation of power involves finding a significant association in the
analysis of a particular data set, the association found may or may not be with the variant
of interest, due to the underlying correlation of the SNPs and/or random chance. In order
to assess how well the OPCC method not only identified an association between a given
cluster and disease trait, but also recognized the correct CV or tag SNP of interest, the
proportion of correct identifications out of the number of associations detected was
examined (Table V). The number of association results containing the CV (scenario 1),
or its tag SNP (scenario 2), out of the total number of associations found was tabulated.
The OPCC method shows a high degree of accuracy in identifying the CV in scenario 1,
or its tag SNP in scenario 2.
52
Table V. Proportion
†
of CVs or tag SNPs correctly identified
by OPCC method.
Available Data
Single
SNP
a
OPCC
b
Scenario 1: Tag SNPs + CV
SNP 54 (rare, Block 5) 0.859 0.892
SNP 26 (moderate, Block 5) 0.872 0.910
SNP 60 (common, Block 5) 0.834 0.923
SNP 135 (low, Block 12) 0.886 0.927
SNP 143 (moderate, Block 12) 0.888 0.954
SNP 141 (common, Block 12) 0.817 0.869
SNP 72 (rare, outside block) 0.877 0.951
SNP 105 (moderate, outside block) 0.821 0.932
SNP 94 (common, outside block) 0.882 0.914
Scenario 2: Tag SNPs only
SNP 54 (rare, Block 5) 0.857 0.788
SNP 26 (moderate, Block 5) 0.876 0.815
SNP 60 (common, Block 5) 0.817 0.883
SNP 135 (low, Block 12) 0.838 0.861
SNP 143 (moderate, Block 12) 0.838 0.815
SNP 141 (common, Block 12) 0.734 0.768
SNP 72 (rare, outside block) 0.752 0.884
SNP 105 (moderate, outside block) 0.669 0.717
SNP 94 (common, outside block) 0.706 0.704
†
Proportion of association results from analysis on each of 1,000 CV-associated data
sets that correctly identify the CV (Scenario 1) or its tag SNP (Scenario 2).
a
Single SNP test of association, 1 df, Bonferroni corrected
b
OPCC univariate test of association, 1 df, Bonferroni corrected. Assumes 60%
explained -variance threshold.
53
DATA ANALYSIS
Application to FUSION Study Data
The goal of the FUSION study is to map and identify genetic variants that
increase susceptibility for type 2 diabetes mellitus (T2DM) or account for the variability
in diabetes-related quantitative traits. Hepatocyte Nuclear Factor 4-Alpha (HNF4A) is
one of several candidate genes investigated by FUSION, and has been shown to contain
variants associated with T2DM, as well as insulin and glucose measures from oral
(OGTT) and intravenous (IVGTT) glucose tolerance tests [Silander et al, 2004; Mohlke
et al, 2005; Bonnycastle et al, 2006]. To demonstrate the methods developed in this
chapter, the PC and OPCC approaches were applied to the analysis of variants within the
P1 and P2 promoter regions of Hepatocyte Nuclear Factor 4-α (HNF4A), in population-
based controls from the FUSION study.
Sample data included genotypes for 16 SNPs, with MAFs > 0.01, spanning the P1
and P2 promoter region of HNF4A, as well as OGTT insulin measures for 216 unaffected
FUSION spouse controls. Of the 16 SNPs in the region, 11 were chosen as tag SNPs
(pairwise r
2
= 0.8). Observed genotype frequencies were assessed for deviation from
Hardy-Weinberg Equilibrium and allele frequencies estimated using Haploview (V.4.1).
Quantitative trait data were transformed to approximate univariate normality prior to
analysis, and genotypes were coded as 0, 1 or 2 according to the number of minor alleles.
Pairwise LD among the 16 SNPs is shown in Figure 13. The MAFs for all 16
SNPs were greater than 0.05. Rs2144908, the SNP in this region previously shown to be
associated with diabetes-related traits, was not in strong LD with any other tag SNP
54
(maximum r
2
= 0.49), and had MAF = 0.185. Figure 13 also indicates eigenvector
loadings of the first 3 PCs from the traditional PC analysis, as well as cluster assignment
of the 11 tag SNPs. As expected, the magnitude of the eigenvector coefficients loosely
correlates with SNP-cluster assignment. However, due to the large number of
intermediate loadings, there is no obvious pattern among the coefficients that would
suggest which SNPs load onto which PCs.
Cluster 2 3 2 3 2 2 1 3 1 1 1
PC1 0.140 0.365 0.304 0.165 0.019 -0.328 0.284 0.282 -0.433 -0.292 -0.431
PC2 0.514 0.187 0.348 -0.106 0.578 -0.230 -0.105 -0.227 0.217 0.277 0.178
PC3 -0.225 0.390 -0.187 0.632 0.056 -0.106 -0.355 0.344 0.088 0.246 0.167
Tag x x x x x x x x x x x
Figure 13. Pairwise LD structure among 16 SNPs spanning HNF4A P1 and P2 promoter
regions from FUSION data. White squares indicate r
2
=0, grey squares 0 < r
2
< 1, black
squares r
2
=1. Top of figure denotes SNP-cluster assignment, eigenvector coefficients for
first 3 PCs, and tag SNP indicators.
55
Results of joint SNP, PC and OPCC tests of omnibus association are shown in
Table VI. PC analysis yielded 3 PCs that accounted for at least 60% of the variance. The
3 PCs were marginally associated with 2-hour insulin (p=0.067), and were composed of
eigenvector coefficients ranging from -0.433 to 0.632. Thus, PCA allows us to conclude
that HNF4A shows trend for association with 2-hour insulin, but does not permit us to
distinguish the relative contribution of each of the 11 SNPs to this association.
Table VI. HNF4A Association with OGTT-measured 2-hour Insulin
2-hour Insulin
Method df
% Genetic
Variance Explained
χ
2
Statistic
a
p-value
Joint SNP Analysis 11 100.0% 17.359 0.098
Traditional PC Analysis 3 69.0% 7.174 0.067
OPCC Cluster Analysis 3 61.9% 8.354 0.039
a
χ
2
statistic based on multi-df likelihood ratio test for number of PCs or clusters included in the full model
vs. the null model (intercept only).
The OPCC approach also identified 3 clusters that accounted for at least 60% of
the variance. Cluster size ranged from 3 to 4 SNPs. The 3 SNP-clusters were
significantly associated with 2-hour insulin (p=0.039) (Table VI). SNP assignment and
p-values from Wald χ
2
tests for univariate association between each cluster and 2-hour
insulin are shown in Table VII. The squared correlation coefficients between each SNP
and its assigned cluster (R
O
2
), shown in Table VII, indicate that most SNPs within each
cluster share a high degree of correlation, and relatively low correlation with SNPs in any
56
other cluster (R
N
2
). Additionally, low values for the ratio of one minus each of these
correlations indicates relatively stable SNP-cluster assignment. Cluster 3, which is
significantly associated with 2-hour insulin (p=0.012), is the cluster that contains the
variant known to be associated with insulin measures (rs2144908). If Bonferroni
corrected for 3 clusters, this association between Cluster 3 and 2-hour insulin remains
significant (corrected p=0.036).
Table VII. HNF4A SNP-Cluster Associations with OGTT-measured 2-hour Insulin
Cluster # SNP Position
a
MAF R
O
2 b
R
N
2 c
(1-R
O
2
)/(1-R
N
2
)
Ratio
P-
value
d
Cluster 1 rs6073425 42447340 0.325 0.432 0.013 0.575
0.326
rs2425637 42457463 0.496 0.730 0.105 0.302
rs2425638 42460043 0.186 0.558 0.012 0.446
rs2425640 42461451 0.369 0.796 0.056 0.216
Cluster 2 rs6031552 42412208 0.255 0.772 0.005 0.229 0.276
rs11508795 42416627 0.115 0.506 0.078 0.536
rs6031558 42433057 0.335 0.673 0.079 0.356
rs6073418 42434004 0.287 0.337 0.130 0.762
Cluster 3 rs6031544 42415761 0.296 0.766 0.189 0.289 0.012
rs2144908 42419131 0.175 0.737 0.019 0.268
rs4812831 42451674 0.120 0.501 0.060 0.531
a
Based on HapMap Data Release 23a (Phase II), NCBI Build 36, dbSNP Build 126
b
Squared correlation coefficient between a given SNP and its own cluster
c
The next highest squared correlation coefficient between a given SNP and any other cluster
d
P-value from 1 df Wald χ
2
for association with outcome, adjusted for all other clusters in the model.
DISCUSSION
The results of the simulation studies demonstrate that the OPCC method is a
powerful alternative to joint SNP and traditional PC analyses for identifying associations
57
between candidate gene variants and quantitative disease traits. Over a wide range of
causal variant allele frequencies and tag SNP correlations, whether the scenario assumed
the causal variant had been genotyped and included in the analysis or not, OPCC usually
outperformed PC association analysis when individual tests of PCs or clusters were
performed, despite the use of the overly conservative Bonferroni correction for multiple
testing. Due to the increased number of clusters required to explain the same amount of
variation captured by a given set of PCs, PC was a slightly more powerful test of global
association between whole candidate region and disease outcome.
For both PC and OPCC, using the number of factors that explained 60% of the
total locus variation generally performed as well as or better than applying an 80%
explained variance threshold. This suggests that for the distribution of SNPs in TCF7L2,
the PC and OPCC methods efficiently captured the variation within the locus. For most
CVs, the additional 20% of variation explained did not justify the additional degrees of
freedom (global association test) or increase in multiple testing correction factor
(univariate association test) to justify the inclusion of more PCs or clusters in the model.
In addition to matching or exceeding the power of joint SNP and PC analyses, the
OPCC approach has several advantages. The aim of this work was primarily to find a
procedure which harnesses the power of the PC approach, but is more readily interpreted,
particularly when large numbers of SNPs are being analyzed across one or possibly
multiple candidate gene regions. The OPCC method is easily interpretable for large
numbers of genetic variants distributed across expansive gene regions. In the
simulations, the OPCC algorithm was also able to accurately detect the true causal variant
58
or its tag SNP, in all scenarios tested. When applied to FUSION data, the OPCC method
identified the cluster of SNPs containing the variant known to be associated with OGTT
2-hour insulin, as reported previously by Silander et al.
However, as observed in both the simulations and analyses applied to the
FUSION data, it generally takes more clusters than PCs to define a certain proportion of
total region variation. This difference appears to increase with a higher explained
variance threshold, and with larger numbers of SNPs or greater regional variation. In the
simulated framework, with enough PCs and clusters to account for 60% of the locus
variation, an average of 4.6 more clusters than PCs were required for scenario 1 (54 tags
+ CV), with a similar difference observed for scenario 2. When an 80% explained
variance threshold was used, an average of 8.9 additional clusters compared to PCs were
required. Moreover, for FUSION data analysis of 11 SNPs in the HNF4A P1 and P2
promoter regions, 3 PCs and 3 clusters were necessary to meet the 60% explained
variance criteria, but 3 PCs explained 7.1% more SNP variation than the 3 clusters.
Thus, depending on the underlying LD structure, the global test of association using PCs
may be slightly more powerful than a model using clusters from the OPCC. This may be
due to the fact that while principal components are orthogonal, or independent, cluster
components formed by the clustering algorithm are oblique. At each iteration, PC
1
and
PC
2
are computed from a distinct set of SNPs that have been assigned to a given cluster,
such that the first PC of one cluster may be correlated with the first PC of another cluster.
Thus, although each SNP is assigned to the cluster with which it has the highest squared
correlation, all SNPs share some degree of correlation with the other clusters they were
59
not assigned to. The underlying correlation among clusters may be indicative of the
correlation pattern among SNPs, although not necessarily haplotype blocks, and thus
better reflect the true relationship of the variants within the candidate region.
60
Chapter 3. The Effects of Gene Size, Patterns of Linkage Disequilibrium
and Quantity of SNP Information on Oblique Principal Components-Based
Clustering
INTRODUCTION
In the previous chapter, it was shown that like PCA, the OPCC approach is able to
reduce a large set of correlated single nucleotide polymorphisms (SNPs) to a smaller
subset of factors which can be used to identify causal variant association with disease
outcomes. Simulations based on TCF7L2 demonstrated that the OPCC method has
power to detect these associations under various scenarios that is comparable to that of
PCA, but that OPCC has the additional advantage of producing clusters that represent a
specific, identifiable array of SNPs, which can then be individually investigated for
contributions to disease outcomes once regional association has been established.
The power and efficiency of the PC and OPCC approaches rely to some extent on
the amount of SNP data available and underlying pattern of linkage disequilibrium (LD)
among the SNPs in a given candidate gene. The data simulated in the previous chapter
was generated from the observed distribution of SNPs in Transcription Factor 7-Like 2
(TCF7L2), a type 2 diabetes (T2D) candidate gene, in the HapMap CEU. As such, it is
natural to wonder whether the results presented in the previous chapter are specific to
gene size, unique choice of causal variants, and/or underlying correlations among the
SNPs in TCF7L2. Thus it is important to investigate whether the level of power and
efficiency shown in Chapter 2 can be maintained under different scenarios of gene size,
SNP information and LD patterns for various candidate genes.
61
The purpose of this chapter is to compare and contrast the behavior of the PC and
OPCC methods in two candidate genes with strikingly different characteristics, under
various analytical scenarios, and to examine the extent to which these characteristics may
influence type 1 error and power to detect association with a quantitative trait.
METHODS
SIMULATIONS
Generation of Genotypes
Two candidate genes for T2D and diabetes-related traits were considered for
investigation: Glucokinase (GCK) and TCF7L2. Evidence for TCF7L2 as a T2D
candidate is discussed in Chapter 2. Genetic variation in GCK is known to cause a
particular form of Maturity Onset Diabetes of the Young (MODY2), which is
characterized by perpetual mild fasting hyperglycemia due to reduced glucose sensing in
the β-cell [Vionnet et al, 1992; Byrne et al, 1994]. Additionally, the -30 G/A variant
(rs1799884) in the GCK promoter has been associated with increased fasting glucose
[Weedon et al, 2005; Rose et al, 2005] and reduced β-cell function [Stone et al, 1996].
In a recent meta-analysis, it was shown to have a modest but significant effect on risk for
T2D [Winckler et al, 2007]. Given the evidence for GCK’s association with T2D and
diabetes-related traits, it was determined to be a reasonable candidate from which to
generate genotype data for comparison with TCF7L2.
The characteristics of TCF7L2, as observed in the HapMap CEU, were described
in Chapter 2. In short, TCF7L2 contains 144 SNPs, 133 of which are polymorphic and
have minor allele frequencies (MAFs) > 0.01. In this same HapMap population, GCK
62
spans a 45.1 kb region on chromosome 7, and is comprised of 53 SNPs, 41 of which are
polymorphic, with MAFs > 0.01. Simulation of TCF7L2 genotype data, based on the
observed genotype distribution of SNPs among the 60 HapMap CEU founders, was
described in Chapter 2. SNP genotype data for GCK was conducted in analogous fashion
for 53 SNPs, but given the smaller size of the gene, a conditioning window of 10 SNPs
was used for simulation.
Analysis of GCK in the original HapMap CEU data using Haploview (V. 4.1)
[Barrett et al, 2005] indicates 4 blocks of LD, as defined by Gabriel et al, with one
particularly large block of LD (Block 3, SNP 14 – SNP 38) spanning ~21 kb, shown in
Figure 14. Among the 41 polymorphic GCK SNPs, 18 tags were identified with the
pairwise tagging algorithm in Tagger [de Bakker et al, 2005], using an r
2
threshold equal
to 0.6. The pairwise r
2
distribution in the simulated data was similar to that observed in
HapMap CEU data. The LD pattern from a single simulated data set is shown in Figure
15. Using Gabriel et al definition for haplotype blocks, the simulated data shows 3
blocks, including one large block encompassing the exact same SNPs as those shown in
the HapMap CEU data (SNP 14 – SNP 38). The remaining block definitions and
haplotype frequencies are also very similar, indicating a fixed window size of 10 SNPs is
adequate to capture the underlying LD pattern in GCK as observed in the HapMap CEU.
All genotype simulations implemented in R (v. 2.6.1).
63
Figure 14. Pairwise LD structure among 53 SNPs in GCK from HapMap CEU data
(GCK). White squares indicate r
2
=0, grey squares 0 < r
2
< 1, black squares r
2
=1.
Figure 15. Pairwise LD structure among 53 SNPs in single example simulated data set
(GCK). White squares indicate r
2
=0, grey squares 0 < r
2
< 1, black squares r
2
=1.
64
Figure 16 also shows the LD structure for GCK in a single simulated data set, but
with a D′ color scheme and D′ values. Interestingly, the LD pattern under the r
2
color
scheme (Figure 15) is strikingly different compared to the pattern observed using D′.
While D′ is directly related to the recombination fraction, and is the only measure of LD
that is not sensitive to allele frequencies [Lewontin R, 1988], r
2
depends purely on allele
frequencies and represents underlying correlation among SNPs [Hendrick P, 1987]. Thus
it is possible that these two metrics may show different patterns of LD for the same gene.
Figure 16. Pairwise LD structure among 53 SNPs in single example simulated data set
(GCK). White squares indicate D′ < 1 and LOD < 2, blue squares D′ = 1 and LOD < 2,
pink/red squares D′ < 1 and LOD > 2, bright red squares D′ = 1 and LOD > 2.
65
In contrast to GCK, the r
2
–based LD patterns for simulated data based on TCF7L2
(Figure 17) are relatively consistent with that observed under a D′ color scheme (Figure
18). Thus, GCK’s differing size and unusual patterns of LD offered a type of data that
would be an interesting contrast to TCF7L2 for the purposes of these simulation studies.
Figure 17. Pairwise LD structure among 144 SNPs in single example simulated data set
(TCF7L2). White squares indicate r
2
=0, grey squares 0 < r
2
< 1, black squares r
2
=1.
66
Figure 18. Pairwise LD structure among 144 SNPs in single example simulated data set
(TCF7L2). White squares indicate D′ < 1 and LOD < 2, blue squares D′ = 1 and LOD <
2, pink/red squares D′ < 1 and LOD > 2, bright red squares D′ = 1 and LOD > 2.
Generation of Phenotype
Casual variant (CV) selection and generation of null and TCF7L2-associated
phenotype data sets were described in Chapter 2. CV selection was conducted in
analogous fashion for GCK. Subsequent to the generation of genotype data for 53 SNPs,
across 1,000 data sets, each with 1,000 individuals, one of the SNPs (not a tag SNP) was
chosen to be a CV. Based on the pairwise r
2
between tag SNP and CV, as well as
position within the gene region, three distinct choices of CV were considered (Table
VIII). The three CVs (SNPs 8, 15, 53) exemplified low, moderate and high allele
frequencies, and ranged from modest to strong LD with their designated tag SNPs (r
2
:
0.621 - 0.999). Simulation of phenotype data was also carried out as described in
Chapter 2, for a continuous trait, assumed to be approximately normally distributed, with
67
a coefficient of variation of 40%. 1,000 null data sets (β =0) and 1,000 CV-associated
data sets (β set to achieve 1% explained trait variation) were simulated for each CV.
Table VIII. Descriptive Statistics for three SNPs in GCK used as Causal Variants in
the Simulations
HapMap CEU Data Simulated Data
CV
Position
†
Location
MAF
Tag
SNP
Pairwise
r
2
between
Tag and
CV
Avg.
MAF
‡
Average
Pairwise r
2
between
Tag SNP
and CV
‡
SNP 8 44161007
Outside
Block
0.117 SNP 7 0.623 0.116 0.621
SNP 53 41195593
Outside
Block
0.200 SNP 44 1.000 0.200 0.999
SNP 15 44167409 Block 3 0.450 SNP 20 0.749 0.450 0.748
†
Based on HapMap Data Release 23a (Phase II), NCBI Build 36, dbSNP Build 126
‡
Average across 1,000 simulated genotype data sets
In order to examine performance of methods under varying scenarios of SNP
availability, two scenarios were considered for each analysis of GCK and TCF7L2: 1)
tag SNPs only, and 2) all SNPs. Estimated power and type 1 error were assessed as the
proportion of data sets for each CV that yielded a significant association at the 0.05 level
assuming a two-sided alternative hypothesis. All phenotype simulations were
implemented in R (v. 2.6.1).
RESULTS
SIMULATION STUDIES
Type 1 Error
For each CV, PC and OPCC analyses were conducted on 1,000 null data sets.
Multi-df tests were evaluated with an LRT for full model (multiple SNPs, PCs or
68
Clusters) vs a null model (intercept only). Single-df tests were evaluated with a Wald χ
1
2
statistic and p-values corrected for multiple testing with the Bonferroni method. Type 1
error estimates for all tests on were within range of 5% for all CVs, under both scenarios
(Table IX and Table X). Each estimate is based on the proportion of 1,000 data sets
showing a significant association for any predictors vs disease. As illustrated by
Gauderman et al, and the work of the previous chapter, PC and OPCC regression have
robust power under most scenarios for a range of minor allele frequencies. Therefore, a
60% explained variance threshold was used for PC and OPCC approaches (vs. 80%).
69
Table IX. Estimated Type 1 Error
†
for PCA and OPCC methods under two scenarios
for three CVs in GCK
GCK PCA
b
OPCC
b
PCA
a
OPCC
a
Scenario 1:
18 Tag SNPs Analyzed
Average df
‡
5.0 6.9 1 1
CV MAF
SNP 8 0.116 0.045 0.059 0.056 0.045
SNP 53 0.200 0.041 0.046 0.049 0.042
SNP 15 0.450 0.042 0.044 0.040 0.046
Scenario 2:
41 SNPs Analyzed
Average df
‡
3.0 4.0 1 1
CV MAF
SNP 8 0.116 0.041 0.053 0.058 0.040
SNP 53 0.200 0.045 0.048 0.052 0.039
SNP 15 0.450 0.044 0.043 0.042 0.044
†
Values are the proportion of 1,000 null data sets in which the test of association was rejected at the 0.05
level. 1 df tests were evaluated with a Wald χ
1
2
statistic based on a linear regression model for simulated
trait vs predictors (PCs or clusters). Multi-df tests were implemented in a likelihood ratio test framework
for multiple predictors vs a null model (intercept only). The PC and OPCC tests were based on the
minimum set of PCs or clusters that explained 60% of the variation in GCK.
‡
The average degrees of freedom across 1,000 data sets.
a
1 df test of main effect, Bonferroni corrected for multiple testing
b
Omnibus, multi-df test
70
Table X. Estimated Type 1 Error
†
for PCA and OPCC methods under two scenarios
for CVs in TCF7L2
TCF7L2 PCA
b
OPCC
b
PCA
a
OPCC
a
Scenario 1:
54 Tag SNPs Analyzed
Average df
‡
10.9 15.5 1 1
CV MAF
SNP 54 0.051 0.044 0.057 0.046 0.057
SNP 26 0.250 0.046 0.049 0.049 0.048
SNP 60 0.417 0.049 0.056 0.052 0.048
SNP 135 0.167 0.056 0.059 0.047 0.052
SNP 143 0.242 0.054 0.047 0.056 0.057
SNP 141 0.416 0.056 0.057 0.050 0.061
SNP 72 0.042 0.043 0.049 0.053 0.043
SNP 105 0.150 0.041 0.040 0.043 0.047
SNP 94 0.333 0.056 0.050 0.045 0.048
Scenario 2:
133 SNPs Analyzed
Average df
‡
7.0 10.0 1 1
CV MAF
SNP 54 0.051 0.040 0.046 0.046 0.044
SNP 26 0.250 0.056 0.056 0.059 0.045
SNP 60 0.417 0.047 0.048 0.043 0.044
SNP 135 0.167 0.056 0.057 0.060 0.055
SNP 143 0.242 0.049 0.058 0.053 0.049
SNP 141 0.416 0.057 0.063 0.057 0.062
SNP 72 0.042 0.062 0.053 0.056 0.050
SNP 105 0.150 0.054 0.047 0.041 0.048
SNP 94 0.333 0.050 0.058 0.050 0.059
†
Values are the proportion of 1,000 null data sets in which the test of association was rejected at the 0.05
level. 1 df tests were evaluated with a Wald χ
1
2
statistic based on a linear regression model for simulated
trait vs predictors (PCs or clusters). Multi-df tests were implemented in a likelihood ratio test framework
for multiple predictors vs a null model (intercept only). The PC and OPCC tests were based on the
minimum set of PCs or clusters that explained 60% of the variation in TCF7L2.
‡
The average degrees of freedom across 1,000 data sets.
a
1 df test of main effect, Bonferroni corrected for multiple testing
b
Omnibus, multi-df test
71
Power
For each CV, PC and OPCC analyses were conducted on 1,000 CV-associated
data sets. Multi-df tests were evaluated with an LRT for full model (multiple SNPs, PCs
or Clusters) vs a null model (intercept only). Single-df tests were evaluated with a Wald
χ
1
2
statistic and p-values corrected for multiple testing with the Bonferroni method.
Estimates of power for all tests, under both scenarios of genotype availability are
presented in Table XI and Table XII. Each estimate is based on the proportion of 1,000
data sets showing a significant association for any predictors tested vs disease. A 60%
explained variance threshold was used for PC and OPCC approaches.
Under the scenario that genotype data for only tag SNPs are available in a given
sample (scenario 1), OPCC performs almost as well as the PC method in omnibus tests of
association for both GCK and TCF7L2. For most CVs in TCF7L2, power is slightly
lower (2-6% lower) and is likely due to the additional degrees of freedom required for the
OPCC analysis (Table XII). However, the OPCC method achieves higher power than the
PC approach in global association tests when the CV is a low frequency variant and/or
has modest r
2
with its tag SNP. For example, OPCC has 10% greater power than PCA in
an omnibus test of association when SNP 72 is the CV (MAF = 0.042, r
2
with tag SNP =
0.699). Global association analysis of GCK yields similar results; global association tests
with OPCC show approx. 3-23% lower power than that for PCA (Table XI). However,
OPCC has slightly more power than the PC approach in an omnibus association test when
SNP 8 is the CV (MAF = 0.116, r
2
with tag SNP = 0.621).
72
Table XI. Estimated Power for PCA and OPCC methods under two scenarios
for three CVs in GCK
GCK PCA
b
OPCC
b
PCA
a
OPCC
a
Scenario 1:
18 Tag SNPs Analyzed
Average df 5.0 6.9 1 1
CV MAF
SNP 8 0.116 0.191 0.220 0.166 0.245
SNP 53 0.200 0.688 0.454 0.572 0.437
SNP 15 0.450 0.559 0.521 0.568 0.507
Scenario 2:
41 SNPs Analyzed
Average df 3.0 4.0 1 1
CV MAF
SNP 8 0.116 0.373 0.566 0.319 0.583
SNP 53 0.200 0.651 0.770 0.567 0.773
SNP 15 0.450 0.733 0.696 0.749 0.741
†
Values are the proportion of 1,000 CV-associated data sets in which the test of association was
rejected at the 0.05 level. 1 df tests were evaluated with a Wald χ
1
2
statistic based on a linear regression
model for simulated trait vs predictors (PCs or clusters). Multi-df tests were implemented in a likelihood
ratio test framework for multiple predictors vs a null model (intercept only). The PC and OPCC tests
were based on the minimum set of PCs or clusters that explained 60% of the variation in GCK.
‡
The average degrees of freedom across 1,000 analyses.
a
1 df test of main effect, Bonferroni corrected for multiple testing
b
Omnibus, multi-df test
73
Table XII. Estimated Power
†
for PCA and OPCC methods under two
scenarios for nine CVs in TCF7L2
TCF7L2 PCA
b
OPCC
b
PCA
a
OPCC
a
Scenario 1:
54 Tag SNPs Analyzed
Average df 10.9 15.5 1 1
CV MAF
SNP 54 0.051 0.361 0.321 0.323 0.424
SNP 26 0.250 0.412 0.383 0.382 0.422
SNP 60 0.417 0.469 0.410 0.433 0.513
SNP 135 0.167 0.401 0.380 0.282 0.383
SNP 143 0.242 0.454 0.374 0.317 0.329
SNP 141 0.416 0.398 0.346 0.326 0.409
SNP 72 0.042 0.211 0.314 0.179 0.405
SNP 105 0.150 0.369 0.261 0.306 0.237
SNP 94 0.333 0.353 0.353 0.255 0.212
Scenario 2:
133 SNPs Analyzed
Average df 7.0 10.0 1 1
CV MAF
SNP 54 0.051 0.441 0.452 0.327 0.564
SNP 26 0.250 0.549 0.548 0.483 0.685
SNP 60 0.417 0.622 0.546 0.682 0.695
SNP 135 0.167 0.529 0.510 0.484 0.601
SNP 143 0.242 0.588 0.547 0.529 0.653
SNP 141 0.416 0.466 0.366 0.357 0.421
SNP 72 0.042 0.200 0.232 0.163 0.336
SNP 105 0.150 0.326 0.373 0.265 0.416
SNP 94 0.333 0.375 0.405 0.276 0.364
†
Values are the proportion of 1,000 CV-associated data sets in which the test of association was
rejected at the 0.05 level. 1 df tests were evaluated with a Wald χ
1
2
statistic based on a linear regression
model for simulated trait vs predictors (PCs or clusters). Multi-df tests were implemented in a likelihood
ratio test framework for multiple predictors vs a null model (intercept only). The PC and OPCC tests
were based on the minimum set of PCs or clusters that explained 60% of the variation in TCF7L2.
‡
The average degrees of freedom across 1,000 analyses.
a
1 df test of main effect, Bonferroni corrected for multiple testing
b
Omnibus, multi-df test
74
Moreover, under scenario 1, if PCs or clusters are tested for univariate association
with the trait and a Bonferroni correction applied for multiple testing, OPCC outperforms
PC regression for most CVs in TCF7L2 and GCK. For TCF7L2 CVs that reside in a
block of LD and thus tend to have higher correlation with a given tag SNP (SNPs 54, 26,
60, 135, 143, 141), and for low frequency CVs not included in any block (SNP 72),
OPCC showed approx. 2-22% greater power than PCA. Conversely, for higher
frequency CVs located outside of any block (SNPs 105 and 94), power for the PC tests
was slightly higher, exceeding that for OPCC analysis by approx. 4-7%. Univariate
analysis of GCK showed similar results; for the low frequency CV (SNP 8), OPCC
performed with 8% greater power than PCA.
When all genotyped SNPs are considered in the analysis (scenario 2), the power
of the OPCC method for omnibus association tests was generally greater than or equal to
that for the PC approach, despite the increased degrees of freedom utilized by the global
OPCC test. Interestingly, very high frequency CVs appear to be an exception to this
(TCF7L2 SNP 60 and SNP 141; GCK SNP 15), where PC slightly outperformed OPCC
in global tests of association. However, for individual testing of PCs or clusters on 1 df,
OPCC outperformed the PC test for almost all CVs, in both TCF7L2 and GCK. Under
this scenario, univariate cluster association tests showed 2-27% greater power than
individual tests of PC association.
Robustness of Power
In chapter 2, for tag SNP only and tag SNP + CV scenarios of genotype
availability, the trade-off between degrees of freedom or increased multiple testing
75
correction and proportion of variation explained did not appear to justify using an 80%,
rather than 60%, explained variance threshold as criteria for defining the quantity of PCs
or clusters to be used in a given analysis. To determine whether that result was an
artifact due to the specific correlation structure among TCF7L2 SNPs, all PC and OPCC
analyses were repeated using an 80% explained variance criteria, for both TCF7L2 and
GCK. The average number of PCs and clusters required to meet 60% and 80% variance
thresholds, under both scenarios of genotype availability, are shown in Table XIII.
Table XIII. Average
†
number of PCs and Clusters required for analysis under various
scenarios of genotype availability and variance explained.
Average # of PCs
Average # of
Clusters
GCK
18 Tag SNPs Analyzed (Scenario 1)
60% Variance Explained 5.0 6.9
80% Variance Explained 8.0 10.9
41 SNPs Analyzed (Scenario 2)
60% Variance Explained 3.0 4.0
80% Variance Explained 6.0 8.9
TCF7L2
54 Tag SNPs Analyzed (Scenario 1)
60% Variance Explained 10.9 15.5
80% Variance Explained 19.1 27.9
133 SNPs Analyzed (Scenario 2)
60% Variance Explained 7.0 10.0
80% Variance Explained 14.2 22.2
†
Average number of PCs and Clusters from PC and OPCC analyses on all genotype data sets for each
gene (1,000 data sets for GCK; 1,000 data sets for TCF7L2) for each scenario of genotype availability
76
Overall, analysis of all SNPs (scenario 2) required fewer PCs and clusters than
analysis of tag SNPs only (scenario 1), whether using a 60% or 80% explained variance
cut-off. Analyzing all GCK SNPs required 2 less PCs and 2 to 2.9 less clusters than
analysis of tags only. Similarly, analyses of TCF7L2 under scenario 2 required 3.9 to 4.9
less PCs, and 5.5 to 5.7 less clusters than analysis under scenario 1. For both genes,
under both scenarios, more clusters than PCs were required to meet any given variance
threshold. Additionally, there was a greater difference in the number of Clusters vs PCs
required for the 80% explained variance criteria than the 60% threshold. On average, 2.9
more clusters than PCs were necessary for analysis of GCK using the 80% variance
criteria. Similarly, 8.8 and 8.1 more clusters than PCs were required for analysis of
TCF7L2 using an 80% explained variance threshold for scenario 1 and scenario 2,
respectively. In contrast, only 1.9 and 1.0 additional clusters were required for GCK, and
4.6 and 3.0 additional clusters for TCF7L2, when the 60% explained variance threshold
was applied under scenarios 1 and 2.
As shown in Figures 19 - 22, the trends described above for the PC and OPCC
tests were generally similar under an 80% proportion of variance criteria, for both
TCF7L2 and GCK. Under scenario1 (tags only), power for the PC global association test
either slightly decreased or remained the same, under the 80% threshold, for most CVs
residing in a block of LD (Figure 19). For CVs not residing in any block, global PC
proved to be slightly more powerful, unless that CV happened to be well tagged. Under
scenario 2 (all SNPs), the same pattern emerges, with the exception of power for CVs not
captured by any block; power increased 4-14% for all CVs not residing in any TCF7L2 or
GCK LD block (Figure 20). For most CVs under scenario 1, power for the global cluster
77
association tests decreased, although power actually increased for well-tagged CVs
residing outside any block of LD (Figure 19). The decrease in power observed for global
association models that contained enough PCs or clusters to explain the higher proportion
of variance suggests that, at least for SNPs in high correlation with a tag SNP or other
SNPs in a given block of LD, the price paid in degrees of freedom is not worth the
additional variance explained.
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
SNP
8
SNP
53
SNP
15
SNP
54
SNP
26
SNP
60
SNP
135
SNP
143
SNP
141
SNP
72
SNP
105
SNP
94
GCK TCF7L2
Power
PC 60% Threshold PC 80% Threshold OPCC 60% Threshold OPCC 80% Threshold
Figure 19. Under scenario 1 of genotype availability (54 tags in TCF7L2, 18 tags in
GCK), estimated power of PC and OPCC tests of global association for 60% and 80%
explained variance thresholds.
78
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
SNP
8
SNP
53
SNP
15
SNP
54
SNP
26
SNP
60
SNP
135
SNP
143
SNP
141
SNP
72
SNP
105
SNP
94
GCK TCF7L2
Power
PC 60% Threshold PC 80% Threshold OPCC 60% Threshold OPCC 80% Threshold
Figure 20. Under scenario 2 of genotype availability (133 SNPs in TCF7L2, 41 SNPs in
GCK), estimated power of PC and OPCC tests of global association for 60% and 80%
explained variance thresholds.
As shown in Figures 21 and 22, tests of individual PCs showed reduced power
when the 80% threshold was applied, for almost all CVs, under both scenarios of
genotype availability. Again, for low frequency SNPs that are not in strong LD with a
tag SNP, using an 80% variance criteria provided equal or slightly higher power. When
only tag SNP information was available (scenario 1), OPCC showed similar or slightly
increased power to detect association for CVs in a haplotype block, but variable amounts
of power for CVs not located in any block, depending on the level of correlation between
the CV and its tag SNP. Under scenario 2, OPCC univariate tests under an 80%
threshold showed similar or decreased power for CVs located in an LD block, compared
to 60%, but substantially improved power for those not captured by any block.
79
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
SNP
8
SNP
53
SNP
15
SNP
54
SNP
26
SNP
60
SNP
135
SNP
143
SNP
141
SNP
72
SNP
105
SNP
94
GCK TCF7L2
Power
PC 60% Threshold PC 80% Threshold OPCC 60% Threshold OPCC 80% Threshold
Figure 21. Under scenario 1 of genotype availability (54 tags in TCF7L2, 18 tags in
GCK), estimated power of PC and OPCC tests of univariate association for 60% and 80%
explained variance thresholds.
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
SNP
8
SNP
53
SNP
15
SNP
54
SNP
26
SNP
60
SNP
135
SNP
143
SNP
141
SNP
72
SNP
105
SNP
94
GCK TCF7L2
Power
PC 60% Threshold PC 80% Threshold OPCC 60% Threshold OPCC 80% Threshold
Figure 22. Under scenario 2 of genotype availability (133 SNPs in TCF7L2, 41 SNPs in
GCK), estimated power of PC and OPCC tests of univariate association for 60% and 80%
explained variance thresholds.
80
Therefore, analysis of GCK and TCF7L2 suggest that the determination of
whether to use a 60% or 80% threshold depends in part, on the amount of SNP data
available, and the level of correlation between the CV and its tag or other surrounding
SNPs. Whether full gene SNPs or only tag SNP data is available for analysis, inclusion
of additional PCs in PC models, which increase # of degrees of freedom or the multiple
testing correction factor, likely do not individually explain enough variation to justify a
test of their main effects. Thus, a 60% explained variance threshold for PC regression
appears to provide an acceptable balance between # of degrees of freedom or size or
multiple testing correction factor, and power to detect significant main effects. In
contrast, results of the OPCC approach suggest that if only tag SNP data is available,
using an 80% threshold may have the greater advantage; power to detect significant main
effects under this scenario was very similar to or higher when using an 80% vs 60%
variance cut-off. However, if all SNP data is available, power under the 80% threshold
was similar to or lower than that using a 60% threshold, with the exception of CVs
located outside a haplotype block and/or not in high correlation with surrounding SNPs.
This may be due to the fact that applying an 80% explained variance threshold to the
OPCC algorithm delineates existing clusters into smaller subsets, reducing the noise that
accompanies analyzing large clusters. Smaller clusters are likelier to have better cluster
fit among the SNPs in a given cluster, particularly among a set of tag SNPs, where
residual correlation among variants has been intentionally reduced.
Chapter 2 presented scenarios of genotype availability (tag SNPs and tag SNP +
CV) that can be compared with the present study’s inclusion of all, or “full gene” SNPs.
Figure 23 shows power of PC univariate tests of association, under all 3 scenarios, using
81
60% and 80% explained variance criteria. Although no single scenario appears to be
uniformly most powerful for all CVs considered, the “tags + CV” scenario and “all
SNPs” scenario are consistently more powerful than “tags only”. However, power for
individual cluster association is consistently highest for SNPs located in a haplotype
block, under the “all SNPs” scenario (Figure 24). For SNPs located outside any block,
and thus, not well captured by the surrounding set of variants, the “tags + CV” scenario
was consistently most powerful, with power from the “all SNPs” scenario only slightly
lower. This suggests that for CVs with low to modest correlation with their neighboring
markers, the inclusion of all SNPs in the analysis would not necessarily improve cluster
definition or power to detect significant cluster associations. These patterns were
generally similar for both 60% and 80% explained variance criteria.
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
SNP 54 SNP 26 SNP 60 SNP 135 SNP 143 SNP 141 SNP 72 SNP 105 SNP 94
TCF7L2
Power
Tags only (60%) Tags + CV (60%) All SNPs (60%) Tags only (80%)
Tags + CV (80%) All SNPs (80%)
Figure 23. Estimated power of PC univariate tests of association under 3 scenarios of
genotype availability (Tag SNPs only, Tag SNPs + CV, All SNPs) using 60% and 80%
explained variance thresholds.
82
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
SNP 54 SNP 26 SNP 60 SNP 135 SNP 143 SNP 141 SNP 72 SNP 105 SNP 94
TCF7L2
Power
Tags only (60%) Tags + CV (60%) All SNPs (60%) Tags only (80%)
Tags + CV (80%) All SNPs (80%)
Figure 24. Estimated power of OPCC univariate tests of association under 3 scenarios of
genotype availability (Tag SNPs only, Tag SNPs + CV, All SNPs) using 60% and 80%
explained variance thresholds.
DISCUSSION
The results of these simulation studies confirm that the OPCC method is a
powerful alternative to traditional PC analyses for identifying associations between
candidate gene variants and quantitative outcomes of interest. Patterns observed in the
results of the global and univariate association tests were relatively consistent between
TCF7L2 and GCK, despite differences in gene characteristics, which implies that the
power of the OPCC approach observed in Chapter 2 is not specific to the attributes of
TCF7L2. Over a spectrum of gene size, casual variant allele frequencies, underlying LD
patterns and SNP genotype availability, OPCC usually outperformed PC association
analysis for individual tests of PCs or clusters, despite the use of the overly conservative
Bonferroni multiple testing correction.
83
In a scenario where the only available data consists of tag SNPs, PCA is the most
powerful test of global association, whether using a 60% or 80% explained variance
criteria. Power for the OPCC approach was slightly lower than PCA for omnibus
association, except when the CV was of low frequency and/or not in strong LD with any
other SNP in the data set (GCK SNP 8, and TCF7L2 SNP 72). This may be due to the
fact that under this scenario, PCs represent weighted contributions of all tag SNPs to the
overall locus variance, such that the effect of a low frequency SNP in modest correlation
with its tag may be attenuated across PCs. OPCC, in contrast, creates a unit of analysis
that specifically captures the low frequency CV; a low frequency CV in modest
correlation with its tag SNP generally tends to form a cluster with only itself and its tag.
For tests of univariate association under scenario 1 (tag SNPs only), applying a
60% explained variance threshold yield mixed results for PC compared to OPCC. For
most CVs, OPCC outperforms PC. However, for medium and high frequency CVs not
residing in any particular block of LD (GCK SNP 53, TCF7L2 SNP 105 and SNP 94), PC
actually has slightly higher power than OPCC to detect univariate association. If an 80%
threshold is applied, OPCC uniformly has greater power than PCA. For medium to high
frequency CVs, which are likely captured by the first few PCs explaining the largest
proportion of total locus variation, utilizing fewer PCs in the analysis gives PCA an
advantage in the trade-off between magnitude of multiple testing correction factor and
power to detect significant main effects. Conversely, using fewer clusters simultaneously
results in larger clusters that contain more SNPs, and hence, more noise. When an 80%
variance threshold is used, the additional PCs detract from the power of the PC method.
However, additional clusters enhance power of the OPCC approach because with
84
increasing numbers of clusters, cluster size decreases, such that each cluster is comprised
of SNPs with better fit to their assigned cluster.
In scenarios that include “full gene” or all SNPs, PCA has slightly more power
than OPCC to detect effects of CVs that reside in blocks of LD, in a global association
analysis where a 60% explained variance threshold is applied. However, OPCC has
greater power to detect signals from CVs outside any block of LD. Using an 80%
variance cut-off, PCA is almost exclusively more powerful. In tests of univariate
association, OPCC outperforms PCA whether a 60% or 80% explained variance
threshold is used. While PCs are convoluted by the inclusion of many redundant SNPs,
OPCC takes advantage of the information in the underlying correlation structure among
all SNPs, such that each cluster represents a distinct, well-defined subset of highly
correlated SNPs to be used as the unit of analysis.
Interestingly, use of all SNPs in an analysis, whether global or univariate, PC or
OPCC, with a 60% or 80% explained variance threshold, consistently provided higher
power to detect associations compared to analysis of tags only. This is due at least in
part to the fact that, for both GCK and TCF7L2, substantially less PCs and clusters were
required for PC and OPCC respectively, whether using a 60% or 80% proportion of
variance cut-off. For TCF7L2, if an “all SNPs” scenario is compared with the “tags +
CV” scenario presented in Chapter 2, “all SNPs” is not necessarily more powerful for
either the PC or OPCC approaches. The power of “all SNPs” vs “tags + CV” for both
methods depends largely on CV frequency and amount of correlation between the CV
and remaining SNPs; analysis of CVs in TCF7L2 located outside any LD block generally
show higher power under a “tags + CV” scenario. The patterns observed for the 60%
85
proportion of variance threshold are similar to those observed for the 80% cut-off,
indicating that increasing the number of PCs or clusters analyzed does not necessarily
improve power for the “all SNPs” scenario when compared to the data scenario of “tags +
CV”. However, in practicality, the investigator usually does not know a priori, which
SNP is the causal variant, so successful analysis under the assumption of a “tags + CV”
scenario is a function of the likelihood that a tag SNP selected for genotyping and
analysis is in fact the causal SNP.
86
Chapter 4. Principal Components-Based Clustering in the Presence of
Epistasis
INTRODUCTION
Gene-gene interactions, often referred to as “epistasis”, have been increasingly
examined for complex disease and are often cited as a potential reason for lack of
replication among candidate gene studies. However, the study of these interactions, in an
attempt to unravel the tangled web of multiple genetic and biochemical pathways, is not a
straightforward task. Both human and animal studies of complex diseases have identified
susceptibility genes that individually have a relatively small contribution to a common
disease trait or not at all, but interact significantly in combined analyses [Williams et al,
2000; Hsueh et al, 2001; Kim et al, 2001; Barlassina et al, 2002; Dong et al, 2005;
Valdar et al, 2006; Black et al, 2008]. Further complicating the issue, several
investigators have found variants that have opposing effects, depending on the genetic
and environmental background [Doney et al, 2004; Staessen et al, 2001; Bjursell et al,
2007], thereby increasing the likelihood of neglecting to account for epistatic
susceptibility genes in single candidate analyses [Culverhouse 2002].
While many definitions of the term “epistasis” exist in biological literature, the
mathematical or quantitative genetic concept of epistasis involves a statistical interaction
between two or more genetic variants, characterized by non-additivity in a mathematical
model [Cordell, 2002]. In other words, an analysis of epistasis necessarily examines the
interdependent effects of multiple disease variants on disease risk or disease-related trait
variation, according to some probability distribution. A variety of methods currently
87
exist to detect these interactions, from standard multivariate regression models to non-
parametric methods that involve data reduction and pattern recognition.
For example, multiple logistic or linear regression approaches, which involve
assessing the statistical significance of a multiplicative interaction term in a likelihood
ratio framework, test whether the departure from additivity represented by the
multiplicative term significantly increases the likelihood of the prediction model.
However, if testing all pairwise interactions between all genotyped SNPs, one must
consider that the number of tests performed scales exponentially with the number of loci.
For any one particular underlying genetic model for K loci, K!/(2!(K – 2)!) pairwise
comparisons may be made, with the significance level adjusted to account for each test.
The multiple testing problem is further compounded by the number of genetic models
assumed (ie. additive x additive, additive x dominant, additive x recessive, etc.). If a
multiple testing correction is applied for this number of tests, only extremely significant
p-values and/or relatively large interaction effect sizes would be able to survive such a
correction. Thus, multiplicative models used in an LRT framework are usually
underpowered to detect effect sizes relevant for complex disease association.
Other methods which are able to test for non-additivity among locus interactions
include Classification and Regression Trees (CART) [Brieman et al, 1984], a partitioning
method combined with regression techniques, Logic Regression [Koopberg et al, 2001],
an adaptive regression procedure, the Combinatorial Partitioning Method (CPM) and
variations thereof, [Nelson et al, 2001], and Multifactor-Dimensionality Reduction
(MDR) [Ritchie, et al 2001]. All of these methods are non-parametric, and as such,
seemingly avoid the issues surrounding traditional regression-based methods; they do not
88
impose assumptions of linearity on predictor variables, nor do they require a priori
specification of a particular genetic model. However, such methods typically suffer from
low power under various scenarios, are only applicable under a narrow set of
circumstances, produce results which are difficult to interpret, and/or require extensive
computational resources for large numbers of loci.
For example, CART is specific to categorical outcomes and highly sensitive to
outcome misclassification. CART is also susceptible to random noise which can cause
over-fitting of the model and reduced power, and is based on a computationally intensive
algorithm. Assessing statistical significance of tree-based methods and accurate
interpretation of results in real data sets are also problematic [Briollais et al, 2007].
Logic Regression can be implemented with either continuous or dichotomous outcomes,
and attempts to reduce dimensionality, but is sensitive to missing or incomplete genotype
data, is susceptible to over-fitting of the model, and is driven mainly by SNPs with large
main effects [Musani et al, 2007]. CPM could have been a promising compromise
between the two, as it can be used with quantitative traits, reduces dimensionality, detects
interactive effects in the absence of marginal main effects, and performs well even with
missing or incomplete data [Musani et al, 2007]. However, CPM is very computationally
intensive, especially for large numbers of loci, and no software currently exists to
implement it procedures. CPM has also not been adapted to handle dichotomous
outcomes. A variation of CPM, known as the restricted partitioning method (RPM), is
less computationally burdensome than CPM, but is susceptible to the same flaws which
CPM was designed to overcome; RPM is sensitive to missing or sparse data and requires
a multiple testing correction be applied [Musani et al, 2007]. These additional caveats of
89
the RPM approach greatly decrease its computational efficiency for large numbers of
loci. Inspired by CPM and RPM, MDR has been extended to both categorical and
quantitative outcomes, has demonstrated the ability to detect interactive effects when
weak or no marginal main effects exist, and has recently been adapted to handle
unbalanced designs [Lou et al, 2007]. However, it has also been shown that MDR has
low power to detect multiplicative interaction between alleles of high frequency, and
performs poorly in data sets with high degree of heterogeneity or small to modest sample
sizes [Vermuelen et al, 2007]. As with CART, determination of statistical significance
and accurate interpretation of results are also problematic for MDR [Briollais et al,
2007]. Also, like CPM, MDR requires an exhaustive search over all possible
interactions, and is thus computationally unfeasible for very large numbers of loci.
In contrast to MDR, Millstein et al developed a “Focused Interaction Testing
Framework” (FITF) [Millstein et al, 2006], which harnesses the efficiency and simplicity
of standard regression approaches in a likelihood testing framework. The basic ITF
involves likelihood ratio tests (LRTs) that are performed in stages, such that joint tests of
main effects and interactions are performed conditional on significant main effects from
the previous step. In their approach, Millstein et al further focused ITF testing by
reducing the number of tests to be performed in a “pre-screening” stage based on a χ
2
goodness of fit statistic derived from allele frequency differences in a pooled case-control
sample. The procedure involves adjustment for multiple testing, at all stages, by
controlling false-discovery rates (FDR). Both the ITF and FITF methods outperformed
standard regression-based 1 df interaction tests, as well as MDR, for all genetic models
tested (additive, dominant and recessive). Although the FITF method is specific to case-
90
control data, the ITF on which it is based utilizes regression modeling in a flexible
manner, and as such, is easily adaptable to models with quantitative outcomes.
In this chapter, the oblique principal components-based clustering method
(OPCC) developed in the previous chapter is extended to the problem of epistasis, or
gene-gene interactions, and is tested under various scenarios of causal variant (CV)
interaction and availability of genotype data. This approach is compared to multiple
linear regression and traditional principal components regression (PCA), for all pairwise
interactions, and corrected for multiple testing with Bonferroni and FDR. A two-stage
ITF is also applied to all SNP-, PC- and Cluster-based tests, in an effort to improve
testing efficiency and maximize power. All methods utilize a likelihood ratio testing
framework for detection of a significant multiplicative interaction among loci.
METHODS
ANALYTICAL APPROACHES
Multiple Linear Regression
Let y be a quantitative trait, assumed to be approximately normally distributed,
and G
i
be genotypes coded with values of 0, 1 or 2 for the respective number of minor
alleles at locus i, where i = 1,…, K. A multiple linear regression model including a term
representing the multiplicative interaction between 2 loci, takes the form:
y = β
0
+ β
i
G
i
+ β
k
G
k
+ β
ik
G
i
G
k
(1)
where the parameter β
ik
for the multiplicative interaction term can be tested with an LRT
that compares equation (1) to a model without the interaction term, and is evaluated for
significance with a χ
2
statistic on 1 df. All p-values resulting from the pairwise SNP
91
associations tested would necessarily be adjusted for multiple testing with Bonferroni,
FDR, etc.
ITF
Given continuous phenotype y, assumed to be approximately normally
distributed, and G
i
and G
k
, representing genotypes coded as 0, 1 or 2 for the number of
minor alleles in SNP
i
and SNP
k
. Perform univariate tests of association on the model
y = β
0
+ β
i
G
i
(2)
and evaluate β
i
for significance with a 1 df LRT for (2) vs. the null model y = β
0
. Declare
a test significant if the LRT p-value is significant after controlling for FDR [Benjamini
and Hochberg, 1995]. Then, for all possible two-variant interactions, the full model in
equation (1) is tested against the reduced model
y = β
0
+ β
i
G
i
I() + β
k
G
k
I() (3)
where I() is an indicator variable coded as 1 if the corresponding term was statistically
significant in the test of main effects, and 0 otherwise. If neither β
i
nor β
k
are statistically
significant in the first stage, then a 3 df LRT of β
i
, β
k
, and β
i
β
k
is performed vs. the null
model y = β
0
. If β
i
, but not β
k
, is statistically significant in stage 1, then a 2 df LRT is
conducted for the full model vs. y = β
0
+ β
i
G
i
. If both β
i
and β
k
are declared statistically
significant in the first stage, demonstrating that both variants have a significant main
effect on variation in trait y, then a 1 df LRT is warranted for the full model vs the joint
model y = β
0
+ β
i
G
i
+ β
k
G
k
. This selective model conditioning provides an appropriate
framework for testing for interactions with no significant main effects, while avoiding
redundant testing of effects that have already been declared significant [Millstein et al,
92
2006]. Significance is declared if the LRT p-value is significant after correcting all
second-stage tests with FDR.
PCA
As previously described by Gauderman et al, and thoroughly examined in Chapter
2, PCA can be used to asses significant association between a candidate gene or region
and disease-related outcome, by calculating and retaining the subset S of PCs that capture
a substantial amount of locus variation, and testing those PCs for joint association with a
continuous outcome in the following framework:
0
=1
=
S
s s
s
y PC β + β
∑ (4)
As shown in Chapter 2, if PCs represent the weighted effects of particular SNPs in a
candidate region, individual PCs may also be tested for association. For example, y = β
0
+ β
1
PC
1
can be modeled, and the parameter estimate β
1
tested for significance with a
Wald χ
2
statistic on 1 df. Because the number of PCs required to capture a large
proportion of locus variation will be considerably smaller than the number of SNPs
tested, the number of univariate tests performed on PCs will be concomitantly less than
those performed on SNPs. This notion could also be extended to pairwise multiplicative
interactions among the subset S of PCs for each candidate gene. Let G1 and G2 represent
candidate gene 1 and 2 respectively, and let S be the subset of PCs that account for a
predetermined proportion of the overall variation in each gene. Suppose PC
i
is the i
th
PC
score from G1, where i = 1,..., S
G1
, and PC
j
the j
th
PC score from G2, where j = 1,..., S
G2
:
y = β
0
+ β
i
PC
i
+ β
j
PC
j
+ β
ij
PC
i
PC
j
(5)
93
such that a LRT would be used to determine the significance of β
ij
by comparing the
value of the log likelihood for model (5) to that of a model without the multiplicative
term, which follows a χ
2
distribution on 1 df. P-values resulting from testing pairwise PC
associations would be adjusted for multiple testing with Bonferroni, FDR, etc.
OPCC
As with PCs, the disjoint clusters resulting from OPCC on individual candidate
genes can also be tested for association with disease or disease-related trait. Using the S
cluster formations accounting for a large proportion of the overall variance for G1 and
G2, pairwise multiplicative interactions between clusters may be formulated and tested
according to the following framework: let C
i
be the i
th
cluster score from gene 1, where i
= 1, …, S
G1
, and C
j
be the j
th
cluster score from gene 2, where j = 1, …, S
G2
:
y = β
0
+ β
i
C
i
+ β
j
C
j
+ β
ij
C
i
C
j
(6)
An LRT χ
1
2
would be used to determine the significance of β
ij
as previously described.
All p-values resulting from the pairwise PC associations tested would necessarily be
adjusted for multiple testing. All statistical methods and tests of association are
implemented in SAS (v 9.1).
SIMULATIONS
Generation of Genotypes
In order to simulate a plausible gene interaction framework, two type 2 diabetes
(T2D) candidates, Transcription Factor 7-Like 2 (TCF7L2) and Glucokinase (GCK),
described in Chapters 2 and 3, were used as the basis on which to generate genotype data.
As described in Chapter 2, Transcription Factor 7-Like 2 (TCF7L2) is the strongest
94
candidate for type 2 diabetes identified to date. GCK encodes an enzyme that regulates
insulin secretion by changing β-cell glucose phosphorylation rates in response to
physiological glucose concentrations [Matchinsky et al, 1990]. It is a structurally and
functionally unique member of the hexokinase family, and is expressed solely in
pancreatic islet β-cells and liver. GCK was preliminarily recognized for its apparent role
in maturity onset diabetes of the young (MODY2), first by linkage analysis [Froguel et
al, 1992] followed by functional studies that showed significant reduction in β-cell
secretory response to continuous glucose infusion in GCK-linked MODY families [Velho
et al, 1992]. Its specific biological function and identification as a β-cell glucose sensor
provided the basis for its subsequent candidacy as a T2D-susceptibility gene. In addition
to exonic variants that confer susceptibility to MODY2, promoter variants in GCK have
been found to be associated with T2D [Hattersley et al, 1992; Chiu et al, 1996; Vionnet
et al, 1992; Bonnycastle et al, 2006; Cauchi et al, 2008; Vaxillaire et al, 2008] and T2D-
related traits, such as fasting and OGTT glucose measures, insulin secretion, hepatic
insulin resistance and triglyceride levels [Chiu et al, 2000; Weedon et al, 2005; Sparso et
al, 2008; Tinto et al, 2008; Prokopenko et al, 2009; Tam et al, 2009; Webster et al,
2009].
In terms of gene structure, GCK has a dual transcription unit that is differentially
regulated; the downstream promoter is active in hepatocytes, while the upstream
promoter is active in pancreatic β-cells. The transcription of islet GCK is mediated by
glucose [Magnuson et al, 1989 (a)], whereas insulin is the key regulator of GCK
transcription in the liver [Magnuson et al, 1989 (b)]. Interestingly, several ubiquitously
expressed transcription factors have been shown to regulate GCK transcription via
95
binding and transactivation of the GCK liver-specific promoter [Fortez et al, 1999; Kim
et al, 2004; Yamamoto et al, 2004; Egea et al, 2008]. While numerous studies have
reported additive effects of TCF7L2 and GCK on glucose and insulin measures, a
putative binding site for TCF7L2 in either promoters of GCK has yet to be elucidated.
Nevertheless, given GCK’s fundamental role in glucose sensing and mediation of
essential metabolic processes, it is plausible to hypothesize interaction with TCF7L2.
Genotype simulation of GCK and TCF7L2, based on HapMap CEU founders, is
described in Chapters 2 and 3. Description of gene size, number of markers, LD patterns,
haplotype block structure and other gene-specific characteristics, for both original CEU
and simulated data can also be found in these chapters.
Generation of Phenotype
Selection of causal variants (CVs) and quantitative trait simulation for TCF7L2
and GCK is described in Chapters 2 and 3, respectively. Tables of CV allele frequencies,
gene location, tag SNP characteristics, etc. can also be found in Chapters 2 and 3. The
present study focused on the genotype availability scenario of “tag SNP + CV”. A “tag
SNPs only” scenario was not considered in this chapter, as the power the interaction test
for such a scenario is purely a function of the r
2
between each CV and tag SNP in the
pairwise comparison. For a specific analysis involving the ITF, a “full gene” or “all
SNPs” scenario is briefly considered for comparison with the “tag SNP + CV” situation.
As described in Chapters 2 and 3, the quantitative phenotype was simulated as
having a coefficient of variation of 40%. For CV associations with the simulated trait,
the following scenarios for a multiplicative interaction were considered: 1) interaction
96
between 2 CVs that each have a significant main effects, 2) interaction between 2 CVs, in
which 1 CV has a significant main effect and the other does not, and 3) interaction
between 2 CVs, in which neither CV has a significant main effect. Each of these
potential interaction scenarios will be simulated for all possible combinations of CV
allele frequencies in TCF7L2 and GCK.
For each CV, with minor allele frequency q, the continuous trait was simulated to
approximate a univariately normal distribution, based on the linear model:
y = α + β
1
(CV
1
) + β
2
(CV
2
) + β
12
(CV
1
·CV
2
) + ε (7)
where CV
1
and CV
2
reside in two distinct genes. β
estimates for marginal CV effects (β
1
and β
2
) are chosen as described in chapter 2, in a manner that both satisfies the above
interaction scenario criteria and is a function of the proportion of trait variance accounted
for by the CV. β
12
is thus determined by:
) ( ) ( ) ( ) ( ) ( ) (
2 1
2
12 2
2
2 1
2
1
ε β β β α Var CV CV Var CV Var CV Var Var y Var + ⋅ + + + =
such that:
) (
) (
2 1
2
12
2 1
CV CV Var
R y Var
CV CV
⋅
⋅
=
⋅
β
where
2
2 1
CV CV
R
⋅
is the proportion of trait variation accounted for by the interaction between
CV
1
and CV
2
, and Var(CV
1
· CV
2
) is calculated as the variance of a product of any two
independent and identically distributed random variables [Goodman, 1960]:
97
) ( ) ( ) ( ) ( ) ( ) ( ) (
2 1 1
2
2 2
2
1 2 1
CV Var CV Var CV Var CV E CV Var CV E CV CV Var ⋅ + ⋅ + ⋅ = ⋅
where E(CV
1
) and E(CV
2
) are calculated as a function of the individual minor allele
frequencies, given by the mean of the multinomial distribution:
) 1 ( 2 2
) Pr( ) Pr( 2
) )( Pr( ) )( Pr( ) )( Pr(
] [ 1
: ) (
2
2 1
3 3 2 2 1 1
3 3 2 2 1 1
1
q q q
x x
x x x x x x
x p x p x p
x p n CV E
m
i
i i
− + =
+ =
+ + =
+ + =
=
∑
=
μ
Var(CV
1
) and Var(CV
2
) are calculated according to the variance of the multinomial
distribution (as described in chapter 2):
2
3 3 3
2 2 2
2
1 1 1
2 2 4 2
2
2 2 1
2
1 2 1
3 2 3 2 2 1 2 1
2
3 3 3
2
2 2 2
2
1 1 1
1
1
1 1
2 2
) 1 ( ) Pr( , 0 :
) 1 ( 2 ) Pr( , 1 :
) Pr( , 2 :
)) 1 ( 2 ( )) 1 ( 2 ( 4 4 ) 1 ( 2 4
)) (Pr( ) Pr( ) Pr( 4 )) (Pr( 4 ) Pr( ) Pr( 4
)] ( 2 ) 1 ( ) 1 ( ) 1 ( [ 1
3 .., , 1 ; 3 ] 2 ) 1 ( [ : ) (
q x p CV x
q q x p CV x
q x p CV x
where
q q q q q q q q q
x x x x x x
x x p p x x p p x p p x p p x p p
i m x x p p x p p n CV Var
m
i
m
i
m
i j
j i j i i i i
− = = =
− = = =
= = =
− − − − − − + =
− − − + =
+ − − + − + − =
= = − − =
∑ ∑ ∑
=
−
= + =
M
σ
Residual error, ε, is sampled from a normal distribution where:
) ( ) ( ) ( ) ( ) (
2 1
2
12 2
2
2 1
2
1
CV CV Var CV Var CV Var y Var Var ⋅ − − − = β β β ε
For all scenarios, the value of β
12
in equation (7) was set equal to 0 to assess type 1 error.
To assess power for scenario 1), β
1
and β
2
were each determined as the minimum effect
98
size required for each CV to explain 0.5% of the trait variation and β
12
was set to explain
1% of the trait variation. To examine power for scenario 2), β
1
was set equal to 0, and β
2
and β
12
to explain 0.5% and 1% of the trait variation respectively. For scenario 3) both
β
1
and β
2
were set to 0, and β
12
to account for 1% of the overall trait variation. As
verified in QUANTO [Gauderman J and Morrison J, 2006], for a trait with a 40%
coefficient of variation, where each CV is assumed to follow an additive genetic model
and have the effects describe above, a direct test of the multiplicative interaction between
the two CVs has 89.0%, 88.8% and 88.7% power for scenarios 1), 2) and 3) respectively,
at a two-sided 0.05 significance level, for 1,000 subjects. 1,000 null data sets (β
12
=0)
and 1,000 CV-associated data sets were simulated for each CV. Estimated power and
type 1 error were assessed as the proportion data sets for each interaction that yielded a
significant association at the 0.05 level assuming a two-sided alternative hypothesis. All
phenotype simulations implemented in R (v. 2.6.1).
RESULTS
SIMULATION STUDIES
Type 1 Error
For each CV combination, pairwise SNP, PC and OPCC analyses were conducted
on 1,000 null data sets. 1-df tests were evaluated with an LRT for full model (main
effects plus interaction term) vs a reduced model (main effects only); p-values corrected
for multiple testing with the Bonferroni and FDR are presented (Table XIV). Type 1
error estimates for most tests were within range of 5% for all CVs, although pairwise
SNP tests that were corrected with Bonferroni (Table XIV), as well as ITF interaction
99
tests, which controlled for FDR (Table XV), showed type 1 error rates in 3-4% range.
Each estimate is based on the proportion of 1,000 data sets showing a significant
association for each test, under each of the 3 interaction scenarios, as described. Given
the relatively consistent results demonstrated in previous chapters when a 60% explained
variance threshold was applied to determine the number of PCs and clusters to be utilized
for PC and OPCC tests, this criteria was used for these approaches (instead of 80%).
Table XIV. Estimated Type 1 Error
†
for SNP, PC, and Cluster interactions under various
interaction scenarios
CV combinations (MAF)
PW
SNP
a
PW
PC
a
PW
OPCC
a
PW
SNP
b
PW
PC
b
PW
OPCC
b
Scenario 1: No Marginal Effects
TCF7L2 GCK
SNP 54 (low) SNP 8 (low) 0.036 0.044 0.046 0.041 0.046 0.048
SNP 54 (low) SNP 53 (med) 0.032 0.042 0.058 0.035 0.044 0.058
SNP 54 (low) SNP 15 (high) 0.045 0.044 0.048 0.048 0.044 0.050
SNP 26 (med) SNP 8 (low) 0.043 0.040 0.050 0.050 0.046 0.054
SNP 26 (med) SNP 53 (med) 0.049 0.042 0.064 0.056 0.044 0.066
SNP 26 (med) SNP 15 (high) 0.050 0.056 0.055 0.053 0.058 0.056
SNP 60 (high) SNP 8 (low) 0.038 0.054 0.062 0.042 0.056 0.065
SNP 60 (high) SNP 53 (med) 0.039 0.058 0.056 0.043 0.058 0.058
SNP 60 (high) SNP 15 (high) 0.053 0.056 0.060 0.058 0.060 0.063
Scenario 2: Marginal Effect of GCK
TCF7L2 GCK
SNP 54 (low) SNP 8 (low) 0.039 0.042 0.048 0.048 0.044 0.048
SNP 54 (low) SNP 53 (med) 0.049 0.047 0.053 0.054 0.047 0.055
SNP 54 (low) SNP 15 (high) 0.035 0.048 0.050 0.040 0.048 0.052
SNP 26 (med) SNP 8 (low) 0.040 0.049 0.040 0.051 0.050 0.041
SNP 26 (med) SNP 53 (med) 0.048 0.047 0.047 0.052 0.048 0.048
SNP 26 (med) SNP 15 (high) 0.039 0.056 0.049 0.059 0.058 0.049
SNP 60 (high) SNP 8 (low) 0.046 0.050 0.054 0.051 0.052 0.055
SNP 60 (high) SNP 53 (med) 0.045 0.060 0.052 0.042 0.062 0.054
SNP 60 (high) SNP 15 (high) 0.042 0.046 0.047 0.048 0.046 0.047
100
Table XIV: Continued
Scenario 3: Marginal Effects of Both Loci
TCF7L2 GCK
SNP 54 (low) SNP 8 (low) 0.048 0.044 0.052 0.054 0.044 0.054
SNP 54 (low) SNP 53 (med) 0.040 0.050 0.058 0.048 0.051 0.062
SNP 54 (low) SNP 15 (high) 0.041 0.048 0.056 0.046 0.048 0.058
SNP 26 (med) SNP 8 (low) 0.040 0.048 0.043 0.045 0.050 0.044
SNP 26 (med) SNP 53 (med) 0.040 0.050 0.050 0.044 0.052 0.051
SNP 26 (med) SNP 15 (high) 0.044 0.056 0.052 0.050 0.056 0.053
SNP 60 (high) SNP 8 (low) 0.034 0.048 0.040 0.038 0.050 0.042
SNP 60 (high) SNP 53 (med) 0.039 0.062 0.052 0.043 0.063 0.060
SNP 60 (high) SNP 15 (high) 0.050 0.050 0.049 0.054 0.052 0.052
†
Values are the proportion of 1,000 null interaction data sets in which the test of association was rejected at
the 0.05 level. 1 df tests were evaluated with a Wald χ
1
2
statistic based on a multiple regression model for
simulated trait vs single-locus effects and pairwise multiplicative interaction (SNPs, PCs, clusters). PC and
OPCC tests were based on a minimum set of PCs or clusters that explain 60% of the variation in each gene.
a
1 df test of interaction effect, Bonferroni corrected for multiple testing
b
1 df test of interaction effect, corrected for multiple testing with FDR
Table XV. Estimated Type 1 Error
†
for SNP, PC, and Cluster interactions under
scenario of no marginal effects
CV combinations (MAF)
ITF
SNP
a
ITF
PC
a
ITF
Cluster
a
Scenario 1: No Marginal Effects
TCF7L2 GCK
SNP 54 (low) SNP 8 (low) 0.030 0.031 0.032
SNP 54 (low) SNP 53 (med) 0.034 0.032 0.038
SNP 54 (low) SNP 15 (high) 0.046 0.032 0.034
SNP 26 (med) SNP 8 (low) 0.036 0.034 0.038
SNP 26 (med) SNP 53 (med) 0.046 0.040 0.042
SNP 26 (med) SNP 15 (high) 0.036 0.036 0.040
SNP 60 (high) SNP 8 (low) 0.040 0.032 0.046
SNP 60 (high) SNP 53 (med) 0.034 0.038 0.040
SNP 60 (high) SNP 15 (high) 0.042 0.040 0.042
†
Values are the proportion of 1,000 null interaction data sets in which the test of association was
rejected at the 0.05 level. The PC and OPCC tests were based on the minimum set of PCs or
clusters that explained 60% of the variation in each gene.
a
ITF for interaction effects, corrected for multiple testing with FDR.
101
Power
For each CV, pairwise SNP, PC and OPCC analyses were conducted on 1,000
CV-associated data sets. 1-df tests were evaluated with an LRT for full model (main
effects plus interaction term) vs a reduced model (main effects only); p-values corrected
for multiple testing with the Bonferroni and FDR are presented. Estimates of power for
pairwise interaction tests under the 3 interaction scenarios are presented in Table XVI.
Power computed for ITF methods, under interaction scenario 1 (no marginal effects) is
shown in Figure 24. Each estimate is based on the proportion of 1,000 data sets showing
a significant association, after correction with Bonferroni or FDR, as described in the
methods. A 60% explained variance threshold was used to determine the number of PCs
and clusters to be analyzed in the PC and OPCC approaches, respectively.
Table XVI. Estimated Power
†
for Pairwise SNP, PC, and Cluster interactions under 3
scenarios
CV combinations (MAF)
PW
SNP
a
PW
PC
a
PW
OPCC
a
PW
SNP
b
PW
PC
b
PW
OPCC
b
Scenario 1: No Marginal Effects
TCF7L2 GCK
SNP 54 (low) SNP 8 (low) 0.140 0.080 0.169 0.156 0.085 0.173
SNP 54 (low) SNP 53 (med) 0.154 0.113 0.224 0.178 0.116 0.230
SNP 54 (low) SNP 15 (high) 0.085 0.074 0.087 0.098 0.075 0.108
SNP 26 (med) SNP 8 (low) 0.115 0.086 0.118 0.125 0.087 0.125
SNP 26 (med) SNP 53 (med) 0.128 0.110 0.187 0.146 0.114 0.191
SNP 26 (med) SNP 15 (high) 0.078 0.065 0.096 0.091 0.068 0.102
SNP 60 (high) SNP 8 (low) 0.086 0.082 0.087 0.089 0.085 0.092
SNP 60 (high) SNP 53 (med) 0.082 0.083 0.116 0.109 0.086 0.122
SNP 60 (high) SNP 15 (high) 0.068 0.071 0.075 0.079 0.082 0.080
102
Table XVI: Continued
Scenario 2: Marginal Effect of GCK
TCF7L2 GCK
SNP 54 (low) SNP 8 (low) 0.142 0.107 0.176 0.160 0.110 0.181
SNP 54 (low) SNP 53 (med) 0.173 0.117 0.219 0.192 0.120 0.233
SNP 54 (low) SNP 15 (high) 0.086 0.074 0.091 0.102 0.075 0.115
SNP 26 (med) SNP 8 (low) 0.116 0.092 0.119 0.126 0.093 0.124
SNP 26 (med) SNP 53 (med) 0.130 0.124 0.172 0.158 0.126 0.178
SNP 26 (med) SNP 15 (high) 0.082 0.083 0.096 0.094 0.084 0.104
SNP 60 (high) SNP 8 (low) 0.088 0.083 0.107 0.095 0.086 0.109
SNP 60 (high) SNP 53 (med) 0.090 0.089 0.115 0.114 0.090 0.123
SNP 60 (high) SNP 15 (high) 0.079 0.080 0.078 0.082 0.084 0.082
Scenario 3: Marginal Effects of Both Loci
TCF7L2 GCK
SNP 54 (low) SNP 8 (low) 0.146 0.109 0.150 0.163 0.078 0.176
SNP 54 (low) SNP 53 (med) 0.184 0.118 0.227 0.207 0.121 0.232
SNP 54 (low) SNP 15 (high) 0.089 0.074 0.098 0.102 0.076 0.110
SNP 26 (med) SNP 8 (low) 0.117 0.095 0.121 0.128 0.078 0.128
SNP 26 (med) SNP 53 (med) 0.146 0.129 0.200 0.169 0.102 0.207
SNP 26 (med) SNP 15 (high) 0.082 0.084 0.079 0.092 0.085 0.083
SNP 60 (high) SNP 8 (low) 0.088 0.084 0.082 0.094 0.087 0.086
SNP 60 (high) SNP 53 (med) 0.096 0.095 0.132 0.112 0.096 0.135
SNP 60 (high) SNP 15 (high) 0.080 0.084 0.075 0.081 0.085 0.081
†
Values are the proportion of 1,000 CV-associated data sets in which the test of association was rejected at
the 0.05 level. 1 df tests were evaluated with a Wald χ
1
2
statistic based on a multiple regression model for
simulated trait vs single locus effects and multiplicative interaction (SNPs, PCs or clusters). PC and OPCC
tests were based on a minimum set of PCs or clusters that explained 60% of the variation in each gene.
a
1 df test of interaction effect, Bonferroni corrected for multiple testing
b
1 df test of interaction effect, corrected for multiple testing with FDR
Analysis of pairwise SNP interactions, whether Bonferonni or FDR corrected,
yielded rather dismal power under all interaction scenarios (Table XVI). Under the
assumption of no single locus effects (scenario 1), power to detect significant interaction
effects ranged form 7-15% for Bonferroni-adjusted tests, compared to FDR-corrected
analyses, in which power ranged from 8-18%. Power improved very slightly for the
103
scenario of one marginal effect (scenario 2), and again a bit more for the scenario of two
locus marginal effects (scenario 3). Analysis of pairwise PCs showed equal or slightly
less than that for pairwise SNPs, whether Bonferroni or FDR corrected, even though
analysis of pairwise PCs required significantly less tests to be conducted. Pairwise SNP
analysis required 1045 tests of association to be performed (55 TCF7L2 SNPs (54 tags +
CV), 19 GCK SNPs (18 tags + CV)), whereas pairwise PCs, with quantity of PCs defined
by the 60% proportion of variance criteria, required an average of approx. 47 tests
performed (4-5 GCK PCs, 10-11 TCF7L2 PCs). Under scenario 1, power for the
pairwise PC approach ranged from 7-12%, and was fairly consistent across scenarios 2
and 3. In contrast, the pairwise OPCC approach demonstrated higher power than either
the pairwise SNP or pairwse PC methods, whether corrected for multiple testing with
Bonferroni or FDR. Under the scenario of no marginal effects, power for the pairwise
OPCC approach ranged from 8-23%, which for some CV combinations, is a marked
improvement over power for either the pairwise SNP or PC methods. Pairwise cluster
interaction tests, where the quantity of clusters was defined by a 60% explain variance
threshold, required an average of approx. 85 tests performed (5-6 GCK clusters, 15-16
TCF7L2 clusters). Thus, both the PC and OPCC approaches significantly reduced the
sample space from which to test interactions, but only the OPCC method improved power
over the standard pairwise SNP approach.
Because the pairwise approaches still require an unacceptable number of tests to
be performed, which contributes to a significant loss of power and efficiency, the ITF
procedure was applied to SNP, PC and cluster interactions. The ITF was specifically
applied to these approaches under the scenario of no single locus effects (scenario 1), for
104
two reasons: 1) a scenario of very weak or no single locus effects is arguably the most
common interaction scenario presumed for studying epistasis in complex diseases, and 2)
Millstein et al demonstrated that the relative efficiency of the 3 df test for joint effects
and multiplicative interaction vs a null model was greater than that for any other test
when no main effects were assumed [Millstein et al, 2006]. Moreover, the trend
observed for pairwise SNP, PC and cluster interactions under scenario 1 was relatively
similar under scenarios of marginal effects of one or more SNPs, with little increase in
power for the latter compared to the former.
As shown in Figure 25, power using the ITF procedure was markedly improved
for SNP, PC and OPCC tests, compared to traditional 1-df pairwise tests. Under the
hypothesis of no marginal effects, most tests performed in the ITF were 3-df LRTs for a
full model of joint and interaction effects vs the null model. For ITF SNP testing, power
ranged from 17-37%, compared to 8-18% power from FDR-corrected pairwise SNP
interaction tests. ITF on PCs resulted in an increase in power as well; power ranged from
13-28%, which was greater than that observed for pairwise PC test, but still lower than
that observed for the SNP ITF procedure. In the interaction testing framework, OPCC
outperformed both SNP and PC-based ITF methods, with power ranging from 19-42%.
Overall, these simulations demonstrate the power and efficiency of multi-df tests that
include both main and interaction effects, especially for clusters rather than single SNPs.
105
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
SNP 54 x SNP 8
SNP 54 x SNP 53
SNP 54 x SNP 15
SNP 26 x SNP 8
SNP 26 x SNP 53
SNP 26 x SNP 15
SNP 60 x SNP 8
SNP 60 x SNP 53
SNP 60 x SNP 15
Power
ITF SNP ITF PC (tags + CV, 60%) ITF PC (tags + CV, 80%)
ITF PC (All SNPs, 60%) ITF OPCC (tags + CV, 60%) ITF OPCC (tags + CV, 80%)
ITF OPCC (All SNPs, 60%)
Figure 25. Estimated power of ITF SNP, PC and OPCC interaction tests under 2
scenarios of genotype availability (Tag SNPs + CV, All SNPs) using 60% and 80%
explained variance thresholds.
In order to determine whether power for the PC and OPCC interaction tests could
be further improved by inclusion of additional PC or clusters, under a more stringent
proportion of variance criteria, ITF analyses were repeated using the number of PCs
and/or clusters than accounted for 80% of the total variation for each gene. At the 80%
explained variance threshold, an average of approx. 138 PC tests were conducted (7-8
GCK PCs, 18-19 TCF7L2 PCs), and an average of ~251 tests were conducted (9-10 GCK
clusters, 26-27 TCF7L2 clusters) for each CV combination. The number of tests
performed is still much lower than the number of pairwise SNP test, but substantially
more than the number conducted using only the number of PCs and clusters accounting
106
for 60% of the total locus variation. As such, power for tests of interacting loci was
generally not improved by inclusion of additional PCs or clusters (Figure 25).
In Chapter 3, it was shown that if “full gene” or “all SNPs” constitute the
available genotype data, greater power is typically observed for single locus tests of
significant main effects. Therefore, the PC- and OPCC-based ITF analyses were repeated
under the assumption that all variants in GCK and TCF7L2 were available for analysis
(not only tags + CV). As may be expected, using a 60% explained variance threshold for
the number of PCs or clusters included in the analysis, power for PC- and OPCC-based
ITFs was substantially higher when PCs and clusters were derived from “all SNPs”
compared with the “tags + CV” scenario (Figure 25). In an “all SNPs” scenario, an
average of approx. 21 PC tests and 38 OPCC tests were required, for each CV
combination. While power for both PC and OPCC improved, the ITF procedure utilizing
clusters formed from all available SNP data outperformed that for PCs. This finding is
consistent with those presented in Chapter 3, and is due in part, to the fact that less PCs or
clusters are required to achieve a given proportion of explained variance threshold when
“full gene” SNPs are analyzed, compared to data only consisting of “tag SNPs” or “tag
SNPs + CV”.
DATA ANALYSIS
Application to Cleveland Clinic Cohort Study Data
The Cleveland Clinic Cohort consists of 3,178 men and 1,696 women who have
entered the multidisciplinary clinic at some point between 1997 and 2004. Subjects
include patients with a history of coronary artery disease (CAD), cerebrovascular disease
107
(CVD), peripheral vascular disease (PVD), as well as clinic-based controls [Cho et al,
2008]. Clinic and laboratory data consist of several cardiovascular disease risk factors,
including body mass index (BMI), blood pressure, type 2 diabetes status, smoking,
fasting lipids (total cholesterol, LDL, HDL, triglycerides), fasting glucose and insulin, C-
reactive protein, and family history of CAD [Cho et al, 2008]. Recently, study
investigators genotyped DNA samples in 1,145 subjects with and without CAD, for
selected variants in a multitude of CAD and lipid-associated candidate genes. Study
characteristics for subjects with complete phenotyping and genotyping are summarized in
Table XVII.
Table XVII. Characteristics of Cleveland Clinic Cohort, values are Mean (Std)
Cases
(n=654)
Controls
(n=480)
Sex
Male N(%) 521 (68%) 240 (32%)
Female N(%) 133 (36%) 240 (64%)
Age [yrs] 65.9 (11.0) 60.6 (10.1)
Body Mass Index (BMI) [m/kg
2
] 29.37 (5.40) 29.55 (6.59)
Systolic Blood Pressure (SBP) [mm/Hg] 135 (21) 132 (18)
Diastolic Blood Pressure (DBP) [mm/Hg] 74 (11) 75 (11)
Statin Use
Yes N(%) 404 (75%) 135 (25%)
No N(%) 250 (42%) 345 (58%)
Triglyceride [mg/dl] 142.6 (78.6) 119.6 (61.1)
Low-Density Lipoprotein (LDL) [mg/dl] 97.5 (29.6) 105.1 (28.0)
High-Density Lipoprotein (HDL) [md/dl] 32.4 (8.9) 39.0 (11.6)
108
Given the study design, HDL was determined to be the most promising outcome
of interest, as statin use and other confounding factors were less likely to significantly
effect HDL levels than the other quantitative traits collected. Thus, to demonstrate the
methods developed in this chapter, the OPCC approach was applied under an interaction
testing framework to the analysis of two candidate genes in 480 clinic-based controls
from the Cleveland Clinic Cohort study. Apolipoprotein A-V (APOA5) and Cholesteryl
Ester Transfer Protein (CETP), located on chromosomes 11 and 16 respectively, have
both been shown to be significantly associated with HDL in multiple populations
[Knoblauch et al, 2004; Arnedo et al, 2007; Lu et al, 2008]. Furthermore, these
associations are supported by extensive functional evidence for the role of these genes in
triglyceride transport and cholesterol metabolism [Dachet et al, 2000; de Grooth et al,
2004; Qu et al, 2007].
Data analyzed included genotypes for 10 APOA5 SNPs and 18 CETP SNPs, all
with MAFs > 0.01, as well as HDL levels in 480 subjects without CAD. As the variants
were all originally selected for genotyping on the basis of tag SNPs in the HapMap CEU,
pairwise r
2
between all SNPs in each gene < 0.8. Observed genotype frequencies were
assessed for deviation from Hardy-Weinberg Equilibrium and allele frequencies were
estimated using Haploview (V.4.1). Quantitative trait data were transformed to
approximate univariate normality prior to analysis, and genotypes were coded as 0, 1 or 2
according to the number of minor alleles. In an effort to improve cluster fit as much as
possible, given that the data analyzed was a set of “uncorrelated” tag SNPs, an 80%
proportion of variance criteria was applied to determine the number of clusters to be
analyzed in each gene.
109
Results of univariate cluster association tests, as well as gene-specific cluster
characteristics are shown in Table V. CETP required 9 clusters, and APOA5 4 clusters, in
order to meet the 80% explained variance threshold. Cluster size ranged from 1 to 3
SNPs in CETP, and 1 to 4 SNPs in APOA5. Cluster fit among the SNPs assigned to each
cluster was extremely stable, for both genes, as indicated by a high R
2
O
(correlation
between a given SNP and its assigned cluster) and low (1-R
O
2
)/(1-R
N
2
) ratio (Table
XVIII). Tests of univariate cluster association were adjusted for baseline age, sex, BMI,
and Statin treatment, with p-values corrected for multiple testing to control FDR. As
shown in Table V, 4 of the 9 CETP clusters, and 1 of the 4 APOA5 clusters were
univariately associated with variation in HDL, after correction for multiple testing.
Clusters in CETP were then tested for interaction with APOA5 clusters, in an interaction
testing framework, where FDR was applied to all p-values in the second stage of testing.
After correcting for multiple testing, one interaction remained significant; the 1-df LRT
for the interaction involving APOA5 Cluster 2 and CETP Cluster 9 was statistically
significantly associated with HDL (FDR p=0.020, nominal p=0.00056).
110
Table XVIII. CETP and APOA5 SNP-Cluster Associations with HDL
Cluster # SNP Position
a
MAF R
O
2 b
R
N
2 c
(1-R
O
2
)/
(1-R
N
2
)
P-
value
d
CETP
Cluster 1 rs289719 55567441 0.319 0.9715 0.3628 0.0447 0.2397
rs5882 55573592 0.322 0.9715 0.3779 0.0458
Cluster 2 rs12447839 55551435 0.226 0.7113 0.3287 0.4301 0.8064
rs4784744 55568685 0.338 0.6551 0.2504 0.4601
rs820299 55557784 0.375 0.8268 0.3555 0.2688
Cluster 3 rs1864163 55554733 0.256 0.8872 0.3189 0.1656 0.0029
rs9939224 55560232 0.202 0.8872 0.4347 0.1996
Cluster 4 rs6499863 55549517 0.160 1.0000 0.1720 0.0000 0.4327
Cluster 5 rs12708974 55563050 0.123 1.0000 0.1404 0.0000 0.6649
Cluster 6 rs1801706 55575162 0.178 0.8536 0.4494 0.2659 0.1625
rs9923854 55574502 0.100 0.8536 0.2213 0.1880
Cluster 7 rs1800775 55552736 0.463 0.6651 0.2922 0.4732 0.0122
rs3764261 55550824 0.319 0.8204 0.2516 0.2400
rs4783961 55552394 0.499 0.6773 0.3709 0.5130
Cluster 8 rs12708967 55550711 0.189 0.8324 0.2122 0.2127 0.0134
rs9929488 55556072 0.268 0.8324 0.5070 0.3399
Cluster 9 rs1800774 55573045 0.355 0.7726 0.2643 0.3092 0.0342
rs289714 55564951 0.193 0.7726 0.3983 0.3780
APOA5
Cluster 1 rs1729408 116180027 0.360 0.5341 0.0648 0.4981
0.1734
rs1729410 116170870 0.468 0.8383 0.0821 0.1762
rs603446 116159644 0.416 0.7123 0.0848 0.3144
rs618923 116159368 0.262 0.5823 0.0417 0.4359
Cluster 2 rs12285095 116163240 0.061 0.9853 0.0090 0.0148
0.0018
rs3135506 116167616 0.060 0.9853 0.0096 0.0148
Cluster 3 rs11216137 116175037 0.073 0.8951 0.0384 0.1091
0.6046
rs1942478 116156672 0.091 0.8951 0.0273 0.1079
Cluster 4 rs2072560 116167035 0.081 0.8374 0.0105 0.1643
0.4345
rs6589567 116175885 0.137 0.8374 0.0186 0.1657
a
Based on HapMap Data Release 23a (Phase II), NCBI Build 36, dbSNP Build 126
b
Squared correlation coefficient between a given SNP and its own cluster
c
The next highest squared correlation coefficient between a given SNP and any other cluster
d
P-value from 1 df Wald χ
2
for univariate association, adjusted for BMI, age, sex, and Statin treatment
111
APOA5 Cluster 2 was comprised of 2 SNPs: rs12285095 and rs3135506.
Rs3135506, a missense mutation in exon 3, has been shown to be significantly associated
with increased triglycerides and decreased HDL [Lu et al, 2008; Keebler et al, 2009;
Webster et al, 2009]. These associations are difficult to parse, however, as triglyceride
and HDL are metabolically linked and thus typically observed in significantly strong
negative correlation. Moreover, CETP Cluster 9 is a compilation of rs1800774 and
rs289714. While both SNPs reside in intronic regions, are in strong LD in Caucasian
populations, and have been shown to be associated with body fat mass and history of
myocardial infarction, only rs289714 has been previously shown to be associated with
HDL [Thompson et al, 2005; Klos et al, 2006; Teran-Garcia et al, 2008]. Although an
interesting finding, the statistical evidence for interaction warrants further study of how
missense mutations in APOA5 might be affected by regulatory elements in CETP introns.
DISCUSSION
The results of the simulation studies confirm that testing interactions in a standard
pairwise interaction model, on 1 df, has dismal power to detect interaction effects with
very weak or no marginal effects are present. The simulation results also support the use
of OPCC-derived clusters as the unit of analysis, when testing interaction effects. As a
multivariate method that efficiently captures the correlation underlying SNPs in a
candidate gene or region, OPCC consistently provided more power to detect significant
main effects, and significant interaction effects, than tests of individual SNPs or PCs.
Despite the fact cluster association tests outperformed traditional SNP and PC
approaches, there are still a number of questions which remain to be explored. For
112
example, it is still not clear which explained variance threshold is most appropriate, for
specific tests, in various genotype availability scenarios. In the simulations of the present
chapter, it appears that a 60% threshold under either the scenario of “tag SNPs + CV”, or
“full gene” SNPs is relatively robust for tests of interactions. It is still obvious however,
that in real data analysis, circumstances can vary widely depending on gene size, SNP
correlation and LD patterns, etc. Thus, if tag SNPs are the only available data, which is
the case for most candidate gene studies, an 80% threshold may allow for improved
refinement of cluster definitions, and thus more power to detect interaction effects via
proxies for the true causal variants.
At any threshold, it is still apparent that more clusters than PCs will be required to
achieve a given amount of explained locus variance. For interaction tests, however, the
statistical penalty in increasing the multiple testing correction factor was seemingly
outweighed by the specificity of the clusters. In contrast, the lower power observed for
testing individual PCs for interaction, each of which represents a linear construct that
accounts for a small fraction of the variation in each locus, is consistent with the results
shown in Chapter 2 and 3 for univariate tests of single PC effects.
The ITF procedure also proved to be a useful and efficient means of testing gene
interactions when no marginal effects are present, for all methods studied. A relatively
simple and straightforward algorithm in implementation, the ITF proved to be most
useful for OPCC cluster interactions, increasing power by 2- to 4-fold for most CV
combinations. Despite the increase, the largest absolute power observed was primarily
for medium and high frequency CV combinations, at roughly 40-44%. Thus, a more
113
efficient means of further reducing the possible test combinations, perhaps focusing on
phenotype, rather than genotype, may further enhance the power of this test.
114
Chapter 5. Future Work
Several issues remain unresolved in terms of testing PCs or clusters for univariate
or interaction association with complex traits. While it is clear that testing cluster
association with a highly variable complex trait, univariately or in an interaction testing
framework, maintains or exceeds the power of the traditional PC approach, the issue of
the “best” or most appropriate explained variance criteria to use as the algorithm stopping
rule is not easily pre-determined. Various factors, such as amount of SNP data and
underlying LD patterns, as well as magnitude of multiple test correction factor to be
applied to association p-values, make it difficult to decide a priori, what level of
explained proportion of variance should be used to determine the number of clusters to be
analyzed. If another metric or preliminary test could be established that would allow
investigators to determine this threshold prior to association testing, investigators would
be better able to maximize power for a given analysis. Such a test would necessarily
involve modeling the penalty for increasing magnitude of multiple test correction vs
SNP-cluster fit. It was previously mentioned that the SAS Varclus procedure uses the
correlation between SNPs and clusters in SNP-cluster assignment, determined by
alternating least squares for a given SNP regressed on its cluster score, as indicated by the
value of R
2
O
. The ratio of (1 – R
2
O
)/(1 – R
2
N
) thus provides a measure of cluster fit.
Ideally, one would seek to simultaneously maximize cluster fit, while minimizing the
magnitude of the correction factor.
Moreover, for a set threshold of explained variance, testing all possible pairwise
cluster interactions still resulted in a very large number of tests over numerous loci and a
concomitant loss of power. As with CART, CPM, MDR, and all other computationally
115
burdensome methods, an informative way of limiting the number of gene interactions
would likely improve power to detect significant effects, as well as reduce computation
time. The interaction testing framework examined in this chapter for testing interactions
among PCs and clusters provided a powerful and efficient method for testing specific
interactions among candidate gene loci, but must be adapted in a manner that would
allow further focusing of tests, thus narrowing the number of tests to be performed.
An FITF approach for case-control data may be adapted to the study of
quantitative traits in a number of ways, depending on the data collected. In the “Focus”
or screening step of the FITF, Millstein et al capitalize on the fact that cases are
oversampled in a case-control sampling scheme, in only selecting for analysis those
genes that show significant difference between observed and expected frequencies among
the cases. The combined case-control sample is actually used for the calculation of the χ
2
goodness-of-fit statistic, in order to avoid biasing the disease-association test statistic,
under the assumption that deviation from expected prevalence in the pooled case-control
sample is indicative of deviation from the expected prevalence in cases. For cross-
sectional data, one could create a similar scenario by sampling individuals from the
extremes of a distribution. For example, one might collect cross-sectional data from a
population of subjects at high risk of developing a particular complex disease, as was
done in BetaGene for probands with 5-year history of gestational diabetes (GDM), who
are at extreme risk for developing T2D. A “Focused” screening step in this population
would further narrow down the search space of genes where one might expect most
quantitative traits to reflect GDM/T2D status.
116
Lastly, the present study includes single pairwise interactions between two CVs,
but does not include examination of the more likely complex disease scenarios of
multiple independent pairwise interactions, some of which act in concert with one
another, and some of which have opposing or compensatory physiological effects.
Hypothetically speaking, it may be possible for variants in TCF7L2 to have differential
interaction with variation in GCK, whereby one set of SNPs in TCF7L2 mediates GCK
sensing of glucose in the β-cell via transactivation of GCK’s islet-specific promoter, and
perhaps another set of TCF7L2 variants regulates GCK’s response to insulin in
hepatocytes via transactivation of its liver-specific promoter. Furthermore, additional
genes may be involved in this pathway; given TCF7L2 is a ubiquitously expressed
transcription factor which interacts with several others in the Wnt signaling pathway,
other intermediates may play a role in the regulation of GCK by TCF7L2. The FITF
approach currently allows for testing of high order interactions, in and LRT-based
framework, and has been shown to have substantial power to detect multi-gene
interactions in case-control data. Perhaps an extension of this approach is warranted to
test clusters derived from several genes, to test multi-cluster interaction within a
biologically relevant network for association with complex disease-traits.
117
BIBLIOGRAPHY
Allen AS and Satten GA. Robust estimation and testing of haplotype effects in case
control studies. Genetic Epidemiology 32:29-40 (2008).
Arnedo M, Taffé P, Sahli R, Furrer H, Hirschel B, Elzi L, Weber R, Vernazza P,
Bernasconi E, Darioli R, Bergmann S, Beckmann JS, Telenti A, Tarr PE; Swiss HIV
Cohort Study. Contribution of 20 single nucleotide polymorphisms of 13 genes to
dyslipidemia associated with antiretroviral therapy. Pharmacogenet Genomics
17(9):755-64 (2007).
Atkinson EJ, Fridley BL, Goode EL, McDonnell SK, Liu-Mares W, Rabe KG, Sun Z,
Slager SL, de Andrade M. Linkage analysis using principal components of gene
expression data. BMC Proceedings 1(Supp 1): S79 (2007).
Barlassina C, Lanzani C, Manunta P, Bianchi G. Genetics of essential hypertension: from
families to genes. J Am Soc Nephrol Suppl 3:S155-64 (2002).
Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and
haplotype maps. Bioinformatics 21: 263–265 (2005).
Benjamini Y and Hochberg Y. Controlling the False Discovery Rate: a practical and
powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B
(Methodological) 57(1): 289–300 (1995).
Bjursell M, Ahnmark A, Bohlooly-Y M, William-Olsson L, Rhedin M, Peng XR, Ploj K,
Gerdin AK, Arnerup G, Elmgren A, Berg AL, Oscarsson J, Lindén D. Opposing effects
of adiponectin receptors 1 and 2 on energy metabolism. Diabetes 56(3):583-93 (2007).
Black MH, Fingerlin TE, Allayee H, Zhang W, Xiang AH, Trigo E, Hartiala J, Lehtinen
AB, Haffner SM, Bergman RN, McEachin RC, Kjos SL, Lawrence JM, Buchanan TA,
Watanabe RM. Evidence of interaction between PPARG2 and HNF4A contributing to
variation in insulin sensitivity in Mexican Americans. Diabetes 57(4):1048-56 (2008).
Bonnycastle LL, Willer CJ, Conneely KN, Jackson AU, Burrill CP, Watanabe RM,
Chines PS, Narisu N, Scott LJ, Enloe ST, Swift AJ, Duren WL, Stringham HM, Erdos
MR, Riebow NL, Buchanan TA, Valle TT, Tuomilehto J, Bergman RN, Mohlke KL,
Boehnke M, Collins FS: Common variants in maturity-onset diabetes of the young genes
contribute to risk of type 2 diabetes in Finns. Diabetes 55:2534 –2540 (2006).
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification and
Regression Trees. Wadsworth International, Belmont, CA (1984).
118
Briollais L, Wang Y, Rajendram I, Onay V, Shi E, Knight J, Ozcelik H. Methodological
issues in detecting gene-gene interactions in breast cancer susceptibility: a population-
based study in Ontario. BMC Medicine 5(22): (2007).
Byrne MM, Sturis J, Clement K, Vionnet N, Pueyo ME, Stoffel M, Takeda J, Passa P,
Cohen D, Bell GI: Insulin secretory abnormalities in subjects with hyperglycemia due to
glucokinase mutations. J Clin Invest 98:1120 –1130 (1994).
Cai G, Cole SA, Freeland-Graves JH, MacCluer JW, Blangero J, Comuzzie AG.
Principal component for metabolic syndrome risk maps to chromosome 4p in Mexican
Americans: the San Antonio Family Heart Study. Human Biology 76:651–665 (2004).
Cargill, M. et al Characterization of single-nucleotide polymorphisms in coding
regions of human genes. Nature Genetics 22:231–238 (1999).
Carroll JB. An analytical solution for approximating simple structure in factor analysis.
Psychometrika 18(1): 23-38 (1953).
Cauchi S, Nead KT, Choquet H, Horber F, Potoczna N, Balkau B, Marre M, Charpentier
G, Froguel P, Meyre D. The genetic susceptibility to type 2 diabetes may be modulated
by obesity status: implications for association studies. BMC Med Genet. 22:9-45 (2008).
Chase K, Carrier DR, Adler FR, Jarvik T, Ostrander EA, Lorentzen TD, Lark KG.
Genetic basis for systems of skeletal quantitative traits: Principal component analysis of
the canid skeleton. PNAS 99(15):9930-9935 (2002).
Cho L, Hoogwerf B, Huang J, Brennan DM, Hazen SL. Gender differences in utilization
of effective cardiovascular secondary prevention: a Cleveland Clinic prevention database
study. J Womens Health 17(4):515-21 (2008).
Cheung V. and Spielman R. The genetics of variation in gene expression. Nature
Genetics 32: 522–525 (2002).
Chiu KC, McCarthy JE. Promoter variation in the liver glucokinase is a risk factor for
non-insulin-dependent diabetes mellitus. Biochem Biophys Res Commun. 221(3):614-8
(1996).
Chiu KC, Chuang LM, Yoon C, Saad MF. Hepatic glucokinase promoter polymorphism
is associated with hepatic insulin resistance in Asian Indians. BMC Genet. 1:2 (2000).
Clark AG. The role of haplotypes in candidate gene studies. Genetic Epidemiology
27(4):321-333 (2004).
Collins, F.S., Guyer, M.S., Chakravarti, A. Variations on a theme: cataloging human
DNA sequence variation. Science 278: 1580−1581 (1997).
119
Conneely K.N. and Boehnke M. So many correlated tests, so little time! Rapid
adjustment of p-values for multiple correlated tests. American Journal of Human
Genetics 81(6): (2007)
Cordell H.J. Epistasis: what it means, what it doesn’t mean, and statistical methods to
detect it in humans. Human Molecular Genetics 11(20): 2463-2468 (2002).
Culverhouse R. A perspective on epistasis: limits of models displaying no main effects.
American Journal of Human Genetics 70:461–471 (2002).
Culverhouse R, Klein T, Shannon W. Detecting epistatic interaction contributing to
quantitative traits. Genetic Epidemiology 27: 141-152 (2004).
Dachet C, Poirier O, Cambien F, Chapman J, Rouis M. New functional promoter
polymorphism, CETP/-629, in cholesteryl ester transfer protein (CETP) gene related to
CETP mass and high density lipoprotein cholesterol levels: role of Sp1/Sp3 in
transcriptional regulation. Arterioscler Thromb Vasc Biol 20:507–515 (2000).
Dahlgren A, Zethelius B, Jensevik K, Syvänen AC, Berne C. Variants of the TCF7L2
gene are associated with beta cell dysfunction and confer an increased risk of type 2
diabetes mellitus in the ULSAM cohort of Swedish elderly men. Diabetologia
50(9):1852-7 (2007).
Damcott CM, Pollin TI, Reinhart LJ, Ott SH, Shen H, Silver KD, Mitchell BD, Shuldiner
AR. Polymorphisms in the transcription factor 7-like 2 (TCF7L2) gene are associated
with type 2 diabetes in the Amish: replication and evidence for a role in both insulin
secretion and insulin resistance. Diabetes 55(9):2654-9 (2006).
de Bakker PIW, Yelensky R, Pe'er I, Gabriel SB, Daly MJ, Altshuler D. Efficiency and
power in genetic association studies. Nature Genetics 37: 1217-1223 (2005).
de Grooth GJ, Klerkx AH, Stroes ES, Stalenhoef AF, Kastelein JJ, Kuivenhoven JA. A
review of CETP and its relation to atherosclerosis. J Lipid Res. Nov;45(11):1967-74
(2004).
Doney AS, Fischer B, Leese G, Morris AD, Palmer CN. Cardiovascular risk in type 2
diabetes is associated with variation at the PPARG locus: a Go-DARTS study.
Arterioscler Thromb Vasc Biol 24(12):2403-7 (2004).
Dong C, Li WD, Li D, Price RA. Interaction between obesity-susceptibility loci in
chromosome regions 2p25-p24 and 13q13-q21. Eur J Hum Genetics 13(1):102-8 (2005).
Egea M, Metón I, Córdoba M, Fernández F, Baanante IV. Role of Sp1 and SREBP-1a in
the insulin-mediated regulation of glucokinase transcription in the liver of gilthead sea
bream (Sparus aurata). Gen Comp Endocrinol. 155(2):359-67 (2008).
120
Epstein MP, Satten GA. Inference on haplotype effects in case-control studies using
unphased genotype data. American Journal of Human Genetics 73:1316–1329 (2003).
Fallin D, Cohen A, Essioux L, Chumakov I, Blumenfield M, Cohen D, Schork N.
Genetic analysis of case/control data using estimated haplotype frequencies: application
to APOE locus variation and Alzheimer’s disease. Genome Research 11:143–151 (2001).
Florez JC, Jablonski KA, Bayley N, Pollin TI, de Bakker PI, Shuldiner AR, Knowler
WC, Nathan DM, Altshuler D. TCF7L2 polymorphisms and progression to diabetes in
the Diabetes Prevention Program. New England Journal of Medicine 355(3):241-50
(2006).
Foretz, M, Guichard C, Ferre P, Foufelle, F. Sterol regulatory element binding protein-1c
is a major mediator of insulin action on the hepatic expression of glucokinase and
lipogenesis-related genes. Proc. Natl. Acad. Sci. USA 96, 12737–12742 (1999)
Froguel P, Vaxillaire M, Sun F, Velho G, Zouali H, Butel MO, Lesage S, Vionnet N,
Clement K, Fougerousse F: Close linkage of glucokinase locus on chromosome 7p to
early-onset non-insulin-dependent diabetes mellitus. Nature 356:162–164 (1992).
Gabriel SB, Schnaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J,
DeFelice M, Lochner A, Faggart M, Liu-Cordero SN, Rotimi C, Adeyemo A, Cooper R,
Ward R, Lander ES, Daly MJ, Altshuler D: The structure of haplotype blocks in the
human genome. Science 296: 2225-2229 (2002).
Gauderman W, Morrison J. QUANTO 1.2. A computer program for power and sample
size calculations for genetic epidemiology studies: Available at http://hydra.usc.edu/gxe.
(2006).
Gauderman JW, Murcray C, Gilliland F, Conti D. Testing association between disease
and multiple SNPs in a candidate gene. Genetic Epidemiology 31: 383-395 (2007).
Goodman LA. On the exact variance of products. Journal of the American Statistical
Association 55(292):708-713 (1960).
Grant SF, Thorleifsson G, Reynisdottir I, Benediktsson R, Manolescu A, Sainz J,
Helgason A, Stefansson H, Emilsson V, Helgadottir A, Styrkarsdottir U, Magnusson KP,
Walters GB, Palsdottir E, Jonsdottir T, Gudmundsdottir T, Gylfason A, Saemundsdottir J,
Wilensky RL, Reilly MP, Rader DJ, Bagger Y, Christiansen C, Gudnason V, Sigurdsson
G, Thorsteinsdottir U, Gulcher JR, Kong A, Stefansson K. Variant of transcription factor
7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nature Genetics 38(3):320-3
(2006).
121
Groves CJ, Zeggini E, Minton J, Frayling TM, Weedon MN, Rayner NW, Hitman GA,
Walker M, Wiltshire S, Hattersley AT, McCarthy MI. Association analysis of 6,736 U.K.
subjects provides replication and confirms TCF7L2 as a type 2 diabetes susceptibility
gene with a substantial effect on individual risk. Diabetes 55(9):2640-4 (2006).
Heidema AG, Boer JMA, Nagelkerke N, Mariman ECM, Van Der A DL, Feskens EJM.
The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in
relation to complex diseases. BMC Genetics 7(23) (2006).
Hendrick P. Gametic disequilibrium measures: proceed with caution. Genetics 117: 331-
341 (1987).
Halushka, M.K. et al Patterns of single-nucleotide polymorphisms in candidate genes for
blood-pressure homeostasis. Nature Genetics 22: 239–247 (1999).
Harris CW and Kaiser HF. Oblique factor analytic solutions by orthogonal
transformations. Psychometrika 29: 347 – 362 (1964).
Hattersley AT, Turner RC, Patel P, O’Rahilly S, Hattersley AT, Patel P, Wainscoat JS,
Permutt MA, Tanazawa Y, Chin KC, Watkins P: Linkage of type 2 diabetes to the
glucokinase gene. Lancet 339:1307–1310 (1992).
Hoh J, Wille A, Ott J. Trimming, weighting and grouping SNPs in human case-control
association studies. Genome Research 11:2115-2119 (2001).
Hsueh WC, Cole SA, Shuldiner AR, Beamer BA, Blangero J, Hixson JE, MacCluer JW,
Mitchell BD. Interactions between variants in the beta3-adrenergic receptor and
peroxisome proliferator-activated receptor-gamma2 genes and obesity. Diabetes Care
24(4):672-7 (2001).
Huang M, Dinney CP, Lin X, Lin J, Grossman HB, Wu X. High-order interactions
among genetic variants in DNA base excision repair pathway genes and smoking in
bladder cancer susceptibility. Cancer Epidemiol Biomarkers Prev 16(1): 84-91 (2007).
Huertas-Vazquez A, Plaisier C, Weissglas-Volkov D, Sinsheimer J, Canizales-Quinteros
S, Cruz-Bautista I, Nikkola E, Herrera-Hernandez M, Davila-Cervantes A, Tusie-Luna T,
Taskinen MR, Aguilar-Salinas C, Pajukanta P. TCF7L2 is associated with high serum
triacylglycerol and differentially expressed in adipose tissue in families with familial
combined hyperlipidaemia. Diabetologia 51(1):62-9 (2008).
Humphries SE, Gable D, Cooper JA, Ireland H, Stephens JW, Hurel SJ, Li KW, Palmen
J, Miller MA, Cappuccio FP, Elkeles R, Godsland I, Miller GJ, Talmud PJ. Common
variants in the TCF7L2 gene and predisposition to type 2 diabetes in UK European
Whites, Indian Asians and Afro-Caribbean men and women. Journal of Molecular
Medicine 84(12):1005-14 (2006).
122
Joliffe IT. Principal Component Analysis. Springer, 2002.
Keebler ME, Sanders CL, Surti A, Guiducci C, Burtt NP, Kathiresan S. Association of
Blood Lipids With Common DNA Sequence Variants at 19 Genetic Loci in the
Multiethnic United States. National Health and Nutrition Examination Survey III. Circ
Cardiovasc Genet. 2:238-243 (2009).
Kiers HAL and Ten Berge JMF. The Harris-Kaiser independent cluster rotation as a
method for rotation to simple component weights. Psychometrika 59(1): 81–90 (1994).
Kim JH, Sen S, Avery CS, Simpson E, Chandler P, Nishina PM, Churchill GA, Naggert
JK. Genetic analysis of a new mouse model for non-insulin-dependent diabetes.
Genomics 74(3): 273-86 (2001).
Kim SY, Kim HI, Kim TH, Im SS, Park SK, Lee IK, Kim KS, Ahn YH. SREBP-1c
mediates the insulin-dependent hepatic glucokinase expression. J. Biol. Chem. 279,
30823–30829 (2004).
Klos, KL, Sing, CF, Boerwinkle, E, Hamon, SC, Rea, TJ, Clark, A. Consistent effects of
genes involved in reverse cholesterol transport on plasma lipid and apolipoprotein levels
in CARDIA participants. Arterioscler. Thromb. Vasc. Biol. 26, 1828–1836 (2006).
Knoblauch H, Bauerfeind A, Toliat MR, Becker C, Luganskaja T, Gunther UP, Rohde K,
Schuster H, Junghans C, Luft FC. Haplotypes and SNPs in 13 lipid-relevant genes
explain most of the genetic variance in high-density lipoprotein and low-density
lipoprotein cholesterol. Hum. Mol. Genet. 13: 993–1004 (2004).
Koopberg C, Ruczinski I, LeBlanc ML, Hsu L. Sequence analysis using logic regression.
Genetic Epidemiology 21(S1): S636–S631 (2001).
Koopergberg C, Bis JC, Marciante KD, Heckbert SR, Lumley T, Psaty BM. Logic
regression for the analysis of the association between genetic variation in the rennin-
angiotensin system and myocardial infarction or stroke. American Journal of
Epidemiology 165(3):334-343 (2006).
Lewontin R. On measures of gametic disequilibrium. Genetics 120:849-852 (1988).
Lohmueller K.E., Pearce C.L., Pike M., Lander E., Hirschhorn J.N. Meta-analysis of
genetic association studies supports a contribution of common variants to susceptibility to
common disease. Nature Genetics 33: 177-182 (2002).
Lou XY, Chen GB, Yan L, Ma JZ, Zhu J, Elston RC, Li MD. A generalized
combinatorial approach for detecting gene-by-gene and gene-by-environment interactions
with application to nicotine dependence. American Journal of Human Genetics 80:1125-
37 (2007).
123
Lu Y, Dollé ME, Imholz S, van 't Slot R, Verschuren WM, Wijmenga C, Feskens EJ,
Boer JM. Multiple genetic variants along candidate pathways influence plasma high-
density lipoprotein cholesterol concentrations. J Lipid Res. 49(12):2582-9 (2008).
Magnuson MA, Shelton KD. An alternate promoter in the glucokinase gene is active in
the pancreatic beta cell. J Biol Chem. 264(27):15936-42 (1989a).
Magnuson MA, Andreone TL, Printz RL, Koch S, Granner DK. Rat glucokinase gene:
structure and regulation by insulin. Proc Natl Acad Sci U S A 86(13):4838-42 (1989b).
Matschinsky FM. Glucokinase as glucose sensor and metabolic signal generator in
pancreatic beta-cells and hepatocytes. Diabetes 39(6):647-52 (1990).
Melzer D, Murray A, Hurst AJ, Weedon MN, Bandinelli S, Corsi AM, Ferrucci L,
Paolisso G, Guralnik JM, Frayling TM. Effects of the diabetes linked TCF7L2
polymorphism in a representative older population. BMC Medicine 4:34 (2006).
Meyers LS, Gamst G, Gaurino AJ. Applied Multivariate Research. Sage, 2006.
Millstein J, Conti D, Gilliland FD, Gauderman WJ. A testing framework for identifying
susceptibility genes in the presence of epistasis. American Journal of Human Genetics
78(1): 15-27 (2006).
Munoz J, Lok KH, Gower BA, Fernandez JR, Hunter GR, Lara-Castro C, De Luca M,
Garvey WT. Polymorphism in the transcription factor 7-like 2 (TCF7L2) gene is
associated with reduced insulin secretion in nondiabetic women. Diabetes 55(12):3630-4
(2006).
Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari H, Allison DB.
Detection of gene x gene interactions in genome-wide association studies of human
population data. Human Heredity 63:67-84 (2007).
Nelson MR, Kardia SLR, Ferrell RE, Sing CF. A combinatorial partitioning method to
identify multilocus genotypic partitions that predict quantitative trait variation. Genome
Research 11: 458-470 (2001).
Ott J and Hoh J. Set association analysis of SNP case-control and microarray data.
Journal of Computational Biology 10(3-4): 569-574 (2003).
Palmer ND, Lehtinen AB, Langefeld CD, Campbell JK, Haffner SM, Norris JM,
Bergman RN, Goodarzi MO, Rotter JI, Bowden DW. Association of TCF7L2 gene
polymorphisms with reduced acute insulin response in Hispanic Americans. Journal of
Clinical Endocrinology and Metabolism 93(1):304-9 (2008).
124
Pritchard JK and Cox NJ. The allelic architecture of human disease genes: common
disease – common variant . . . or not? Human Molecular Genetics 11(20): 2417–2423
(2002).
Prokopenko I, Langenberg C, Florez JC, Saxena R, Soranzo N, Thorleifsson G, Loos RJ,
Manning AK, Jackson AU, Aulchenko Y, Potter SC, Erdos MR, Sanna S, Hottenga JJ,
Wheeler E, Kaakinen M, Lyssenko V, Chen WM, Ahmadi K, Beckmann JS, Bergman
RN, Bochud M, Bonnycastle LL, Buchanan TA, Cao A, Cervino A, Coin L, Collins FS,
Crisponi L, de Geus EJ, Dehghan A, Deloukas P, Doney AS, Elliott P, Freimer N, Gateva
V, Herder C, Hofman A, Hughes TE, Hunt S, Illig T, Inouye M, Isomaa B, Johnson T,
Kong A, Krestyaninova M, Kuusisto J, Laakso M, Lim N, Lindblad U, Lindgren CM,
McCann OT, Mohlke KL, Morris AD, Naitza S, Orrù M, Palmer CN, Pouta A, Randall J,
Rathmann W, Saramies J, Scheet P, Scott LJ, Scuteri A, Sharp S, Sijbrands E, Smit JH,
Song K, Steinthorsdottir V, Stringham HM, Tuomi T, Tuomilehto J, Uitterlinden AG,
Voight BF, Waterworth D, Wichmann HE, Willemsen G, Witteman JC, Yuan X, Zhao
JH, Zeggini E, Schlessinger D, Sandhu M, Boomsma DI, Uda M, Spector TD, Penninx
BW, Altshuler D, Vollenweider P, Jarvelin MR, Lakatta E, Waeber G, Fox CS, Peltonen
L, Groop LC, Mooser V, Cupples LA, Thorsteinsdottir U, Boehnke M, Barroso I, Van
Duijn C, Dupuis J, Watanabe RM, Stefansson K, McCarthy MI, Wareham NJ, Meigs JB,
Abecasis GR. Variants in MTNR1B influence fasting glucose levels. Nat Genet.
41(1):77-81 (2009).
Qin ZS, Tianhua N, Liu JS. Partition-Ligation–Expectation-Maximization Algorithm for
haplotype inference with single-nucleotide polymorphisms. American Journal of Human
Genetics 71(5): 1242-1247.
Qu S, Perdomo G, Su D, D'Souza FM, Shachter NS, Dong HH. Effects of apoA-V on
HDL and VLDL metabolism in APOC3 transgenic mice. J Lipid Res. 48(7):1476-87
(2007).
Risch N and Merikangas K. The future of genetic studies of complex human diseases.
Science 273: 1516−1517 (1996).
Ritchie MD, Hahn LW, Roodi N, Bailey RL, Dupont WD, Parl FF, Moore JH.
Mutlifactor-Dimensionality Reduction reveals high-order interactions among estrogen-
metabolism gene in sporadic breast cancer. American Journal of Human Genetics 69:
138-147 (2001).
Roeder K, Bacanu SA, Sonpar V, Zhang X, Devlin B. Analysis of single-locus tests to
detect gene/disease associations. Genetic Epidemiology 28(3): 207–219 (2005).
Rose CS, Ek J, Urhammer SA, Glumer C, Borch-Johnsen K, Jorgensen T, Pedersen O,
Hansen T: A -30G_A polymorphism of the β-cell–specific glucokinase promoter
associates with hyperglycemia in the general population of whites. Diabetes 54:3026 –
3031 (2005).
125
Saxena R, Gianniny L, Burtt NP, Lyssenko V, Giuducci C, Sjögren M, Florez JC,
Almgren P, Isomaa B, Orho-Melander M, Lindblad U, Daly MJ, Tuomi T, Hirschhorn
JN, Ardlie KG, Groop LC, Altshuler D. Common single nucleotide polymorphisms in
TCF7L2 are reproducibly associated with type 2 diabetes and reduce the insulin response
to glucose in nondiabetic individuals. Diabetes 55(10):2890-5 (2006).
Saxena R, Voight BF, Lyssenko V, Burtt NP, de Bakker PI, Chen H, Roix JJ, Kathiresan
S, Hirschhorn JN, Daly MJ, Hughes TE, Groop L, Altshuler D, Almgren P, Florez JC,
Meyer J, Ardlie K, Bengtsson Boström K, Isomaa B, Lettre G, Lindblad U, Lyon HN,
Melander O, Newton-Cheh C, Nilsson P, Orho-Melander M, Råstam L, Speliotes EK,
Taskinen MR, Tuomi T, Guiducci C, Berglund A, Carlson J, Gianniny L, Hackett R, Hall
L, Holmkvist J, Laurila E, Sjögren M, Sterner M, Surti A, Svensson M, Svensson M,
Tewhey R, Blumenstiel B, Parkin M, Defelice M, Barry R, Brodeur W, Camarata J, Chia
N, Fava M, Gibbons J, Handsaker B, Healy C, Nguyen K, Gates C, Sougnez C, Gage D,
Nizzari M, Gabriel SB, Chirn GW, Ma Q, Parikh H, Richardson D, Ricke D, Purcell S.
Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride
levels. Science 316(5829):1331-6 (2007).
Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA. Score tests for
association between traits and haplotypes when linkage phase is ambiguous. American
Journal of Human Genetics 70:425–434 (2002).
Schaid DJ. Evaluating associations of haplotypes with traits. Genetic Epidemiology 27:
348–364 (2004).
Scott LJ, Bonnycastle LL, Willer CJ, Sprau AG, Jackson AU, Narisu N, Duren WL,
Chines PS, Stringham HM, Erdos MR, Valle TT, Tuomilehto J, Bergman RN, Mohlke
KL, Collins FS, Boehnke M. Association of transcription factor 7-like 2 (TCF7L2)
variants with type 2 diabetes in a Finnish sample. Diabetes 55(9):2649-53 (2006).
Scott LJ, Mohlke KL, Bonnycastle LL, Willer CJ, Li Y, Duren WL, Erdos MR,
Stringham HM, Chines PS, Jackson AU, Prokunina-Olsson L, Ding CJ, Swift AJ, Narisu
N, Hu T, Pruim R, Xiao R, Li XY, Conneely KN, Riebow NL, Sprau AG, Tong M,
White PP, Hetrick KN, Barnhart MW, Bark CW, Goldstein JL, Watkins L, Xiang F,
Saramies J, Buchanan TA, Watanabe RM, Valle TT, Kinnunen L, Abecasis GR, Pugh
EW, Doheny KF, Bergman RN, Tuomilehto J, Collins FS, Boehnke M. A genome-wide
association study of type 2 diabetes in Finns detects multiple susceptibility variants.
Science 316(5829):1341-1345 (2007).
Silander K, Mohlke KL, Scott LJ, Peck EC, Hollstein P, Skol AD, Jackson AU, Deloukas
P, Hunt S, Stavrides G, Chines PS, Erdos MR, Narisu N, Conneely KN, Li C, Fingerlin
TE, Dhanjal SK, Valle TT, Bergman RN, Tuomilehto J, Watanabe RM, Boehnke M,
Collins FS: Genetic variation near the hepatocyte nuclear factor-4alpha gene predicts
susceptibility to type 2 diabetes. Diabetes 53:1141–1149 (2004).
126
Sparsø T, Andersen G, Nielsen T, Burgdorf KS, Gjesing AP, Nielsen AL, Albrechtsen A,
Rasmussen SS, Jørgensen T, Borch-Johnsen K, Sandbaek A, Lauritzen T, Madsbad S,
Hansen T, Pedersen O. The GCKR rs780094 polymorphism is associated with elevated
fasting serum triacylglycerol, reduced fasting and OGTT-related insulinaemia, and
reduced risk of type 2 diabetes. Diabetologia 51(1):70-5 (2008).
Staessen JA, Wang JG, Brand E, Barlassina C, Birkenhäger WH, Herrmann SM, Fagard
R, Tizzoni L, Bianchi G. Effects of three candidate genes on prevalence and incidence of
hypertension in a Caucasian population. Journal of Hypertension 19(8):1349-58 (2001).
Stephens M, Smith NJ, Donnelly P. A new statistical method for haplotype
reconstruction from population data. American Journal of Human Genetics 68:978-989
(2001).
Stewart J. Towards the Genetic Analysis of Multifactorial Diseases: The Estimation of
Allele Frequency and Epistasis. Human Heredity 54: 118-131 (2002).
Stone L, Kahn S, Fujimoto W, Deeb S, Porte D: A variation at position -30 of the β-cell
glucokinase gene promoter is associated with reduced β-cell function in middle-aged
Japanese-American men. Diabetes 45:422–428 (1996).
Tam CH, Ma RC, So WY, Wang Y, Lam VK, Germer S, Martin M, Chan JC, Ng MC.
Interaction effect of genetic polymorphisms in glucokinase (GCK) and glucokinase
regulatory protein (GCKR) on metabolic traits in healthy Chinese adults and adolescents.
Diabetes 58(3):765-9 (2009).
Terán-García M, Després JP, Tremblay A, Bouchard C. Effects of cholesterol ester
transfer protein (CETP) gene on adiposity in response to long-term overfeeding.
Atherosclerosis 196(1):455-60 (2008).
Thompson JF, Durham LK, Lira ME, Shear C, Milos PM. CETP polymorphisms
associated with HDL cholesterol may differ from those associated with cardiovascular
disease. Atherosclerosis 181(1):45-53 (2005).
Thurston LL. Multiple-factor analysis. Chicago: University of Chicago Press (1947).
Vermeulen SHHM, Heijer MD, Sham P, Knight J. Application of multi-locus analytical
methods to identify interacting loci in case-control studies. Annals of Human Genetics
71: 689–700 (2007).
Tinto N, Zagari A, Capuano M, De Simone A, Capobianco V, Daniele G, Giugliano M,
Spadaro R, Franzese A, Sacchetti L. Glucokinase gene mutations: structural and
genotype-phenotype analyses in MODY children from South Italy. PLoS One
2;3(4):e1870 (2008).
127
Valdar W, Solberg LC, Gauguier D, Cookson WO, Rawlins JN, Mott R, Flint J. Genetic
and environmental effects on complex traits in mice. Genetics 174(2):959-84 (2006).
Vaxillaire M, Veslot J, Dina C, Proença C, Cauchi S, Charpentier G, Tichet J, Fumeron
F, Marre M, Meyre D, Balkau B, Froguel P; DESIR Study Group. Impact of common
type 2 diabetes risk polymorphisms in the DESIR prospective study. Diabetes 57(1):244-
54 (2008).
Velho G, Froguel P, Clement K, Pueyo ME, Rakotoambinina B, Zouali H, Passa P,
Cohen D, Robert JJ. Primary pancreatic beta-cell secretory defect caused by mutations in
glucokinase gene in kindreds of maturity onset diabetes of the young. Lancet 340(8817):
444-8 (1992).
Vermeulen SHHM, Den Heijer M, Sham P, Knight J. Application of multi-locus
analytical methods to identify interacting loci in case-control studies. Annals of Human
Genetics 71:689-700 (2007).
Vionnet N, Stoffel M, Takeda J, Yasuda K, Bell GI, Zouali H, Lesage S, Velho G, Iris F,
Passa P, Froguel P, Cohen D: Nonsense mutation in the glucokinase gene causes early-
onset non-insulin-dependent diabetes mellitus. Nature 356:721–722 (1992).
Wang K, Abbott D. A principal components regression approach to multilocus genetic
association studies. Genetic Epidemiology 32(2): 108-118 (2008).
Wang N, Akey JM, Zhang K, Chakraborty R, Jin L. Distribution of recombination
crossovers and the origin of haplotype blocks: the interplay of population history,
recombination, and mutation. American Journal of Human Genetics 71: 1227-1234
(2002).
Watanabe RM, Allayee H, Xiang AH, Trigo E, Hartiala J, Lawrence JM, Buchanan TA.
Transcription factor 7-like 2 (TCF7L2) is associated with gestational diabetes mellitus
and interacts with adiposity to alter insulin secretion in Mexican Americans. Diabetes
56(5):1481-5 (2007).
Webster RJ, Warrington NM, Weedon MN, Hattersley AT, McCaskie PA, Beilby JP,
Palmer LJ, Frayling TM. The association of common genetic variants in the APOA5,
LPL and GCK genes with longitudinal changes in metabolic and cardiovascular traits.
Diabetologia 52(1):106-14 (2009).
Weedon MN, Frayling TM, Shields B, Knight B, Turner T, Metcalf BS, Voss L, Wilkin
TJ, McCarthy A, Ben-Shlomo Y, Davey Smith G, Ring S, Jones R, Golding J, ALSPAC
Study Team, Byberg L, Mann V, Axelsson T, Syvanen A-C, Leon D, Hattersley AT:
Genetic regulation of birth weight and fasting glucose by a common polymorphism in the
islet cell promoter of the glucokinase gene. Diabetes 54:576 –581 (2005).
128
Weedon MN, McCarthy MI, Hitman G, Walker M, Groves CJ, et al Combining
information from common type 2 diabetes risk polymorphisms improves disease
prediction. PLoS Medicine 3(10): 1877-1882 (2006).
Wilcox MA, Zhong L, Tapper W. Genetic association with rheumatoid arthritis –
Genetic Analysis Workshop 15 : Summary of contributions from Group 2. Genetic
Epidemiology 31(S1): S12 – S21 (2007).
Williams SM, Addy JH, Phillips JA 3rd, Dai M, Kpodonu J, Afful J, Jackson H, Joseph
K, Eason F, Murray MM, Epperson P, Aduonum A, Wong LJ, Jose PA, Felder RA.
Combinations of variations in multiple genes are associated with hypertension.
Hypertension 36(1):2-6 (2000).
Winckler W, Weedon MN, Graham RR, McCarroll SA, Purcell S, Almgren P, Tuomi T,
Gaudet D, Bostrom KB, Walker M, Hitman G, Hattersley AT, McCarthy MI, Ardlie KG,
Hirschhorn JN, Daly MJ, Frayling TM, Groop L, Altshuler D: Evaluation of common
variants in the six known maturity onset diabetes of the young (MODY) genes for
association with type 2 diabetes. Diabetes 56:685– 693 (2007).
Yamamoto T, Shimano H, Nakagawa Y, Ide T, Yahagi N, Matsuzaka T, Nakakuki M,
Takahashi A, Suzuki H, Sone H, Toyoshima H, Sato R, Yamada N. SREBP-1 interacts
with hepatocyte nuclear factor-4 alpha and interferes with PGC-1 recruitment to suppress
hepatic gluconeogenic genes. J. Biol. Chem. 279, 12027–12035 (2004).
Zhang C, Qi L, Hunter DJ, Meigs JB, Manson JE, van Dam RM, Hu FB. Variant of
transcription factor 7-like 2 (TCF7L2) gene and the risk of type 2 diabetes in large
cohorts of U.S. women and men. Diabetes 55(9):2645-8 (2006).
Zaykin DV, Westfall PH, Young SS, Karnoub MA, Wagner MJ, Ehm MG. Testing
association of statistically inferred haplotypes with discrete and continuous traits in
samples of unrelated individuals. Human Heredity 53:79–91 (2002).
Zeggini E, Weedon MN, Lindgren CM, Frayling TM, Elliott KS, Lango H, Timpson NJ,
Perry JR, Rayner NW, Freathy RM, Barrett JC, Shields B, Morris AP, Ellard S, Groves
CJ, Harries LW, Marchini JL, Owen KR, Knight B, Cardon LR, Walker M, Hitman GA,
Morris AD, Doney AS; Wellcome Trust Case Control Consortium (WTCCC), McCarthy
MI, Hattersley AT. Replication of genome-wide association signals in UK samples
reveals risk loci for type 2 diabetes. Science 316(5829):1336-41 (2007).
Zhang H. and Bonney G. Use of classification trees for association studies. Genetic
Epidemiology 19:323-332 (2000).
129
Zhao JH, Curtis D, Sham PC. Model-free analysis and permutation tests for allelic
associations. Human Heredity 50:133–139 (2000).
Zhao LP, Li SS, Khalid N. A method for the assessment of disease associations with
single-nucleotide polymorphism haplotypes and environmental variables in case-control
studies. American Journal of Human Genetics 72:1231–1250 (2003).
Abstract (if available)
Abstract
Multivariate methods ranging from traditional joint SNP methods to principal components analysis (PCA) have been developed for testing multiple markers in a region for association with disease and disease- related traits. However, these methods suffer from low power or the inability to identify the subset of markers contributing to the evidence for association under various scenarios. In this paper, a new approach is introduced, based on PCA with an orthoblique rotation (OPCC), in order to identify specific subsets of markers genotyped in a candidate region showing associating with an outcome of interest. When compared to traditional methods, the OPCC approach has similar or improved power, but also identifies the unique subset of markers contributing to the evidence for association. The utility of OPCC is demonstrated with an example from type 2 diabetes literature.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Pathway-based analysis of multivariate traits using classical dimensionality reduction methods
PDF
Adaptive set-based tests for pathway analysis
PDF
Genetic association studies of age-related macular degeneration from candidate gene to whole genome
PDF
Genetic studies of inflammation and cardiovascular disease
PDF
Population substructure and its impact on genome-wide association studies with admixed populations
PDF
Genomic risk factors associated with Ewing Sarcoma susceptibility
PDF
Pharmacogenetic association studies and the impact of population substructure in the women's interagency HIV study
PDF
Discovery of complex pathways from observational data
PDF
Missing heritability may be explained by the common household environment and its interaction with genetic variation
PDF
Genetic risk factors in multiple myeloma
PDF
Genetic variations in gene from the cytochrome P450 family may account for differential response to pioglitazone therapy in the Hispanic women
PDF
Bayesian hierarchical models in genetic association studies
PDF
Observed and underlying associations in nicotine dependence
PDF
Genetic epidemiological approaches in the study of risk factors for hematologic malignancies
PDF
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping
PDF
Bayesian model averaging methods for gene-environment interactions and admixture mapping
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis
PDF
Characterizing the genetic and environmental contributions to ocular and central nervous system health
PDF
Genetic studies of cancer in populations of African ancestry and Latinos
Asset Metadata
Creator
Black, Mary Helen
(author)
Core Title
Multivariate methods for extracting genetic associations from correlated data
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Statistical Genetics and Genetic Epidemiology
Publication Date
08/07/2010
Defense Date
06/30/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
association studies,candidate gene,cluster analysis,genetic epidemiology,OAI-PMH Harvest,principal components
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Watanabe, Richard M. (
committee chair
), Allayee, Hooman (
committee member
), Buchanan, Thomas A. (
committee member
), Conti, David (
committee member
), Gauderman, W. James (
committee member
)
Creator Email
mary_helenb@hotmail.com,mhbridge@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2530
Unique identifier
UC1356267
Identifier
etd-Black-2763 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-179175 (legacy record id),usctheses-m2530 (legacy record id)
Legacy Identifier
etd-Black-2763.pdf
Dmrecord
179175
Document Type
Dissertation
Rights
Black, Mary Helen
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
association studies
candidate gene
cluster analysis
genetic epidemiology
principal components