Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Pathway-based analysis of multivariate traits using classical dimensionality reduction methods
(USC Thesis Other)
Pathway-based analysis of multivariate traits using classical dimensionality reduction methods
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PATHWAY-BASED ANALYSIS OF MULTIVARIATE TRAITS USING
CLASSICAL DIMENSIONALITY REDUCTION METHODS
by
(Allen) Yu-Hsiang Shu
________________________________________________________________________
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements of the Degree
DOCTOR OF PHILOSOPHY
(STATISTICAL GENETICS)
Degree Conferral Date
May 2015
Copyright 2015 Yu-Hsiang Shu
Table of Contents
I. Background 1 – 18
I.1. Pathway analysis in complex diseases 1 – 2
I.2. Single-nucleotide polymorphism (SNP) 2 – 6
I.3. Disease-related quantitative traits 6 – 7
I.4. Testing genetic associations 7 – 12
I.5. Dimensionality reduction methods (DRMs) 12 – 16
I.6. Proposal aims 16 – 17
Tables and Figures 18
II. Univariate Regression to Test Associations between Multiple SNPs and Traits 19 – 67
II.1. Evidence for sex-specific associations between variation in acid phosphatase
locus 1 (ACP1) and insulin sensitivity in Mexican-Americans 20 – 42
II.1.1. Introduction 20 – 21
II.1.2. Methods 21 – 26
II.1.3. Results 26 – 28
II.1.4. Discussion 29 – 32
Tables and Figures 33 – 42
II.2. Additive effects of genetic variation in GCK and G6PC2 on insulin
secretion and fasting Glucose 43 – 67
II.2.1. Introduction 43 – 44
II.2.2. Methods 45 – 49
II.2.3. Results 50 – 54
II.2.4. Discussion 54 – 59
Tables and Figures 60 – 67
III. Dimensionality Reduction Methods for testing joint genetic effects on multiple traits 68 – 89
III.1. Rational for using dimensionality reduction methods (DRMs) 68 – 69
III.2. Principal component analysis (PCA) 69 – 72
III.3. Canonical correlation analysis (CCA) 72 – 75
III.4. Partial least square (PLS) 75 – 78
III.5. A unified approach to PCA, PLS, and CCA 79 – 84
III.6. Hypothesis testing with DRMs 84 – 87
III.7. Summary 87 – 88
Tables and Figures 89
IV. Evaluation of Dimensionality Reduction Methods for Testing Associations with
Multivariate Traits 90 – 134
IV.1. Sparsity of genetic effects 90
IV.2. Low-dimensional setting 91 – 104
IV.3. Low-dimension data analysis 104 – 106
IV.4. High-dimensional setting 106 – 113
IV.5. Summary 113 – 114
Tables and Figures 115 – 134
V. Adaptive CCA for Set-Based Testing with Multivariate Traits 135 – 154
V.1. Motivation 135 – 136
V.2. Adaptive rank truncated product (ARTP) 136 – 139
V.3. Adaptive CCA-based approach (ACCA) 140 – 141
V.4. High-dimensional setting 141 – 144
V.5. Other high-dimensional settings 144 – 147
V.6. Summary 147 – 148
Table and Figures 149 – 154
VI. Type 2 Diabetes Related Traits and a Glucose Cycle Pathway 155 – 160
VI.1 Background 155 – 156
VI.2 Method 156 – 158
VI.3 Results and discussion 158 – 159
Tables and Figures 160
VII. General Conclusion and Final Remarks 161 – 164
Bibliography/References 165 – 178
List of Tables/Figures 179 – 180
1
I. Background
I.1. Pathway analysis in complex diseases
Complex diseases, also known as multifactorial or polygenic disorders, such as
cancers, type 2 diabetes, and obesity, are considered to be caused by multiple genetic
variants. With more reports been published, it is recognized that each genetic variant may
only contribute a small effect to alter an individual’s disease susceptibility [Stewart,
2002; Pritchard and Cox, 2002; Lohmueller et al., 2003]. Accordingly, one mostly
acceptable notion in complex diseases aetiology is that multiple genetic variants
contribute small effects simultaneously through one or more underlying biological
pathways to alter multiple traits (Figure I.). This can be described as that, along a
pathway progress, while certain symptomatic stages were developed or quantitative traits
were accumulated to reach certain levels where an individual’s body system fails to
recover from, abnormal syndromes then became measureable and the disease could be
diagnosed. In order to explore whether a set of genetic variants and traits are involved in
a transition from health to disease based on a statistical approach, multiple variables were
measured to capture the variations in both genetic and phenotypic sides. After these
variables were measured, a statistic method would be designed to test associations
between them in the two sides. Given the fact that each variable carried different sizes of
effects, a statistical method which is capable to consider every variable in a more
appropriate way should be the one which can achieve better statistical power in the
analyses. When these associations were confirmed, we then could start to look into
whether a subset of genetic variants are associated with a subset of traits using model
2
building or variable selection approaches so as to reconstruct the structure of underlying
biological pathways.
Nowadays, newer technology was developed to obtain more detailed information
in both genotyping and phenotyping in order to provide observed variables with better
and better information coverage. Therefore, it becomes common that a large number of
genetic markers and traits are involved in a single genetic study in testing their joint
effects of an underlying pathway. As a result, methodological challenges are then posed
while data contain multiple genetic variants as predictors and multiple traits as outcomes.
I.2. Single-nucleotide polymorphism (SNP)
For measuring genetic variants, one extensively used measurement of genetic
variants is single-nucleotide polymorphism (SNP). SNPs are DNA sequence variations
occurring with differential single nucleotides across individuals. They can be mapped
across the genome for defining haplotype variation and help to identify biomedically
important genes [Sachidanandam R et al., 2001]. For statistical testing purpose, SNPs can
be measurements of observed variables by directly being converted into numerical
variables based on their corresponding reference alleles under presumed genetic models.
Due to these characteristics, genetic variants coverage of genes on a DNA can be flexibly
extended by using SNPs. However, one major issue arise while using SNPs as observed
variables in statistical analysis is that, SNPs are possible to be correlated to each other
due to linkage disequilibrium (LD).
LD is an observed phenomenon in nature that a combination of genetic markers,
3
such as SNPs, appears more often or less than what will be expected under random
formation. LD can be occurred from process of human evaluation or mating histories
across populations to a recent recombination event on one fragment of a chromosome or
a single mutation. Since SNPs are possibly correlated to each other due to LD, without an
appropriate consideration of these correlations, estimations of genetic effects and test
statistics can both be biased in modeling and testing, especially for complex diseases in
which only a small or modest effect is expected from each genetic variant. Taking the
classical simple linear regression analysis as an example, numbers of correlated SNPs
were tested separately with independent regression models and all of them were found
with positive results. The fact behind this result could be that only one or two SNPs have
true causal effects, and the other SNPs also returned positive results due to their
correlations with those causal SNPs. Or, when all the SNPs actually carried effects in the
analysis, it was known that using Bonferroni or Sidak corrections to correct p-values
from multiple testing will result in reduced statistical power in the case of correlated loci
[Roeder et al., 2005]. On the other hand, testing SNPs jointly in a multiple regression
model might avoid false positive and provide more accurate estimates of effects for
SNPs. Nevertheless, collinearity and lack of degrees of freedom would become a new
issue to reduce the statistical power to detect genetic effect when large numbers of
correlated SNPs are being tested.
One commonly used method to consider LD among SNPs in statistical analysis is
tag SNPs [Stram, 2004, de Bakker et al., 2005]. Tag SNPs can be selected from a group
of those in LD and then be used to represent their genetic variations in a region of
genome after defining r
2
thresholds and tagging methods [de Bakker et al., 2005].
4
Accordingly, performing association tests with relatively fewer number of selected tag
SNPs is efficient in terms of reducing the number of tests and thus increase statistical
power [de Bakker et al., 2005]. Also, because a set of correlated SNPs are represented
with fewer tag SNPs, correlations among the tag SNPs included in association tests are
also reduced. However, due to the fact that a set of tag SNPs were selected under a given
r
2
threshold, the correlations among tag SNPs and the genetic variations those tag SNPs
could represent vary with what threshold was manually chosen. Using higher r
2
thresholds could make tag SNPs be more represented to their tagging SNPs, but then
many tag SNPs would be selected and there still might have high correlations among
them. On the contrary, fewer tag SNPs with less correlations can be selected with lower
r
2
thresholds, but their capability to represent tagging SNPs could be very poor.
Furthermore, LD among SNPs does not have inevitability to indicate genetic associations
to disease or traits. For example, it is possible that two SNPs in high LD can have totally
independent effects to one or different traits. Accordingly, selecting one tag SNP over
another in the two will be sure to lose partial information, or even lead to a biased
interruption of the genetic effects to traits.
Another commonly used approach in considering correlations of SNPs is
haplotype analysis [Schaid et al., 2002; Zaykin et al., 2002]. Unlike tag SNPs, haplotypes
are considered to be more meaningful in both biological and statistical views. In biology,
a haplotype is defined as a DNA sequence of loci on a chromosome which were
transmitted together from generations to generations. A haplotype could be found as one
locus, several loci, or a region of a chromosome depending on the number of
recombination events that have occurred between a given set of loci. Due to this
5
biological fact that these loci are linked during transmission and thus may have some
amount of LD, a haplotype can be constructed with a sequence of alleles of those linked
loci [Schaid, 2004]. Following this concept, multiple SNPs sitting next to each other in a
region can be combined together as a unit of haplotypes, and then which can be used to
test for genetic associations with diseases or disease-related traits. In statistical analysis,
there are two common ways to define haplotype blocks in a given region. One is Gabriel
method [Gabriel et al., 2002] which defines a haplotype block based on the confidence
bounds of D’, and the other one is Four Gamete Rules [Wang et al., 2002] which
determines blocks by counting whether all four gametes are represented in the data at
give locus. Similar to tag SNPs, using haplotypes to represent multiple SNPs can reduce
the number of tests and thus achieve a higher statistical power in testing. Moreover, with
the biological meaning, haplotypes may reveal possible transcriptional and regulatory
function for a gene or gene product [Clark, 2004]. However, haplotypes are possible to be
defined as different combinations of SNPs in different data and thus increase the
inconsistency across studies. This issue can be viewed as how well a statistically defined
haplotype can be used to estimate the one with biological meaning. Under different study
designs, populations, genotyping density, methods of haplotype assignment, and
requirements of Hardy Weinberg Equilibrium (HWE) assumption, the reliability of
individual haplotype estimations varies with a wide range, and thus changes the
associations with disease or disease-related traits [Stephens et al., 2001, Allen and Satten,
2008]. In addition, haplotypes are defined only based on information along the sequence
of SNPs, and which is not necessary to indicate the genetic effects to traits. Therefore, a
haplotype may contain redundant SNPs which have no effects at all. Consequently, such
6
haplotypes may actually dilute the weightings of the SNPs with true causal effects. These
can lead to not only loss of statistical power but also substantial bias in terms of
interruptions of associations.
Tag SNPs are the SNPs selected from a large pool of them in a region, and
haplotypes are combinations of a given set of alleles of SNPs. Both methods can be
viewed as a type of variable selection strategy in statistical analysis. Although
correlations between SNPs can be reduced after these selections, these two methods
provide no information to help understand how well the issue of SNPs correlations can be
controlled in association testing. In the fact, correlations among tag SNPs or haplotypes
still remain at an uncertain level in the following up statistical analysis.
I.3. Disease-related quantitative traits
For measuring phenotypic variants, it is needed to define multiple quantitative
traits to represent different stages along an underlying pathway, especially for a complex
disease that exhibits a syndromic phenotype that is not fully captured by a single binary
or quantitative trait. Quantitative traits, or called polygenic inheritance, refer to
inheritance of a phenotypic characteristic that is attributable to more than one gene or the
interactions between biological or environmental factors. Unlike monogenic traits,
quantitative traits do not follow patterns of Mendelian inheritance. They are typically
measured to vary alone a continuous distribution. However, since traits are usually
defined from measurements of bodies or clinical examinations for disease diagnosis, they
are very likely to be correlated with each other. For example, an individual’s weight is
possibly to be correlated to the height. Also, traits may have causal effects to each other
7
when considering them in a disease pathway. For example, an individual’s blood glucose
level can be correlated with the insulin level. Therefore, similar to SNPs, the issue of
correlations among traits appeared in statistical analysis as well.
I.4. Testing genetic associations
Candidate gene approaches were introduced for testing associations between
genetic variants and disease or disease-related quantitative traits in the early phase of
pathway analysis development [Risch and Merikangas 1996; Collins et al. 1997]. A
candidate gene is a gene which was considered to be involved in an underlying biological
pathway related to the risk of diseases, such as the gene’s protein product is suggested to
be suspected of being involved in the expression of a disease or trait. Based on the
understanding of potential biological background in disease etiology, several genes can
be pre-selected and then genotyped for testing possible associations. However, in some
cases, after excluding environmental factors and study populations differences, these
results were still not as consistent as expected across studies. After more findings were
reported, it is usually suggested that there can be existing disease susceptibility genetic
variants which miss from previous genotyping or locate outside the coding region of the
genes [Cargill et al., 2002; Halushka et al., 2002; Cheung and Spielman, 2002].
Accordingly, increasing the number of genetic markers, such as SNPs, to have denser and
wider coverage is required to improve the coverage of this piece of information. Also,
because each genetic variant may only explain a small or modest effect to a trait in a
pathway, relatively small penetrance makes the associations become harder to be detected
[Stewart, 2002; Pritchard and Cox, 2002; Lohmueller et al., 2003].
8
Correcting significance levels without considering the underlying correlations
between genetic markers and traits was known to cause loss of statistical power. In
addition, several studies raise one consideration that some genetic variants may only be
detected with significant associations when testing them jointly because they interact
with each other or environmental factors for certain traits [Stewart, 2002; Weedon et al.,
2006]. Therefore, testing the joint effect of a group of SNPs in a gene or pathway on
multiple related traits becomes a more reasonable statistical approach in pathway
analysis.
The simplest and also widely applied statistic method for testing genetic
associations is a simple linear regression model-based univariate analysis, or known as
univariate regression analysis. With genetic variants, such as SNPs, at one side as
independent variables, and quantitative traits at another side as dependent variables, a
simple linear regression model can be constructed to test whether a SNP has significant
effect on a trait. For one individual participated in a study, the simple linear regression
equation can be expressed as:
𝑦 = 𝑎 + 𝑏𝑥 + 𝑒
where y is the individual’s trait value and x is the SNP value. a is the intercept, b is the
estimated beta parameter, or viewed as the genetic effect of that SNP. e is the value
remains after considering intercept and genetic effect, or called residual. After centering
the dependent variable, this model can be express with vectors as a more general equation
format that includes all individuals in the study:
9
𝐲 𝑛 ×1
= 𝐱 𝑛 ×1
𝑏 + 𝐞 𝑛 ×1
(I-1)
where y is a vector of trait and x is a vector of SNP for the total n individuals. Residuals
are represented as vector e. Traits follow a continuous distribution, and SNPs are coded
as 0, 1, or 2 for the respective number of its reference allele under additive genetic model.
Either Wald test or likelihood ratio test (LRT) can be applied to obtain test statistics for b.
When more than one SNP or trait are considered in the same time, such as a potential
pathway involves a group of genes or multiple traits, this regression model will be
repeated with the number of all possible combinations between each SNP and trait.
Accordingly, multiple comparisons are required to adjust the significance level for
performing test in each model. Since SNPs are possibly correlated due to their LD
patterns or haplotypes structures, in many genetic studies, it was often proposed that
independent variables in this regression model should be replaced as tag SNPs or
estimated haplotypes. These approaches would possibly help gain statistical power
because the number of tests decreases due to testing with less number of independent
variables. However, as previously described, tag SNPs or haplotypes were still likely to
be correlated, so the issue of correlation among independent variables still remains. In
addition, quantitative traits which are dependent variables could be correlated as well.
Therefore, assuming each regression model is independent and performing tests with
correction for multiple comparisons will result in either biased estimates of genetic
effects or loss of statistical power [Roeder et al., 2005].
When a trait was presumed to be altered by multiple genes simultaneously or
modified by underlying interactions between genes in a pathway, it is straightforward to
10
include more than one SNP in a regression model in order to examine genetic variants
jointly, which is also known as a multiple linear regression model. The regression
equation can be extended from (1) and expressed as:
𝐲 𝑛 ×1
= 𝐗 𝑛 ×𝑚 𝐛 𝑚 ×1
+ 𝐞 𝑛 ×1
(I-2)
where X is a matrix with the n individuals and m SNPs, and b is a beta vector which has
m estimated effects of SNPs. LRT can be applied to obtain test statistics for this model.
When multiple SNPs were included into one regression model, single SNP effect would
be estimated conditional on the other SNPs as covariates. This model can help to clarify
whether an estimated effect of one SNP was confounded or masked due to its correlation
with the other SNPs. In addition, because multiple SNPs are tested at once in a model,
using multiple linear regression model can help gain more statistical power than single
linear regression mode while multiple SNPs have small or modest effects, and which is a
conceivable circumstance in complex diseases. However, due to the correlations among
independent variables, such as SNPs, tag SNPs, or haplotypes, collinearity could make
genetic effects to be incorrectly estimated and thus resulted in losing statistical power.
Besides, when a large number of SNPs were considered to test a hypothesis with a
limited sample size, a multiple regression model might be failed. Or it was required to
drop some independent variables to satisfy the degrees of freedom, and which would
result in either loss of information and lower statistical power.
Upon previous regression models, when multiple traits as dependent variables
were also considered to be tested jointly, we could use multivariate simple linear
regression or multivariate multiple linear regression model. The regression equation for
11
multivariate simple linear regression model can be extended from (2) and expressed as:
𝐘 𝑛 ×𝑡 = 𝐱 𝑛 ×1
𝐛 1×𝑡 + 𝐄 𝑛 ×𝑡
(I-3)
where Y is a matrix with n individuals and t traits, and b is a vector of genetic effects of
one SNPs versus t traits. Accordingly, extending to multiple SNPs, the multivariate
multiple linear regression equation can be expressed as:
𝐘 𝑛 ×𝑡 = 𝐗 𝑛 ×𝑚 𝐁 𝑚 ×𝑡 + 𝐄 𝑛 ×𝑡
(I-4)
where B is a beta matrix of genetic effects of m SNPs versus t traits. In both models, the
least squares estimate for the beta parameters can be obtained by solving the normal
equations as in multiple regressions. We can apply multivariate analysis of variance
(MANOVA) for testing the models, which is essentially testing SNPs and traits jointly.
Consequently, when pathways analysis involved a number of SNPs and traits,
multivariate multiple linear regression seems to be the most appropriate approach to use.
Nevertheless, similar to multiple linear regression models, collinearity in SNPs could
cause the beta matrix to be estimated incorrectly. Furthermore, because SNPs and traits
are now considered jointly in one model, lack of degrees of freedom would still be an
issue when performing MANOVA.
In summary, there are four typical linear regression-based modeling methods: one
SNPs to one trait (simple linear regression), multiple SNPs to one trait (multiple linear
regression), one SNP to multiple traits (multivariate simple linear regression), and
multiple SNPs to multiple traits (multivariate multiple linear regression). Each of them
12
can be used to test the hypothesis that whether a group of SNPs are associated with a
group of traits along a potential biological pathway. However, the issues of power loss
due to correlations among SNPs and traits and lack of degrees of freedom in modeling
remains unsolved.
I.5. Dimensionality reduction methods (DRMs)
In recent years, some statistical methods have been proposed to address the issue
of correlated SNPs and traits in testing genetic associations. There are two commonly
seen concepts to handle the issue. One is to convert the original variables with high
correlations into less correlated or independent linear components using dimensionality
reduction methods (DRM), and then perform tests on those linear components [Lee and
Shu, 2004; Guaderman et al, 2007]. Another concept is to correct significant levels of
hypothesis testing by taking the correlations of variables into account [Conneely and
Boehnke, 2007; Yu et al., 2009]. In this proposal, we focused on the method of applying
DRM in testing genetic associations.
Principal component analysis (PCA) is one basic DRM. It has been successfully
applied in several fields in genetic analysis, including handling population admixture and
testing genetic effects [Lee and Shu, 2004; Price et al., 2006; Guaderman et al, 2007].
PCA requires no additional information other than what the original dataset already
contained, and is easy to be implemented into the original analysis framework. The
concept of PCA approach is to obtain PCs (principal components, also named as latent
variables or linear components in a general definition in DRM) from SNPs and/or traits
using PCA, and then perform tests between those PCs instead of original variables. When
13
applying PCA in a multiple linear regression model (equation I-2), it was to simply
replace original x with PCs of x in the model and then test the associations between the
PCs and a trait. Similarly, this method can be easily extended to the other types of
regression models.
For genetic variants, applying PCA to handle LD of SNPs can gain several
advantages to cover the issues in using tag SNPs or haplotypes [Gauderman et al, 2007].
First, the first few PCs could normally capture a large proportion variance of SNPs, and
consequently the number of association tests was able to be reduced by only testing those
PCs. Second, the proportion of variance of SNPs explained by PCs could be measured
with eigenvalues. This information usually remained uncertain or unknown in testing
with tag SNPs or haplotypes. Third, PCs are statistically independent to each other, so
using Bonferroni or Sidak corrections to adjust significance level for multiple
comparisons becomes more reasonable in testing the jointly genetic effects. Nevertheless,
one major limitation of applying PCA in genetic association tests is the loss of
interpretability beyond testing associations between a candidate region and disease
outcomes. The betas obtained from regression models using PCs only indicated the
effects for each PC score, and each PC score was generated from summing all SNPs after
weighting them by the corresponding eigenvectors. Therefore, effects of each SNP to
disease or disease-related traits were not directly measured or tested in the model. Several
studies proposed an idea that some level of interpretability may be revealed by
eigenvector loadings of SNPs in each PC [Atkinson et al., 2007; Chase et al., 2007;
Guaderman et al, 2007], or used rotation methods to modify eigenvector loadings in order
to enhance the discriminability across PCs [Black and Watanabe, 2011]. However, these
14
methods could be only working in a scenario when the directions of PCs of SNPs
happened to match with the directions of traits. More specifically, eigenvectors of one PC
simply reflected the direction of that PC to explain the variance of SNPs in the variables
space, and the process of finding that direction was irrelevant to how the SNPs were
associated to traits. Apparently, such scenario was not guaranteed to be true in the real
data analysis. Furthermore, while more SNPs were included in a PCA, the values of
eigenvector loadings would lose more contrast either within or between each obtained PC
and thus each PC increasingly lose the interpretability.
Similar with the concept that applying PCA in correlated SNPs, the PC-based
approaches were also developed in handling correlated traits. Traits are the dependent
variables in a regression model, and which can also be clustered into relatively fewer
linear components. Opposite to linear components of SNPs, each linear component of
traits would usually be given a meaningful definition based on the traits clustered within
it. Several studies have showed that PCA or factor analysis (FA) could be properly used
to cluster disease-related traits [Meigs et al., 1997; Gray et al., 1998; Jurg and Daniel,
1999]. In addition, FA was shown to be able to examine the possible relations among
metabolic risk variables [Edwards et al., 1994; Arya et al., 2002]. Because a relatively
small number of factors were selected from multiple traits and the factors are independent
to each other, testing associations with factors of traits would be more efficient (reduce
the number of tests and increase the data variance included in each test) and thus
expected to achieve higher power. Due to the fact that the number of factors and the
corresponding factor loadings vary from different rotation methods and factors selection
criteria, the interpretability of factors of traits is still debatable.
15
Based on the previous studies of using PCA in genetic analysis, some researchers
then looked into the potential in using the other advanced types of DRMs, such as partial
least square (PLS) and canonical correlation analysis (CCA). PCA only seeks the linear
components based on maximizing the variance of a set of variables (either independent or
dependent variables), and which do not truly reflect the association between two sets of
variables (independent and dependent variables). PLS and CCA define linear components
from maximizing the covariance and correlations between the two sets of variables
[Hotelling, 1936; Geladi and Kowalski, 1983]. Therefore, it is expected that the linear
components obtained from PLS and CCA may capture more variation related to genetic
associations. Several studies have proposed to use variable and feature selection in PLS
and CCA to build prediction model for supporting an underlying biological pathway
[Waaijenborg and Zwinderman, 2009; Wang et al., 2009]. Some other studies then
introduced the concept of constructing a formal statistical test based on PLS or CCA to
jointly test associations between multiple genetic variants and traits [Chun and Keles,
2009; Zhang et al., 2011; Turkmen and Lin, 2011].
Overall, PCA, PLS, and CCA can be viewed as that each of them seeks one
specific eigenvalue solution in data covariance matrix [Borga, 1992]. Therefore, it is
important to thoroughly review the nature of each method and understand how the
process of seeking eigenvalue solutions really helped to improve the performance of
genetic association tests. Accordingly, it is also important to explore and compare their
performances across different data characteristics, including correlations between
variables and sparsity of genetic effects. However, no systematical examinations and
comparisons made to address this topic were found from the previously published
16
studies.
I.6. Proposal aims
In Chapter II, I provided two published papers from our previous research works
which used candidate gene approaches to test genetic associations under presumed
biological pathways. In the two studies, it was observed that there had potential power
loss and limitations to detect genetic associations using traditional statistical approaches.
This became my motivation to develop an approach which is able to detect genetic
associations between multiple genetic markers and traits involved in a underlying
biological pathway with higher statistical power.
In Chapter III, I discussed the methodology of PCA, PLS, and CCA from their
traditional calculations to a unified approach to all of them. Based on the discussion of
the unified approach, I proposed a new statistical approach which is able to perform
association tests using the three DRMs in a very fast and efficient fashion.
Accordingly, in Chapter IV, I used this new approach to build an efficient
simulation design to demonstrate how the nature of these three DRMs impacted the
performance in doing association tests under different simulation scenarios. Next, we
extended the simulation design into a large number variables framework. Finally, we plan
to build a simulation with implements of real data information.
In Chapter V, I proposed to use another statistical concept reported in a pathway
analysis method to further improve the power of the previously proposed approach. In the
last chapter (Chapter VI), I plan to use the improved new statistical approach in a real
17
data analysis, where the data represent a presumed biological pathway in one published
papers in the Chapter II. I expected the newly proposed approach can help to better detect
the genetic associations in the pathway.
18
Table and Figures (Chapter I.)
Figure I. Two simple conceptual biological pathways (circle indicates measurable variables; lines
indicate relationships; arrows indicate causal directions; S: stage, T: trait, G: gene). a) Disease
development begins from stage one. Following this pathway, genes contribute effects and alter
the values of corresponding traits at each stage, and then finally cause diseases. ex. tumor cells
development for cancers. b) Pathway runs as a cycle between stages. Trait two increases to enter
stage two and then body starts to release trait one. Trait one is to suppress the level of traits two in
order to keep their balance in body. Disease is diagnosed when trait two reach a certain level. ex.
glucose cycling for type 2 diabetes.
a) One-direction pathway
b) Cycling pathway
G1
T1
S1
G2
T2
S2
G3
T3
S3 Disease
G1
T1
S2
S1
G2
T2 Disease
19
II. Use Univariate Regression Model to Test Associations between Multiple SNPs
and Traits
I worked with the BetaGene Study which has been exploring genetic associations
between candidate genes and type 2 diabetes quantitative traits under presumed
biological pathways using univariate regression models. The following are my two papers
published in 2009.
The first paper was proposed to estimate genetic effects of one candidate gene
(ACP1) to a set of type 2 diabetes traits [Shu et al., 2009]. Both tag SNPs and estimated
haplotype were applied to capture genetic variation along ACP1 region. A simple linear
regression model design was used to test the genetic associations. The second paper was
looking into an interaction between two candidate genes (GCK and G6PC2) to a set of
type 2 diabetes traits using multiple regression models based on the underlying glucose
cycling pathway [Li and Shu et al., 2009]. In both papers, significance levels in
hypothesis testing were corrected using Bonferroni corrections because of multiple
comparisons. It was found that the SNPs and traits are correlated and the genetic effects
are relatively small, so we would expect to suffer the power loss in testing genetic
associations after adjusting for multiple comparisons, especially when using Bonferroni
corrections. Therefore, these works became a part of the motivations making me desire to
develop a new statistical approach to test genetic associations in a presumed biological
pathway with improved statistical power.
20
II.1. Evidence for sex-specific associations between variation in acid phosphatase
locus 1 (ACP1) and insulin sensitivity in Mexican-Americans [Shu et al.,
2009]
II.1.1 Introduction
Protein tyrosine phosphatases (PTPs) regulate insulin signaling and therefore are
functional candidate genes for type 2 diabetes mellitus (T2DM) and adiposity. Genetic
studies thus far have focused on PTP1B which is encoded by PTPN1 [Bento et al., 2004;
Palmer et al., 2004; Pei et al., 2004], a gene shown to be associated with T2DM and
T2DM-related traits [DiPaola et al., 2002; Bento et al., 2004; Palmer et al., 2004].
However, the low molecular weight protein tyrosine phosphatase (LMPTP) is emerging
as a second important negative regulator of insulin signaling [Hopkinson et al., 1963;
Spencer et al., 1964; Bottini et al., 1990; Lucarini et al., 1990; Bottini et al., 1995].
LMPTP, a Class II PTP [Alonso et al., 2004] encoded by acid phosphatase locus 1
(ACP1), was originally isolated as an acid phosphatase in red blood cells [Hopkinson et
al., 1963], and was later independently isolated from adipocytes where it is highly
expressed [Bottini et al., 1990; Lucarini et al., 1990; Paggi et al., 1991]. ACP1 efficiently
inhibits both metabolic and growth signaling through the insulin receptor [Taddei et al.,
2000; Raugei et al., 2002]. Recently, Pandey et al. showed correction of metabolic
abnormalities associated with adiposity in mice by knocking down ACP1 using antisense
oligonucleotides in liver and adipose tissue [Pandey et al., 2007]. The ACP1 knock-down
resulted in a phenotype similar to the PTP1B
-/-
mutation [Elchebly et al., 1999],
suggesting ACP1 also directly dephosphorylates the insulin receptor. Overall, the
21
evidence suggests ACP1 could be a critical regulator of insulin signaling in adipose tissue
and a promising drug target to treat obesity and obesity-associated metabolic
abnormalities.
The BetaGene Study is a family-based study with the goal of identifying genes
influencing variation in T2DM-related quantitative traits. In BetaGene we are performing
detailed phenotyping of Mexican American probands with recent GDM and their family
members to obtain quantitative estimates of body composition, insulin sensitivity (S
I
),
acute insulin response to glucose (AIR), and -cell compensation (disposition index, DI).
Variability in these traits is involved in the pathogenesis of T2DM [DeFronzo, 1987;
Bergman, 1989] and heritable in our Mexican American families [Watanabe et al., 2003].
Based on the evidence that ACP1 variation may regulate insulin signaling in adipose
tissue [Lucarini et al., 1998; Faggioni et al., 2002; Alonso et al., 2004; Iannaccone et al.,
2005], we examined genetic variation in the ACP1 region and tested the variants for
association with T2DM-related quantitative traits and adiposity in our sample of Mexican
Americans.
II.1.2. Methods
Subject recruitment
At the time of these analyses, subject recruitment for BetaGene was on-going and
for the purpose of this report we describe only those subjects, clinical protocols and
assays related to the results presented herein. Subjects are Mexican American (both
parents and ≥3 grandparents Mexican or of Mexican descent) who are either probands
22
with GDM diagnosed within the previous five years and their family members, or
probands with normal glucose levels in pregnancy in the past five years. All probands are
identified from the patient populations at Los Angeles County/USC Medical Center, the
Kaiser Permanente Southern California health plan membership, and OB/GYN clinics at
local hospitals. Details regarding family recruitment have been previously detailed
[Watanabe et al., 2007]. Briefly, GDM probands are recruited for phenotyping if they, (a)
have a confirmed diagnosis of GDM within the previous five years, (b) have glucose
levels associated with poor pancreatic β-cell function and a high risk of diabetes when not
pregnant [Kjos et al., 1995], (c) have no evidence of β-cell autoimmunity by GAD-65
testing, and (d) have available for study at least two siblings. Non-GDM probands are
recruited if they had a 1-hour 50 g glucose screening result <130 mg/dl (7.2 mM) during
their most recent pregnancy, have no family history of diabetes, and have normal glucose
tolerance. The non-GDM probands are frequency-matched to GDM probands by age,
BMI, and parity categories.
All protocols for BetaGene have been approved by the Institutional Review
Boards of participating institutions and all participants provided written informed consent
prior to participation.
Clinical protocols
Phenotyping was performed on two separate visits to the University of Southern
California General Clinical Research Center. The first visit consisted of a physical
examination, DNA collection, and a 75 g 2-hour oral glucose tolerance test (OGTT) with
blood samples obtained before and 30, 60, 90, and 120 min after glucose ingestion.
23
Participants in GDM families with fasting glucose <126 mg/ml and non-GDM probands
with normal fasting and 2-hour glucose levels were invited for a second visit, which
consisted of a dual-energy X-ray absorptiometry (DEXA) scan for body composition and
an insulin-modified intravenous glucose tolerance test (IVGTT) analyzed by minimal
model (MINMOD Millennium V5.18) to quantify glucose effectiveness (S
G
), insulin
sensitivity (S
I
), acute insulin response (AIR), and the disposition index (DI=S
I
×AIR)
[Bergman et al., 1979].
Assays
Plasma glucose is measured on an auto-analyzer using the glucose oxidase
method (YSI Model 2300, Yellow Springs Instruments, Yellow Springs, OH). Insulin is
measured by two-site immunoenzymometric assay (TOSOH) that has <0.1% cross-
reactivity with proinsulin and intermediate split products.
Molecular analysis
We screened the ACP1 region by selecting polymorphic SNPs from HapMap
(http://hapmap.org/) and dbSNP (http://www.ncbi.nlm.nih.gov/SNP/) at approximately
~2.5 Kb intervals across the coding region of the gene, 40 Kb upstream, and 10 Kb
downstream from the gene. The extra 50 kb beyond the coding regions were examined to
ensure we did not miss potential nearby regulatory elements. We preferentially selected
SNPs that had minor allele frequency (MAF) >5% in all four HapMap populations. In
addition to the 13 SNPs selected from HapMap and dbSNP, we also include 2 SNPs
believed to have functional consequence (rs11553742 and rs7576247) [Bottini et al.,
24
1990]. All SNPs were genotyped using the Applied Biosystems, Inc. (ABI) TaqMan
system [Livak, 1999; Livak 2003]. Genotyping assays were either selected through ABI’s
“Assays on Demand” database (http://myscience.appliedbiosystems.com/navigation/
mysciapplications.jsp) or custom designed using ABI’s “Assays by Design” service.
Data analysis
Genotype data were tested for deviation from Hardy-Weinberg equilibrium
(HWE) and for non-Mendelian inheritance using PEDSTATS V0.6.4 [Wigginton and
Abecasis, 2005]. Allele frequencies were estimated using all available data taking into
account relatedness using SOLAR V2.1.4 [Blangero and Almasy, 1997]. Haplotype
frequencies were estimated using FUGUE (http://www.sph.umich.edu/csg/abecasis/
FUGUE/). Pair-wise linkage disequilibrium (LD) and haplotype block structure were
assessed using Haploview V4.1 [Barrett et al., 2005]. Haplotype block structure was
assessed using the method of Gabriel et al [Gabriel et al., 2002]. Tag SNPs were selected
from among the genotyped SNPs using aggressive tagging with 2- and 3-maker
haplotypes in TAGGER as implemented in Haploview [Barrett et al., 2005]. We used the
default settings of pair-wise r
2
≥0.8 and LOD threshold of 3.0 for multimarker.
Quantitative trait data were statistically transformed to approximate univariate
normality prior to analyses. Each tag SNP was then tested individually for associations
with T2DM-related traits using likelihood ratio testing under a variance components
framework using SOLAR [Blangero and Almasy, 1997]. Because families were
ascertained on probands with or without previous GDM, we corrected for ascertainment
bias by conditioning on the proband’s phenotype. SNPs were coded by the number of
25
minor alleles under an additive genetic model, and the univariate association between tag
SNPs and traits was adjusted for age and sex. Any result under the additive model that
showed a trend for association, defined as p=0.1 Bonferroni-corrected for the number of
SNPs examined (corrected p=0.017), was subsequently tested for association under
dominant and/or recessive genetic models.
Given prior evidence that variation in ACP1 may be associated with adiposity
[Bottini et al., 1990; Paggi et al., 1991; Shekels et al., 1992], we performed an
independent series of tests to examine whether adiposity might alter the association
between variation in ACP1 and T2DM-related traits. Specifically, we compared a model
with age, sex, and percent body fat as main effects with a second model that included a
main effect for the SNP and the multiplicative interaction between the SNP and percent
body fat. We tested for the overall gene effect using a 2-df test and also tested for a
significant interaction effect using a 1-df test.
We also considered the possibility that sex-specific effects may exist, given
previous observations on serum triglyceride levels in females [Bottini et al., 2002].
Therefore, we repeated the test of association between variation in ACP1 and T2DM-
related traits stratifying by sex.
For the interaction and sex-stratified analyses, we applied a Bonferroni correction
to account for the multiple SNPs and traits tested (6 SNPs × 14 traits = 84 for the
adiposity and sex interactions and 3 SNPs × 14 traits = 42 for sex-stratified). However,
Bonferroni assumes independence among tests and there exists correlation both among
the SNPs and among the quantitative traits we examined. Therefore, the Bonferroni-
26
corrected p-values we report should be overly conservative. Therefore, we also applied
the P
ACT
approach [Conneely and Boehnke, 2007] to adjust p-values for our most
significant results. P
ACT
adjusts significance levels accounting for the correlation among
both SNPs and traits across all tests.
All results are reported at age and sex-adjusted means and standard deviations,
unless otherwise specified.
II.1.3. Results
We report results from 1035 individuals in 339 families for whom both phenotype
and genotype data were available. Subject characteristics are shown in Table II.1.1. In
general, GDM probands, siblings, and cousins were similar in median age, BMI, and
percent body fat, although these characteristics tended to be highest in the GDM
probands and lowest in the cousins. The median BMI exceeded the threshold for
overweight (25 kg/m
2
) in all three groups and exceeded the threshold for obese (30 kg/m
2
)
only in the GDM probands. The non-GDM probands were of similar age as GDM
probands, but less likely to be obese, reflecting the fact that BetaGene participant accrual
was on-going at the time of analysis and recruitment of non-GDM probands was lagging
behind GDM probands to allow for matching as described above.
Among the 15 SNPs genotyped, the assay for rs10171111 failed and rs11691572
was monomorphic in our sample. The LD and haplotype block structure for the
remaining 13 SNPs are shown in Figure II.1.1. In Mexican Americans, these 13 SNPs
form a single 30.3 kb haplotype block that encompasses and extends beyond the ACP1
27
coding region. This block structure is similar to what is observed in the HapMap CEU
population (www.hapmap.org). The SNPs in this region formed 33 total haplotypes, with
5 having frequencies >5% (Table II.1.2). The three most frequent haplotypes had similar
frequencies in our sample of Mexican Americans (21.3-27.4%), followed by one
haplotype with frequency of ~11%, and another with a frequency of ~8%.
Six SNPs were selected as tags for the ACP1 region (Table II.1.3) and were tested
for association with T2DM-related quantitative traits. Among the 6 tag SNPs, three
(rs11553742, rs3828329, rs10167992) showed nominal evidence for association with
T2DM-related quantitative traits (Table II.1.4). The strongest associations were observed
between rs10167992 and BMI (p=0.009), rs3828329 and S
I
(p=0.008), and rs11553742
and triglycerides (p=0.005) under an additive genetic model. However, none of these
associations remained significant after correction for multiple testing. Similarly, when we
tested whether the association between tag SNPs and T2DM-related quantitative traits
was modified by percent body fat, we observed nominal associations that became non-
significant when corrected for multiple testing (data not shown).
We used a 2-df test to assess whether tag SNPs were associated with T2DM-
related quantitative traits in the presence of an interaction with sex. The interaction
between sex and rs11553742, rs3828329, and rs10167992 all showed nominal evidence
for association with one or more traits (Table II.1.5). We further explored the association
between these three tag SNPs and T2DM-related quantitative traits by stratifying on sex
and observed a striking difference between males and females (Table II.1.6- II.1.8, Figure
II.1.2).
28
Among females, we observed nominal associations between rs11553742 and
triglycerides (p=0.004), but this did not remain significant after Bonferroni correction for
multiple testing (Table II.1.6). Thus, variation in ACP1 did not appear to have any
detectable effect on T2DM-related quantitative traits in females (Table II.1.6). However,
among males rs3828329 showed nominal evidence for association with BMI (p=0.036),
percent body fat (p=0.002), fasting and 2-hour glucose (p=0.025 and p=0.014,
respectively), fasting and 2-hour insulin (p=0.0002 and p=0.001, respectively), 30’
S
I
(p=0.0005), and disposition index (p=0.032) (Table II.1.7).
rs11553742 only showed marginal association with AIR (p=0.061), while rs10167992
showed weak association with several T2DM-realted traits (p<0.049), with the strongest
association being with S
I
(p=0.004) (Table II.1.7).
After correction for multiple testing, rs3828329 remained significantly associated
with fasting insulin (Bonferroni p=0.007, P
ACT
=0.013) and S
I
(Bonferroni p=0.019,
P
ACT
=0.034) and marginally associated with 2-hour insulin (Bonferroni p=0.058) and
percent body fat (Bonferroni p=0.09) (Table II.1.7). Each copy of the rs3828329 T allele
increased fasting insulin by ~13.1 pM (~24.7%/allele) in males, but modestly decreased
fasting insulin by ~2.5 pM (~4.6%/allele) in females (Figure II.1.2, Table II.1.8). The
increase in fasting insulin with number of T alleles in males could reflect compensation
for increasing insulin resistance as demonstrated by the decrease in S
I
(0.39 S
I
units per
allele or ~13.3%/allele) that is conferred by the T allele (Figure II.1.2, Table II.1.8). The
association between rs3828319 and both fasting insulin and S
I
in males remained
significant even after the additional adjustment for percent body fat (p=0.005 and
p=0.020, respectively).
29
II.1.4. Discussion
In the present study we examined whether variation in ACP1, an acid phosphatase
involved in insulin signaling, was associated with T2DM-related quantitative traits. To
our knowledge, this is the first report in which variation at the ACP1 locus was
comprehensively tested for association with such traits in humans. Previous reports have
been restricted to analysis of either biochemical genotypes or single SNPs [Gloria-Bottini
et al., 1996; Lucarini et al., 1997; Bottini et al., 2002]. Our analysis reveals that
rs3828329, which is located in the 3’ UTR region of ACP1, is associated with fasting
insulin and S
I
in a sex-specific manner. The T allele of rs3828329 appears to increase
fasting insulin and decrease S
I
in males, but not females.
Sex-specific effects of ACP1 have been reported for other phenotypes [Bottini et
al., 2002; Bottini et al., 2005; Gloria-Bottini et al., 2007]. For example, Bottini et al.
showed that variation in ACP1 was associated with increased age at onset of type 1
diabetes in male children and decreased age at onset in female children [Bottini et al.,
2002]. In a separate study, they also showed that variation in ACP1 was associated with
triglycerides in obese females [Bottini et al. 2002]. We also found nominal evidence for
association between rs11553742 and triglycerides in female participants in BetaGene
(Table II.1.6), although the association became non-significant after correction for
multiple testing. Gloria-Bottini et al. showed that the ACP1 biochemical genotype
ACP1*A was associated with increased susceptibility to allergy in females as compared
to males [Gloria-Bottini et al., 2007]. Mechanisms underlying the various sex-specific
effects of ACP1 remain to be determined.
30
rs11553742 and rs7576247, which are reported to be correlated with the levels of
ACP1-F (also called LMPTP-A) and ACP1-S (also called LMPTP-B) isoenzymes
[Faggioni et al., 2002], respectively, did not show any evidence for association with
T2DM-related quantitative traits. It is also noteworthy that LD between rs3828329 and
both rs11553742 (r
2
=0.073) and rs7576247 (r
2
=0.105) is very weak. Additionally,
rs11691572, which results in a Asn Lys substitution at codon 7, was monomorphic in
our sample, consistent with observations from the HapMap populations
(www.hapmap.org). Thus, our analysis suggests that the functionally relevant variant(s)
associated with metabolic phenotypes in Mexican Americans is distinct from the
previously described biochemical alleles.
Given our study population, it is possible that population stratification may
confound our analyses. However, two lines of evidence suggest this is not the case. First,
our recruitment strategy was such that both parents and 3 of 4 grandparents of probands
were required to have been born in Mexico. While such selection does not guarantee a
sample devoid of population substructure, it should minimize the level of admixture in
our sample. Second, when we re-analyze the association for rs3828329 with fasting
insulin and S
I
in males using FBAT [Rabinowitz and Laird, 2000], which is protected
from population stratification, we still see evidence for association (fasting insulin
p=0.013 and S
I
p=0.066). The reduction in the magnitude of the association is expected,
given that FBAT is based on allelic transmissions, which reduces the efficiency and
power of the test relative to the variance components approach. Thus, these results
suggest our results are not confounded by population substructure and may represent true
associations between variation in ACP1 and both fasting insulin and S
I
.
31
Mechanistic studies suggest that ACP1 is a negative regulator of insulin receptor
phosphorylation [Elchebly et al., 1999; Taddei et al., 2000; Pandey et al., 2007] and may
alter insulin receptor signaling. For example, Pandey et al. performed a series of in vitro
and in vivo experiments in which ACP1 expression was suppressed using an antisense
oligonucleotide [Pandey et al., 2007]. Their results show that in both culture mouse
hepatocytes and in liver and fat tissue isolated from diet-induced obese and ob/ob mice,
the presence of the antisense oligonucleotide resulted in a reduction of ACP1, a
concomitant decrease in insulin receptor phosphorylation, and significant changes in
activities of key components of the insulin receptor signaling pathway (e.g., PI3-kinase,
Akt) [Pandey et al., 2007]. Furthermore, diet-induced obese and ob/ob mice treated with
the antisense oligonucleotide exhibited hyper-phosphorylation of the insulin receptor,
subsequent improvement in insulin sensitivity, and reductions in fasting glucose and
insulin. Treatment by antisense oligonucleoitide did not appear to change body weight or
increase metabolic rate assessed by indirect calorimetry [Pandey et al., 2007]. These
phenotypic changes are similar to our observations in Mexican Americans where
variation in rs3828329 appears to alter fasting insulin and S
I
, at least in males. However,
it was unclear whether Pandey et al. observed sex-specific effects in their rodent studies.
Unfortunately, the location of rs3828329 does not provide information about the
potential location of the putative functional variant, given the LD structure in this region
(Figure II.1.1). For example, rs3828329 is in strong LD with rs4455191 (D'=0.99,
r
2
=0.96) and rs4255987 (D'=0.99, r
2
=0.99) which are 8.6 and 5.6 kb 5’ from exon 1,
respectively. LD is also strong with rs6755722 (D'=1.0, r
2
=1.0), located in intron 1, and
rs4452185 (D'=0.98, r
2
=0.96) located in intron 4. We therefore performed haplotype-
32
based tests of association to determine if a specific haplotype might be associated with
T2DM-related quantitative traits and provide insights into the region that may harbor the
putative functional variant. We assigned most probably haplotypes to each individual and
performed additional tests of association. However, our analyses, which showed that the
CGCCACGCAATTT haplotype (haplotype 3 in Table II.1.2) was associated with our
phenotypes of interest (data not shown), did not yield information beyond that observed
from our single SNP analyses. Therefore, additional studies will be required to identify
the putative functional variant(s).
In summary, we screened genetic variation across the coding region of ACP1 and
for association with type 2 diabetes-related quantitative traits in a large sample of
Mexican Americans from families of probands with a previous diagnosis of GDM. In
univariate analyses, ACP1 variation was not associated with type 2 diabetes-related
quantitative traits nor did we detect an effect of body fat to modify the association
between variation in ACP1 and these traits. However, we did observe a significant
association between fasting insulin and S
I
and the interaction between rs3828329 and
sex. Specifically, S
I
decreased and fasting insulin increased with each copy of the T allele
for rs3828329 in males, while there was no such effect in females. Our results, combined
with the known biology of ACP1, suggest that there is a sex-specific effect of variation in
ACP1 to alter insulin signaling. The specific mechanism underlying this effect will
require additional study.
33
Tables and Figures (Chapter II.1)
Table II.1.1. Subject Characteristics. Data are reported as unadjusted median and interquartile
range.
GDM
Probands
Siblings Cousins
Non-GDM
Probands
Female/Male 225/0 272/184 147/116 91/0
Age (years) 35.3 (6.5) 34.3 (12.3) 32.9 (12.9) 34.0 (5.8)
BMI (kg/m
2
) 30.4 (7.8) 29.1 (6.8) 27.8 (7.4) 28.5 (7.0)
Body fat (%) 38.7 (7.7) 33.4 (13.0) 31.2 (13.2) 36.6 (7.2)
Fasting glucose (mM) 5.2 (0.7) 5.1 (0.6) 5.1 (0.6) 4.9 (0.6)
2-hr. glucose (mM) 8.0 (3.3) 7.3 (2.7) 6.6 (2.4) 6.6 (2.1)
Fasting insulin (pM) 54 (48) 42 (42) 36 (36) 36 (30)
2-hr. insulin (pM) 450 (429) 354 (372) 306 (342) 252 (306)
30’ ∆insulin (pM) 363 (282) 384 (282) 408 (342) 384 (300)
S
G
(×10
-2
per min
-1
) 1.47 (0.69) 1.64 (0.80) 1.79 (0.85) 1.90 (0.92)
S
I
(×10
-3
per min
-1
per pM) 2.62 (1.75) 2.61 (1.99) 2.89 (1.87) 3.31 (1.89)
AIR (pM×10 min) 3,456 (4,275) 4,310 (5,047) 4,830 (5,340) 5,530 (4,307)
Disposition Index 8786 (9492) 11,944 (11,566) 13,264 (10,251) 17,431 (10,440)
Cholesterol (mM) 4.37 (1.16) 4.50 (1.19) 4.42 (1.06) 4.01 (1.03)
HDL (mM) 1.24 (0.36) 1.14 (0.36) 1.22 (0.41) 1.22 (0.34)
Triglyceride (mM) 2.51 (2.07) 2.77 (2.48) 2.30 (1.84) 1.73 (1.47)
34
Table II.1.2. Haplotypes frequencies using SNPs genotyped across the ACP1 region.
Haplotype
rs4455191
rs4255987
rs7596678
rs10167992
rs6755722
rs11553742
rs7419262
rs4452185
rs7576247
rs6855
rs3828329
rs6708541
rs3828330
Frequency
(%)
1 T C C C G C G T G A C T C 27.4
2 T C T C G C C T A G C G C 25. 5
3 C G C C A C G C A A T T T 21.3
4 T C T T G C C T A A C G C 10.7
5 T C T C G C C T A A C G C 7.8
All Others* 7.0
* Haplotypes with frequency <5% were pooled
35
Table II.1.3. Tag SNP Characteristics
SNP
Position
(kb)*
Minor
Allele MAF** SNPs Tagged
rs10167992 253270 T 0.10
rs11553742 262051 T 0.02
rs7419262 263621 C 0.48 rs6708541, rs7596678
rs7576247 267003 G 0.27
rs6855 267955 G 0.30
rs3828329 269705 T 0.24
rs4455191, rs4255987,
rs6755722, rs4452185, rs3828330
* Genomic position based on NCBI Build 36
** MAF = minor allele frequency
36
Table II.1.4. Results of single marker tests of association between SNPs in ACP1 and T2DM-
related quantitative traits.
rs11553742
(T / 0.02)
†
rs3828329
(T / 0.24)
†
rs10167992
(T / 0.10)
†
Trait Effect* p-value** Effect* p-value** Effect* p-value**
BMI – 0.1756 + 0.1294 – 0.0088
Body fat – 0.0998 + 0.0361 – 0.0491
Fasting glucose – 0.5337 + 0.8813 + 0.4186
2-hr. glucose – 0.3612 + 0.4426 – 0.8742
Fasting insulin – 0.7619 + 0.0297 – 0.4793
2-hr. insulin + 0.5871 + 0.0439 – 0.5482
30’ ∆insulin + 0.9291 + 0.0325 + 0.6867
S
G
+ 0.0690 – 0.9199 + 0.8284
S
I
– 0.9641 – 0.0083 + 0.0454
AIR – 0.2130 – 0.3507 – 0.8638
Disposition Index + 0.2240 – 0.1321 + 0.0560
Cholesterol + 0.8972 + 0.6307 + 0.3524
HDL + 0.0323 – 0.8940 + 0.0603
Triglyceride – 0.0050 – 0.8964 – 0.7574
* Refers to the directional effect of the minor allele on the quantitative trait based
on the regression coefficient and assuming an additive genetic model.
** Nominal p-value not corrected for multiple testing. There were no significant
associations when correction for multiple testing was made.
†
(minor allele / minor allele frequency).
37
Table II.1.5. Results of associations between T2DM-related quantitative traits and the
interaction between SNPs in ACP1 and sex.
rs11553742
(T / 0.02)*
rs3828329
(T / 0.24)*
rs10167992
(T / 0.10)*
Trait
p-value** p-value** p-value**
2-df 1-df 2-df 1-df 2-df 1-df
BMI 0.1940 0.2292 0.0443 0.0473 0.0259 0.5025
Body fat 0.0314 0.0401 0.0068 0.0181 0.1441 0.9597
Fasting glucose 0.5532 0.3722 0.0248 0.0066 0.1918 0.1034
2-hr. glucose 0.4221 0.3451 0.0426 0.0167 0.0589 0.0176
Fasting insulin 0.1273 0.0447 0.0001 0.0002 0.0163 0.0054
2-hr. insulin 0.7363 0.5732 0.0048 0.0100 0.0365 0.0123
30’ ∆insulin 0.1638 0.0574 0.0066 0.0194 0.9011 0.8307
S
G
0.1854 0.8010 0.9943 0.9710 0.6725 0.3876
S
I
0.2492 0.0957 0.0025 0.0247 0.0067 0.0143
AIR 0.2442 0.2601 0.6179 0.7619 0.6997 0.4080
Disposition Index 0.4630 0.8048 0.0744 0.0871 0.0837 0.2527
Cholesterol 0.7024 0.4062 0.8400 0.7316 0.0207 0.0087
HDL 0.0827 0.5263 0.8029 0.5163 0.1492 0.5988
Triglyceride 0.0027 0.0472 0.3865 0.1699 0.0123 0.0032
* (minor allele / minor allele frequency)
** Nominal p-value not corrected for multiple testing.
38
Table II.1.6. Results of single marker tests of association between tag SNPs in ACP1 and
T2DM-related quantitative traits in females only.
rs11553742
(T / 0.02)
†
rs3828329
(T / 0.24)
†
rs10167992
(T / 0.10)
†
Effect* p-value
Bonferroni
p-value** Effect* p-value
Bonferroni
p-value** Effect* p-value
Bonferroni
p-value**
BMI – 0.1240 1.0 – 0.9847 1.0 – 0.1334 1.0
Body fat – 0.0377 1.0 + 0.4858 1.0 – 0.1786 1.0
Fasting glucose – 0.3782 1.0 – 0.0878 1.0 + 0.1458 1.0
2-hr. glucose – 0.1492 1.0 – 0.1733 1.0 + 0.2379 1.0
Fasting insulin – 0.1385 1.0 – 0.4767 1.0 + 0.2959 1.0
2-hr. insulin + 0.9217 1.0 – 0.6981 1.0 + 0.3530 1.0
30’ ∆Insulin – 0.4207 1.0 + 0.6925 1.0 – 0.7982 1.0
S
G
+ 0.1297 1.0 – 0.9615 1.0 – 0.7020 1.0
S
I
+ 0.3371 1.0 – 0.3440 1.0 + 0.8353 1.0
AIR – 0.6524 1.0 – 0.3721 1.0 – 0.6205 1.0
Disposition Index + 0.2215 1.0 – 0.9916 1.0 + 0.4496 1.0
Cholesterol + 0.3998 1.0 + 0.7952 1.0 + 0.0270 1.0
HDL + 0.0470 1.0 + 0.5765 1.0 + 0.3233 1.0
Triglyceride – 0.0039 0.164 – 0.3176 1.0 + 0.0579 1.0
* Refers to the directional effect of the minor allele on the quantitative trait based on the regression coefficient
and assuming an additive genetic model.
** Bonferroni-corrected p-value. Correction is made for 3 SNPs and 14 quantitative traits tested.
†
(minor allele / minor allele frequency).
39
Table II.1.7. Results of single marker tests of association between tag SNPs in ACP1
and T2DM-related quantitative traits in males only.
rs11553742
(T / 0.02)
†
rs3828329
(T / 0.24)
†
rs10167992
(T / 0.10)
†
Effect* p-value
Bonferroni
p-value** Effect* p-value
Bonferroni
p-value** Effect* p-value
Bonferroni
p-value**
BMI – 0.7898 1.0 + 0.0362 1.0 – 0.0431 1.0
Body fat – 0.9585 1.0 + 0.0022 0.094 – 0.1627 1.0
Fasting glucose – 0.9702 1.0 + 0.0253 1.0 – 0.4516 1.0
2-hr. glucose + 0.9736 1.0 + 0.0136 0.572 – 0.0488 1.0
Fasting insulin + 0.3148 1.0 + 0.0002 0.007 – 0.0266 1.0
2-hr. insulin + 0.5497 1.0 + 0.0014 0.058 – 0.0688 1.0
30’ ∆insulin + 0.1120 1.0 + 0.0034 0.142 + 0.4763 1.0
S
G
+ 0.2804 1.0 – 0.9183 1.0 + 0.3884 1.0
S
I
– 0.1742 1.0 – 0.0005 0.019 + 0.0036 0.150
AIR – 0.0611 1.0 – 0.2987 1.0 + 0.8421 1.0
Disposition Index + 0.5292 1.0 – 0.0323 1.0 + 0.0322 1.0
Cholesterol – 0.7921 1.0 + 0.6129 1.0 – 0.2098 1.0
HDL + 0.2061 1.0 – 0.5666 1.0 + 0.2277 1.0
Triglyceride – 0.5746 1.0 + 0.3594 1.0 – 0.0610 1.0
* Refers to the directional effect of the minor allele on the quantitative trait based on the regression coefficient
and assuming an additive genetic model.
** Bonferroni-corrected p-value. Correction is made for 3 SNPs and 14 quantitative traits tested.
†
(minor allele / minor allele frequency).
40
Table II.1.8. Association between rs3828329 and fasting insulin and S
I
in males and females.
Genotypic mean (SD)*
Trait Sex C/C C/T T/T
Per allele
effect
Standardized
effect size
Bonferroni
p-value
Fasting insulin
(pM)
Males 47.40 (3.23) 60.52 (3.50) 73.70 (6.21) 13.149 3.753 0.007
Females 55.35 (3.35) 53.32 (3.74) 50.41 (6.90) 2.472 0.673 1.0
S
I
(×10
-3
per
min
-1
per pM)
Males 3.16 (0.17) 2.76 (0.18) 2.37 (0.30) 0.393 2.165 0.019
Females 3.08 (0.13) 2.95 (0.15) 2.82 (0.26) 0.130 0.891 1.0
* Values adjusted for age.
41
Figure II.1.1. ACP1 pairwise LD and haplotype block structure. Pairwise LD and haplotype
block structure as determined by Gabriel method as implemented in Haploview V4.0 for the 13
SNPs (one monomorphic SNP at position 254985, rs11691572, is excluded) genotyped in our
Mexican-American families. The twelve SNPs form a single 30.3 kb haplotype block. LD is
displayed as pair-wise r
2
values (values within boxes), where white indicates r
2
= 0, varying
shades of gray indicates 0 < r
2
< 1, and black indicates r
2
= 1.
42
Figure II.1.2. Interaction between rs3828329 and sex on fasting insulin and S
I
. Top panel
shows age-adjusted mean fasting insulin and standard deviation stratified by sex and rs3828329
genotype. There was a strong association between fasting insulin and genotype among males
(Bonferroni p=0.007), but not among females (Bonferroni p=1.0). Bottom panel shows age-
adjusted mean S
I
and standard deviation stratified by sex and rs3828329 genotype. Like fasting
insulin, there was a strong association between S
I
and genotype among males (Bonferroni
p=0.019), but not among females (Bonferroni p=1.0).
40
50
60
70
80
Fasting Inulin [pM]
Males Females
C/C C/T T/T
2.0
2.5
3.0
3.5
S
I
[×10
-4
min
-1
per pM]
rs3828329
43
II.2. Additive effects of genetic variation in GCK and G6PC2 on insulin secretion
and fasting Glucose [Li and Shu et al., 2009]
II.2.1. Introduction
Genome-wide association (GWA) studies have identified several loci for type 2
diabetes [Sladek et al., 2007; Steinthorsdottir et al., 2007; Zeggini et al., 2007; Zeggini et
al., 2008] and type 2 diabetes–related quantitative traits [Bouatia-Naji et al., 2008;
Chambers et al., 2008; Chen et al., 2008; Frayling et al., 2007; Loos et al., 2008; Orho-
Melander et al., 2008; Sanna et al., 2008; Vaxillaire et al., 2008; Weedon et al., 2008;
Willer et al., 2008; Bouatia-Naji et al., 2009; Kathiresan et al., 2009; Prokopenko et al.,
2009; Thorleifsson et al., 2009; Willer et al., 2009]. Two of these loci, glucokinase (GCK)
[Stone et al., 1996; Rose et al., 2005; Weedon et al., 2005] and glucose-6-phosphatase
catalytic subunit 2 (G6PC2) [Bouatia-Naji et al., 2008; Chen et al., 2008], regulate the
critical glucose-sensing mechanism within pancreatic β-cells. Mutations in GCK confer
susceptibility to maturity-onset diabetes of the young (MODY)-2 [Stoffel et al., 1992;
Vionnet et al., 1992; Froguel et al., 1993], and a −30 GCK promoter variant (rs1799884)
has been shown to be associated with β-cell function [Stone et al., 1996], fasting glucose,
and birth weight [Weedon et al., 2005]. Chen et al. [Chen et al., 2008] demonstrated an
association between the G6PC2 region and fasting glucose, an observation replicated by
Bouatia-Naji et al. [Bouatia-Naji et al., 2008]. Although fasting glucose levels are
associated with both GCK and G6PC2, there has been no evidence that genetic variation
at these two loci contribute a risk for type 2 diabetes, suggesting contribution to mild
elevations in glycemia.
44
GCK phosphorylates glucose to glucose-6-phosphate, whereas G6PC2
dephosphorylates glucose-6-phosphate back to glucose, forming a glucose cycle
previously demonstrated to exist within pancreatic β-cells [Katz and Rognstad, 1976;
Khan et al., 1989]. The important role of GCK in glucose sensing by pancreatic islets has
been demonstrated by numerous studies, and other studies suggest a role for glucose
cycling in insulin secretion and diabetes [Khan et al., 1989; Khan et al., 1990], implying
that the balance between GCK and G6PC2 activity is important for determining
glycolytic flux, ATP production, and subsequent insulin secretion. This was validated by
a demonstration that direct manipulation of glucose cycling alters insulin secretion
[Iizuka et al., 2000; Watanabe et al., 2007].
BetaGene is a study in which we are performing detailed phenotyping of Mexican
American probands with recent gestational diabetes mellitus (GDM) and their family
members to obtain quantitative estimates of body composition, insulin sensitivity (S
I
),
acute insulin response (AIR), and β-cell compensation (disposition index) with the goal
of identifying genes influencing variations in type 2 diabetes–related quantitative traits
[Watanabe et al., 2007; Black et al., 2008; Shu et al., 2009]. Based on the evidence that
variation in GCK (rs1799884) and G6PC2 (rs560887) are independently associated with
fasting glucose concentrations and both are crucial to glucose cycling in β-cells, we
hypothesized that interaction between these loci may be associated not just with fasting
glucose but also with measures of insulin secretion or β-cell function. We tested this
hypothesis in the BetaGene Study and, for replication, in a separate sample of Finnish
men participating in the METabolic Syndrome in Men (METSIM) Study [Wang et al.,
2007].
45
II.2.2. Methods
The BetaGene Study
Subject recruitment. Subject recruitment for BetaGene was ongoing at the time of
these analyses. Subjects are Mexican American (both parents and ≥3 grandparents
Mexican or of Mexican descent) who are either probands with GDM diagnosed within
the previous 5 years and diagnosed in their family members or non-GDM probands with
normal glucose levels in pregnancy in the past 5 years. Subject recruitment has been
previously detailed [Watanabe et al., 2007; Black et al., 2008; Shu et al., 2009]. The
institutional review boards of participating institutions approved all protocols for
BetaGene, and all participants provided written informed consent before participation.
Clinical protocols. Phenotyping was performed on two separate visits to the
University of Southern California General Clinical Research Center. The first visit
consisted of a physical examination, DNA collection, and an oral glucose tolerance test
(OGTT; 75 g) as previously described [Watanabe et al., 2007; Black et al., 2008; Shu et
al., 2009]. Participants with fasting glucose <126 mg/dl were invited back for a second
visit, which consisted of a dual-energy X-ray absorptiometry scan for body composition
(percent of body fat) and an insulin-modified intravenous glucose tolerance test (IVGTT)
performed as previously described [Buchanan et al., 1999].
Assays. Plasma glucose is measured on an autoanalyzer using the glucose oxidase
method (YSI Model 2300; Yellow Springs Instruments, Yellow Springs, OH). Insulin is
46
measured by two-site immunoenzymometric assay that has <0.1% crossreactivity with
proinsulin and intermediate split products.
The METSIM Study
Subject recruitment. We attempted to replicate our findings in subjects
participating in the ongoing METSIM Study [Wang et al., 2007]. The primary goal of the
METSIM Study was to investigate the effect of genetic variation on risk for type 2
diabetes and cardiovascular disease in a random sample of Finnish men (aged 50–70
years) living in the town of Kuopio, eastern Finland (population 95,000). There were
5,327 men from this ongoing population-based study included in this report. The ethics
committee of the University of Kuopio and in accordance with the Helsinki Declaration
approved the METSIM Study.
Clinical protocols. Phenotyping for METSIM was performed in a single visit to
the Clinical Research Unit of the University of Kuopio. In addition to fasting blood
samples for DNA, glucose, and insulin, all subjects underwent an OGTT with samples
obtained at 30 and 120 min postload.
Molecular analysis
In BetaGene, GKC −30G→A polymorphism (rs1799884) and G6PC2 rs560887
were genotyped using the TaqMan system (Applied Biosystems) [Livak 1999; Livak
2003]. Genotyping discrepancy rate based on blinded duplicates was <0.1%, and overall
genotyping success rates were >97% for rs1799884 and >99% for rs560887.
47
In METSIM, GCK rs4607517 and G6PC2 rs560887 were genotyped by the
homogeneous MassEXTEND reaction using the MassARRAY System (Sequenom)
[Chen et al., 2008]. In the HapMap CEU data [The International HapMap Consortium,
2005], which provide a good representation of linkage disequilibrium in Finns [Willer et
al., 2006], rs4607517 is a perfect proxy for rs1799884 (D' = 1.0, r
2
= 1.0). There were no
genotype discrepancies based on >220 blinded duplicates, and overall genotyping success
rates were >96% for rs4607517 and >94% for rs560887.
Data analysis
Genotype data were tested for deviation from Hardy-Weinberg equilibrium and
for non-Mendelian inheritance using PEDSTATS version 0.6.4 [Wigginton and Abecasis,
2005]. Allele frequencies for BetaGene samples were estimated using all available data
taking into account relatedness using SOLAR version 2.1.4 [Blangero and Almasy, 1997;
Almasy and Blangero, 1998], whereas METSIM frequencies were estimated by gene
counting.
We calculated two measures of insulin response to glucose: the difference
between the 30′ and fasting plasma insulin concentration from OGTT (30′ ΔInsulin) and
the incremental area under the insulin curve for the first 10 min of the IVGTT (AIR).
IVGTT glucose and insulin data were analyzed using the minimal model (MINMOD
Millennium version 5.18) to derive measures of glucose effectiveness and S
I
[Boston et
al., 2003]. Disposition index, a measure of β-cell compensation for insulin resistance, was
computed as the product of S
I
and AIR. We also computed an OGTT-derived measure of
β-cell compensation as the product of S
I
and 30′ Δinsulin (DI30).
48
We initially examined the association between the two single nucleotide
polymorphisms (SNPs) and fasting glucose, fasting insulin, 30′ Δinsulin, and AIR given
our primary hypothesis that these variants may underlie variation in these specific
quantitative traits. Details of the association analyses are described below. We recognized
that these variants could also underlie variation in other type 2 diabetes–related
quantitative traits. We tested these SNPs for association with 2-h glucose, 2-h insulin, S
I
,
S
G
, BMI, and percent of body fat as a series of secondary analyses.
Quantitative trait data were statistically transformed to approximate univariate
normality before analyses. We first tested each SNP for association with type 2 diabetes–
related quantitative traits using likelihood ratio testing under a variance components
framework as implemented in SOLAR [Blangero and Almasy, 1997; Almasy and
Blangero, 1998]. Because families were ascertained through probands based on previous
GDM status, we corrected for ascertainment bias by conditioning each model on the
proband's phenotype. Each SNP was tested for association with quantitative traits
assuming a dominant genetic model because of the relatively low minor allele
frequencies observed for both SNPs. All models were adjusted for age, sex, and, where
appropriate, percent body fat. P values were corrected for multiple testing to account for
the number of traits tested (four traits in the primary analysis, six traits in the secondary
analysis). We used P
ACT
to control for the number of tests performed and to account for
the correlation among the quantitative traits tested for association [Conneely and
Boehnke, 2007]. P
ACT
compares the observed test statistic with their asymptotic
distribution using numeric integration and directly accounts for correlation among both
dependent and independent variables.
49
Next, we performed a test of the association between type 2 diabetes–related
quantitative traits and the interaction between GCK and G6PC2. We compared a model
with age, sex, rs1799884, and rs560887 as main effects with a second model that
included the multiplicative interaction between the two SNPs. As in the univariate
analyses, we assumed dominant genetic models for both SNPs. We tested for a
significant interaction effect using a 1-d.f. test and performed a Bonferroni correction to
account for the number of traits tested because P
ACT
is not valid for multi-d.f. tests
[Conneely and Boehnke, 2007]. However, because many of the quantitative traits
examined are correlated and Bonferroni assumes independence, the corrected Pvalues we
report should be overly conservative.
Given that GCK and G6PC2 work in opposite directions on the glucose cycle, we
hypothesized that additivity may best characterize their interaction. Therefore, we
performed a second analysis in which we used a 2-d.f. test to test the joint effect of
rs1799884 and rs560887 for association with type 2 diabetes–related quantitative traits.
METSIM Study. Specific results observed in the BetaGene Study were tested in
the METSIM samples. Standard regression methods were used to test for association
between genetic variants and type 2 diabetes–related traits as METSIM participants are
unrelated individuals. Covariates included age and, when appropriate, BMI. Given that
only specific a priori hypotheses were tested in the METSIM data, we did not apply a
correction for multiple testing. All data for BetaGene are reported as age- and sex-
adjusted means and SD, whereas METSIM data are reported as age-adjusted means and
SD unless noted otherwise.
50
II.2.3. Results
The BetaGene Study
We report results from 861 BetaGene individuals in 251 families with available
phenotype and genotype data (Table II.2.1). Probands, siblings, and cousins were similar
in median age, BMI, and percent of body fat, although these characteristics tended to be
highest in the GDM probands and lowest in cousins. Non-GDM probands were slightly
younger and less obese compared with GDM probands, reflecting the fact that BetaGene
participant accrual was ongoing and recruitment of non-GDM probands was slightly
lagging to allow for matching as previously described [Watanabe et al., 2007; Black et al.,
2008; Shu et al., 2009]. Parameters of glucose metabolism tended to be best in the non-
GDM probands and worst in the GDM probands.
GCK rs1799884 (G>A) had a minor allele frequency of 18.0% and was
marginally associated with fasting glucose (Table II.2.2, P = 0.052) that modestly
increased (P = 0.021) after adjustment for body fat. Individuals homozygous for the G
allele had a lower average fasting glucose compared with those with at least one A allele
(5.13 ± 0.52 vs. 5.22 ± 0.59 mmol/l). However, the association between GCKrs1799884
and fasting glucose did not remain significant when corrected for multiple testing. In
addition, we did not observe any evidence for association between this SNP and measures
of insulin secretion (Table II.2.2) or other type 2 diabetes–related quantitative traits
(Table II.2.3), although there was a tendency for mean fasting insulin, 30′ Δinsulin, and
AIR to be lower in individuals with at least one rs1799884 A allele (Table II.2.2).
51
G6PC2 rs560887 (G>A) had a minor allele frequency of 14.5% and was
nominally associated with fasting glucose (Table II.2.2, P = 0.015) and 30′ Δinsulin (P =
0.0038) and marginally associated with AIR (P = 0.088). These associations became
weaker when body fat was included in the model (P = 0.059, P= 0.014, and P = 0.222,
respectively). Individuals homozygous for the rs560887 G allele had an average fasting
glucose higher than those with at least one A allele (5.19 ± 0.52 vs. 5.08 ± 0.61 mmol/l).
In parallel with these glucose differences, both 30′ Δinsulin and AIR were, on average,
higher in individuals homozygous for the G allele (Table II.2.2). Only the association
with 30′ Δinsulin remained significant after correction for multiple testing (P
ACT
= 0.021).
Although G6PC2 rs560887 was not associated with most type 2 diabetes–related
quantitative traits, we did observe an association between this SNP and percent body fat
(P
ACT
= 0.035), where individuals homozygous for the G allele had higher trait values
(Table II.2.3).
We found no evidence for association between the multiplicative interaction
between GCK rs1799884 and G6PC2 rs560887 and type 2 diabetes–related quantitative
traits (uncorrected P > 0.26), and additional adjustment for body fat did not change the
overall results. In contrast, when we tested whether the additive effects
of GCK rs1799884 and G6CP2 rs560887 were associated with type 2 diabetes–related
quantitative traits, we observed association with fasting glucose (Bonferroni P = 0.03)
and 30′ Δinsulin (Bonferroni P = 0.027).
Bouatia-Naji et al. reported a linear relationship between number of “glucose-
lowering alleles” for GCK,G6PC2, and GCKR and fasting glucose, which suggested
52
additive effects of these alleles on fasting glucose [Bouatia-Naji et al., 2008]. We
observed a similar linear relationship (data not shown) when we
stratified GCK andG6PC2 genotypes, consistent with the results of our statistical analysis
and the observations of Bouatia-Naji et al. [Bouatia-Naji et al., 2008]. However, given
the specificity of these two gene products in regulating the glucose-sensing step in
pancreatic β-cells, we hypothesized the primary effect of variation in GCK and G6PC2 to
be at the level of insulin secretion, not directly upon glucose. We tested this hypothesis
using the conceptual construct depicted in Figure II.2.1. The presence of an A allele in
the GCK rs1799884 promoter variant is hypothesized to reduce GCK gene expression and
result in decreased GCK protein [Weedon et al., 2005]. This should result in modest
reductions in glycolytic flux and ATP production, resulting in modest reductions in
insulin secretion for any level of circulating glucose. G6PC2 rs560887 is located in intron
3, just 26 bp proximal to exon 4 [Bouatia-Naji et al., 2008; Chen et al., 2008]. The
presence of an A allele may4 result in a G6PC2 isoform with differential activity that
favors increased conversion of glucose-6-phosphate to glucose, significantly reducing
glycolytic flux and ATP generation, resulting in larger reductions in insulin secretion for
any glucose level. The balance between the effects of GCK and G6PC2 should
differentially modulate insulin secretion, depending upon genotype combination as
detailed in Figure II.2.1.
When we stratified individuals by GCK and G6PC2 genotypes as depicted
in Figure II.2.1, a linear trend in 30′ Δinsulin (Figure II.2.2, top; P = 0.0016) is observed
where on average, 30′ Δinsulin decreased ∼5.6 pmol/l among genotype groups. We do
not see a clear linear relationship when the same stratification is used to examine the
53
relationship with fasting glucose (Figure II.2.2, bottom; P = 0.191). Thus, the linearity
between GCK and G6PC2 genotype combinations and insulin secretion does not directly
translate into linear changes in fasting glucose. We observed a similar dichotomy when
we examined fasting glucose and 30′ Δinsulin stratified by total number of glucose-
raising alleles, as previously performed by Bouatia-Naji et al. [Bouatia-Naji et al., 2008];
fasting glucose is linear with total number of glucose-raising alleles, whereas 30′ Δinsulin
is not (data not shown).
Grouping and ordering by genotype as described above is semi-artificial.
Therefore, we plotted fasting glucose against 30′ Δinsulin from BetaGene stratified
by GCK and G6PC2 genotypes (Figure II.2.3, top) to better characterize the relationship
between 30′ Δinsulin and fasting glucose. We noted a different pattern of coordinate
changes in fasting glucose and 30′ Δinsulin associated with variation in GCK as
compared with variation in G6PC2. For GCK rs1799884, individuals homozygous for the
G allele (Figure II.2.3, circles) had lower fasting glucose and higher 30′ Δinsulin than
individuals with at least one A allele (Figure II.2.3, squares). For G6PC2 rs560887,
individuals homozygous for the G allele (Figure II.2.3, open symbols) had higher fasting
glucose and higher 30′ Δinsulin than individuals with at least one A allele (Figure II.2.3,
solid symbols). In other words, variation in GCK led to inverse changes in fasting glucose
and 30′ Δinsulin, whereas variation in G6PC2 led to parallel changes in these two traits.
Replication in the METSIM Study
Given the observation of additivity between GCK rs1799884 and G6PC2
rs560887 on fasting glucose and 30′ Δinsulin in BetaGene, we attempted to replicate
54
these findings in the METSIM sample, in which GCK rs4607517 is in perfect linkage
disequilibrium with rs1799884. GCK rs4607517 was associated with fasting glucose but
not associated with 30′ Δinsulin (Table II.2.4). G6PC2 rs560887 was associated with
fasting glucose and marginally associated with 30′ Δinsulin (Table II.2.4). The
multiplicative interaction between GCK rs4607517 and G6PC2 rs560887 was weakly
associated with fasting glucose (P = 0.038) and marginally associated with 30′ Δinsulin
(P = 0.049). However, the 2-d.f. test of the joint additive effect of the two SNPs was
significant for both fasting glucose (P = 3.5 × 10
−10
) and 30′ Δinsulin (P = 0.028). The
evidence for association between the joint additive effect of both SNPs and fasting
glucose was not affected when BMI was included as an additional covariate (P = 3.3 ×
10
−11
); the association with 30′ Δinsulin became stronger when BMI was included as a
covariate (P = 0.0028). When we examined the relationship between fasting glucose and
30′ Δinsulin stratified by GCK rs4607517 and G6PC2 rs560887 genotype groupings, an
identical pattern of relationship to that seen in BetaGene was observed (Figure
II.2.3, bottom). For GCK rs4607517, individuals homozygous for the G allele had lower
fasting glucose and higher 30′ Δinsulin than individuals with at least one A allele. For
G6PC2 rs560887, individuals homozygous for the G allele had higher fasting glucose
and higher 30′ Δinsulin than individuals with at least one A allele.
II.2.4. Discussion
Our results from Mexican Americans, replicated in a sample of Finnish men,
show that variation in both GCK and G6PC2 have additive effects on both fasting
glucose and insulin secretion. Our examination of the relationship between 30′ Δinsulin
55
and fasting glucose stratified by GCK and G6PC2genotype reveals that the joint effect of
genetic variation in these two genes may be through distinct mechanisms. Variation
in GCK leads to inverse changes in fasting glucose and insulin secretion, consistent with
a primary effect on insulin secretion. Variation in G6PC2 leads to parallel changes in
fasting glucose and insulin secretion, suggesting a primary effect on glucose, with
secondary changes in insulin secretion.
Table II.2.5 summarizes our examination of the effects of GCK rs1799884
and G6PC2 rs560887 on fasting glucose and 30′ Δinsulin. Although it is possible that our
conclusion of these variants having an additive effect could be driven mainly
by G6PC2 rs560887, the summary in Table II.2.5 suggests otherwise. It is clear that in
both BetaGene and METSIM, the model testing the joint effect
of GCK and G6PC2 shows greater significance than the effect of each individual gene
variant alone, which suggests both genes are contributing to the effect on fasting glucose
and 30′ Δinsulin. In support of this observation, when we tested for the additional effect
of GCK in the presence of G6PC2 we observed marginal or significant effects for fasting
glucose (P = 0.058 for BetaGene; P = 0.0009 for METSIM) and 30′ Δinsulin (P = 0.148
for BetaGene; P = 0.723 for METSIM). Our a priori hypothesis tested whether the
multiplicative interaction between rs1799884 and rs560887 was associated with type 2
diabetes–related quantitative traits. In BetaGene we observed no evidence for a
multiplicative interaction; our analysis favored a joint additive effect of these two loci.
However, our relatively small size may limit our ability to exclude a multiplicative
interaction. A priori power estimates indicated that a sample size of 600 should have
provided 80% power to detect a multiplicative interaction effect that accounted for
56
∼1.8% of the variability in fasting glucose or 30′ Δinsulin, assuming α = 0.01, minor
allele frequencies of 15% for both loci, and dominant genetic models for both loci. Thus,
BetaGene cannot discount the possibility that a very weak multiplicative interaction
exists between these two loci. In fact, the METSIM sample, which is ∼6 times larger than
BetaGene, showed modest evidence for a multiplicative interaction (Table II.2.5)
suggesting that a weak multiplicative interaction between GCK and G6PC2 could be
contributing to variation in fasting glucose and 30′ Δinsulin. This interaction would need
to be replicated in a sample of similar or larger size.
Our initial examination of the additive effect of these two genes was motivated by
the fact that when we plotted 30′ Δinsulin against GCK rs1799884 and G6PC2 rs560887
genotype combinations sorted by the hypothesized biologic effect on insulin secretion,
we observed a linear relationship with 30′ Δinsulin. Recently, Pirot et al. [Pirot et al.,
2008] showed in mouse and human islets that G6PC2 protein levels and enzymatic
activity increased when islets were incubated in high-glucose media, whereas those for
GCK did not change, suggesting that G6PC2 activity may adapt to changes in glucose to
alter insulin secretion. Because insulin secretion was not measured, it is unclear whether
the changes observed translated into altered rates of insulin secretion. However, previous
studies have demonstrated that manipulation of glucose cycling in insulin-secreting cell
lines does result in changes in insulin secretion [Trinh et al., 1997; Iizuka et al., 2000].
For example, Iizuka et al. [Iizuka et al., 2000] showed that altering glucose cycling by
varying the G6PC2-to-GCK ratio leads to reductions in ATP production and reduced
glucose-stimulated insulin secretion in MIN-6 cells. Taken together, these studies
examining the balance between GCK and G6PC2 support our hypothesized effects of
57
variation in GCK and G6PC2 on insulin secretion (compared with Figure II.2.1).
However, they do not account for the differences we observed in the relationship between
fasting glucose and insulin secretion when stratified by these two loci.
Primary enhancement of β-cell sensitivity to glucose should result in increased
systemic insulin concentrations in relation to glucose levels. The fact that the linearity in
30′ Δinsulin we observed between GCK rs1799884 and G6PC2 rs560887 genotype
combinations did not extend to fasting glucose (compared with Figure II.2.2) is
unexpected. We confirmed this initial observation by examining the relationship between
fasting glucose and 30′ Δinsulin, stratifying on GCK and G6PC2 genotype (compared
with Figure II.2.3). In both the BetaGene and METSIM studies, the presence of
the GCK rs1799884 A allele resulted in lower 30′ Δinsulin and higher fasting glucose,
regardless of G6PC2 genotype. The results suggest that variation in GCK may alter
insulin secretion and fasting glucose in the traditionally envisioned manner where
changes in insulin secretion result in reciprocal changes in glucose concentration.
However, the presence of a G6PC2 A allele resulted in both lower 30′ Δinsulin
and fasting glucose, regardless of GCK genotype. There are two possible mechanisms
that may explain our observation. First, variation in G6PC2 could have a direct and
independent effect to regulate hepatic glucose production. This effect is unlikely because
the hepatic isoform of glucose-6-phosphatase is encoded by an independent gene. Second,
variation in G6PC2 could alter a characteristic of insulin secretion that disrupts the
normal signaling between the pancreas and other tissues, resulting in both insulin
secretion and glucose changing in tandem. Recent studies by Matveyenko et al.
58
[Matveyenko et al., 2006] provide evidence for this possibility. Specifically, their studies
show that in canines with partial pancreas resection or in rats transgenic for human islet
amyloid polypeptide, disruption of pulsatile insulin secretion results in hepatic insulin
resistance and impaired fasting glucose (A. Matveyenko, personal communication). They
conclude that the loss of pulsatile insulin secretion results in a loss of efficiency in insulin
action at the liver, leading to hepatic insulin resistance and increased glucose output,
which subsequently leads to a gradual rise in glucose levels. Variation in G6PC2 may
disrupt pulsatility in insulin secretion, thereby reducing insulin-signaling efficiency
between the pancreas and liver. The small increase in glucose because of hepatic insulin
resistance may result in modest increases in absolute insulin secretion. Additional studies
will be required to test this hypothesis.
Finally, we observed association between G6PC2 rs560887 and measures of
adiposity that have not been previously reported. BMI and body fat were both lower in
the presence of the A allele (compared with Table II.2.3), but the biologic mechanism
underlying this association is unclear. One possibility is that if variation in G6PC2 results
in decreased insulin-signaling efficiency because of the loss of pulsatility and leads to
hepatic insulin resistance, then a similar loss of insulin signaling may occur in other
tissues, such as adipose. Our observed association between variation in G6PC2 and
adiposity will require replication in other populations and further study to elucidate the
mechanism underlying the association.
In conclusion, we observed evidence for an additive effect of variation
in GCK (rs1799884) and G6PC2 (rs560887) on insulin secretion and fasting glucose.
59
When stratified by genotype, the effect of GCKrs1799884 follows an expected pattern of
reduced insulin secretion and reciprocal increase in fasting glucose. By contrast, glucose
levels and insulin secretion change in parallel with G6PC2 rs560887, suggesting a
primary effect on insulin signaling to regulate hepatic glucose production. Our analyses
highlight the importance of considering biologic effects in assessing relationships
between genotype and quantitative traits. Like previous studies, our analyses show that
these variants account for only a small fraction of the variability in insulin secretion or
glucose. The remaining variability is likely because of the effects of other common
genetic variants of both modest and moderate effect and a variety of additional gene-gene
and gene-environment interactions. Thus, additional studies of samples with detailed
phenotyping will be necessary to fully unravel the genetic architecture underlying
glucose regulation and diabetes.
60
Tables and Figures (Chapter II.2)
Table II.2.1. Subject Characteristics. Values are unadjusted median and interquartile range.
GDM
Probands
Siblings Cousins
Non-GDM
Probands
Female/Male 164/0 239/154 133/101 67/0
Age [yrs.] 35.5 (5.6) 35.1 (8.6) 33.2 (9.3) 34.6 (4.9)
BMI [Kg/m
2
] 38.8 (5.5) 32.8 (8.4) 31.0 (8.5) 36.9 (5.7)
Body Fat [%] 31.4 (6.7) 29.8 (5.9) 28.4 (6.4) 28.9 (6.0)
Fasting Glucose [mM] 5.3 (0.6) 5.2 (0.5) 5.1 (0.6) 4.8 (0.4)
2-hour Glucose [mM] 8.6 (2.4) 7.7 (2.2) 6.9 (2.0) 6.3 (1.2)
Fasting Insulin [pM] 74 (53) 62 (47) 59 (52) 49 (37)
30’ Insulin [pM] 494 (316) 522 (342) 547 (373) 506 (353)
2-hour Insulin [pM] 647 (475) 523 (410) 453 (380) 362 (252)
30’ ∆Insulin [pM] 420 (288) 461 (320) 488 (345) 458 (325)
S
G
[ 10
-2
min
-1
] 1.58 (0.55) 1.73 (0.68) 1.83 (0.71) 2.01 (0.69)
S
I
[ 10
-3
min
-1
per pM] 2.98 (1.77) 2.95 (1.64) 3.14 (1.72) 3.32 (1.46)
AIR [pM 10 min] 4540 (4280) 5771 (5565) 6313 (5690) 7082 (5079)
Disposition Index 10933 (8882) 14002 (9519) 15518 (9712) 19173 (9267)
61
Table II.2.2. Univariate tests of associations between genetic variants and fasting glucose and
measures of insulin secretion. Values are unadjusted mean SD.
GCK (rs1799884) G6PC2 (rs560887)
Trait
GG
(n=405)
GA or AA
(n=207)
Uncorrected
p-value*
GG
(n=458)
GA or AA
(n=165)
Uncorrected
p-value*
Fasting Glucose [mM] 5.13 0.52 5.22 0.59 0.052 5.19 0.52 5.08 0.61 0.015
Fasting Insulin [pM] 63.2 49.7 57.5 47.3 0.412 60.5 45.3 63.0 57.0 0.833
30’ ∆Insulin [pM] 489 338 447 314 0.119 488 339 427 300 0.004
AIR [pM 10 min] 6197 6238 5721 4258 0.770 6196 5673 5650 5485 0.088
* age- and sex-adjusted p-value not corrected for multiple testing.
62
Table II.2.3. Univariate tests of associations between genetic variants and additional T2DM-
related quantitative traits. Values are unadjusted mean SD.
GCK (rs1799884) G6PC2 (rs560887)
Trait
GG
(n=405)
GA or AA
(n=207)
Uncorrected
p-value*
GG
(n=458)
GA or AA
(n=165)
Uncorrected
p-value*
BMI [Kg/m
2
] 29.6 6.4 28.7 5.6 0.150 29.7 6.2 28.3 5.7 0.012
Body Fat [%] 32.2 8.6 31.8 8.3 0.121 32.5 8.5 31.4 8.4 0.007
2-hour Glucose [mM] 7.39 2.24 7.40 2.07 0.477 7.38 2.08 7.45 2.43 0.805
2-hour Insulin [pM] 505 416 485 374 0.764 502 409 488 377 0.871
S
G
[ 10
-2
min
-1
] 1.78 0.71 1.73 0.66 0.340 1.78 0.67 1.73 0.74 0.218
S
I
[ 10
-3
min
-1
per pM] 3.02 1.78 3.00 1.48 0.944 2.96 1.68 3.12 1.66 0.469
* age- and sex-adjusted p-value not corrected for multiple testing.
63
Table II.2.4. Univariate tests of associations between genetic variants and fasting glucose and
30’ ∆Insulin in METSIM. Values are unadjusted mean SD.
GCK (rs4607517) G6PC2 (rs560887)
Trait
GG
(n=4874)
GA or AA
(n=1154)
p-value*
GG
(n=3117)
GA or AA
(n=2759)
p-value*
Fasting Glucose [mM] 5.68 0.49 5.74 0.50 0.0003 5.72 0.49 5.64 0.50 1.2×10
-8
30’ ∆Insulin [pM] 343 270 333 235 0.7362 351 280 332 244 0.0517
* age-adjusted p-value
64
Table II.2.5. Summary of modeling analysis.
BetaGene METSIM
Model*
Fasting
Glucose**
30’
Fasting
Glucose**
30’
Age + Sex + GCK 0.058 0.148 0.0003 0.736
Age + Sex + G6PC2 0.012 0.012 1.2×10
-8
0.052
Age + Sex + GCK + G6PC2 + GCK×G6PC2 0.257 0.396 0.038 0.049
Age + Sex + GCK + G6PC2 0.0075 0.0067 3.5×10
-10
0.028
* Underline indicates the genetic effect being tested
** Uncorrected p-values for the genetic effect being tested.
65
Figure II.2.1. Hypothesized effect of GCK rs1799884 and G6PC2 rs560887 genotype
combinations on ATP production and insulin secretion in pancreatic -cells. Top half of the
figure shows the glucose cycling step and the effect of specific alleles in GCK and G6PC2 on
activity of each enzyme: thick arrow depicts increased activity, thin arrow depicts reduced
activity. Bottom half shows the specific genotype combinations and the hypothesized effect on
ATP production and subsequent insulin secretion.
G/G
G/G
G/A or A/A
G/A or A/A
Glucokinase (GCK)
Glucose-6-phosphatase,
Catalytic Subunit 2(G6PC2)
Glucose Glucose-6-phosphate
- - - - - - - - G/G G/G
G/G G/A or A/A
G/A or A/A G/A or A/A
G/A or A/A G/G
Insulin
Secretion
ATP
Production
G6PC2
Genotype
GCK
Genotype
66
Figure II.2.2. 30’ ΔInsulin and fasting glucose stratified by GCK rs1799884 and G6PC2
rs560887 genotypes. Top Panel: Age- and sex-adjusted 30’ SD) stratified by
GCK rs1799884 and G6PC2 rs560887 genotype combination. Genotype groups are ordered by
hypothesized effect on glucose cycling in the pancreatic -cell (see text for details). Dominant
genetic models are assumed for both loci. 30’ ΔInsulin changed linearly (p=0.0016) with
genotype group. Bottom Panel: Age- and sex-adjusted fasting plasma glucose (mean SD)
stratified by genotype. Details are as described for top panel. Fasting glucose was not linearly
related to genotype group (p=0.191).
300
350
400
450
500
30’ Insulin [pM]
p=0.0016
4.8
4.9
5.0
5.1
5.2
5.3
5.4
5.5
Glucose [mM]
GG GA or AA GG GA or AA GCK
GG GG GA or AA GA or AA G6PC2
p=0.191
67
Figure II.2.3. 30’ ΔInsulin versus fasting plasma glucose stratified by GCK rs1799884 and
G6PC2 rs560887 genotype. Top Panel: Shows results from the BetaGene study. Age- and sex-
adjusted 30’ ΔInsulin (mean SD) stratified by genotype is plotted against age- and sex-adjusted
fasting plasma glucose (mean SD). Solid symbols represent G6PC2 rs560887 G/A or A/A
genotypes and open symbols represent G6PC2 rs560887 G/G genotype. Circles represent GCK
rs1799884 G/G genotype and squares represent GCK rs1799884 G/A or A/A genotypes. Bottom
Panel: Shows results from the METSIM study. Details are identical to that described for the top
panel except that circles and squares represent GCK rs4607517 genotypes.
300 320 340 360 380 400 420 440 460
4.8
5.0
5.2
5.4
Glucose [mM]
30’ Insulin [pM]
300 310 320 330 340 350 360 370
5.5
5.6
5.8
5.9
Glucose [mM]
30’ Insulin [pM]
BetaGene
METSIM
68
III. Dimensionality Reduction Methods for testing joint genetic effects on multiple
traits
III.1. Rational for using dimensionality reduction methods (DRMs)
Several studies have discussed using DRMs to test the genetic association of a
gene or a pathway with either a single or multiple traits [Gauderman et al., 2007; Black
and Watanabe, 2011; Aschard et al., 2014]. Given a set of variables, one can apply
methods such as principal component analysis (PCA), canonical correlation analysis
(CCA), and partial least square (PLS) to determine and select a smaller set of
independent linear combinations that “best” captures the information in the entire set of
original variables, and the entire process is known as DRMs. In other words, DRMs can
convert a large number of correlated variables into a smaller number of uncorrelated
variables while still capturing most of the information contained in the data. As discussed
in Chapter II, the main two challenges when testing the association of multiple genetic
variants with multiple traits are high dimensionality (i.e. large number of variables) and
correlations between the variables. Using DRMs is expected to help address those two
challenges. In particular, it is known that: 1) multiple genetic markers in a gene or
pathway can be correlated; 2) related quantitative traits can be correlated; 3) multiple
genetic markers can be correlated with multiple traits, so combining signals of these
correlations using DRMs would be expected to enhance the statistical power to detect
genetic associations. Specifically, the idea is to use DRMs on multiple genetic markers
and/or traits to obtain a smaller set of linear components that captures most of the
relevant information, and then perform statistical tests on those linear components instead
69
of the original variables. However, although many DRMs have applied for association
testing, there are no studies systematically comparing different DRMs for testing
associations with multivariate traits. I therefore undertake such a comparison in a series
of simulations focusing on the three most commonly used DRMs: PCA, CCA, and PLS.
I first briefly reviewed the three DRMs in chapter III.2, III.3, and III.4
respectively, including their basic formulation and the traditional ways approach their
computation. In III.5, I described a unified approach encompassing the three DRMs. This
unified approach not only gives a better insight to understand the three DRMs, but also
provides the foundation that helps perform the computations and the simulations
presented later in this dissertation in a more efficient fashion. In the final section, I
propose a new DRM-based approach for testing the association between multiple SNPs
and traits based on this unified approach.
III.2. Principal component analysis (PCA)
PCA has been proposed as a way to deal with correlated SNPs in candidate gene
studies with binary traits (e.g. case-control status) [Gauderman et al., 2007], disease-
related quantitative traits [Wang and Abbott, 2008], and genome-wide association studies
[Lee and Shu, 2004]. The principal components (PCs) form a new orthogonal coordinate
system to express the original variables such that in each of the new coordinate directions
the variance explained is maximal. Let X be a data matrix with n individuals and p
variables:
70
X
𝑛 ×𝑝 = [
x
1
x
2
⋯ x
𝑝 ] = [
𝑥 11
𝑥 12
⋯ 𝑥 1𝑝 𝑥 21
𝑥 22
⋯ 𝑥 2𝑝 ⋮ ⋮ ⋱ ⋮
𝑥 𝑛 1
𝑥 𝑛 2
⋯ 𝑥 𝑛𝑝
]
𝑛 ×𝑝
(III-1)
Assume each x was already centered with mean, the sample covariance matrix V can then
be calculated as:
V =
1
𝑛 − 1
X
T
X =
1
𝑛 − 1
[
∑𝑥 𝑖 1
𝑥 𝑖 1
𝑛 𝑖 =1
⋯ ∑𝑥 𝑖 1
𝑥 𝑖𝑝
𝑛 𝑖 =1
⋮ ⋱ ⋮
∑𝑥 𝑖𝑝
𝑥 𝑖 1
𝑛 𝑖 =1
⋯ ∑𝑥 𝑖𝑝
𝑥 𝑖𝑝
𝑛 𝑖 =1
]
𝑝 ×𝑝 = [
𝑣 11
⋯ 𝑣 1𝑝 ⋮ ⋱ ⋮
𝑣 𝑝 1
⋯ 𝑣 𝑝𝑝
]
𝑝 ×𝑝
(III-2)
The eigenvalues of V can be calculated by first subtracting a scalar multiple of the
identity matrix:
𝐕 − 𝜆 𝐈 = 𝐕 − 𝜆 [
1 ⋯ 0
⋮ ⋱ ⋮
0 ⋯ 1
]
𝑝 ×𝑝 = [
𝑣 11
− 𝜆 ⋯ 𝑣 1𝑝 ⋮ ⋱ ⋮
𝑣 𝑝 1
⋯ 𝑣 𝑝𝑝
− 𝜆 ]
𝑝 ×𝑝
(III-3)
We then can solve the equation resulting from setting the determinant of the resulting
matrix to zero:
det(𝐕 − 𝜆 𝐈 )= 𝜆 𝑝 + 𝑐 𝑝 −1
𝜆 𝑝 −1
+ ⋯+ 𝑐 2
𝜆 + 𝑐 1
= 0
(III-4)
where we can obtain p eigenvalues which satisfy this equation. Once all the eigenvalues
were obtained, we can calculate corresponding eigenvectors to satisfy 𝐕 𝐞 𝒊 = 𝜆 𝑖 𝐞 𝑖 , which
71
can be rewritten as:
(𝐕 − 𝜆 𝑖 𝐈 )𝐞 𝑖 = [
𝑣 11
− 𝜆 ⋯ 𝑣 1𝑝 ⋮ ⋱ ⋮
𝑣 𝑝 1
⋯ 𝑣 𝑝𝑝
− 𝜆 ]
𝑝 ×𝑝 [
𝑒 1𝑖 ⋮
𝑒 𝑝𝑖
]
𝑝 ×1
= 0, where 𝑖 = 1,2,…,𝑝
(III-5)
(Note that equation III-4 and III-5 represent a theoretical way to compute eigenvalues and
eigenvectors, and which is not really practical or numerically stable in general. Functions
in SAS or R for computing eigenvalues and eigenvectors use much more efficient linear
algebra routines.) After eigenvectors were also calculated, data matrix X can then be
converted into a product with eigenvectors:
𝐗 𝑛 ×𝑝 = ∑𝐮 𝒊 𝑛 ×1
𝐞 𝑖 ′
𝑝 ×1
𝑝 𝑖 =1
(III-6)
where the u is a principal component, or score vector, and the e is the eigenvector, or
loading vector. Consequently, PCs can be calculated by weighting the observed variables
by the eigenvector component for each individual. We can write equation III-6 as:
PC
𝑖 = 𝐮 𝑖 𝑛 ×1
= ∑𝐗 𝑛 ×𝑝 𝐞 𝑖 ′
𝑝 ×1
𝑝 𝑖 =1
→ 𝐔 𝑛 ×𝑝 = 𝐗 𝑛 ×𝑝 𝐄 𝑝 ×𝑝
(III-7)
where the ith column of U is the ith PC. The largest variance from each projection in the
variables space is carried by the first PC, and the second greatest variance is carried by
the second PC, and so on:
var(𝐮 𝑖 )= 𝜆 𝑖 , where 𝜆 1
≥ 𝜆 2
≥ ⋯ ≥ 𝜆 𝑝
72
(III-8)
The sum of all eigenvalues is equal to the sum of variances of X (∑ 𝜆 𝑖 𝑝 𝑖 =1
= ∑ 𝑣 𝑖𝑖
𝑝 𝑖 =1
), so
a number of PCs can be determined to represent a certain proportion of the variance in
the original variables. Those PCs are then considered as the linear combinations or
components of the observed variables.
We can apply PCA to either the SNPs or the traits matrix to obtain corresponding
linear components. Given a threshold of proportion of variance explained by PCs, we can
determine a specific number of required PCs (counting from the first PC) to represent the
original data. Studies have shown that using a number of PCs instead of SNPs in
association analysis can reduce the number of tests with little loss of information
[Guaderman et al., 2007].
III.3. Canonical correlation analysis (CCA)
CCA is a classical multivariate procedure for accessing the relationship between
two sets of variables [Hotelling, 1936]. Under the presumption that we want to test
genetic associations between X (a SNPs data matrix with n individuals and p variables)
and Y (a traits data matrix with n individuals and q variables), the associations are
captured by the correlations between X and Y. The purpose of CCA is to fine pairs of
linear combinations of the Xs and the Ys, which have maximum correlation between
them and which are uncorrelated across pairs. Assuming variables in X and Y were
standardized, the covariance matrix of X and Y can be calculated:
73
Cov(𝐗 ,𝐘 )=
1
𝑛 − 1
[
∑𝑥 𝑖 1
𝑥 𝑖 1
𝑛 𝑖 =1
⋯ ∑𝑥 𝑖 1
𝑥 𝑖𝑝
𝑛 𝑖 =1
⋮ ⋱ ⋮
∑𝑥 𝑖𝑝
𝑥 𝑖 1
𝑛 𝑖 =1
⋯ ∑𝑥 𝑖𝑝
𝑥 𝑖𝑝
𝑛 𝑖 =1
∑𝑥 𝑖 1
𝑦 𝑖 1
𝑛 𝑖 =1
⋯ ∑𝑥 𝑖 1
𝑦 𝑖𝑞
𝑛 𝑖 =1
⋮ ⋱ ⋮
∑𝑥 𝑖𝑝
𝑦 𝑖 1
𝑛 𝑖 =1
⋯ ∑𝑥 𝑖𝑝
𝑦 𝑖𝑞
𝑛 𝑖 =1
∑𝑦 𝑖 1
𝑥 𝑖 1
𝑛 𝑖 =1
⋯ ∑𝑦 𝑖 1
𝑥 𝑖𝑝
𝑛 𝑖 =1
⋮ ⋱ ⋮
∑𝑦 𝑖𝑞
𝑥 𝑖 1
𝑛 𝑖 =1
⋯ ∑𝑦 𝑖𝑞
𝑥 𝑖𝑝
𝑛 𝑖 =1
∑𝑦 𝑖 1
𝑦 𝑖 1
𝑛 𝑖 =1
⋯ ∑𝑦 𝑖 1
𝑦 𝑖𝑞
𝑛 𝑖 =1
⋮ ⋱ ⋮
∑𝑦 𝑖𝑞
𝑦 𝑖 1
𝑛 𝑖 =1
⋯ ∑𝑦 𝑖𝑞
𝑦 𝑖𝑞
𝑛 𝑖 =1
]
=
1
𝑛 − 1
[
𝐗 ′
𝐗 𝐗 ′
𝐘 𝐘 ′
𝐗 𝐘 ′
𝐘 ] = [
𝐒 𝐗𝐗
𝐒 𝐗𝐘
𝐒 𝐘𝐗
𝐒 𝐘𝐘
]
(𝑝 +𝑞 )×(𝑝 +𝑞 )
(III-9)
(Note here I calculated the sample covariance, but in order to express the method formula
in a general concept, I used the notation of population covariance (∑ ) in the following
equations.) Specifically, CCA seeks to maximize:
𝜌 = Corr(𝐚 ′
𝐗 ,𝐛 ′𝐘 )=
𝐚 ′∑ 𝐛 𝐗𝐘
√𝐚 ′∑ 𝐚 𝐗𝐗
√𝐛 ′∑ 𝐛 𝐘𝐘
(III-10)
where ρ is the correlation (called canonical correlation) between the linear components
defined by the vector of coefficients, a and b. Assuming that 𝚺 𝐗𝐗
and 𝚺 𝐘𝐘
are nonsingular
matrices and therefore positive definite, I introduce the symmetric square root matrices
𝚺 𝐗𝐗
1/2
and 𝚺 𝐘𝐘
1/2
defined so that 𝚺 𝐗𝐗
= 𝚺 𝐗𝐗
1/2
𝚺 𝐗𝐗
1/2
and 𝚺 X𝐗 −1
= 𝚺 𝐗𝐗
−1/2
𝚺 𝐗𝐗
−1/2
. Square roots
always exist for positive definite matrices and can be found using the QR or the singular
value decompositions. Let 𝐜 = 𝚺 𝐗𝐗
1/2
𝐚 and 𝐝 = 𝚺 𝐘𝐘
1/2
𝐛 , so 𝐚 = 𝚺 𝐗𝐗
−1/2
𝐜 and 𝐛 = 𝚺 𝐘𝐘
−1/2
𝐝 .
Then equation III-10 can be rewritten as:
74
𝜌 =
𝐚 ′∑ 𝐛 𝐗𝐘
√𝐚 ′∑ 𝐚 𝐗𝐗
√𝐛 ′∑ 𝐛 𝐘𝐘
=
𝐜 ′𝚺 𝐗𝐗
−1/2
𝚺 𝐗𝐘
𝚺 𝐘𝐘
−1/2
𝐝 √𝐜 ′𝐜 √𝐝 ′𝐝
(III-11)
By the Cauchy-Schwarz inequality:
𝐜 ′𝚺 𝐗𝐗
−1/2
𝚺 𝐗𝐘
𝚺 𝐘𝐘
−1/2
𝐝 ≤ (𝐜 ′𝚺 𝐗𝐗
−1/2
𝚺 𝐗𝐘
𝚺 𝐘𝐘
−1
𝚺 𝐘𝐗
𝚺 𝐗𝐗
−1/2
𝐜 )
𝟏 /𝟐 (𝐝 ′𝐝 )
𝟏 /𝟐
(III-12)
Since 𝚺 𝐗𝐗
−1/2
𝚺 𝐗𝐘
𝚺 𝐘𝐘
−1
𝚺 𝐘𝐗
𝚺 𝐗𝐗
−1/2
is a 𝑚 × 𝑚 symmetric matrix, the maximization results
yields:
𝐜 ′𝚺 𝐗𝐗
−1/2
𝚺 𝐗𝐘
𝚺 𝐘𝐘
−1
𝚺 𝐘𝐗
𝚺 𝐗𝐗
−1/2
𝐜 ≤ λ
1
𝐜 ′𝐜
(III-13)
where λ
1
is the largest eigenvalue of 𝚺 𝐗𝐗
−1/2
𝚺 𝐗𝐘
𝚺 𝐘𝐘
−1
𝚺 𝐘𝐗
𝚺 𝐗𝐗
−1/2
. We then can use the same
approach as equation III-3 to obtain eigenvalues λ from this symmetric matrix.
Accordingly, the corresponding eigenvectors e can be calculated using the same approach
as equations III-4 and III-5. Once the eigenvectors are obtained, a and b can be calculated
as:
𝐚 𝑖 = 𝚺 𝐗𝐗
−1/2
𝐞 𝑖
𝐛 𝑖 = 𝚺 𝐘𝐘
−1/2
𝚺 𝐘𝐗
𝐚 𝑖
(III-14)
where 𝒃 𝑖 must be scaled by 𝐛 𝑖 ′
𝚺 𝐘𝐘
𝐛 𝑖 = 1. We can see that 𝐚 𝑖 and 𝐛 𝑖 constitute a pair of
canonical coefficients. The number of non-zero solutions are limited by the smallest
dimension of X and Y. That means, for example, if X dimension was more than Y (p >
75
q), then the maximum number of pairs of canonical correlations will be equal to q.
Similar to equation III-7, a and b can be used to calculate canonical variables 𝐫 𝐗 and 𝐫 𝐘 ,
defined as:
𝐫 𝐗 𝑖 𝑛 ×1
= ∑𝐗 𝑛 ×𝑝 𝐚 𝑖 𝑝 ×1
𝑡 𝑖 =1
→ 𝐑 𝐗 𝑛 ×𝑞 = 𝐗 𝑛 ×𝑝 a
𝑝 ×𝑞
𝐫 𝐘 𝑖 𝑛 ×1
= ∑𝐘 𝑛 ×𝑞 𝐛 𝑖 𝑞 ×1
𝑡 𝑖 =1
→ 𝐑 𝐘 𝑛 ×𝑞 = 𝐘 𝑛 ×𝑞 b
𝑞 ×𝑞
(III-15)
where the ith pair of 𝐫 𝐗 and 𝐫 𝐘 is the ith pair of canonical variables. Each pair of
canonical variables is uncorrelated to each other. The first pair of canonical variables has
the largest correlation
ρ, the second pair the second largest, and so on:
corr(𝐫 𝐗 𝑖 ,𝐫 𝐘 𝑖 )= 𝜌 𝑖 , where 𝜌 1
≥ 𝜌 2
≥ ⋯ ≥ 𝜌 𝑞
(III-16)
One can test the genetic associations between SNPs and traits by testing the correlation
between the canonical variables. Similar to testing using principal components, given a
proportion of variance explained by the canonical variables, we can select a set of
canonical variables for testing associations.
III.4. Partial least square (PLS)
PLS is a method to find the “best” linear relationship between projections of X
and Y to a new space respectively [Geladi and Kowalski, 1983]. Analogous to CCA, the
pairs of linear components obtained from using PLS are linear combinations of X and Y
76
but with maximum covariance rather than correlation, so the results of PLS and CCA are
typically different [Sun et al., 2009]. PLS regression is particularly useful when the
number of observed variables is larger than the number of individuals, or when there
exists collinearity among variables. When PLS was first developed, the idea was
formulated in two parts: the outer relations of X and Y in their original variable spaces,
and the inner relation in another variable space. The outer relation for X can be expressed
as:
𝐗 𝑛 ×𝑝 = ∑𝐮 𝒊 𝑛 ×1
𝐩 𝑖 ′
𝑝 ×1
𝑎 𝑖 =1
+ 𝐄 = 𝐔𝐏 + 𝐄
(III-17)
and the outer relation for Y can be built in the same way:
𝐘 𝑛 ×𝑞 = ∑𝐯 𝒊 𝑛 ×1
𝐪 𝑖 ′
𝑞 ×1
𝑎 𝑖 =1
+ 𝐅 = 𝐕𝐐 + 𝐅
(III-18)
where a is the number of pairs of components, 𝐮 𝑖 and 𝐯 𝑖 are the scores, 𝐩 𝑖 and 𝐪 𝑖 are the
eigenvectors, and E and F are residuals for each model. Given the dimension of X was
more than Y (p > q), when a = p, covariance of X would be able to be totally explained
by UP and thus E = 0. In order to explain covariance of X and Y as much as possible in
the same time, the intention here is to seek one best relation between X and Y to explain
Y as much as possible too and thus make F as small as possible. Consequently, we need
to build another relation between U and V (the inner relation) to achieve this purpose. For
the i
th
pair of component, the inner relation can be simply expressed as:
77
𝐯 𝑖 𝑛 ×1
= 𝛽 𝐮 𝑖 𝑛 ×1
(III-19)
This is essential a regression model, and 𝛽 play a role of the regression coefficient in the
inner relation. However, this model is not the best possible at this point, because the
linear components are calculated from X and Y in outer relations without considering the
inner relation. In order to improve the inner relation, the whole PLS process can be
written in the following sequence:
1) take 𝐯 𝑠𝑡𝑎𝑟𝑡 = a score vector y
2) calculate 𝐩 ′=
𝐯 ′𝐗 𝐯 ′𝐯
3) calculate 𝐩 𝑛𝑒𝑤 ′=
𝐩 𝑜𝑙𝑑 ′
‖𝐩 𝑜𝑙𝑑 ′‖
( 𝐩 𝑜𝑙𝑑 is the p from step 2; 𝐩 𝑛𝑒𝑤 is the p for step 4)
4) calculate 𝐮 = 𝐗𝐩
5) calculate 𝐪 ′=
𝐮𝐘
𝐮 ′𝐮
6) calculate 𝐪 𝑛𝑒𝑤 ′=
𝐪 𝑜𝑙𝑑 ′
‖𝐪 𝑜𝑙𝑑 ′‖
( 𝐪 𝑜𝑙𝑑 is the q from step 5; 𝐪 𝑛𝑒𝑤 is the q for step 7)
7) calculate 𝐯 = 𝐘𝐪
8) use the v from step 7 to calculate u again using step 2 to 4
When both u are equal (or within a certain rounding error), we stop and obtain the first
pair of u and v. Next, we can replace X and Y with E and F respectively and repeat the
same procedure to obtain the second pair of u and v, and so on. This procedure was
known as nonlinear iterative partial least squares (NIPALS) [Geladi and Kowalski,
1983]. Each pair of linear components can be defined as:
78
𝐬 𝐗 𝑖 𝑛 ×1
= ∑𝐗 𝑛 ×𝑝 𝐮 𝑖 𝑝 ×1
𝑎 𝑖 =1
→ 𝐒 𝐗 𝑛 ×𝑎 = 𝐗 𝑛 ×𝑝 U
𝑝 ×𝑎
𝐬 𝐘 𝑖 𝑛 ×1
= ∑𝐘 𝑛 ×𝑞 𝐯 𝑖 𝑞 ×1
𝑎 𝑖 =1
→ 𝐒 𝐘 𝑛 ×𝑎 = 𝐘 𝑛 ×𝑞 V
𝑞 ×𝑎
(III-20)
We can define a, the number of selected linear components pairs, with various ways. For
example, we can set a certain amount covariance explained by pairs of linear components
(𝐒 𝐗 and 𝐒 𝐘 ) as a threshold, or we can use cross-validation approach [Mevik and
Cederkvist, 2004]. The relationship for covariance of components can be expressed as:
cov(𝐒 𝐗 𝑖 ,𝐒 𝐘 𝑖 )= 𝜎 𝑖 , where 𝜎 1
≥ 𝜎 2
≥ ⋯ ≥ 𝜎 𝑎
(III-21)
Note that the maximum number a can attain would be equal to min(p,q). After all linear
components were calculated, we can select a set of them to use in a regression model to
test for genetic association between multiple SNPs and traits.
Due to the concern that the NIPALS algorithm only shows how to seek the inner
relation but does not clearly point out how to maximize the covariance between X and Y,
an alternative approach to PLS, SIMPLS, was proposed [de Jong, 1993]. The major
difference between SIMPLS and NIPALS in terms of algorithm is that SIMPLS avoids
partitioning the X matrix and thus is more computationally efficient. For univariate y,
SIMPLS is entirely equivalent to NIPALS. For multivariate Y, there is a slight difference
between SIMPLS and NIPALS [de Jong, 1993].
79
III.5. A unified approach to PCA, PLS, and CCA
PCA, PLS, and CCA find linear combinations of the column vectors of X and Y,
Xa′ and Yb′ maximizing the variance, covariance, and correlation respectively:
PCA:max
a
Var( Xa′),max
b
Var( Yb′)
PLS:max
a, b
Cov( X a
′
, Yb′)
CCA:max
a, b
Cor( X a
′
, Yb′)
(III-22)
where have constraints that 𝐚 ′
𝐚 = 1 and 𝐛 ′
𝐛 = 1.
Borga et al. showed that all three conditions above could be rewritten as a
Rayleigh quotient [Horn and Johnson, 1985] for a generalized eigenvalue problem [Borga
et al. 1992]. This provides a unified way to solve the three problems more efficiently
compared to the traditional ways mentioned in the previous sections.
Principal component analysis (PCA)
PCA was designed to find vectors a and b which maximize variance of
X a
′
and Yb′ respectively:
Var( Xa′)= a
′
Var( X)𝐚 = 𝐚 ′𝚺 𝐗𝐗
𝐚
Var( Yb′)= b
′
Var( Y)𝐛 = 𝐛 ′𝚺 𝐘𝐘
𝐛
(III-23)
80
The calculations for finding a and b would be performed separately and independently,
so here we take 𝐚 ′𝚺 𝐗𝐗
𝐚 as an example. We could write 𝐚 ′𝚺 𝐗𝐗
𝐚 in a Rayleigh quotient
form:
𝐚 ′
𝚺 𝐗𝐗
𝐚 =
𝐚 ′
𝚺 𝐗𝐗
𝐚 𝐚 ′𝐚 = ℛ
(III-24)
Next use Lagrange multiplier technique to maximize 𝐚 ′
𝚺 𝐗𝐗
𝐚 while subjecting 𝐚 ′
𝐚 = 1:
ℒ = 𝐚 ′
𝚺 𝐗𝐗
𝐚 −
𝜃 2
(𝐚 ′
𝐚 − 1)
(III-25)
By taking the derivative of ℒ, we get the equation:
𝑑 ℒ
𝑑 a
= 𝚺 𝐗𝐗
𝐚 − 𝜃 𝐚 = 0 → 𝚺 𝐗𝐗
𝐚 = 𝜃 𝐚
(III-26)
which shows that a is an eigenvector of 𝚺 𝐗𝐗
with eigenvalue 𝜃 . Therefore, a and 𝜃 can be
computed using standard numerical linear algebra algorithms. The first eigenvalue is the
largest maximized variance of Xa′, and the second eigenvalue is the second largest
maximized variance, and so on. The process is the same for finding b for Y.
The process of using Rayleigh quotient and Lagrange multipliers shows that PLS
and CCA in equation III-22 can also be recognized as seeking solutions of an eigenvalue
problem but involving covariance matrices of X and Y.
81
Partial least square (PLS)
PLS was designed to find a pair of vectors (𝐚 ,𝐛 ) which maximize the covariance
of X a
′
and Yb′ :
Cov(Xa
′
, Yb′)= a
′
Cov(X, Y)𝐛 = 𝐚 ′𝚺 𝐗𝐘
𝐛
(III-27)
Then a Rayleigh quotient form can be written as:
𝐚 ′
𝚺 𝐗𝐘
𝐛 =
𝐚 ′
𝚺 𝐗𝐘
𝐛 √𝐚 ′𝐚𝐛 ′ 𝐛 = ℛ
(III-28)
Use Lagrange multiplier to maximize 𝐚 ′
𝚺 𝐗𝐘
𝐛 while subjecting 𝐚 ′
𝐚 = 1,𝐛 ′
𝐛 = 1:
ℒ = 𝐚 ′
𝚺 𝐗𝐘
𝐛 −
𝜃 2
(𝐚 ′
𝐚 − 1)−
𝜃 2
(𝐛 ′
𝐛 − 1)
(III-29)
Differentiating ℒ with respect to a:
𝑑 ℒ
𝑑 a
= 𝚺 𝐗𝐘
𝐛 − 𝜃 𝐚 = 0 → 𝚺 𝐗𝐘
𝐛 = 𝜃 𝐚 → 𝐚 = 𝜃 −1
𝚺 𝐗𝐘
𝐛
𝑑 ℒ
𝑑 b
= 𝚺 𝐘𝐗
𝐚 − 𝜃 𝐛 = 0 → 𝚺 𝐘𝐗
𝐚 = 𝜃 𝐛 → 𝐛 = 𝜃 −1
𝚺 𝐘𝐗
𝐚
which gives eigenvalue equation forms:
𝜃 −1
𝚺 𝐗𝐘
𝚺 𝐘𝐗
𝐚 = 𝜃 𝐚 → 𝚺 𝐗𝐘
𝚺 𝐘𝐗
𝐚 = 𝜃 2
𝐚
𝜃 −1
𝚺 𝐘𝐗
𝚺 𝐗𝐘
𝐛 = 𝜃 𝐛 → 𝚺 𝐘𝐗
𝚺 𝐗𝐘
𝐛 = 𝜃 2
𝐛
(III-30)
82
where 𝜃 2
is the eigenvalue solved from doing eigen-decomposition on either 𝚺 𝐗 𝐘 𝚺 𝐘𝐗
or
𝚺 𝐘𝐗
𝚺 𝐗𝐘
, and (𝐚 ,𝐛 ) are the corresponding eigenvectors. When p was not equal to q, say
p>q, the first q 𝜆 calculated from both equations would be equivalent, and the rest would
be zero. The first eigenvalue is the maximized covariance of X
1
a
′
and X
2
b′ , and the rest
follow a descending order to the last.
Without using the complex iteration steps in NIPALS or SIMPLS, here we found
that all pairs of (𝐚 ,𝐛 ) could be calculated at once by solving a simple eigen-
decomposition problem.
Canonical correlation analysis (CCA)
CCA finds a pair of vectors (𝐚 ,𝐛 ) which maximize correlation of X a
′
and Yb
′
:
Cor( X a
′
, Yb′)=
Cov( X a
′
, Yb′)
√Cov( X a
′
) Cov( Yb′ )
=
𝐚 ′𝚺 𝐗𝐘
𝐛 √𝐚 ′𝚺 𝐗𝐗
𝐚𝐛 ′𝚺 𝐘𝐘
𝐛
(III-31)
A Rayleigh quotient form can be written as:
𝐚 ′𝚺 𝐗𝐘
𝐛 √𝐚 ′𝚺 𝐗𝐗
𝐚𝐛 ′𝚺 𝐘𝐘
𝐛 = ℛ
(III-32)
Use Lagrange multiplier to maximize 𝐚 ′
𝚺 𝐗𝐘
𝐛 while subjecting 𝐚 ′𝚺 𝐗𝐗
𝐚 = 1,𝐛 ′𝚺 𝐘𝐘
𝐛 = 1:
ℒ = 𝐚 ′
𝚺 𝐗𝐘
𝐛 −
𝜃 2
(𝐚 ′𝚺 𝐗𝐗
𝐚 − 1)−
𝜃 2
(𝐛 ′𝚺 𝐘𝐘
𝐛 − 1)
(III-33)
83
Differentiating ℒ with respect to a:
𝑑 ℒ
𝑑 a
= 𝚺 𝐗𝐘
𝐛 − 𝜃 𝚺 𝐗𝐗
𝐚 = 0 → 𝚺 𝐗𝐘
𝐛 = 𝜃 𝚺 𝐗𝐗
𝐚 → 𝐚 = 𝜃 −1
𝚺 𝐗𝐗
−1
𝚺 𝐗𝐘
𝐛
𝑑 ℒ
𝑑 b
= 𝚺 𝐘𝐗
𝐚 − 𝜃 𝚺 𝐘𝐘
𝐛 = 0 → 𝚺 𝐘𝐗
𝐚 = 𝜃 𝚺 𝐘𝐘
𝐛 → 𝐛 = 𝜃 −1
𝚺 𝐘𝐘
−1
𝚺 𝐘𝐗
𝐚
which gives eigenvalue equations of the form:
𝜃 −1
𝚺 𝐗𝐘
𝚺 𝐘𝐘
−1
𝚺 𝐘𝐗
𝐚 = 𝜃 𝚺 𝐗𝐗
𝐚 → 𝚺 𝐗𝐗
−1
𝚺 𝐗𝐘
𝚺 𝐘𝐘
−1
𝚺 𝐘𝐗
𝐚 = 𝜃 2
𝐚
𝜃 −1
𝚺 𝐘𝐗
𝚺 𝐗𝐗
−1
𝚺 𝐗𝐘
𝐛 = 𝜃 𝚺 𝐘𝐘
𝐛 → 𝚺 𝐘𝐘
−1
𝚺 𝐘𝐗
𝚺 𝐗𝐗
−1
𝚺 𝐗𝐘
𝐛 = 𝜃 2
𝐛
(III-34)
where 𝜃 2
can be solved by performing an eigen-decomposition on either 𝚺 𝐗𝐗
−1
𝚺 𝐗𝐘
𝚺 𝐘𝐘
−1
𝚺 𝐘𝐗
or 𝚺 𝐘𝐘
−1
𝚺 𝐘𝐗
𝚺 𝐗𝐗
−1
𝚺 𝐗𝐘
. When p was not equal to q, say p>q, the first q 𝜃 2
calculated from
both equations would be equivalent, and the rest would be zero. The first eigenvalue is
the maximized squared correlation of X a
′
and Yb′, and the rest follow a descending
correlation order.
Here we found CCA can also be computed in a simple and efficient way. In
addition, we get an insight that the nature of CCA is to find eigenvalue solutions from a
symmetric matrix which was composed by the covariance matrix between X and Y
conditional on the covariance of X and Y.
Based on the unified approach to PCA, PLS, and CCA, we learn that the three
DRMs are all seeking eigenvalue solutions from specific combinations of data covariance
matrices: PCA only involves the variance matrix of X and Y respectively; PLS only
considers the covariance matrix between X and Y; CCA considers both covariance
84
between X and Y and covariance of X and Y jointly. This approach also provides a much
faster and efficient way to calculate linear components from the three DRMs, which then
became an advantage to help make a more efficient hypothesis testing approach and
simulation design in the following work.
III.6. Hypothesis testing with DRMs
As mentioned in previous chapters, regression modeling is the most common
approach used to test for associations between genetic markers and traits. To address the
issues arising large number of correlated variables, dependent and independent variables
could be simply replaced with linear components in regression models when DRMs. By
using different types of regression models appropriately, this framework is still a
convenient and feasible approach for pathway analysis. However, testing whether
multiple genetic markers are associated with traits jointly can be achieved by simply
testing whether X and Y are correlated. Under a normality assumption, independence is
equivalent to non-correlation, and therefore the null hypothesis of no association becomes
𝐻 0
: 𝚺 𝐗𝐘
= 0. Accordingly, instead of building regression models from every individual in
a data, we can directly build a statistical test from a data covariance matrix. I take
advantage of this idea to achieve greater efficiency in the simulation studies introduced
later. The theoretical foundation for multivariate tests of correlation is as follows:
[ X,Y] ~𝑁 𝑝 +𝑞 (𝛍 ,𝚺 ),where 𝚺 = [
𝚺 𝐗𝐗
𝚺 𝐗𝐘
𝚺 𝐘𝐗
𝚺 𝐘𝐘
]
(III-35)
85
where 𝚺 is the population covariance matrix (a positive-semidefinite matrix) and X is
random sample from the multivariate normal distribution. The likelihood ratio test (LRT)
of 𝐻 0
: 𝚺 𝐗𝐘
= 0 versus 𝐻 1
: 𝚺 𝐗𝐘
≠ 0 rejects the null based on:
−2ln(Λ)= 𝑛 ln(
|𝐒 𝐗𝐗
| × |𝐒 𝐘𝐘
|
|𝐒 |
)
(III-36)
where 𝐒 = [
𝐒 𝐗𝐗
𝐒 𝐗𝐘
𝐒 𝐘𝐗
𝐒 𝐘𝐘
] is maximum likelihood estimate (MLE) of 𝚺 . Under the null
hypothesis 𝐻 0
, the MLE of 𝚺 is:
𝐒 0
= [
𝐒 𝐗𝐗
0
0 𝐒 𝐘𝐘
]
This test statistic is actually equivalent to the Wilk’s lambda test used in MANOVA.
Accordingly, performing association tests using DRMs can be based on preselected linear
components of X and Y which explain a proportion of 𝚺 such that:
𝐒 𝐗𝐗
̂
= Cov(XA
′
) , 𝐒 𝐘𝐘
̂
= Cov(YB
′
) , 𝐒 𝐗𝐘
̂
= Cov(XA
′
, YB′ )
(III-37)
so the LRT (equation III-36) became:
−2ln(Λ)= 𝑛 ln(
|𝐒 𝐗𝐗
̂
| × |𝐒 𝐘𝐘
̂
|
|𝐒 ̂
|
)
(III-38)
Under the alternative hypothesis,
𝐒 ̂
= [
𝐒 𝐗𝐗
̂
𝐒 𝐗𝐘
̂
𝐒 𝐘𝐗
̂
𝐒 𝐘𝐘
̂
]
86
Under the null hypothesis,
𝐒 ̂
= 𝐒 0
̂
= [
𝐒 𝐗𝐗
̂
0
0 𝐒 𝐘𝐘
̂
]
This test statistic follows an asymptotic chi-square distribution for pre-specified A and B.
However, because a subset of components is selected that A=A(X,Y) and B=B(X,Y), the
test statistic distribution is no longer asymptotically chi-squared distributed. We can use
permutation to obtain the test statistic distribution under the null that we repeat the test a
number of times and each time the order of X rows is reshuffled randomly. We then can
calculate an empirical p-value based on the distribution.
Sampling based p-values
While permutation testing can be used to derive an empirical p-value, the
performance of calculation can be inefficient with a large size of data. Therefore, we used
an alternative sampling method to estimate the null distribution of Wilks’ lambda in our
simulation study.
It is known that the sample variance-covariance matrix is 𝐒 0
̂
under the null. Given
sample size 𝑛 and 𝐒 0
̂
, we can sample 𝑚 covariance matrices from the Wishart
distribution. After calculating test statistic for each sample matrix, we can calculate a p-
value:
p-value =
sum I(𝑡 𝑖 > 𝑡 )
𝑚
where 𝑡 is the test statistic of data, and 𝑡 𝑖 is the Wilks’ lambda statistic at the i
th
sample
matrix (𝑖 = 1,⋯ ,𝑚 ) . This approach increases efficiency because the sampling method
only involves variance-covariance matrix itself and does not require computing from the
87
individual level data. We found the test statistic null distribution estimated using this
sampling method is approximately equal to the distribution estimated using permutation
based on our simulation results.
When there is no genetic effect, the power of every test is approximately equal to
the given 0.05 type 1 error rate in the simulations. In the comparisons of the null test
statistics distribution between permutation and sampling method, we found the
distributions are approximately the same (Figure III). The diversity on the curves at the
right tails implies sampling method slightly tends to be more conservative than
permutation.
III.7. Summary
In this chapter, we present a unified approach to perform PCA, PLS, and CCA.
The approach not only leads to faster computational implementations than traditional
approaches, but also provides insight into the differing nature of the three DRMs. We
also presented a statistical approach for testing associations between linear components
of multiple genetic variants and multiple traits jointly. This approach is the basis for
faster simulations in subsequent chapters.
Based on the understanding of the nature of three DRMs, we can already
anticipate that the performance of association tests based on linear components of SNPs
or traits obtained from PCA is going to be generally poor.. The reason is that,
associations between SNPs and traits are determined by the correlations between both,
but the PCs only point to directions of large variability within the SNPs and within the
88
traits but not necessarily to the correlations between the two. Therefore, PCA itself is not
expected to help improve the chances of capturing genetic associations with a smaller
number of components. Studies reporting improved power with using simple PCA or
factor analysis (FA), are due to limited scenarios in which the directions of large
variability among the SNPs align with directions of large correlation with the traits.
However, when the under scenarios where this is not the case, performing tests with PCA
would not be able to improve power. Conversely, because PLS and CCA rely on
information on the correlations between SNPs and traits, testing on linear components
obtained from the two DRMs would be expected to have improved power. We provide
supportive evidence in the following simulation study.
89
Table and Figures (Chapter III)
Figure III. Tail probability of Wilks’ lambda (1,000,000 permutations versus 1,000,000 samples)
under the null hypothesis in high dimensional simulation setting (Chapter IV.4).
Wilk’s lambda
90
IV. Evaluation of Dimensionality Reduction Methods for Testing Associations with
Multivariate Traits
IV.1. Sparsity of genetic effects
In the previous chapter it was shown that the data covariance matrix contains all
the information required to apply the three classical dimensionality reduction methods
(DRMs) and to perform the corresponding statistical test (equation III-36) of associations
between multiple SNPs and multiple traits. For association data including SNP and trait
variables, the covariance matrix can be partitioned into 1) covariance between the SNPs;
2) covariance between multiple traits, and 3) covariance between multiple SNPs and
traits. Principal component analysis, partial least square, and canonical correlation
analysis take different pieces of the information to calculate their linear components. To
test the association between a set of SNPs and a set of traits using DRMs one can test
whether the covariance between linear components of SNPs and traits is equal to zero
(equations III-36 and III-37). The magnitude of the covariance (or correlation) within and
between SNPs and traits, and their distribution in the matrix are the key determinants of
power for a test of association. In general we expect only a relatively small proportion of
non-zero SNP-trait correlations. In other words, we expect the correlation matrix to be
sparse. We informally refer to the proportion and arrangement of non-zero elements of
SNP-trait the correlation matrix as the ‘sparsity of the genetic effects’. In general,
different sparsity patterns would be captured by different linear components from DRMs
affecting the power to detect associations. To investigate the relationship between
sparsity and power I conduct an extensive simulation study.
91
IV.2. Low-dimensional setting
Simulation setting
In order to gain insight about the relationship between sparsity of genetic effects
and the power to detect those using DRMs, I first conduct a simulation in a low-
dimensional setting with relatively few simulation parameters. Since all three DRMs and
their corresponding statistical tests only require the data covariance matrix, we directly
simulated sample covariance matrices rather than generating individual level genotypic
and trait data. This approach is not only computationally efficient but it also allows to
focus on the key driver of power –the correlation structure– rather than on parameters of
secondary importance such as allele frequencies, Hardy Weinberger equilibrium, etc. I
assumed 4 SNPs (X) and 4 traits (Y) and a corresponding 8 by 8 joint population
correlation matrix, which can be partitioned as:
Σ = [
𝚺 𝐗𝐗
𝚺 𝐗𝐘
𝚺 𝐘𝐗
𝚺 𝐘𝐘
] =
[
[
1
1
𝜌 𝐗𝐗
𝜌 𝐗𝐗
1
1
]
4×4
𝛒 𝐗𝐘
4×4
𝛒 𝐘𝐗
4×4
[
1
1
𝜌 𝐘𝐘
𝜌 𝐘𝐘
1
1
]
4×4
]
8×8
(IV-1)
where 𝜌 𝐗𝐗
and 𝜌 𝐘𝐘
are correlations within X and Y respectively ranging between 0 and 1.
𝛒 𝐘𝐗
is a 4 by 4 parameters matrix of correlations between SNPs X and traits Y, and 𝛒 𝐘𝐗
is
the transpose of 𝝆 𝐗𝐘
. The correlation 𝜌 𝐗𝐗
captures the amount of linkage disequilibrium
between the four SNPs: 𝜌 𝐗𝐗
= 0 corresponds to SNPs in linkage equilibrium on the same
92
chromosome or in different chromosomes (but considered to belong to one biological
pathway); 𝜌 𝐗𝐗
≠ 0 corresponds to SNPs in linkage disequilibrium (LD). For the traits Y
𝜌 𝐘𝐘
captures the correlation among the trait components. ρ
𝐗𝐘
is the correlation matrix
between X and Y and specifies the structure of the genetic effects. We controlled the
parameters 𝜌 𝐗𝐗
, 𝜌 𝐘𝐘
, and ρ
𝐗𝐘
and computed the corresponding power performance. We
specified four types of covariance structures of ρ
𝐗𝐘
:
ρ
𝐗𝐘
𝐴 = [
0 0 0 𝜌 0 0 0 𝜌 0 0 0 𝜌 0 0 0 𝜌 ] , ρ
𝐗𝐘
𝐵 = [
0 0 𝜌 0
0 0 0 𝜌 0 0 0 𝜌 0 0 0 𝜌 ]
ρ
𝐗𝐘
𝐶 = [
0 𝜌 0 0
0 0 𝜌 0
0 0 0 𝜌 0 0 0 𝜌 ], ρ
𝐗𝐘
𝐷 = [
𝜌 0 0 0
0 𝜌 0 0
0 0 𝜌 0
0 0 0 𝜌 ]
(IV-2)
where 𝜌 is the correlation between a SNP-trait pair (the magnitude of genetic effect) with
equal probability of a positive or a negative sign. ρ
𝐗𝐘
𝐴 represents a scenario where all
four SNPs are correlated to a single trait; ρ
𝐗𝐘
𝐵 corresponds to three SNPs correlated with
one trait, and the other SNP correlated with another trait; ρ
𝐗𝐘
𝐶 represents two SNPs
correlated to one trait, and the each of the remaining SNPs correlated with a different
trait; ρ
𝐗𝐘
𝐷 represents a scenario with each SNP correlated to a different trait.
For each given population covariance matrix, we performed association tests with
the three DRMs in 10,000 sample covariance matrices sampled from the Wishart
distribution assuming 100 individuals per dataset. The distribution of test statistics under
the null hypothesis 𝐻 0
: 𝚺 𝐗𝐘
= 0 was estimated from 10,000 sample covariance matrices
93
sampled from the Wishart distribution when 𝚺 𝐗𝐘
was set to be zero. In addition to using
DRMs, we also performed tests using classical regression models, including simple linear
regression (SLR), multiple linear regression (MLR), and MANOVA. For SLR models we
adjusted for multiple testing using a Bonferroni correction. This is probably the most
commonly used approach for testing joint genetic associations between multiple SNPs
and multiple traits. For MLR model and MANOVA, power was calculated based on the
same sampling scheme used for the DRM approaches.
Null hypothesis
Under the null hypothesis that there were no correlations between genetic markers
and traits (𝚺 𝐗𝐘
= 0), the estimated powers for all statistical tests (testing one SNP versus
one trait using SLR, testing all SNPs versus one trait using MLR, testing all SNPs versus
all traits using MANOVA and DRM-based approaches) were approximately equal to the
target 5% type I error rate (Table IV.2.0).
Independent SNPs and traits
For testing the alternative hypothesis, at first, we assumed there was no
correlations between SNPs (𝜌 𝐗𝐗
= 0), between traits (𝜌 𝐘𝐘
= 0), and correlations between
SNPs and traits were equal to 0.2 (𝜌 = 0.2). According to the population covariance form
showed in equation IV-1, when four SNPs were correlated to one trait (𝚺 𝐗𝐘
= ρ
𝐗𝐘
𝐴 ) , the
given population covariance matrix would be:
94
Σ = [
𝚺 𝐗𝐗
𝚺 𝐗 𝐘 𝚺 𝐘𝐗
𝚺 𝐘𝐘
] =
[
1 0 0 0 2 . 2 . 2 . 2 .
0 1 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 1 0 0 0 0
2 . 0 0 0 1 0 0 0
2 . 0 0 0 0 1 0 0
2 . 0 0 0 0 0 1 0
2 . 0 0 0 0 0 0 1
]
We could use the same way draw the population covariance matrix structures for the
other three types of sparsity of genetic effects (𝚺 𝐗𝐘
= 𝛒 𝐗𝐘
𝐵 ,𝛒 𝐗𝐘
𝐶 ,𝛒 𝐗𝐘
𝐷 ) .
The power performances of each type of approaches under the four types of
sparsity are showed in Table IV.2.1. For ρ
𝐗𝐘
𝐴 that four SNPs were correlated to the same
trait, the power of testing the four genetic effects using SLR is around 51-52% (without
considering of Bonferroni Correction), and the power of testing the others have no
genetic effects is ~5% (the presumed α-level). This means, if we knew where the genetic
effects located in 𝚺 𝐗𝐘
and went test them directly without including the penalty of
significance levels from multiple comparisons, the power for each test is 51-52%.
However, we wound not know where the genetic effects located in 𝚺 𝐗𝐘
in a real data
analysis and thus we needed to consider the fact that we tested all possible pairs of SNPs
and traits. Therefore, the power of testing each individual genetic effect using SLR would
become much lower after correcting for multiple comparisons. For doing a joint tests
using SLR with Bonferroni corrections, the power of such test is 53.65%. Similar results
were found in the other three setting of sparsity. This finding indicated that, because SLR
was essentially used to test the correlation in each pair of SNPs and traits individually
95
and the joint test was done by directly combining those results afterward, the sparsity was
not related to the power in these SLR-based tests.
For MLR that tested four SNPs versus one trait in each model when 𝚺 𝐗𝐘
= ρ
𝐗𝐘
𝐴 ,
the powers were 5.52%, 5.01%, 5.25%, 97.51%, for each trait respectively (Table
IV.2.1.). In the first three models, given the traits were not related to any SNPs, the
corresponding power should be the presumed α-level. As to the last model that four SNPs
all carried genetic effects to the one trait, MLR showed a very high power. The
corresponding joint test included the four models has an 84.86% high power as well.
When 𝚺 𝐗𝐘
= ρ
𝐗𝐘
𝐵 that one SNP correlated to one trait and three SNPs correlated to
another, the first two MLR models which have no given genetic effects showed ~5%
power. The third model which included one causal SNP has 30.27% power, and the last
model which included three causal SNPs has 82.54% power. Because the genetic effects
were spread into two MLR models, so the highest power found previously decreased. The
power of joint test also decreased to 71.19%. Same reason can be used to explain the
results when 𝚺 𝐗𝐘
= ρ
𝐗𝐘
𝐶 and ρ
𝐗𝐘
𝐷 in which power of joint tests dropped to 58.07% and
49.9% respectively. From the results, we can see that sparsity of genetic effects has huge
impact on MLR-based approaches. When different SNPs have effects to different traits,
the MLR-based approaches even have worse power than SLR-based approaches.
MANOVA and DRM-based approaches were designed to detect associations
between two sets of variables in one test, so their tests are the direct joint test for testing
joint genetic effects. MANOVA showed powers 74.9%, 71.06%, 68.45%, 70.98% under
96
𝚺 𝐗𝐘
= 𝛒 𝐗𝐘
𝐴 ,𝛒 𝐗𝐘
𝐵 ,𝛒 𝐗𝐘
𝐶 , ρ
𝐗𝐘
𝐷 (Table IV.2.1.). The power performance is relatively good
in general, and remained stable across different sparsity (Figure IV.2.1.).
For PCA, when 𝚺 𝐗𝐘
= 𝛒 𝐗𝐘
𝐴 , we found the power is poor at 21.44% when only
using the first pairs of linear components in testing joint genetic effects. By increasing the
number of linear components, the power increased to 39.39%, 56.8%, and 74.9% when
the first two, three, and four linear components were used respectively. Similar power
patterns were found for the other three sparsity structures (Figure IV.2.1.). PLS and CCA
performed similarly that, when 𝚺 𝐗𝐘
= 𝛒 𝐗𝐘
𝐴 , the highest power (78.43% and 78.95%)
were observed under testing the first pairs of linear components. When more linear
components were included, the power became lower (Figure IV.2.1.). However, given the
total number of causal SNPs are the same but they distributed to more than one trait
(𝚺 𝐗𝐘
= 𝛒 𝐗𝐘
𝐵 ,𝛒 𝐗𝐘
𝐶 ,𝛒 𝐗𝐘
𝐷 ), PLS and CCA would need to include more linear components
to achieve higher power (Figure IV.2.1.). For all three DRMs, when using all linear
components in the tests, the determinants of 𝐒 𝐗𝐗
̂
, 𝐒 𝐘𝐘
̂
, and 𝐒 𝐗𝐘
̂
(equation III-37) would be
equal to the determinants of the original sample covariance matrices 𝐒 𝐗𝐗
, 𝐒 𝐘𝐘
, and 𝐒 𝐗𝐘
(equation III-36), so that the power of all three DRMs would be identical to MANOVA.
In summary, for testing joint genetic effects, when SNPs are independent and so
are traits, the power of using SLR are similar across different sparsity. The best power
were found when all four SNPs have effects to one trait and testing using MLR, but MLR
power decreased rapidly and even became lower than SLR when each SNP has effect to
each trait respectively. MANOVA has relatively good power performance across four
types of sparsity. Compared to MLR, MANOVA is more stable from the impact of
97
sparsity of genetic effects. The worst power was found when testing with the first pair of
PCA linear components across four sparsity structures. That could be improved by
including more linear components, but no better than MANOVA. PLS and CCA have
better power than MANOVA when only using the first pair of linear components under
the situation that all four SNPs have effects to one trait, but no better than MLR.
The important message here is, for testing genetic associations, no matter
individually or jointly, the power of using SLR would never be the best. When MLR
perfectly captured the sparsity structure, it would surely have the best power. However, in
real data analysis, we do not know the true underlying sparsity of genetic effects. If we
took wrong guess of the sparsity, using MLR could have worse power than the other
approaches. For PCA, this simulation results clear indicated that, because the PCA linear
components were calculated independent to the correlation between SNPs and traits, they
help nothing to capture the underlying sparsity. Consequently, power of using the first
few PCA linear components is very poor, even worse than using SLR-based approaches.
On the other hand, PLS and CCA included the covariance between SNPs and traits to
calculate the linear components. Under the sparsity that four SNPs have effects to one
trait, they can even achieve better power than MANOVA and close to MLR when testing
with the first pair of linear components. This can be comprehended that PLS and CCA
first generated linear components close to the perfect MLR model in that case, so the best
power was found when testing only the first pair. While more linear components were
included, the power then decreased. Nevertheless, when the genetic effects spread to
different traits, more linear components would be needed to represent such sparsity, so
need more linear components to achieve higher power.
98
Correlated SNPs and traits
In the real data analysis, it is more likely that multiple SNPs are correlated and
multiple traits are also correlated. We mentioned the importance of this issue in the
background of pathway analysis previously, and which is the actual core issue we
proposed to address in testing genetic associations jointly.
From Table IV.2.2. to Table IV.2.6., we showed the power performance for all
approaches while the correlations between SNPs (𝜌 𝐗𝐗
) and between traits (𝜌 𝐘𝐘
) increased
from 0.1 to 0.5. According to the population covariance form showed in equation IV-1,
when 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.5 and 𝚺 𝐗𝐘
= ρ
𝐗𝐘
𝐴 , the given population covariance matrix would
be:
Σ = [
𝚺 𝐗𝐗
𝚺 𝐗𝐘
𝚺 𝐘𝐗
𝚺 𝐘𝐘
] =
[
1 5 . 5 . 5 . 2 . 2 . 2 . 2 .
5 . 1 5 . 5 . 0 0 0 0
5 . 5 . 1 5 . 0 0 0 0
5 . 5 . 5 . 1 0 0 0 0
2 . 0 0 0 1 5 . 5 . 5 .
2 . 0 0 0 5 . 1 5 . 5 .
2 . 0 0 0 5 . 5 . 1 5 .
2 . 0 0 0 5 . 5 . 5 . 1
]
We could use the same way draw the population covariance matrix structures for the
other three types of sparsity of genetic effects (𝚺 𝐗𝐘
= 𝛒 𝐗𝐘
𝐵 ,𝛒 𝐗𝐘
𝐶 ,𝛒 𝐗𝐘
𝐷 ) . In this section,
we focus on how the power of each test were impacted by the correlations issue, so we
would not list detailed values of power for each test under different setting of
correlations. Please view Tables IV.2.2. to IV.2.6. for the power values. Figures IV.2.2.
99
to IV.2.6. show the power patterns of joint tests under different types of sparsity across
different correlations of SNPs and traits.
When both 𝜌 𝐗𝐗
and 𝜌 𝐘𝐘
increased, SLR-based approaches showed unapparent
changes in power compared to when 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0 across different sparsity of genetic
effects. The correlations of SNPs and traits did not really make the joint tests using SLR
drop much power after correcting for multiple comparisons. For MLR-based approaches,
the same power patterns across the sparsity of genetic effects remained, plus the power
increased with the increasing 𝜌 𝐗𝐗
(not with 𝜌 𝐘𝐘
, will explain in the next section).
MANOVA, as expected, showed improved power performance with the increasing of
both 𝜌 𝐗𝐗
and 𝜌 𝐘𝐘
. When 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
≥ 0.3, the power of MANOVA became even higher
than MLR when 𝚺 𝐗𝐘
= 𝛒 𝐗𝐘
𝐴 .
Figure IV.2.7. presented the power performance in combination with correlations
of SNPs and traits and the number of selected linear components. For PCA, we observed
power of testing with the first pair of linear components actually decreased when both
𝜌 𝐗𝐗
and 𝜌 𝐘𝐘
increased. It was known that, when higher level of correlations existed in
SNPs and traits respectively, the first few pairs of linear components calculated from
PCA would explain more covariance of SNPs and traits. For example, when using PCA
to capture the LD across SNPs, the higher LD, the less principal components (selected
from the first) would be needed to represent a given variance of SNP data. However, we
already knew those variances did not mean to indicate the correlations between SNPs and
traits, so they simply created more variation when estimating the correlations between
SNPs and traits. Consequently, testing the first few PCA linear components resulted in
100
even lower power under multiple correlated SNPs and traits. For PLS, when the first few
pairs of linear components were used in testing joint genetic effects, the power did not
change much with the increase of 𝜌 𝐗𝐗
and 𝜌 𝐘𝐘
. This is reasonable because PLS calculated
linear components directly from covariance between SNPs and traits without conditional
on covariance between SNPs and traits. On the contrary, linear components of CCA are
calculated based on covariance between SNPs and traits conditional on covariance
between SNPs and traits. Therefore, we observed the power increased from using the first
pair of linear components with increasing levels of correlations of SNPs and traits (Figure
IV.2.7.). When 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
≥ 0.3 , join test using CCA with the first pair of linear
components then showed the best power under 𝚺 𝐗𝐘
= 𝛒 𝐗𝐘
𝐴 .
In conclusion, the power performance patterns for most approaches did not
change much compared to the setting of independent SNPs and traits except one
important finding. The finding is that, the correlations of SNPs and traits could help the
power performance of MANOVA and CCA surpass MLR even when MLR was
performed under a presumed model which perfectly matched the underlying sparsity of
genetic effects. Furthermore, CCA could surpass MANOVA when less linear
components were selected to be tested in some specific situations. In other words,
without guessing the “correct” presumed model (the model matches the sparsity), the
linear components calculated in CCA could not only well capture the underlying sparsity
of genetic effects but also use information of correlations of SNPs and traits as
advantages to achieve a better power. Nevertheless, we did not know how many pairs of
linear components can provide sufficient information in terms of capturing the sparsity
101
and contribute to increase power. This question then became the next important topic that
we purposed to address in the later part of this proposal.
Another interesting finding here is to clarify a concept that, testing joint genetic
effects under correlated SNPs and traits using SLR did not really “lose power” compared
to under independent SNPs and traits. This means, the power performance of combining
related tests using Bonferroni corrections would be just similar with the power as if those
tests were really independent. The “lose power” actually means the power of such joint
test could have been better if the information of correlations of SNPs and traits were
considered and contributed to the test appropriately. However, if those correlations were
considered inappropriately, such as the example of testing with the first few pairs of PCA
linear components, the power could became even worse than no consideration at all.
Correlated SNPs or correlated traits
In order to further explore whether the correlations of SNPs (𝜌 𝐗𝐗
) and correlations
of traits (𝜌 𝐘𝐘
) could have distinct impacts on the power in testing joint genetic effects, we
did additional two simulation settings.
The first setting, we set SNPs to be correlated (𝜌 𝐗𝐗
= 0.3) and traits to be
independent (𝜌 𝐘𝐘
=0). Based on equation IV-1, the given population covariance matrix
under 𝚺 𝐗𝐘
= 𝛒 𝐗𝐘
𝐴 would be:
102
Σ = [
𝚺 𝐗𝐗
𝚺 𝐗𝐘
𝚺 𝐘𝐗
𝚺 𝐘𝐘
] =
[
1 0 0 0 2 . 2 . 2 . 2 .
0 1 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0
2 . 0 0 0 1 3 . 3 . 3 .
2 . 0 0 0 3 . 1 3 . 3 .
2 . 0 0 0 3 . 3 . 1 3 .
2 . 0 0 0 3 . 3 . 3 . 1
]
Same way was applied to draw the population covariance matrix structures for the other
three types of sparsity of genetic effects (𝚺 𝐗𝐘
= 𝛒 𝐗𝐘
𝐵 ,𝛒 𝐗𝐘
𝐶 ,𝛒 𝐗𝐘
𝐷 ) .
SLR-based approaches showed similar power performance as previously found.
When all causal SNPs have effects to one trait (𝚺 𝐗𝐘
= 𝛒 𝐗𝐘
𝐴 ), MLR-based approach
showed the best power (Figure IV.2.9). Using CCA with the first pair of linear
components showed the second best power, and MANOVA was the next best. When
three SNPs have effects to one trait and one SNP has effect to another trait (𝚺 𝐗𝐘
= 𝛒 𝐗𝐘
𝐵 ),
the power performance of MLR dropped about 30%, CCA with the first pair of linear
components dropped about 25% (but increased while more linear components were
selected), and MANOVA dropped about 20%. This made MANOVA and CCA and PLS
with more linear components to achieve the power higher than MLR. When the causal
SNPs distributed across traits, power of MANOVA and CCA and PLS with more linear
components increased, but not MLR. The results indicated that, when SNPs are
correlated, MLR with the model which perfectly matched the sparsity of genetic effects
could achieve higher power than when SNPs are independent. However, when one causal
SNPs move from one trait to another to create an “unbalanced” sparsity structure, the
MLR power dropped a lot, even became worse than when SNPs are independent.
103
Accordingly, correlations of SNPs are no longer benefit to power, but became the reason
to make the sparsity be more unbalanced and thus decrease power. Same reason was
applied to the power performance of MANOVA and DRMs. Consequently, same
phenomena could happen for MANOVA and DRMs in the present of pleiotropy that
multiple traits are correlated to one SNP. Because MLR did not consider the correlations
of traits (testing multiple SNPs versus one trait in each model independently), one SNP
has effects to all traits is equivalent to 𝚺 𝐗𝐘
= 𝛒 𝐗𝐘
𝐷 in which MLR showed the worst
power overall. Figure IV.2.9. showed the power performance particularly for PCA, PLS,
and CCA in combination of correlations of SNPs (𝜌 𝐗𝐗
= 0 to 0.5) and the number of
pairs of linear components.
The second setting, opposite to the first one, we set SNPs to be independent
(𝜌 𝐗𝐗
= 0) and traits to be correlated (𝜌 𝐘𝐘
= 0.3). Based on equation IV-1, the given
population covariance matrix under 𝚺 𝐗𝐘
= 𝛒 𝐗𝐘
𝐴 would be:
Σ = [
𝚺 𝐗𝐗
𝚺 𝐗𝐘
𝚺 𝐘𝐗
𝚺 𝐘𝐘
] =
[
1 3 . 3 . 3 . 2 . 2 . 2 . 2 .
3 . 1 3 . 3 . 0 0 0 0
3 . 3 . 1 3 . 0 0 0 0
3 . 3 . 3 . 1 0 0 0 0
2 . 0 0 0 1 0 0 0
2 . 0 0 0 0 1 0 0
2 . 0 0 0 0 0 1 0
2 . 0 0 0 0 0 0 1
]
Same way was applied to draw the population covariance matrix structures for the other
three types of sparsity of genetic effects (𝚺 𝐗𝐘
= 𝛒 𝐗𝐘
𝐵 ,𝛒 𝐗𝐘
𝐶 ,𝛒 𝐗𝐘
𝐷 ) . The major finding
found from this setting is, doing joint test using CCA with the first pair of linear
104
components can achieve the best power under 𝚺 𝐗𝐘
= 𝛒 𝐗𝐘
𝐴 in which MLR could perfectly
capture the sparsity of genetic effects (Figure IV.2.10.). This suggested that linear
components calculated from CCA can well capture the overall sparsity of genetic effects
among the data covariance matrix by considering information from both covariance of
SNPs and traits and between them.
Conclusion
With a low-dimension simulation, I comprehensively presented the relationships
between correlations of SNPs and traits, sparsity of genetic effects, and power in testing
with SLR, MLR, MANOVA, PCA, PLS, and CCA. In this simulation setting, when the
sparsity of genetic effects is unknown, we found using MANOVA to test joint genetic
effects would give good power in general. On top of that, CCA could further provide
better power in few specific situations when the linear components were selected
appropriately. Therefore, the next important aim would be how to find an approach to
help select linear components of CCA to better fit the sparsity of genetic effects and
improve power.
IV.3. Low-dimension data analysis
For applying in a real analysis, we repeated one association analysis that we have
done before (Chapter II.1.) using PCA, PLS, and CCA approaches. The study is a
candidate gene approach for testing genetic associations between multiple SNPs in ACP1
region and multiple type 2 diabetes related quantitative traits using tag SNPs and
haplotype approaches [Shu et al., 2009]. Here we take the same dataset that we had
105
analyzed for the study. In the dataset, 14 SNPs were genotyped across ACP1 gene region
to capture the genetic variant, and 12 quantitative traits were measured in clinic to
represent the characteristics of type 2 diabetes. The traits include measurements such as
BMI, percent body fat, measurements of glucose and insulin levels at different time
points during glucose tolerance tests, glucose effectiveness, insulin sensitivity, acute
insulin response, and disposition index. We tested the hypothesis that whether the 14
SNPs are associated with the 12 traits.
Because BetaGene is a family-based study, we randomly sampled one normal
individual from each family and thus total 200 unrelated individuals were included in this
analysis. Both SNPs and traits data were standardized and taken residuals from multiple
linear regression models for adjusting age and sex. Data covariance matrix was then
calculated based on the residuals. For selecting linear components, we used a convenient
and also commonly applied method that we set the selection threshold to be 80% (the
selected linear components can explain more than 80% variance out of total variance
explained from all linear components) for all three DRMs. The test statistics distributions
under the null were calculated from 10,000 sample covariance matrices sampled from the
Wishart distribution given the data covariance matrix in which the covariance between
SNPs and traits are set to be zero.
In this real data analysis, p-values of testing the associations between the 14 SNPs
and 12 traits are 0.15513, 0.00015, and <0.00001, for using PCA, PLS, and CCA,
respectively. In summary, CCA approach showed the most significant p-value
(p<0.00001) among all three DRMs. PLS approach showed a significant p-value as well
106
(p=0.00015) but not as significant as CCA. PCA approach showed a non-significant p-
value (p=0.15513). In conclusion, using CCA approach to test associations between
ACP1 variants and type 2 diabetes traits could achieve the most significant result among
all three DRMs, and also much more significant compared to using Tag SNPs and
haplotype analysis in this dataset.
IV.4. High-dimensional setting
Simulation setting
Although the previous low-dimension simulation is good for controlling
parameters of correlations and sparsity of genetic effects to get the insight of their
relationships with power, the variables space would not be large enough to clearly
differentiate the power across the approaches. On the other hand, nowadays, it is
commonly seen dozens even hundreds genetic markers and traits are included in a
pathway analysis. Thus, a high-dimension simulation is needed to show the performance
of testing genetic associations in such real data analysis. Here we did a simulation study
to explore the performance of all types of join tests we have in the previous section in a
large simulated dataset.
We built a simulation with a hundred SNPs and eight quantitative traits to show
how the powers of joint tests change with the sparsity of genetic effects in a high-
dimension analysis. Using the same notations from the previous section (Chapter IV.2.),
the given population covariance matrix Σ would be:
107
Σ = [
𝚺 𝐗𝐗
𝚺 𝐗𝐘
𝚺 𝐘𝐗
𝚺 𝐘𝐘
] =
[
[
1
1
𝛒 𝐗𝐗
𝛒 𝐗𝐗
1
1
]
100×100
𝛒 𝐗𝐘
100×8
𝛒 𝐘𝐗
8×100
[
1
1
𝛒 𝐘𝐘
𝛒 𝐘𝐘
1
1
]
8×8
]
108×108
(IV-3)
LD blocks on chromosomes were simulated in a simple fashion. I assumed each
SNP in the data is correlated with the other ten SNPs (five on its upstream and five on its
downstream) by giving non-zero correlations in 𝚺 𝐗𝐗
. Under this setting, every SNP would
locate in the middle of a row of ten correlated SNPs (except the first and last few ones)
alike a LD block. The nearest six correlated SNPs (three on upstream and three on
downstream) are correlated with ±0.2 correlation, and the rest four correlated SNPs are
correlated with ±0.1 correlation. The same rules were also applied to generate correlated
traits in 𝚺 𝐘 𝐘 . As a result, ~12% SNP-SNP pairs have non-zero correlations in 𝚺 𝐗𝐗
and all
traits are all correlated
In order to explore the sparsity of genetic effects in this high-dimension
framework, we specified three types of covariance structures of 𝚺 𝐗𝐘
, which could be
conceptually presented with the figure:
108
𝛒 𝐿
𝛒 𝑀
𝛒 𝑆
10× 2
50 × 4
100× 8
(IV-4)
There are three different sizes of squares, and each represents a matrix with size 10 × 2,
50 × 4 , and 100× 8 , respectively. Each matrix is nested inside 𝚺 𝐗𝐘
, and the ten
correlations with magnitude 0.1 were fairly distributed inside each nested matrix. For
example, 𝛒 𝑆 indicates ten non-zero genetic effects (given correlations of values 0.1 or -
0.1) were assigned into a 10 × 2 matrix, and the matrix is nested in the middle of the
covariance matrix 𝚺 𝐗𝐘
.
The high-dimension simulation framework is the same with the previous low-
dimension simulation. Sparsity of genetic effects was controlled with a set of parameters
(proportion, arrangement, and magnitude of correlations in SNPs and traits, and
correlations between SNPs and traits) in one population covariance matrix. For each
population covariance matrix, association tests were performed in 10,000 sample
covariance matrices sampled from the Wishart distribution assuming a total of 500
individuals. Distribution of test statistics under the hull hypothesis (𝚺 𝐗𝐘
= 0) was
109
estimated from 10,000 sample covariance matrices sampled from the Wishart distribution
when 𝚺 𝐗𝐘
of a population covariance matrix was set as zero.
Null hypothesis
Under the null hypothesis that there were no correlations between genetic markers
and traits (𝚺 𝐗𝐘
= 0), the estimated powers for all statistical tests (testing one SNP versus
one trait using SLR, testing all SNPs versus one trait using MLR, testing all SNPs versus
all traits using MANOVA and DRM-based approaches) were approximately equal to the
target 5% type I error rate.
Results
Table IV.3.1. showed results of the power performance for each approach in
testing the alternative hypothesis. For SLR, the power did not change much across
different sparsity of covariance matrices. The highest power of SLR without correcting
for multiple comparisons in the three types of sparsity were is about 61-63%. After
correcting for Bonferroni corrections, the SLR-based joint test power was reduced to 33-
36% (Table IV.3.1.). MLR-based approach showed relatively poor power performance
compared to low-dimension simulations. When causal SNPs distributed in a smaller size
of nested matrix 𝛒 𝑆 inside 𝚺 𝐗𝐘
, MLR could have power at 61.8%. When the size of such
matrix became larger to 𝛒 𝑀 and 𝛒 𝐿 , MLR power drop rapidly (36.84% and 21.66%) to be
lower than SLR-based approach. This finding is consistent which the pattern we found in
low-dimension simulation previously. MANOVA have relatively good power
performance (65-77%) across the three types of sparsity.
110
Given the one hundred SNPs and eight traits, the maximum number of possible
pairs of linear components for PLS and CCA is eight. Therefore, we included the pairs of
linear components from the top first to the top eight in the analysis. PCA was performed
in SNPs and traits separately and thus which is not limited to choosing linear components
by pairs. However, in order to be consistently following the order of maximizations
variance/covariance in the comparisons of PLS and CCA, we made pairs of linear
components in PCA and tested by pairs as well. (A more popular method of choosing
linear components in PCA will be discussed in the next chapter.) PCA showed the worst
power overall (< 15%). We knew the number of SNPs is more than the number of traits
(100 > 8), and linear components were calculated independently for SNPs and traits in
PCA, so that the linear components obtained from PCA could not cover all variance of
SNPs when only the first eight pairs of them were selected. Consequently, PCA could not
achieve power similar to MANOVA when the all the first eight pairs of linear
components were included in testing. PLS showed low power when only the first few
linear components were selected. While more linear components were selected, the
power was increased and close to the power of MANOVA. When causal SNPs are
correlated and the corresponding traits (the traits shared non-zero covariance with causal
SNPs) are also correlated (𝚺 𝐗𝐘
= 𝛒 𝑆 ) , CCA could achieve good power at 87.3% when
testing the first two linear components pairs. In the scenario that causal SNPs and
corresponding traits are less correlated (𝚺 𝐗𝐘
= 𝛒 𝑀 ) , is was needed to selected the first
five CCA linear components pairs to get good power (80.36%). When causal SNPs and
corresponding traits are independent (𝚺 𝐗𝐘
= 𝛒 𝐿 ) , all CCA linear components would be
needed to achieve the best power (65.98%) and which is equivalent to using MANOVA.
111
These results were consistent with the findings in low-dimension simulation, and once
again supported that an approach to help select linear components is needed to achieve
better power performance in CCA-based approaches. Figure IV.4.1. showed the changes
in power for PCA, PLS, and CCA-based approaches using difference pairs of linear
components and the comparisons with SLR, MLR and MANOVA.
Conclusion
Because SLR tested a pair of SNP and traits individually and pooled all the test
statistics assuming there are independent events, the information of correlations in SNPs
and traits was ignored in SLR-based approach. Therefore, it was as expected that
Bonferroni corrected SLR showed a poor power performance and was consistent across
difference covariance structures. MLR-based approach tested all SNPs versus one trait in
one model at a time and then pooled the models test statistics using Bonferroni
correction. Compared to SLR, MLR could consider correlations of SNPs in a model and
the number of corrections was reduced to the number of traits only. While most casual
SNPs could be captured with fewer MLR models, those models would have relatively
significant test statistics and thus the overall pooled power may still be fine when the
significances could overcome the number of corrections. When causal SNPs contributed
genetic effects to different traits so that each MLR model would only include one or two
causal SNPs, the power became low and could be worse than SLR-based approach. This
result indicated the performance of MLR-based approach heavily depended on how
genetic effects were arranged across traits. Therefore, for the purpose of testing genetic
associations in a real data analysis in which the sparsity of genetic is unknown, MLR-
112
based approach would not be a good choice because of its uncertainty and lack of
robustness in power performance. MANOVA maintained a relatively good and robust
power performance across these covariance matrices settings. However, when the
correlations in causal SNPs and in corresponding traits became lower, MANOVA started
to lose power.
PCA-based approach showed very poor power overall. The results once again
supported the conclusion that selecting linear components from SNPs and traits
separately would not help to capture the information of relationships between SNPs and
traits. As a result, testing PCA linear components of SNPS and traits resulted in a poor
power performance. However, there could be some specific cases that maximizing SNPs
variance could help capture the relationship between SNPs and traits. For example, when
a data in which causal SNPs are correlated with numbers of the other SNPs and their
variance take a relatively large proportion of variance of traits, maximizing variances of
SNPs and/or traits is in fact to capture their relationships. In such cases, testing PCA
linear components could help improve power because of avoiding corrections of multiple
comparisons in traditional SLR or MLR-based approaches. PLS-based approach showed
better power while including all pairs of linear components across three settings of
covariance structures, but no better than MANOVA. CCA-based approach showed the
best power performance in the three DRMs, and it was able to achieve the best power
among all approaches while a specific set of linear components was selected. When
causal SNPs were correlated and traits were also correlated, the first few CCA linear
components can better captured correlations between SNPs and traits. Therefore, testing
the first few CCA linear components became the most powerful approach of association
113
testing. Including more linear components is like adding noise into tests, and thus power
dropped while more linear components were included. This finding is consistent with the
conclusion in low-dimension simulation. Given the same correlation structure in SNPs
and traits, when the correlations between causal SNPs and traits become small, it was
needed to include more CCA components to capture the associations for achieving better
power. When genetic effects were very sparse, it was required all CCA linear
components to get better power, and which is equivalent to MANOVA. These results
once again supported the argument that selection of CCA components is crucial to
achieve good power performance in CCA-based approach.
IV.5. Summary
The comprehensive low dimension simulations showed that, compared to the
traditional analysis (SLR, MLR), CCA-based approach has the potential to well capture
the sparsity of genetic effects and thus achieve better power in testing joint genetic
associations between multiple SNPs and traits. In addition, the correlations in SNPs and
traits can further improve power for MANOVA and CCA-based approach. The example
of real data analysis of testing genetic associations in ACP1 region supported the low
dimension simulation results.
In order to better examine the potential of DRMs and reflect more reality in real
data analysis, the next high dimension simulation considered scenarios with more
dimensions in both sides of SNPs and traits. The results again showed MANOVA and
CCA-based approach are very powerful in detecting joint genetic associations compared
to traditional SLR analysis. When those genetic associations are more correlated (causal
114
SNPs are correlated to each other and/or so the corresponding traits), CCA can capture
such sparsity structure with fewer components and thus achieve power higher than
MANOVA in testing associations. Therefore, how to gain the advantage of testing with
fewer CCA components when the selection of components is actually unknown in a real
data analysis would be the next important topic in applying CCA-based approach.
115
Tables and Figures (Chapter IV)
Table IV.2.0. Power of genetic associations tests when 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0,𝜌 = 0.
𝚺 𝐗𝐘
Methods Power
ρ
𝐗𝐘
= 0 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0486 0.0479 0.0459 0.0503 0.0450
SNP 2 0.0495 0.0534 0.0561 0.0504
SNP 3 0.0499 0.0504 0.0469 0.0494
SNP 4 0.0470 0.0464 0.0489 0.0442
MLR 0.0517 0.0523 0.0472 0.0503 0.0501
MANOVA 0.0458
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.0486 0.0498 0.0493 0.0458
PLS 0.0463 0.0501 0.0473 0.0458
CCA 0.0514 0.0486 0.0468 0.0458
* i-LC: the first i pairs of linear components used in testing.
116
Table IV.2.1. Power of genetic associations tests when 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0,𝜌 = 0.2.
𝚺 𝐗𝐘
Methods Power
ρ
𝐗𝐘
𝐴 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0493 0.0464 0.0512 0.5144 0.5365
SNP 2 0.0491 0.0465 0.0496 0.5154
SNP 3 0.0515 0.0480 0.0477 0.5171
SNP 4 0.0462 0.0488 0.0483 0.5171
MLR 0.0542 0.0504 0.0543 0.9378 0.8486
MANOVA 0.7490
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.2144 0.3939 0.568 0.7490
PLS 0.7843 0.7616 0.7527 0.7490
CCA 0.7895 0.7646 0.7502 0.7490
ρ
𝐗𝐘
𝐵 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0505 0.046 0.5147 0.0486 0.5354
SNP 2 0.0504 0.0471 0.0491 0.5175
SNP 3 0.0513 0.0466 0.0442 0.5196
SNP 4 0.0467 0.0501 0.0497 0.5302
MLR 0.0455 0.0514 0.3027 0.8254 0.7119
MANOVA 0.7106
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.2248 0.4175 0.5832 0.7106
PLS 0.6622 0.7091 0.7087 0.7106
CCA 0.6756 0.7054 0.7106 0.7106
ρ
𝐗𝐘
𝐶 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0513 0.5149 0.0512 0.0522 0.5289
SNP 2 0.0497 0.0505 0.5224 0.0484
SNP 3 0.0483 0.0513 0.0488 0.5081
SNP 4 0.0499 0.0501 0.0476 0.5115
MLR 0.0459 0.3244 0.3164 0.6154 0.5807
MANOVA 0.6845
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.2334 0.426 0.5832 0.6845
PLS 0.5499 0.6556 0.6803 0.6845
CCA 0.5671 0.6631 0.6836 0.6845
ρ
𝐗𝐘
𝐷 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.5240 0.0490 0.0463 0.051 0.5381
SNP 2 0.0469 0.5185 0.0473 0.0477
SNP 3 0.0460 0.0493 0.5246 0.0541
SNP 4 0.0465 0.0484 0.0478 0.5153
MLR 0.3302 0.3255 0.3251 0.3205 0.4990
MANOVA 0.7098
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.2480 0.4323 0.6083 0.7098
PLS 0.4966 0.6368 0.6981 0.7098
CCA 0.5068 0.6469 0.6998 0.7098
* Gray shading indicates the distribution of given correlations in 𝚺 𝐗𝐘
.
** i-LC: the first i pairs of linear components used in testing.
117
Table IV.2.2. Power of genetic associations tests when 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.1,𝜌 = 0.2.
𝚺 𝐗𝐘
Methods Power
ρ
𝐗𝐘
𝐴 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0501 0.0508 0.047 0.5163 0.5365
SNP 2 0.0486 0.0466 0.0508 0.5222
SNP 3 0.0477 0.0500 0.0500 0.5199
SNP 4 0.0510 0.0512 0.0486 0.5238
MLR 0.0535 0.0544 0.0497 0.9507 0.8588
MANOVA 0.7600
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.1751 0.3635 0.5668 0.7600
PLS 0.7943 0.7787 0.7601 0.7600
CCA 0.8110 0.7868 0.764 0.7600
ρ
𝐗𝐘
𝐵 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0447 0.0489 0.5254 0.0483 0.5251
SNP 2 0.0517 0.0482 0.0476 0.5186
SNP 3 0.0470 0.0475 0.0523 0.5110
SNP 4 0.0493 0.0501 0.0509 0.5295
MLR 0.0523 0.0521 0.0493 0.7021 0.6294
MANOVA 0.6662
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.3071 0.4278 0.5608 0.6662
PLS 0.5812 0.6623 0.6704 0.6662
CCA 0.6054 0.6645 0.6696 0.6662
ρ
𝐗𝐘
𝐶 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0516 0.5155 0.0507 0.0472 0.5282
SNP 2 0.0449 0.0502 0.5223 0.0493
SNP 3 0.0454 0.0471 0.0512 0.5255
SNP 4 0.0457 0.0523 0.0464 0.5066
MLR 0.0452 0.327 0.3358 0.5904 0.5538
MANOVA 0.6979
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.1494 0.3986 0.5797 0.6979
PLS 0.5108 0.6473 0.6901 0.6979
CCA 0.5432 0.656 0.6906 0.6979
ρ
𝐗𝐘
𝐷 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.5192 0.0464 0.0467 0.0487 0.5294
SNP 2 0.0496 0.5255 0.0495 0.0515
SNP 3 0.0476 0.0492 0.5116 0.0525
SNP 4 0.0474 0.0502 0.0517 0.512
MLR 0.321 0.3445 0.3315 0.3261 0.4929
MANOVA 0.7184
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.1458 0.3892 0.5866 0.7184
PLS 0.4784 0.6511 0.7051 0.7184
CCA 0.5176 0.6584 0.7088 0.7184
* Gray shading indicates the distribution of given correlations in 𝚺 𝐗𝐘
.
** i-LC: the first i pairs of linear components used in testing.
118
Table IV.2.3. Power of genetic associations tests when 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.2,𝜌 = 0.2.
𝚺 𝐗𝐘
Methods Power
ρ
𝐗𝐘
𝐴 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0491 0.0444 0.0498 0.5247 0.5366
SNP 2 0.0476 0.0478 0.0463 0.5207
SNP 3 0.0486 0.0472 0.0477 0.5148
SNP 4 0.047 0.0487 0.0505 0.5188
MLR 0.0502 0.055 0.0482 0.958 0.8922
MANOVA 0.8489
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.1150 0.3555 0.6069 0.8489
PLS 0.8256 0.8517 0.8406 0.8489
CCA 0.8843 0.8635 0.8518 0.8489
ρ
𝐗𝐘
𝐵 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0492 0.0521 0.5222 0.0511 0.5062
SNP 2 0.0511 0.0495 0.0491 0.5280
SNP 3 0.0457 0.0509 0.0481 0.5236
SNP 4 0.0472 0.0481 0.0506 0.5057
MLR 0.0439 0.0584 0.3522 0.7121 0.6095
MANOVA 0.7103
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.2457 0.3958 0.5639 0.7103
PLS 0.5658 0.6989 0.7080 0.7103
CCA 0.6600 0.7031 0.7084 0.7103
ρ
𝐗𝐘
𝐶 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0512 0.5166 0.0465 0.0488 0.5229
SNP 2 0.0502 0.0494 0.5219 0.0508
SNP 3 0.0507 0.0469 0.049 0.5124
SNP 4 0.052 0.0468 0.0492 0.5229
MLR 0.0545 0.3417 0.3283 0.5839 0.5556
MANOVA 0.7125
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.0812 0.3709 0.5772 0.7125
PLS 0.5268 0.6773 0.7132 0.7125
CCA 0.5736 0.6867 0.707 0.7125
ρ
𝐗𝐘
𝐷 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.5185 0.0458 0.046 0.0501 0.5368
SNP 2 0.048 0.5214 0.0482 0.047
SNP 3 0.0488 0.0494 0.5078 0.0481
SNP 4 0.0517 0.0454 0.0483 0.5206
MLR 0.3418 0.3466 0.3379 0.3429 0.5155
MANOVA 0.7793
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.0774 0.3644 0.6202 0.7793
PLS 0.5089 0.6949 0.7568 0.7793
CCA 0.5823 0.7211 0.7685 0.7793
* Gray shading indicates the distribution of given correlations in 𝚺 𝐗𝐘
.
** i-LC: the first i pairs of linear components used in testing.
119
Table IV.2.4. Power of genetic associations tests when 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.3,𝜌 = 0.2.
𝚺 𝐗𝐘
Methods Power
ρ
𝐗𝐘
𝐴 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0497 0.0506 0.0502 0.5126 0.5291
SNP 2 0.0504 0.0492 0.0496 0.5285
SNP 3 0.0475 0.0478 0.0428 0.5221
SNP 4 0.0462 0.046 0.0469 0.5261
MLR 0.0552 0.0501 0.0525 0.9751 0.9355
MANOVA 0.9367
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.0910 0.3577 0.6644 0.9367
PLS 0.8438 0.921 0.9277 0.9367
CCA 0.9613 0.9469 0.9390 0.9367
ρ
𝐗𝐘
𝐵 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0468 0.0517 0.5235 0.0505 0.4755
SNP 2 0.0505 0.0484 0.050 0.5134
SNP 3 0.0494 0.0519 0.0514 0.5169
SNP 4 0.0485 0.0521 0.0530 0.5143
MLR 0.0467 0.0524 0.3904 0.6745
MANOVA 0.7797
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.1901 0.3700 0.5846 0.7797
PLS 0.5524 0.7712 0.7818 0.7797
CCA 0.7362 0.7808 0.7828 0.7797
ρ
𝐗𝐘
𝐶 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0497 0.5189 0.0508 0.0499 0.5347
SNP 2 0.0544 0.0463 0.5124 0.0488
SNP 3 0.0498 0.0509 0.0495 0.5126
SNP 4 0.0501 0.0446 0.0464 0.5140
MLR 0.0471 0.3907 0.3879 0.6003 0.6093
MANOVA 0.8123
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.0623 0.3968 0.6431 0.8123
PLS 0.5720 0.7408 0.7953 0.8123
CCA 0.6881 0.7910 0.8136 0.8123
ρ
𝐗𝐘
𝐷 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.5262 0.0545 0.0506 0.0494 0.5265
SNP 2 0.0518 0.5130 0.0500 0.0501
SNP 3 0.0454 0.0493 0.5166 0.0520
SNP 4 0.0469 0.0460 0.0492 0.5205
MLR 0.3802 0.3820 0.3780 0.3832 0.5592
MANOVA 0.8728
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.0623 0.4003 0.7082 0.8728
PLS 0.5326 0.7716 0.8506 0.8728
CCA 0.7128 0.8273 0.8660 0.8728
* Gray shading indicates the distribution of given correlations in 𝚺 𝐗𝐘
.
** i-LC: the first i pairs of linear components used in testing.
120
Table IV.2.5. Power of genetic associations tests when 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.4,𝜌 = 0.2.
𝚺 𝐗𝐘
Methods Power
ρ
𝐗𝐘
𝐴 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0542 0.0465 0.051 0.516 0.5076
SNP 2 0.0505 0.0485 0.0476 0.5104
SNP 3 0.0552 0.0499 0.0491 0.5156
SNP 4 0.0506 0.0516 0.0515 0.5186
MLR 0.0516 0.0535 0.0528 0.989 0.9649
MANOVA 0.9846
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.0838 0.4079 0.7258 0.9846
PLS 0.8355 0.9676 0.9772 0.9846
CCA 0.9911 0.9871 0.9851 0.9846
ρ
𝐗𝐘
𝐵 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0524 0.051 0.5151 0.0508 0.4571
SNP 2 0.0508 0.0502 0.0482 0.5178
SNP 3 0.0509 0.0479 0.0472 0.5153
SNP 4 0.0507 0.0443 0.0496 0.5243
MLR 0.0537 0.0467 0.4542 0.6933 0.6387
MANOVA 0.8979
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.1453 0.4061 0.6597 0.8979
PLS 0.5804 0.8766 0.8918 0.8979
CCA 0.8878 0.9014 0.8985 0.8979
ρ
𝐗𝐘
𝐶 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0491 0.5089 0.0506 0.0520 0.5240
SNP 2 0.0487 0.0453 0.5115 0.0473
SNP 3 0.0519 0.0448 0.0493 0.5191
SNP 4 0.0468 0.0527 0.0496 0.5115
MLR 0.0531 0.4166 0.4363 0.6472 0.6488
MANOVA 0.9081
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.0606 0.4446 0.7418 0.9081
PLS 0.5749 0.8240 0.8934 0.9081
CCA 0.8093 0.8926 0.9067 0.9081
ρ
𝐗𝐘
𝐷 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.5207 0.0501 0.0512 0.0476 0.5250
SNP 2 0.0458 0.5118 0.0515 0.0478
SNP 3 0.0498 0.0483 0.5183 0.0492
SNP 4 0.0485 0.0489 0.0541 0.514
MLR 0.4400 0.4292 0.4435 0.4256 0.6504
MANOVA 0.9563
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.0541 0.4695 0.8127 0.9563
PLS 0.5484 0.8673 0.9387 0.9563
CCA 0.8539 0.9363 0.9512 0.9563
* Gray shading indicates the distribution of given correlations in 𝚺 𝐗𝐘
.
** i-LC: the first i pairs of linear components used in testing.
121
Table IV.2.6. Power of genetic associations tests when 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.5,𝜌 = 0.2.
𝚺 𝐗𝐘
Methods Power
ρ
𝐗𝐘
𝐴 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0493 0.0473 0.0500 0.5131 0.500
SNP 2 0.0516 0.0506 0.0487 0.5207
SNP 3 0.0455 0.0471 0.0479 0.5236
SNP 4 0.0509 0.0504 0.0445 0.5218
MLR 0.0516 0.054 0.0483 0.9976 0.9919
MANOVA 0.9996
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.0773 0.4525 0.7939 0.9996
PLS 0.8293 0.9923 0.9988 0.9996
CCA 0.9999 0.9997 0.9996 0.9996
ρ
𝐗𝐘
𝐵 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0504 0.0495 0.5175 0.0503 0.4302
SNP 2 0.0456 0.0497 0.0459 0.5195
SNP 3 0.0473 0.0466 0.0459 0.5199
SNP 4 0.0488 0.0493 0.0483 0.5187
MLR 0.0512 0.0542 0.5061 0.7178 0.6989
MANOVA 0.9806
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.1338 0.4367 0.7538 0.9806
PLS 0.5703 0.9509 0.977 0.9806
CCA 0.9799 0.9825 0.981 0.9806
ρ
𝐗𝐘
𝐶 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.0466 0.5192 0.0498 0.0480 0.5169
SNP 2 0.0470 0.0470 0.5167 0.0479
SNP 3 0.0498 0.0476 0.0480 0.5292
SNP 4 0.0469 0.0499 0.0501 0.5245
MLR 0.0475 0.5101 0.4875 0.7157 0.7331
MANOVA 0.9786
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.0506 0.5275 0.8551 0.9786
PLS 0.5699 0.9061 0.9668 0.9786
CCA 0.9363 0.9727 0.9782 0.9786
ρ
𝐗𝐘
𝐷 SLR Trait 1 Trait 2 Trait 3 Trait 4 Joint Test
SNP 1
0.5258 0.0493 0.0500 0.0473 0.5280
SNP 2 0.0515 0.5129 0.049 0.0516
SNP 3 0.0509 0.0514 0.5136 0.0488
SNP 4 0.0496 0.0523 0.0521 0.5112
MLR 0.5123 0.4961 0.4929 0.4974 0.7222
MANOVA 0.9961
DRM 1-LC 2-LC 3-LC 4-LC
PCA 0.0584 0.5640 0.9166 0.9961
PLS 0.5110 0.9406 0.9903 0.9961
CCA 0.9720 0.9939 0.9957 0.9961
* Gray shading indicates the distribution of given correlations in 𝚺 𝐗𝐘
.
** i-LC: the first i pairs of linear components used in testing.
122
Table IV.4. Power of testing joint genetic associations in high-dimension simulation.
𝚺 𝐗𝐘
SLR MLR
MAN
OVA
Max*
Bonferroni corrected**
PCA PLS CCA PCA PLS CCA
ρ
𝑆 0.362 0.608 0.747 0.143 0.687 0.858 0.056 0.183 0.739
ρ
𝑀 0.354 0.399 0.791 0.130 0.607 0.792 0.057 0.202 0.625
ρ
𝐿 0.370 0.216 0.668 0.111 0.546 0.668 0.057 0.187 0.426
* Based on the smallest uncorrected p-value in eight sets of linear components.
** Based on Bonferroni corrected p-values (corrected for eight sets of linear composites).
123
Figure IV.2.1. Power of testing joint genetic effects using SLR, MLR, MANOVA, PCA, PLS,
and CCA in independent SNPs and traits (𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0).
Sparsity ρ
𝐗𝐘
𝐴 ρ
𝐗𝐘
𝐵 ρ
𝐗𝐘
𝐶 ρ
𝐗𝐘
𝐷
Power
Number of linear components
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
PCA
PLS
CCA
SLR
MLR
MANOVA
124
Figure IV.2.2. Power of testing joint genetic effects using SLR, MLR, MANOVA, PCA, PLS,
and CCA in correlated SNPs and traits (𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.1).
Sparsity ρ
𝐗𝐘
𝐴 ρ
𝐗𝐘
𝐵 ρ
𝐗𝐘
𝐶 ρ
𝐗𝐘
𝐷
Power
Number of linear components
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
PCA
PLS
CCA
SLR
MLR
MANOVA
125
Figure IV.2.3. Power of testing joint genetic effects using SLR, MLR, MANOVA, PCA, PLS,
and CCA in correlated SNPs and traits (𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.2).
Sparsity ρ
𝐗𝐘
𝐴 ρ
𝐗𝐘
𝐵 ρ
𝐗𝐘
𝐶 ρ
𝐗𝐘
𝐷
Power
Number of linear components
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
PCA
PLS
CCA
SLR
MLR
MANOVA
126
Figure IV.2.4. Power of testing joint genetic effects using SLR, MLR, MANOVA, PCA, PLS,
and CCA in correlated SNPs and traits (𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.3).
Sparsity ρ
𝐗𝐘
𝐴 ρ
𝐗𝐘
𝐵 ρ
𝐗𝐘
𝐶 ρ
𝐗𝐘
𝐷
Power
Number of linear components
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
PCA
PLS
CCA
SLR
MLR
MANOVA
127
Figure IV.2.5. Power of testing joint genetic effects using SLR, MLR, MANOVA, PCA, PLS,
and CCA in correlated SNPs and traits (𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.4).
Sparsity ρ
𝐗𝐘
𝐴 ρ
𝐗𝐘
𝐵 ρ
𝐗𝐘
𝐶 ρ
𝐗𝐘
𝐷
Power
Number of linear components
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
PCA
PLS
CCA
SLR
MLR
MANOVA
128
Figure IV.2.6. Power of testing joint genetic effects using SLR, MLR, MANOVA, PCA, PLS,
and CCA in correlated SNPs and traits (𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.5).
Sparsity ρ
𝐗𝐘
𝐴 ρ
𝐗𝐘
𝐵 ρ
𝐗𝐘
𝐶 ρ
𝐗𝐘
𝐷
Power
Number of linear components
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
PCA
PLS
CCA
SLR
MLR
MANOVA
129
Figure IV.2.7. Trends of power (axis range 0 to 1) for PCA, PLS, and CCA in combination with
correlations of SNPs and traits (𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
, range 0 to 0.5) and number of linear components pairs
(LC, range 1 to 4) across four types of sparsity of genetic effects:
a) 𝚺 𝐗𝐘
= ρ
𝐗𝐘
𝐴
b) 𝚺 𝐗𝐘
= ρ
𝐗𝐘
𝐵
c) 𝚺 𝐗𝐘
= ρ
𝐗𝐘
𝐶
d) 𝚺 𝐗𝐘
= ρ
𝐗𝐘
𝐷
130
Figure IV.2.8. Power of testing joint genetic effects using SLR, MLR, MANOVA, PCA, PLS,
and CCA in correlated SNPs and independent traits (𝜌 𝐗𝐗
= 0.3,𝜌 𝐘𝐘
= 0).
Sparsity ρ
𝐗𝐘
𝐴 ρ
𝐗𝐘
𝐵 ρ
𝐗𝐘
𝐶 ρ
𝐗𝐘
𝐷
Power
Number of linear components
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
PCA
PLS
CCA
SLR
MLR
MANOVA
131
Figure IV.2.9. Trends of power (axis range 0 to 1) for PCA, PLS, and CCA in combination with
correlations of SNPs and traits (𝜌 𝐗𝐗
range 0 to 0.5, 𝜌 𝐘𝐘
= 0) and number of linear components
pairs (LC, range 1 to 4) when setting ρ
𝐗𝐘
for covariance between SNPs and traits (𝚺 𝐗𝐘
):
a) 𝚺 𝐗𝐘
= ρ
𝐗𝐘
𝐴
b) 𝚺 𝐗𝐘
= ρ
𝐗𝐘
𝐵
c) 𝚺 𝐗𝐘
= ρ
𝐗𝐘
𝐶
d) 𝚺 𝐗𝐘
= ρ
𝐗𝐘
𝐷
132
Figure IV.2.10. Power of testing joint genetic effects using SLR, MLR, MANOVA, PCA, PLS,
and CCA in correlated SNPs and traits (𝜌 𝐗𝐗
= 0,𝜌 𝐘𝐘
= 0.3).
Sparsity ρ
𝐗𝐘
𝐴 ρ
𝐗𝐘
𝐵 ρ
𝐗𝐘
𝐶 ρ
𝐗𝐘
𝐷
Power
Number of linear components
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
PCA
PLS
CCA
SLR
MLR
MANOVA
133
Figure IV.2.11. Trends of power (axis range 0 to 1) for PCA, PLS, and CCA in combination with
correlations of SNPs and traits (𝜌 𝐗𝐗
= 0,𝜌 𝐘𝐘
range 0.1 to 0.5) and number of linear components
pairs (LC, range 1 to 4) when setting ρ
𝐗𝐘
for covariance between SNPs and traits (𝚺 𝐗𝐘
):
a) 𝚺 𝐗𝐘
= ρ
𝐗𝐘
𝐴
b) 𝚺 𝐗𝐘
= ρ
𝐗𝐘
𝐵
c) 𝚺 𝐗𝐘
= ρ
𝐗𝐘
𝐶
d) 𝚺 𝐗𝐘
= ρ
𝐗𝐘
𝐷
134
Figure IV.4. Power of testing joint genetic effects using SLR, MLR, MANOVA, and three
DRM-based approaches in high-dimension simulation.
135
V. Adaptive CCA for Set-Based Testing with Multivariate Traits
V.1. Motivation
The simulation results in Chapter IV showed that set-based testing using
canonical correlations can have better power performance than multivariate linear
regression, PLS and PCA, especially with correlated genetic markers and correlated
quantitative traits (Chapter IV.4). However, to use canonical correlations for set-based
testing, one needs to pre-specify the number of components on which to do the test. In
general, the optimal number of components will depend on parameters of the data
generating mechanism, such as correlation structure and number and pattern of the
associations. In practice these parameters are unknown, and therefore it is not possible to
make an a priori optimal choice of the number of canonical correlation components to
include. One possibility is to set a threshold for the proportion of variance (PC),
covariance (PLS), or correlation explained (CC) by the linear components and based on
the characteristics of data or suggested decisions from the other studies. For example, for
using PCs for testing variants in a region for association with a univariate trait, selecting
the top components explaining at least 80% of the genotypic variance has been proposed
[Gauderman et al., 2007]. But this does not guarantee good performance, and in fact can
result in very poor power in some circumstances [Yu et al. 2009]. This was also
demonstrated in Chapter IV where the performance of set-base tests based on DRMs
varies greatly with the sparsity of the genetic effects. A poor choice in the number of
components included in the set test can yield suboptimal power, possibly much lower
than the power achievable with standard multivariate regression. A natural question is,
136
whether it is possible to select the number of DR components in a data-driven fashion to
achieve good power regardless of the unknown scenario. Such concept was already
introduced and realized as adaptive tests, which is, one will consider testing of
significance to be “adaptive” if the test procedure is modified after the data have been
collected and examined [O’Gorman 2012].
Here I propose an adaptive approach for selecting CC components to combine
into a set-test statistics. The approach is similar in spirit to the adaptive rank truncated
product (ARTP) [Yu et al. 2009], which I describe below. Starting with the top CC
component, the one explaining the most correlation between genotypes and traits, and
add one component at a time until the smallest p-value is obtained. The final test statistic
is the one that yields the smallest p-value. To account for the data-driven adaptation a
permutation approach is use to assess significance. I expected this adaptive approach to
achieve a good and robust power across a wide range of scenarios.
V.2. Adaptive rank truncated product (ARTP)
The simulation in Chapter IV shows that testing one variant against one trait at a
time and then correcting for multiple testing yields a set-based test with reduced power to
detect associations. Rather than combining all individual tests, ARTP is an approach
proposed to, adaptively combine multiple individual tests from individual SNPs or genes
in a set to achieve a robust power performance [Yu et al. 2009]. There three steps in the
ARTP approach. The first step is to, k=1,..,p, calculate a summary statistic by combining
the top k most significant individual test statistics, using the rank truncated product (RTP)
of Dudbridge and Koeleman [Dudbridge and Koeleman, 2003, 2004] The second step is
137
finds the null distribution of each summary statistics and computes a p-value for it. The
summary statistic corresponding to the value of k with the smallest p-value is selected as
the summary statistic for the entire set. In the final step, the null distribution of the final
summary test statistic is obtained accounting for the data-driven selection in step 2.
Rank truncated product (RTP)
RTP can be simply understood as a modified version of the classic Fisher’s
method [Fisher, 1932]. Fisher’s method was proposed to test a global hypothesis by
combining the results of multiple independent tests. Assuming a total of k tests were
performed to test a hypothesis and total k p-values were calculated from the tests
(𝑝 1
,⋯,𝑝 𝑘 ) , Fisher’s method computes the statistic:
𝑡 = −2∑ln𝑝 𝑖 𝑘 𝑖 =1
= −2ln(∏𝑝 𝑖 𝑘 𝑖 =1
) ~𝑋 𝑑𝑓 =2𝑘 2
(V-1)
Assuming independence of the p-values under a global null hypothesis, the t statistic
follows an approximate (exact if the individual p-values are exactly uniformly
distributed) chi-square distribution with 2k degrees of freedom. However, this method
would lose power when only a small fraction of the individual hypothesis is non-null,
which yields many large p-values would be included in the calculation of t. Therefore,
excluding large p-values from calculating the product t appeared to be a reasonable
improvement to make the test be more powerful.
One simple way to improve Fisher’s method is sorting 𝑝 𝑖 from the smallest (most
138
significant) to the largest (least significant) and setting a truncated point τ (0 < τ < 1, τ is
commonly equal to 0.05) for excluding large p-values. Those p-values smaller than
were retained in the calculation of Fisher’s method. This method is known as rank
truncated product (RTP) [Zaykin et al., 2002, Dudbridge and Koeleman, 2004]:
𝑊 = ∏𝑝 𝑖 𝐼 (𝑝 𝑖 ≤𝜏 )
𝑘 ′
𝑖 =1
(V-2)
where 𝐼 (∙) is the indicator function and 𝑘 ′
is the number of p-values included into RTP.
Because of an additional step of organizing p-values was taken before calculating the
product, the null distribution of W would be no longer the former chi-square distribution.
In order to conduct a test using W, permutation is one typically approach used to assess
the significance. However, the null distributions of W for each choice of truncated points
would be different, and the power would be different as well. In a real data analysis, we
would not know what truncation point should be chosen to have a better performance of
power.
Adaptive rank truncated product (ARTP)
ARTP designed a procedure which adaptively selects a truncation point in RTP,
and then calculates the corresponding distribution of the selected statistic for assessing
the significance. This procedure was showed to help achieve a relatively better and robust
power in hypothesis testing [Yu et al., 2009].
The calculation of ARTP can be understood as two steps. In the first step, the p-
139
values are calculated for each SNP versus each trait and then sorted from the most to least
significant. Given a data has m SNPs and a single trait, a total of 𝑚 × 1 tests are
performed. Assuming a truncation point was set to include all p-values, for each 𝑘 (𝑘 =
1,⋯,𝑚 ) , a combination statistic 𝑊 𝑘 can be computed as the product of the top k most
significant p-values. 𝑊 𝑘 can then be converted into an estimated p-value (𝑝 𝑘 ′) using
permutation. This step is equivalent to the RTP method (equation V-2) but calculated all
(a total of m) 𝑝 𝑘 ′ from each permuted data. In the second step, one just simply selects the
most significant p-value from all 𝑝 𝑘 ′ as the presumed most powerful test statistic (𝑝 H1
)
in ARTP. However, the null distribution of 𝑝 H1
is unknown because of the selecting
process. In order to estimate the null distribution, one can repeat the same process of
selecting one smallest p-value in each permuted data. Assuming a total of n permuted
data were generated in permutation, a total of n smallest p-value (𝑝 H0
𝑙 , where 𝑙 =
1,⋯,𝑛 ) can be selected, and the null distribution of 𝑝 H1
can be estimated with 𝑝 H0
𝑙 .
Accordingly, the adjusted test statistic of ARTP can be calculated:
𝑝 ARTP
=
∑ (𝑝 H1
< 𝑝 H0
𝑙 )
𝑛 𝑙 =1
𝑛
(V-3)
This is the final p-value for assessing the significance of testing one global hypothesis.
The second step in ARTP can be recognized as the adaptive step and which is the new
idea introduced in ARTP (extended from the previous RTP). Yu et al. showed that the
ARTP approach can achieve robust power under a wide range of scenarios [Yu et al.
2009].
140
V.3. Adaptive CCA-based approach (ACCA)
CCA components were calculated in a sorted order that the first pair of linear
components has the maximum correlations and follow a descending order to the last pair
(equation III-16). CCA-based approach was to test a set of CCA components which
included the first i pairs of components following the sorted order. Therefore, the analysis
framework of CCA-based approach can be viewed as conceptually equal to the first step
in ATRP. There are two dissimilarities in the calculations (the sorting order of CCA
components is based on the strength of canonical correlations instead of p-values; pooled
statistics are directly calculated by a set-based likelihood ratio test, not the products of p-
values), but these dissimilarities did not violet the analysis framework. We learned the
power performance of CCA-based approach can be optimized when the set of linear
components which can well capture the sparsity of genetic effects was selected into tests.
However, the information of finding the “best” set of components is unknown when
analyzing a real data. This problem is actually can be addressed using the same idea as
the second (adaptive) step in ARTP.
In ARTP, after all products were converted into p-values, one can just select the
most significant p-value from them and then calculated the corresponding adjusted p-
value for adjusting the process of selection (equation V-3). In CCA-based approach, a
series of p-values were calculated from using different sets of CCA components using
LRT (equation III-38). According to the adaptive step in ARTP, one can choose the most
significant p-value as the presumed most powerful test statistic, and then calculate the
141
corresponding adjusted p-value using the same adaptive step in ARTP. I named this
approach as adaptive CCA-based approach (ACCA).
V.4. High-dimensional setting
Simulation setting
The power performance of ACCA was examined using a simulation study. The
simulation settings are the same with the high-dimension simulation in Chapter IV, and
thus the details can be referred to Chapter IV.4. ARTP and ACCA were added into the
comparisons of methods (SLR, MLR, MANOVA, and there DRM-based approaches).
Alternative DRM-based approaches
The adaptive approach is a process of selecting and adjusting a test statistic in a
set-based test, so which can also be applied in the other two DRM-based approaches.
Although it was already concluded PCA and PLS-based approaches would perform no
better than CCA, the adaptive step was still applied in PCA and PLS-based approaches in
this simulation study as the references in the comparisons of power, and they were named
as adaptive PCA and adaptive PLS-based approaches (APCA and APLS) respectively.
On the other hand, it is known PCA calculates linear components for SNPs and
traits separately, so one would argue that testing linear components by pairs may not
really reflect the true performance of PCA-based approach. A commonly seen PCA-
based approach is to select linear components based on a given threshold, usually the
142
percentage of variance explained by selected components [Guaderman et al, 2007]. Based
on equation III-8, one can select the top 𝑠 (𝑠 ≤ 𝑝 ) components given a threshold T:
∑ 𝜆 𝑠 −1
𝑠 −1
𝑖 =1
∑ 𝜆 𝑖 𝑝 𝑖 =1
< 𝑇 and
∑ 𝜆 𝑠 𝑠 𝑖 =1
∑ 𝜆 𝑖 𝑝 𝑖 =1
≥ 𝑇
The top s components would be included into analysis. This process can be applied to
select components in SNPs and traits separately. After components of SNPs and traits
were selected, the same LRT (equation III-36) can be used to test the PCA components.
The concept of threshold method can also be applied in PLS and CCA-based approaches.
PLS seeks a pairs of component which maximize the covariance between SNPs and traits.
Based on equation III-21, one can select the top 𝑠 (𝑠 ≤ 𝑝 ) components given a threshold
T:
∑ 𝜎 𝑠 −1
𝑠 −1
𝑖 =1
∑ 𝜎 𝑖 𝑝 𝑖 =1
< 𝑇 and
∑ 𝜎 𝑠 𝑠 𝑖 =1
∑ 𝜎 𝑖 𝑝 𝑖 =1
≥ 𝑇
CCA seeks a pair of components which maximize the correlation between SNPs and
traits (equation III-21), and the squared canonical correlations can be interpreted as
estimates of the amount of shared variance. Accordingly, based on equation III-16, one
can select the top 𝑠 (𝑠 ≤ 𝑝 ) components given a threshold T:
∑ 𝜌 2
𝑠 −1
𝑠 −1
𝑖 =1
∑ 𝜌 2
𝑖 𝑝 𝑖 =1
< 𝑇 and
∑ 𝜌 2
𝑠 𝑠 𝑖 =1
∑ 𝜌 2
𝑖 𝑝 𝑖 =1
≥ 𝑇
In this simulation, I also compared power between DRM-based approaches using this
threshold method to select components with a threshold of 80% variance/covariance.
143
Null hypothesis
Under the null hypothesis that there were no correlations between genetic markers
and traits (𝚺 𝐗𝐘
= 0), the estimated powers for all statistical tests (testing one SNP versus
one trait using SLR, testing all SNPs versus one trait using MLR, testing all SNPs versus
all traits using MANOVA, ARTP, and DRM-based approaches) were approximately
equal to the target 5% type I error rate.
Results
Under the alternative hypothesis, when the associations between SNPs and traits
are relatively correlated (𝚺 𝐗𝐘
= ρ
𝑆 ), ACCA showed the best power (82.7%) among all
approaches (Table V.4), and which is close to the power when the best CCA components
were selected (85.8%). Similar to ACCA, APCA showed power close to the best power
selected from testing all possible PCA components. However, the adaptive methods did
not perform as well in APLS (Figure V.4). The threshold methods did help PCA-based
approach to gain more power than testing by pairs of components, but the power gain is
very limited and still lower than the other approaches (Table V.4). ARTP showed
improved power (48.5%) than SLR (36.2%) but lower than MANOVA (74.7%) and
ACCA. This pattern of power performance was observed in the other two settings
(𝚺 𝐗𝐘
= ρ
𝑀 and ρ
𝐿 ) of covariance structures (Table V.4, Figure V.4).
Conclusion
When causal SNPs and corresponding traits are more correlated than the other
SNPs and traits, CCA only need fewer components to capture the sparsity to achieve
144
power better than MANOVA, and ACCA can achieve that power performance without
knowing which components should be selected. If the causal SNPs and corresponding
traits are more spare (for example, causal SNPs are more correlated to the other non-
causal SNPs than causal SNPs), implying the genetic associations are independent, CCA
needed more components to achieve the best power. With the penalty of seeking
components, ACCA showed lower power than MANOVA. On the other hand, even the
threshold was taken to select PCA components separately from SNPs and traits, the
power performance of PCA-based approach is still poor.
V.5. Other high-dimensional settings
The previous high-dimension simulation results showed MANOVA and ACCA
always outperformed ARTP with a strong advantage in power across different data
covariance structures. However, such advantage could be reduced in some particular
simulation settings.
Direction of correlations
In the previous simulation settings, correlations between SNPs were assumed to
be half positive and half negative, and the same assumption was made for traits and the
given genetic effects. This assumption was made in the purpose to generate a relatively
complex data covariance matrix. In the real data analysis, the ratio between positive and
negative correlations can vary. Therefore, I designed this simulation study to examine the
relationship between correlations direction and power.
145
The settings of the previous high-dimension simulation were used in this
simulation study. In order to present the differences in power changes, the covariance
matrix structure was fixed to be 𝚺 𝐗𝐘
= ρ
𝑆 . Three scenarios were made to represent the
variations of correlations direction: 1) the correlations in SNPs were set to be all positive;
2) the correlations in SNPs and traits were set to be all positive; 3) all correlations were
set to be positive. All approaches were repeated and compared the power performance.
Table V.5.1 showed, when the correlations in SNPs was all positive in the first
scenario, the power of MANOVA and ACCA dropped to 73.4% and 84.5%. When the
correlations in traits also became all positive in the second scenario, the power of
MANOVA and ACCA dropped rapidly to 50.0% and 51.3%. When all correlations
between SNPs and traits were positive, MANOVA and ACCA showed the lowest power
(42.1% and 46.5%) across all three scenarios. On the other hand, ARTP showed power as
42%, 47.7%, and 44.6% from the first to the third scenario respectively. The patterns of
power performance were presented in Figure V.5.1. These results indicated the powers of
MANOVA and ACCA are more sensitive to the uniformity of correlations direction, and
ARTP is relatively robust across three scenarios.
In conclusion, given the same data covariance structure, the ratio between positive
and negative correlations would affect the power performance of MANOVA and ACCA,
and the powers can possibly be lower than ARTP in the specific cases that the direction
of correlations is close to uniform.
146
Magnitude of correlations
The relationship between power and magnitude of correlations in SNPs and traits
was already discussed in the low-dimension simulation study (Chapter IV.2). The
conclusion was as expected that MANOVA, CCA-based approaches could gain more
power when the magnitude of correlations increased. However, after knowing the
direction of correlations is an important parameter to affect the power performance of
MANOVA and ACCA, it would be interesting to observe whether the increase of
magnitude can still help to improve power while the direction of correlations is uniform
(all positive).
The settings of the previous high-dimension simulation were used in this
simulation. In addition, the structure of covariance matrix between SNPs and traits is
fixed (𝚺 𝐗𝐘
= ρ
𝑆 ) , and the direction of data correlations is fixed to be all positive. The
correlations between any one SNP and its nearest six SNPs (three on upstream and three
on downstream) are set to be 0.4, 0.25, and 0.1 for three scenarios respectively.
When the correlations in SNPs are low (0.1), the power of ARTP (45.7%) is
higher than both MANOVA (28.8%) and ACCA (36.4%) (Table V.5.2). However,
powers of MANOVA and ACCA increase with the increase of correlations in SNPs
(Table V.5.2 and Figure V.5.2). ARTP power did not vary much with the changes of
magnitude of correlations in SNPs, and thus became lower than MANOVA and ACCA
when the correlations in SNPs are at 0.25 and 0.4.
147
In conclusion, the power of MANOVA and ACCA would be reduced when the
direction of correlations is uniform. However, the power can still be enhanced when the
magnitude of correlations increased. ARTP still showed robust power across three
scenarios. Therefore, ARTP can have better power performance when SNPs are nearly
independent, but be lower than MANOVA and ACCA while the SNPs are more
correlated.
V.6. Summary
Without knowing which CCA components should be selected into analysis,
ACCA can achieve nearly the best power selected from CCA-based approaches. While
the underlying genetic associations are relatively correlated, ACCA can perform better
than MANOVA, and which is due to CCA can use fewer components to capture the
sparsity of genetic effects. MANOVA and ACCA would lose power when the diversity
of data covariance matrix became less. ARTP is robust across all scenarios in the
simulations. However, ARTP did not capture the relationships between SNPs and traits as
well as MANOVA and ACCA. Thus, when the data covariance matrix tend to be more
complex and the correlations between SNPs or traits were stronger, ARTP still have the
similar power as without those additional information.
In conclusion, ACCA can be a good choice to test associations between multiple
correlated SNPs and multiple correlated traits in a biological pathway. MANOVA is an
alternative approach to use when it is believed the underlying associations are less related
to each other. ARTP is a robust approach that constantly performs better than SLR.
148
Compared to MANOVA and ACCA, ARTP may have huge potential power lose when
the information of data covariance structure can be used to enhance power.
149
Tables and Figures (Chapter V)
Table V.4. Power of testing joint genetic associations in high-dimension simulation using
adaptive approaches.
𝚺 𝐗𝐘
SLR ARTP MLR
MAN
OVA
Max*
Threshold
Adaptive
PCA PLS CCA PCA PLS CCA PCA PLS CCA
ρ
𝑆 0.362 0.485 0.608 0.747 0.143 0.687 0.858 0.135 0.237 0.776 0.090 0.274 0.827
ρ
𝑀 0.354 0.466 0.399 0.791 0.130 0.607 0.792 0.180 0.281 0.792 0.088 0.273 0.742
ρ
𝐿 0.370 0.471 0.216 0.668 0.111 0.546 0.668 0.210 0.305 0.643 0.088 0.277 0.543
* Based on the smallest uncorrected p-value in eight sets of linear components.
150
Table V.5.1. Power of testing joint genetic associations in high-dimension simulation of
positive correlated SNPs and/or traits using adaptive approaches.
+/- corr SLR ARTP MLR
MAN
OVA
Max*
Threshold
Adaptive
PCA PLS CCA PCA PLS CCA PCA PLS CCA
+SNP 0.319 0.420 0.493 0.734 0.070 0.756 0.853 0.187 0.255 0.794 0.058 0.302 0.845
+SNP,
+Trait
0.380 0.477 0.495 0.500 0.064 0.511 0.522 0.338 0.385 0.513 0.052 0.445 0.513
+All 0.344 0.446 0.354 0.421 0.176 0.399 0.454 0.249 0.284 0.438 0.126 0.316 0.465
* Based on the smallest uncorrected p-value in eight sets of linear components.
151
Table V.5.2. Power of testing joint genetic associations in high-dimension simulation of low
to high SNPs correlations using adaptive approaches.
SNPs
corr
SLR ARTP MLR
MAN
OVA
Max*
Threshold
Adaptive
PCA PLS CCA PCA PLS CCA PCA PLS CCA
0.4 0.311 0.395 0.879 0.869 0.118 0.481 0.990 0.121 0.243 0.912 0.093 0.347 0.975
0.25 0.338 0.432 0.390 0.459 0.139 0.396 0.561 0.227 0.252 0.481 0.141 0.315 0.544
0.1 0.343 0.457 0.281 0.288 0.218 0.334 0.345 0.259 0.246 0.335 0.160 0.295 0.364
* Based on the smallest uncorrected p-value in eight sets of linear components.
152
Figure V.4. Power of testing joint genetic effects using SLR, ARTP, MLR, MANOVA, three
DRM-based and three adaptive DRM-based approaches in high-dimension simulation.
153
Figure V.5.1. Power of testing joint genetic effects using SLR, ARTP, MLR, MANOVA, there
DRM-based and three adaptive DRM-based approaches in high-dimension simulation of positive
correlated SNPs and/or traits.
154
Figure V.5.2. Power of testing joint genetic effects using SLR, ARTP, MLR, MANOVA, three
DRM-based and three adaptive DRM-based approaches in high-dimension simulation of low to
high SNPs correlations.
155
VI. Type 2 Diabetes Related Traits and a Glucose Cycle Pathway
VI.1 Background
We have previously found an additive effect of genetic variation in the
glucokinase (GCK) [Stone et al., 1996; Rose et al., 2005; Weedon et al., 2005] and
glucose-6-phosphatase catalytic subunit 2 (G6PC2) [Bouatia-Naji et al., 2008; Chen et
al., 2008] on insulin secretion and fasting glucose [Li and Shu, et al., 2009]. The
relationships between genetic variations of GCK and C6PC2 and type 2 diabetes
quantitative traits were presumed based on an underlying biological pathway in
regulating the critical glucose-sensing mechanism. GCK phosphorylates glucose to
glucose-6-phosphate, whereas G6PC2 dephosphorylates glucose-6-phosphate back to
glucose, forming a glucose cycle previously demonstrated to exist within pancreatic β-
cells [Katz and Rognstad, 1976; Khan et al., 1989]. Accordingly, we called the pathway a
“glucose cycle pathway” in this chapter. More details about the background of GCK,
G6PC2, and their roles in the glucose cycle pathway in connection to type-2 diabetes
were described in Chapter II.2.
In our the previously published paper however, [Li and Shu, et al., 2009], we did
not find strong significant associations results to support the presumed relationship
between the genetic variants of GCK and G6PC2 in glucose cycle pathway and type 2
diabetes related quantitative traits. In fact, after adjusting for multiple comparisons using
a Bonferroni correction, most of the univariate association tests results were non-
significant. In this chapter we apply the new CC-based approach proposed in Chapter 5
to test the joint association between the GCK and G6PC2 variants and type 2 diabetes
156
related quantitative traits. In addition, for the purpose of comparing tests results, we will
also perform tests using the traditional univariate (simple linear and multiple linear
regression models), multivariate (MANOVA) analysis, and the other two dimensionality
reduction methods (principal component analysis and partial least square) based
approaches.
VI.2 Methods
Study Subjects
Study subjects were collected from the BetaGene study reported in our published
paper [Li and Shu, et al., 2009]. BetaGene is a family-based study in which we are
performing detailed phenotyping of Mexican American probands with recent gestational
diabetes mellitus (GDM) and their family members to obtain quantitative estimates of
body composition, insulin sensitivity (S
I
), acute insulin response (AIR), and β-cell
compensation (disposition index) with the goal of identifying genes influencing
variations in type 2 diabetes–related quantitative traits [Watanabe et al., 2007; Black et
al., 2008; Shu et al., 2009]. The details of the BetaGene study can be found in Chapter II.
Because of the family-based sampling design in BetaGene Study, measurements of
genotypes and phenotypes of subjects in a family would be correlated due to the subjects’
family relationships. At this stage, our proposed approach does not take the family
relationships into account. Therefore, I selected independent non-GDM individuals from
families to form a dataset subset with independent individuals for this pathway analysis.
157
Genotyping
BetaGene study was extended to become a genome-wide association study
(GWAS) after the two papers in Chapter II were published. Blood samples of BetaGene
study participations were being re-genotyped using Illumina OmniExpress chip which
has approximately 693k SNPs after quality control. Therefore, compared to the previous
paper [Li and Shu, et al., 2009], now we have better coverage with more SNPs to
represent the genetic variations of GCK and G6PC2 gene regions.
Phenotyping
Phenotyping was performed on two separate visits to the University of Southern
California General Clinical Research Center. The first visit consisted of a physical
examination, DNA collection, and an oral glucose tolerance test (OGTT; 75 g) as
previously described [Watanabe et al., 2007; Black et al., 2008; Shu et al., 2009].
Participants with fasting glucose <126 mg/dl were invited back for a second visit, which
consisted of a dual-energy X-ray absorptiometry scan for body composition (percent of
body fat) and an insulin-modified intravenous glucose tolerance test (IVGTT) performed
as previously described [Buchanan et al., 1999]. IVGTT glucose and insulin data were
analyzed using the minimal model (MINMOD Millennium version 5.18) to derive
measures of glucose effectiveness and S
I
[Boston et al., 2003]. Disposition index, a
measure of β-cell compensation for insulin resistance, was computed as the product of
S
I
and AIR. We also computed an OGTT-derived measure of β-cell compensation as the
product of S
I
and 30′ Δinsulin.
158
Data analysis
407 unrelated individuals are included in this analysis. We selected SNPs which
locate between ~50k bps up and down streams of each gene region for better coverage of
genetic variants. A total of 47 SNPs (28 for GCK and 19 for G6PC2) are selected after
excluding monomorphic genotypes. Let 𝑥 𝑖𝑗
be a matrix of genotypes for individual i and
SNP j where 𝑖 = 1 to 𝑁 (𝑁 = 407) and 𝑗 = 1 to 47 . Genotypes are first converted into
0, 1, or 2 based on randomly chosen reference alleles, and then normalized by subtracting
their means and dividing by √𝑝 𝑖 (1 − 𝑝 𝑖 ), where 𝑝 𝑖 is a posterior estimate of the
unobserved underlying allele frequency of SNP i which was defined by 𝑝 𝑖 =
(1 + ∑ 𝑥 𝑖𝑗 𝑖 ) (2 + 2𝑁 ) ⁄ . On the other hand, a total of 16 T2D-related traits (fasting
glucose, 30-min glucose, 2hr glucose, fasting insulin, 30-min insulin, 2hr insulin, insulin
sensitivity, glucose effectiveness, acute insulin response, disposition index, BMI, waist-
hip ratio, body fat, cholesterol, HDL, and triglycerides) were selected. We calculated
covariate-adjusted traits by taking the residuals from linear regression models where each
model has age and sex as covariates and a trait as the outcome. We performed set-based
tests on the normalized genotypes and covariate-adjusted traits using SLR, ARTP,
MAONVA, and ACCA. MLR, APCA and APLS were also included in the analysis as
references.
VI.3 Results and discussion
Table IV.4 showed the tests results for using different approaches. ARTP has the
most significant p-value (0.006), and ACCA and MANOVA have the next significant p-
159
values (0.030 and 0.031). The other set-based approaches showed non-significant results
(p-value > 0.05).
Based on the previously reported results [Li and Shu et al., 2009], we knew there
existed very few but also very strong associations in this pathway (two smallest p-values
are 0.004 and 0.007, and the most of the rest are larger than 0.1). Therefore, it is no
surprising to observe that ARTP has more significant p-values then ACCA. The fact that
ACCA yields a similar p-value to MANOVA may due to the reason that, correlations in
SNPs and traits which are associated to each other are not correlated, and therefore, those
SNPs and traits cannot be captured with fewer CCA components.
160
Tables and Figures (Chapter VI)
Table VI.4. Summary of pathway-based analysis using SLR, ARTP, MLR, MANOVA, and
adaptive DRM-based approaches (APCA, APLS, ACCA).
SLR ARTP MLR
MANO
VA
Adaptive
PCA PLS CCA
0.101 0.006 0.086 0.031 0.133 0.058 0.030
161
VII. General Conclusions and Final Remarks
Dimensionality reduction methods (DRMs) can capture important information of
the data with fewer components, and testing those components can improve power in
association testing. My work compared three classical DRMs: principal components
analysis (PCA), partial least squares (PLS), and canonical correlations analysis (CCA),
and demonstrated their power advantage for testing genetic associations between multiple
genes and multiple quantitative traits. I found that the PCA-based approach is not as
powerful as generally recognized. When the covariance of genetic markers and traits are
not (or less) related to the correlations between them, PCA-based approaches showed
generally poor power in the simulations. Based on the theoretical insights and the
supportive evidence from the simulation results, we learned CCA has the best
performance in testing joint associations among all three DRMs. CCA-based approach
and MANOVA (or known as multivariate multiple regression) both showed good power
in the scenarios of correlated genetic markers and correlated traits. Such power
performance is sensitive to the magnitude and direction of those correlations. I applied an
adaptation method motivated by the ARTP to address the issue of selecting the CCA
components for achieving better power. Based on the results of simulations and read data
analysis, I concluded that adaptive CCA-based approach (ACCA) is a powerful
association testing approach for pathway-based analysis.
I simulated data based on a given population variance-covariance matrix. This
simulation design gave us a lot flexibility to explore different correlation structures of
genes and traits and sparsity of genetic effects. In addition, based on the insights of
162
DRMs I have reviewed, I proposed an alternative sampling method in my simulation
work to replace the traditional permutation approach to assess the statistical significance.
This sampling method reduced the simulation burden from days to minutes, and made
exploring different scenarios in higher dimensional settings become much feasible. My
simulation work also revealed the parameter space in variance-covariance is too large to
be thoroughly explored at once, especially for understanding the performance of
multivariate approaches. This again reminded researchers to be cautious when comparing
their statistical approaches in a simulated data. I reported a series of simulations that
highlight the performance of different approaches as key parameters change. This
provided a depth perception as to how DRMs capture important information and how the
correlations between variants and outcomes can impact power.
Future research directions
My results showed the power performance of CCA-based approach is sensitive to
the covariance structure whether the SNPs and traits in associations are also correlated in
each other. When a set of correlated SNPs associated with a set of correlated traits, CCA
usually can capture those SNPs and traits with the top components and consequently
showed the most significant results among all approaches. This fact indicates a feature
that one can further look into the loadings in CCA components which may help
researchers to narrow down key associated variants among all SNPs and traits. One
practical approach would be that, we can keep tracking what sets of CCA components are
selected into tests during the adaption process in ACCA. Once we know which
components sets are selected, we can then look into the loadings of those components.
163
CCA components are generated in pairs which maximize the correlations between them.
Consequently, SNPs and traits are weighted more than the others in a pair of selected
components will be the interested variants that provide information in the association.
However, the loadings are calculated while subjecting to one. We can use techniques
such as rotations or Lasso shrinkage to refine the loadings under the constraint of
correlations between components. In the past, such idea of using loadings information
was mentioned for PCA or factor analysis (FA). It is now clear that PCA or FA are
calculated independently for SNPs and traits, and thus they have less potential to let the
loadings reveal further information about which set of SNPs and traits did contribute the
genetic associations. Based on my work, I expect CCA-based approach should be more
promising to find a small set of causal SNPs and traits as the next step.
In my simulations, I simulated a reasonable size of data in high-dimensional
setting for a pathway-based analysis. However, it will be interesting to known whether
those DRM-based approaches can be extended to handle a much higher dimension
situation, such as a genome-wide association study (GWAS). The primary issue is the
number of observed variables is more than the number of individuals. This makes
classical DRMs fail to calculate their components correctly because the sample variance-
covariance matrix cannot be correctly estimated. One shortcut is to simply calculate the
variance-covariance regardless the issue, and then replace it with one close positive
definite matrix. A close positive definite matrix can be calculated by first calculating its
eigenvalues and eigenvectors, and next replacing the non-positive eigenvalues with a
small positive value like 0.05, and then calculated backward with the new eigenvalues
and original eigenvectors. However, this shortcut approach is easy to apply but unreliable.
164
On the other hand, it is known a sparse version DRMs as more modern approaches to
handle such high dimension data. The concept of selecting and testing components in my
work is expected to be realized with using sparse DRMs when analyzing a much higher
dimension data. I believe this is a direction worth for further research work.
165
Bibliography/References
Allen AS, Satten GA. Robust estimation and testing of haplotype effects in case-control
studies. Genet Epidemiol 32(1):29-40 (2008).
Almasy L, Blangero J. Multipoint quantitative-trait linkage analysis in general pedigrees.
Am J Hum Genet 62:1198-1211 (1998).
Alonso A, Sasin J, Bottini N, Friedberg I, Friedberg I, Osterman A, Godzik A, Hunter T,
Dixon J, Mustelin T. Protein tyrosine phosphatases in the human genome. Cell 117:699-
711 (2004).
Atkinson EJ, Fridley BL, Goode EL, McDonnell SK, Liu-Mares W, Rabe KG, Sun Z,
Slager SL, de Andrade M. Linkage analysis using principal components of gene
expression data. BMC Proc 1(Suppl 1):S79 (2007).
Arya R, Blangero J, Williams K, Almasy L, Dyer TD, Leach RJ, O'Connell P, Stern MP,
Duggirala R. Factors of insulin resistance syndrome--related phenotypes are linked to
genetic locations on chromosomes 6 and 7 in nondiabetic mexican-americans. Diabetes
51(3):841-7 (2002).
Aschard H, Vilhjálmsson BJ, Greliche N, Morange PE, Trégouët DA, Kraft P.
Maximizing the power of principal-component analysis of correlated phenotypes in
genome-wide association studies. Am J Hum Genet 94(5):662-76 (2014).
Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and
haplotype maps. Bioinformatics 21:263-265 (2005).
Bento JL, Palmer ND, Mychaleckyj JC, Lange LA, Langefeld CD, Rich SS, Freedman
BI, Bowden DW. Association of protein tyrosine phosphatase 1B gene polymorphisms
with type 2 diabetes. Diabetes 53:3007-3012 (2004).
Bergman RN. Toward physiological understanding of glucose tolerance. Minimal model
approach. Diabetes 38:1512-1527 (1989).
Bergman RN, Ider YZ, Bowden CR, Cobelli C. Quantitative estimation of insulin
sensitivity. Am J Physiol 236:E667-E677 (1979).
Bergman RN, Phillips LS, Cobelli C. Physiologic evaluation of factors controlling
glucose tolerance in man. Measurement of insulin sensitivity and b-cell glucose
sensitivity from the response to intravenous glucose. J Clin Invest 68:1456-1467 (1981).
Black MH, Fingerlin TE, Allayee H, Zhang W, Xiang AH, Trigo E, Hartiala J, Lehtinen
AB, Haffner SM, Bergman RN, McEachin RC, Kjos SL, Lawrence JM, Buchanan TA,
Watanabe RM. Evidence of interaction between peroxisome proliferator-activated
166
receptor-γ2 and hepatocyte nuclear factor-4α contributing to variation in insulin
sensitivity in Mexican Americans. Diabetes 57:1048– 1056 (2008).
Black MH, Watanabe RM. A principal components-based clustering method to identify
variants associated with complex traits. Hum Hered 71(1):50-8 (2011).
Blangero J, Almasy L. Multipoint oligogenic linkage analysis of quantitative traits. Genet
Epidemiol 14:959-964 (1997).
Borga M, Landelius T, Kuntsson H. A unified approach to PCA, PLS, MLR, and CCA.
LiTH-ISY-R, ISSN 1400-3902 (1992).
Boston RC, Stefanovski D, Moate PJ, Sumner AE, Watanabe RM, Bergman RN.
MINMOD Millennium a computer program to calculate glucose effectiveness and insulin
sensitivity from the frequently sampled intravenous glucose tolerance test. Diab Tech
Therap 5:1003– 1015 (2003).
Bottini E, Lucarini N, Gerlini G, Finocchi G, Scire G, Gloria-Bottini F. Enzyme
polymorphism and clinical variability of diseases: study of acid phosphatase locus 1
(ACP1) in obese subjects. Hum Biol 62:403-411 (1990).
Bottini E, Gloria-Bottini F, Borgiani P. ACP1 and human adaptability. 1. Association
with common diseases: a case-control study. Hum Genet 96:629-637 (1995).
Bottini N, MacMurray J, Peters W, Rostamkhani M, Comings DE. Association of the
acid phosphatase (ACP1) gene with triglyceride levels in obese women. Mol Genet
Metab 77:226-229 (2002).
Bottini N, Meloni GF, Borgiani P, Giorgini A, Buzzetti R, Pozzilli P, Lucarelli P, Gloria-
Bottini F. Genotypes of cytosolic low-molecular-weight protein-tyrosine-phosphatase
correlate with age at onset of type 1 diabetes in a sex-specific manner. Metablosim
51:419-422 (2002).
Bottini N, Meloni GF, Lucarelli P, Amante A, Saccucci P, Gloria-Bottini F, Bottini E.
Risk of type 1 diabetes in childhood and maternal age at delivery, interaction with ACP1
and sex. Diabetes Metab Res Rev 21:353-358 (2005).
Bouatia-Naji N, Bonnefond A, Cavalcanti-Proenca C, Sparso T, Holmkvist J, Marchand
M, Delplanque J, Lobbens S, Rocheleau G, Durand E, De GF, Chevre JC, Borch-Johnsen
K, Hartikainen AL, Ruokonen A, Tichet J, Marre M, Weill J, Heude B, Tauber M,
Lemaire K, Schuit F, Elliott P, Jorgensen T, Charpentier G, Hadjadj S, Cauchi S,
Vaxillaire M, Sladek R, Visvikis-Siest S, Balkau B, Levy-Marchal C, Pattou F, Meyre D,
Blakemore AI, Jarvelin MR, Walley AJ, Hansen T, Dina C, Pedersen O, Froguel P. A
variant near MTNR1B is associated with increased fasting plasma glucose levels and type
2 diabetes risk. Nat Genet 41:89– 94 (2009).
167
Blangero J, Almasy L. Multipoint oligogenic linkage analysis of quantitative traits. Genet
Epidemiol 14:959– 964 (1997).
Bouatia-Naji N, Rocheleau G, Van Lommel L, Lemaire K, Schuit F, Cavalcanti-Proenca
C, Marchand M, Hartikainen AL, Sovio U, De Graeve F, Rung J, Vaxillaire M, Tichet J,
Marre M, Balkau B, Weill J, Elliott P, Jarvelin M-R, Meyer D, Polychronakos C, Dina C,
Sladek R, Froguel P. A polymorphism within the G6PC2 gene is associated with fasting
plasma glucose levels. Science 320:1085– 1088 (2008).
Buchanan TA, Xiang AH, Kjos SL, Trigo E, Lee WP, Peters RK. Antepartum predictors
of the development of type 2 diabetes in Latino women 11–26 months after pregnancies
complicated by gestational diabetes. Diabetes 48:2430– 2436 (1999).
Chambers JC, Elliott P, Zabaneh D, Zhang W, Li Y, Froguel P, Balding D, Scott J,
Kooner JS. Common genetic variation near MC4R is associated with waist circumference
and insulin resistance. Nat Genet 40:716– 718 (2008).
Chen W-M, Erdos MR, Jackson AU, Saxena R, Sanna S, Silver KD, Timpson NJ,
Hansen T, Orru M, Piras MG, Bonnycastle LL, Willer CJ, Lyssenko V, Shen H, Kuusisto
J, Ebrahim S, Sestu N, Duren WL, Spada MC, Stringham HM, Scott LJ, Olla N, Swift
AJ, Najjar S, Mitchell BD, Lawlor DA, Davey-Smith G, Ben-Shlomo Y, Andersen G,
Borch-Johnsen K, Jorgensen T, Saramies J, Valle TT, Buchanan TA, Shuldiner AR,
Lakatta E, Bergman RN, Uda M, Tuomilehto J, Pedersen O, Cao A, Groop L, Mohlke
KL, Laakso M, Schlessinger D, Collins FS, Altshuler D, Abecasis GR, Boehnke M,
Scuteri A, Watanabe RM. Variations in the G6PC2/ABCB11 genomic region are
associated with fasting glucose levels. J Clin Invest 118:2609– 2628 (2008).
Chun H, Keles S. Expression quantitative trait loci mapping with multivariate sparse
partial least squares regression. Genetics 182(1):79-90 (2009).
Clark AG. The role of haplotypes in candidate gene studies. Genet Epidemiol 27(4):321-
33 (2004).
Conneely KN, Boehnke M. So many correlated tests, so little time! Rapid adjustment of
P values for multiple correlated tests. Am J Hum Genet 81(6):1158-68 (2007).
Cordell HJ. Epistasis: what it means, what it doesn't mean, and statistical methods to
detect it in humans. Hum Mol Genet 11(20):2463-8 (2002).
de Bakker PI, Yelensky R, Pe'er I, Gabriel SB, Daly MJ, Altshuler D. Efficiency and
power in genetic association studies. Nature Genetics. 37: 1217-23 (2005)
de Jong S. SIMPLS: An alternative approach to partial least squares regression.
Chemometrics and Intelligent Laboratory Systems 18(3):251–263 (1993).
DeFronzo RA. The triumvirate: B-cell, muscle, liver. A collusion responsible for
168
NIDDM. Diabetes 37:667-687 (1987).
DiPaola R, Frittitta L, Miscio G, Bozzali M, Caratta R, Centra M, Spampinato D,
Santagati MG, Ercolino T, Cisternino C, Soccio T, Mastroianno S, Tassi V, Almgren P,
Pizzuti A, Vigneri R, Trischitta V. A variation in 3' UTR of hPTP1B increases specific
gene expression and associates with insulin resistance. Am J Hum Genet 70:806-812
(2002).
Dudbridge F, Koeleman BP. Efficient computation of significance levels for multiple
associations in large studies of correlated data, including genomewide association studies.
Am J Hum Genet ;75 (3):424–35 (2004).
Dudbridge F1, Koeleman BP. Rank truncated product of P-values, with application to
genomewide association scans. Genet Epidemiol. 25(4):360-6 (2003).
Edwards KL, Austin MA, Newman B, Mayer E, Krauss RM, Selby JV. Multivariate
analysis of the insulin resistance syndrome in women. Arterioscler Thromb 14:1940–5
(1994).
Elchebly M, Payette P, Michaliszyn E, Cromlish W, Collins S, Loy AL, Normandin D,
Cheng A, Himms-Hagen J, Chan CC, Ramachandran C, Gresser MJ, Tremblay ML,
Kennedy BP. Increased insulin sensitivity and obesity resistance in mice lacking the
protein tyrosine phosphatase-1B gene. Science 283:1544-1548 (1999).
Faggioni G, Borgiani P, Bottini N, Gloria-Bottini F, Tontoli F, Contreas V, Bottini E,
Lista F. Identification of two SNPs in the 5' flanking region of the ACP1 gene and
evaluation of disequilibrium among polymorphic sites. Ann Hum Genet 66:245-254
(2002).
Fisher, R. A. Statistical Methods for Research Workers, 4th Edition. Oliver and Boyd,
Edinburgh (1932).
Frayling TM, Timpson NJ, Weedon MN, Zeggini E, Freathy RM, Lindgren CM, Perry
JRB, Elliott KS, Lango H, Rayner NW, Shields B, Harries LW, Barrett JC, Ellard S,
Groves CJ, Knight B, Patch A-M, Ness AR, Ebrahim S, Lawlor DA, Ring SM, Ben-
Shlomo Y, Jarvelin M-R, Sovio U, Bennett AJ, Melzer D, Ferrucci L, Loos RJF, Barroso
I, Wareham NJ, Karpe F, Owen KR, Cardon LR, Walker M, Hitman GA, Palmer CNA,
Doney ASF, Morris AD, Smith GD, The Wellcome Trust Case Control Consortium.
Hattersley AT, McCarthy MI. A common variant in the FTO gene is associated with
body mass index and predisposes to childhood and adult obesity. Science 316:889– 894
(2007).
Froguel Ph, Zouali H, Vionnet N, Velho G, Vaxillaire M, Sun F, Lesage S, Stoffel M,
Takeda J, Passa P, Permutt MA, Beckmann JS, Bell GI, Cohen D. Familial
hyperglycemia due to mutations in glucokinase. N Engl J Med 328:697– 702 (1993).
169
Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J,
DeFelice M, Lochner A, Faggart M, Liu-Cordero SN, Rotimi C, Adeyemo A, Cooper R,
Ward R, Lander ES, Daly MJ, Altshuler D. The structure of haplotype blocks in the
human genome. Science 296(5576):2225-9 (2002).
Gauderman WJ, Murcray C, Gilliland F, Conti DV. Testing association between disease
and multiple SNPs in a candidate gene. Genet Epidemiol 31(5):383-95 (2007).
Genz A. Numerical computation of multivariate normal probabilities. J Comput Graph
Stat 1:141–150 (1992).
Gloria-Bottini F, Bottini N, Renzetti G, Bottini E. ACP1 and Th class of immunological
disease: Evidence of interaction with gender. Int Arch Allergy Immunol 143:170-176
(2007).
Gloria-Bottini F, Gerlini G, Lucarini N, Borgiani P, Amanter A, La Torre M, Antonacci
E, Bottini E. Phosphotyrosine protein phosphatases and diabetic pregnancy: an
association between low molecular weight acid phosphatase and degree of glycemic
control. Experientia 52:340-343 (1996).
Gray RS, Fabistz RR, Cowan LD, Lee ET, Howard BV, Savage PJ. Risk factor clustering
in the insulin resistance syndrome: the Strong Heart Study. Am J Epidemiol 148:869–78
(1998).
Hopkinson DA, Spencer N, Harris H. Red cell acid phosphatase variants: a new human
polymorphism. Nature 199:969-971 (1963).
Horn RA, Johnson CA. Matrix Analysis. Cambridge University Press 176–180 (1985).
Hotelling, H. Relations between two sets of variants. Biometrika, 28, 321-377 (1936).
Iannaccone U, Bergamaschi A, Magrini A, Marino G, Bottini N, Lucarelli P, Bottini E,
Gloria-Bottini F. Serum glucose concentration and ACP1 genotype in healthy adult
subjects. Metablosim 54:891-894 (2005).
Iizuka K, Nakjima H, Ono A, Okita K, Miyazaki J-I, Miyagawa J-I, Namba M, Hanafusa
T, Matsuzawa Y. Stable overexpression of the glucose-6-phosphatase catalytic subunit
attenuates glucose sensitivity of insulin secretion from a mouse pancreatic beta-cell line.
J Endocrinol 164:307– 314 (2000).
Jurg O, Daniel R. A principal-components approach based on heritability for combining
phenotype information. Hum Hered 49:106-111 (1999).
Kathiresan S, Willer CJ, Peloso GM, Demissie S, Musunuru K, Schadt EE, Kaplan L,
Bennett D, Li Y, Tanaka T, Voight BF, Bonnycastle LL, Jackson AU, Crawford G, Surti
A, Guiducci C, Burtt NP, Parish S, Clarke R, Zelenika D, Kubalanza KA, Morken MA,
170
Scott LJ, Stringham HM, Galan P, Swift AJ, Kuusisto J, Bergman RN, Sundvall J,
Laakso M, Ferrucci L, Scheet P, Sanna S, Uda M, Yang Q, Lunetta KL, Dupuis J, de
Bakker PI, O'Donnell CJ, Chambers JC, Kooner JS, Hercberg S, Meneton P, Lakatta EG,
Scuteri A, Schlessinger D, Tuomilehto J, Collins FS, Groop L, Altshuler D, Collins R,
Lathrop GM, Melander O, Salomaa V, Peltonen L, Orho-Melander M, Ordovas JM,
Boehnke M, Abecasis GR, Mohlke KL, Cupples LA. Common variants at 30 loci
contribute to polygenic dyslipidemia. Nat Genet 41:56– 65 (2009).
Katz J, Rognstad R. Futile cycles in the metabolism of glucose. Curr Top Cell Regul
10:237– 289 (1976).
Khan A, Chandramouli V, Ostenson C-G, Ahren B, Schumann WC, Low H, Landau BR,
Efendic S. Evidence for the presence of glucose cycling in pancreatic islets of the ob/ob
mouse. J Biol Chem 264:9732– 9733 (1989).
Khan A, Chandramouli V, Ostenson CG, Berggren PO, Low H, Landau BR, Efendic S.
Glucose cycling is markedly enhanced in pancreatic islets of obese hyperglycemic mice.
Endocrinology 126:2413– 2416 (1990).
Khan A, Chandramouli V, Ostenson CG, Low H, Landau BR, Efendic S. Glucose cycling
in islets from healthy and diabetic rats. Diabetes 39:456– 459 (1990).
Kjos SL, Peters RK, Xiang A, Henry OA, Montoro M, Buchanan TA. Predicting future
diabetes in Latino women with gestational diabetes. Diabetes 44:586-591 (1995).
Lee WC, Shu YH. Mapping disease-susceptibility genes in admixed populations using
interval principal component tests. Behav Genet 34(5):525-31 (2004).
Lettre G, Lange C, Hirschhorn JN. Genetic model testing and statistical power in
population-based association studies of quantitative traits. Genet Epidemiol 31(4):358-62
(2007).
Li X, Shu YH, Xiang AH, Trigo E, Kuusisto J, Hartiala J, Swift AJ, Kawakubo M,
Stringham HM, Bonnycastle LL, Lawrence JM, Laakso M, Allayee H, Buchanan TA,
Watanabe RM. Additive effects of genetic variation in GCK and G6PC2 on insulin
secretion and fasting glucose. Diabetes. 58(12):2946-53 (2009).
Livak KJ. Allelic discrimination using fluorogenic probes and the 5' nuclease assay.
Genet Anal 14:143-149 (1999).
Livak KJ. SNP genotyping by the 5'-nuclease reaction. Methods Mol Biol 212:129-147
(2003).
Lohmueller KE, Pearce CL, Pike M, Lander ES, Hirschhorn JN. Meta-analysis of genetic
association studies supports a contribution of common variants to susceptibility to
common disease. Nat Genet 33(2):177-82 (2003).
171
Loos RJ, Lindgren CM, Li S, Wheeler E, Zhao JH, Prokopenko I, Inouye M, Freathy
RM, Attwood AP, Beckmann JS, Berndt SI, Jacobs KB, Chanock SJ, Hayes RB,
Bergmann S, Bennett AJ, Bingham SA, Bochud M, Brown M, Cauchi S, Connell JM,
Cooper C, Smith GD, Day I, Dina C, De S, Dermitzakis ET, Doney AS, Elliott KS,
Elliott P, Evans DM, Sadaf F, I, Froguel P, Ghori J, Groves CJ, Gwilliam R, Hadley D,
Hall AS, Hattersley AT, Hebebrand J, Heid IM, Lamina C, Gieger C, Illig T, Meitinger
T, Wichmann HE, Herrera B, Hinney A, Hunt SE, Jarvelin MR, Johnson T, Jolley JD,
Karpe F, Keniry A, Khaw KT, Luben RN, Mangino M, Marchini J, McArdle WL,
McGinnis R, Meyre D, Munroe PB, Morris AD, Ness AR, Neville MJ, Nica AC, Ong
KK, O'Rahilly S, Owen KR, Palmer CN, Papadakis K, Potter S, Pouta A, Qi L, Randall
JC, Rayner NW, Ring SM, Sandhu MS, Scherag A, Sims MA, Song K, Soranzo N,
Speliotes EK, Syddall HE, Teichmann SA, Timpson NJ, Tobias JH, Uda M, Vogel CI,
Wallace C, Waterworth DM, Weedon MN, Willer CJ, Wraight, Yuan X, Zeggini E,
Hirschhorn JN, Strachan DP, Ouwehand WH, Caulfield MJ, Samani NJ, Frayling TM,
Vollenweider P, Waeber G, Mooser V, Deloukas P, McCarthy MI, Wareham NJ, Barroso
I, Jacobs KB, Chanock SJ, Hayes RB, Lamina C, Gieger C, Illig T, Meitinger T,
Wichmann HE, Kraft P, Hankinson SE, Hunter DJ, Hu FB, Lyon HN, Voight BF,
Ridderstrale M, Groop L, Scheet P, Sanna S, Abecasis GR, Albai G, Nagaraja R,
Schlessinger D, Jackson AU, Tuomilehto J, Collins FS, Boehnke M, Mohlke KL.
Common variants near MC4R are associated with fat mass, weight and risk of obesity.
Nat Genet 40: 768– 775 (2008).
Lucarini N, Antonacci E, Bottini E, Borgiani P, Faggioni G, Gloria-Bottini F.
Phosphotyrosine-protein-phosphatase and diabetic disorders. Further studies on the
relationship between low molecular weight acid phosphatase genotype and degree of
glycemic control. Dis Markers 14:121-125 (1998).
Lucarini N, Antonacci E, Bottini N, Gloria Bottini F. Low-molecular-weight acid
phosphatase (ACP1), obesity, and blood lipid levels in subjects with non-insulin-
dependent diabetes mellitus. Hum Biol 69:509-515 (1997).
Lucarini N, Finocchi G, Gloria-Bottini F, Macioce M, Borgiani P, Amante A, Bottini E.
A possible genetic component of obesity in childhood. Observations on acid phosphatase
polymorphism. Experientia 46:90-91 (1990).
Matveyenko AV, Butler PC. β-cell deficit due to increased apoptosis in the human islet
amyloid polypeptide transgenic (HIP) rat recapitulates the metabolic defects present in
type 2 diabetes. Diabetes 55:2106– 2114 (2006).
Matveyenko AV, Veldhuis JD, Butler PC. Mechanisms of impaired fasting glucose and
glucose intolerance induced by an approximate 50% pancreatectomy. Diabetes 55:2347–
2356 (2006).
Meigs JB, D’Agostino RB Sr, Wilson PWF, Cupples LA, Nathan DM, Singer DE. Risk
variable clustering in the insulin resistance syndrome: the Framingham offspring study.
Diabetes 46:1594–1600 (1997).
172
Mevik BH, Cederkvist HR. Mean squared error of prediction (MSEP) estimates for
principal component regression (PCR) and partial least squares regression (PLSR). J.
Chemometrics, 18: 422–429 (2004).
Naylor MG, Lin X, Weiss ST, Raby BA, Lange C. Using canonical correlation analysis
to discover genetic regulatory variants. PLoS One, 13;5(5):e10395 (2010).
O’Gorman TW. Adaptive tests of significance using permutations of residuals with R and
SAS, 1th Edition. Wiley (2012).
Orho-Melander M, Melander O, Guiducci C, Perez-Martinez P, Corella D, Roos C,
Tewhey R, Rieder MJ, Hall J, Abecasis G, Tai ES, Welch C, Arnett DK, Lyssenko V,
Lindholm E, Saxena R, de Bakker PI, Burtt N, Voight BF, Hirschhorn JN, Tucker KL,
Hedner T, Tuomi T, Isomaa B, Eriksson KF, Taskinen MR, Wahlstrand B, Hughes TE,
Parnell LD, Lai CQ, Berglund G, Peltonen L, Vartiainen E, Jousilahti P, Havulinna AS,
Salomaa V, Nilsson P, Groop L, Altshuler D, Ordovas JM, Kathiresan S. Common
missense variant in the glucokinase regulatory protein gene is associated with increased
plasma triglyceride and C-reactive protein but lower fasting glucose concentrations.
Diabetes 57:3112– 3121 (2008).
Paggi A, Borgiani P, Gloria-Bottini F, Russo S, Saponara I, Banci M, Amante A,
Lucarini N, Bottini E. Further studies on acid phosphatase in obese subjects. Dis Markers
9:1-7 (1991).
Palmer ND, Bento JL, Mychaleckyj JC, Langefeld CD, Campbell JK, Norris JM, Haffner
SM, Bergman RN, Bowden DW. Association of protein tyrosine phosphatase 1B gene
polymorphisms with measures of glucose homeostasis in Hispanic Americans: The
Insulin Resistance Atherosclerosis Study (IRAS) Family Study. Diabetes 53:3013-3019
(2004).
Pandey SK, Yu XX, Watts LM, Michael MD, Sloop KW, Rivard AR, Leedom TA,
Manchem VP, Samadzadeh L, McKay RA, Monia BP, Bhanot S. Reduction of low
molecular weight protein-tyrosine phosphatase expression improves hyperglycemia and
insulin sensitivity in obese mice. J Biol Chem 282:14291-14299 (2007).
Pei Z, Liu G, Lubben TH, Szczepankiewicz BG. Inhibition of protein tyrosine
phosphatase 1B as a potential treatment of diabetes and obesity. Curr Pharm Des
10:3481-3504 (2004).
Pirot PG, Hackl SI, O'Brien RM, Davidson HW, Hutton JC. Roles of islet glucose 6
phosphatase related protein in islet function in vivo and in vitro. Diabetologia 51:S200
(2008).
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal
components analysis corrects for stratification in genome-wide association studies. Nat
Genet 38(8):904-9 (2006).
173
Pritchard JK, Cox NJ. The allelic architecture of human disease genes: common disease-
common variant...or not? Hum Mol Genet 11(20):2417-23 (2002).
Prokopenko I, Langenberg C, Florez JC, Saxena R, Soranzo N, Thorleifsson G, Loos RJ,
Manning AK, Jackson AU, Aulchenko Y, Potter SC, Erdos MR, Sanna S, Hottenga JJ,
Wheeler E, Kaakinen M, Lyssenko V, Chen WM, Ahmadi K, Beckmann JS, Bergman
RN, Bochud M, Bonnycastle LL, Buchanan TA, Cao A, Cervino A, Coin L, Collins FS,
Crisponi L, de Geus EJ, Dehghan A, Deloukas P, Doney AS, Elliott P, Freimer N, Gateva
V, Herder C, Hofman A, Hughes TE, Hunt S, Illig T, Inouye M, Isomaa B, Johnson T,
Kong A, Krestyaninova M, Kuusisto J, Laakso M, Lim N, Lindblad U, Lindgren CM,
McCann OT, Mohlke KL, Morris AD, Naitza S, Orru M, Palmer CN, Pouta A, Randall J,
Rathmann W, Saramies J, Scheet P, Scott LJ, Scuteri A, Sharp S, Sijbrands E, Smit JH,
Song K, Steinthorsdottir V, Stringham HM, Tuomi T, Tuomilehto J, Uitterlinden AG,
Voight BF, Waterworth D, Wichmann HE, Willemsen G, Witteman JC, Yuan X, Zhao
JH, Zeggini E, Schlessinger D, Sandhu M, Boomsma DI, Uda M, Spector TD, Penninx
BW, Altshuler D, Vollenweider P, Jarvelin MR, Lakatta E, Waeber G, Fox CS, Peltonen
L, Groop LC, Mooser V, Cupples LA, Thorsteinsdottir U, Boehnke M, Barroso I, Van
DC, Dupuis J, Watanabe RM, Stefansson K, McCarthy MI, Wareham NJ, Meigs JB,
Abecasis GR. Variants in MTNR1B influence fasting glucose levels. Nat Genet 41:77–
81 (2009).
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria (http://www.R-project.org) (2012).
Rabinowitz D, Laird N. A unified approach to adjusting association tests for population
admixture with arbitrary pedigree structure and arbitrary missing marker information.
Hum Hered 50:211-223 (2000).
Raugei G, Ramponi G, Chiarugi P. Low molecular weight protein tyrosine phosphatases:
small, but smart. Cell Mol Life Sci 59:941-949 (2002).
Rishe N and Merikangas K. The future of genetic studies of complex human diseases.
Science 273: 1516–1517 (1996).
Roeder K, Bacanu SA, Sonpar V, Zhang X, Devlin B. Analysis of single-locus tests to
detect gene/disease associations. Genet Epidemiol 28(3):207-19 (2005).
Rose CS, Urhammer SA, Glumer C, Borch-Johnsen K, Jorgensen T, Pedersen O, Hansen
T. A ?30G>A polymorphism of the β-cell-specific glucokinase promoter associates with
hyperglycemia in the general population of whites. Diabetes 54:3026– 3031 (2005).
Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, Marth G, Sherry S,
Mullikin JC, Mortimore BJ, Willey DL, Hunt SE, Cole CG, Coggill PC, Rice CM, Ning
Z, Rogers J, Bentley DR, Kwok PY, Mardis ER, Yeh RT, Schultz B, Cook L, Davenport
R, Dante M, Fulton L, Hillier L, Waterston RH, McPherson JD, Gilman B, Schaffner S,
Van Etten WJ, Reich D, Higgins J, Daly MJ, Blumenstiel B, Baldwin J, Stange-Thomann
174
N, Zody MC, Linton L, Lander ES, Altshuler D; International SNP Map Working Group.
A map of human genome sequence variation containing 1.42 million single nucleotide
polymorphisms. Nature 409(6822): 928-33 (2001).
Sanna S, Jackson AU, Nagaraja R, Willer CJ, Chen WM, Bonnycastle LL, Shen H,
Timpson N, Lettre G, Usala G, Chines PS, Stringham HM, Scott LJ, Dei M, Lai S, Albai
G, Crisponi L, Naitza S, Doheny KF, Pugh EW, Ben-Shlomo Y, Ebrahim S, Lawlor DA,
Bergman RN, Watanabe RM, Uda M, Tuomilehto J, Coresh J, Hirschhorn JN, Shuldiner
AR, Schlessinger D, Collins FS, Davey Smith G, Boerwinkle E, Cao A, Boehnke M,
Abecasis GR, Mohlke KL. Common variants in the GDF5-UQCC region are associated
with variation in human height. Nat Genet 40:198– 203 (2008).
Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA. Score tests for
association between traits and haplotypes when linkage phase is ambiguous. Am J Hum
Genet 70(2):425-34 (2002).
Schaid DJ. Evaluating associations of haplotypes with traits. Genet Epidemiol 27(4):348-
64 (2004).
Shekels LL, Smith AJ, Van Etten RL, Bernlohr DA. Identification of the adipocyte acid
phosphatase as a PAO-sensitive tyrosyl phosphatase. Protein Sci 1:710-721 (1992).
Shu YH, Hartiala J, Xiang AH, Trigo E, Lawrence JM, Allayee H, Buchanan TA, Bottini
N, Watanabe RM. Evidence for sex-specific associations between variation in acid
phosphatase locus 1 (ACP1) and insulin sensitivity in Mexican-Americans. J Clin
Endocrinol Metab 94(10):4094-102 (2009).
Sladek R, Rocheleau G, Rung J, Dina C, Shen L, Serre D, Boutin P, Vincent D, Belisle
A, Hadjadj S, Balkau B, Heude B, Charpentier G, Hudson TJ, Montpetit A, Pshezhetsky
AV, Prentki M, Posner BI, Balding DJ, Meyre D, Polychronakos C, Froguel P. A
genome-wide association study identified novel risk loci for type 2 diabetes. Nature
445:881– 885 (2007).
Spencer N, Hopkinson DA, Harris H. Quantitative differences in gene dosage in the
human red cell acid phosphatase polymorphism. Nature 201:299-300 (1964).
Steinthorsdottir V, Thorleifsson G, Reynisdottir I, Benediktsson R, Jonsdottir T, Walters
GB, Styrkarsdottir U, Gretarsdottir S, Emilsson V, Ghosh S, Baker A, Snorradottir S,
Bjarnason H, Ng MCY, Hansen T, Bagger Y, Wilensky RL, Reilly MP, Adeyemo A,
Chen Y, Zhou J, Gudnason V, Chen G, Huang H, Lashley K, Doumatey A, So W-Y, Ma
RCY, Andersen G, Borch-Johnsen K, Jorgensen T, van Vliet-Ostaptchouk JV, Hofker
MH, Wijmenga C, Christiansen C, Rader DJ, Rotimi C, Gurney M, Chan JCN, Pedersen
O, Sigurdsson G, Gulcher JR, Thorsteinsdottir U, Kong A, Stefansson K. A variant in
CDKAL1 influences insulin response and risk of type 2 diabetes. Nat Genet 39:770– 775
(2007).
175
Stephens M, Smith NJ, Donnelly P. A new statistical method for haplotype
reconstruction from population data. Am J Hum Genet 68(4):978-89 (2001).
Stewart J. Towards the genetic analysis of mulfactorial diseases: the estimation of allele
frequency and epistasis. Hum Hered 54(3):118-31 (2002).
Stoffel M, Froguel Ph, Takeda J, Zouali H, Vionnet N, Nishi S, Weber IT, Harrison RW,
Pilkis SJ, Lesage S, Vaxillaire M, Velho G, Sun F, Iris F, Passa Ph, Cohen D, Bell GI.
Human glucokinase gene isolation, characterization, and identification of two missense
mutations linked to early-onset non-insulin-dependent (type 2) diabetes mellitus. Proc
Natl Acad Sci 89:7698– 7702 (1992).
Stone LM, Kahn SE, Fujimoto WY, Deeb SS, Porte D., Jr A variation at position ?30 of
the β-cell glucokinase gene promoter is associated with reduced β-cell function in
middle-aged Japanese-American men. Diabetes 45:428 (1996).
Stram DO. Tag SNP selection for association studies. Genet Epidemiol 27(4):365-74
(2004).
Sun L, Ji S, Yu S, Ye J. On the equivalence between canonical correlation analysis and
orthonormalized partial least squares. Proceedings of the 21st international jont
conference on Artifical intelligence p.1230-1235 (2009).
Taddei ML, Chiarugi P, Cirri P, Talini D, Camici G, Manao G, Raugei G, Ramponi G.
LMW-PTP exerts a differential regulation on PDGF- and insulin-mediated signaling.
Biochem Biophys Res Commun 270:564-569 (2000).
The International HapMap Consortium. A haplotype map of the human genome. Nature
437:1299– 1320 (2005).
Thorleifsson G, Walters GB, Gudbjartsson DF, Steinthorsdottir V, Sulem P, Helgadottir
A, Styrkarsdottir U, Gretarsdottir S, Thorlacius S, Jonsdottir I, Jonsdottir T, Olafsdottir
EJ, Olafsdottir GH, Jonsson T, Jonsson F, Borch-Johnsen K, Hansen T, Andersen G,
Jorgensen T, Lauritzen T, Aben KK, Verbeek AL, Roeleveld N, Kampman E, Yanek LR,
Becker LC, Tryggvadottir L, Rafnar T, Becker DM, Gulcher J, Kiemeney LA, Pedersen
O, Kong A, Thorsteinsdottir U, Stefansson K. Genome-wide association yields new
sequence variants at seven loci that associate with measures of obesity. Nat Genet 41:18–
24 (2009).
Trinh K, Minassian C, Lange AJ, O'Doherty RM, Newgard CB. Adenovirus-mediated
expression of the catalytic subunit of glucose-6-phosphatase in INS-1 cells. J Biol Chem
272:24837– 24842 (1997).
Turkmen AS, Lin S. Gene-based partial least-squares approaches for detecting rare
variant associations with complex traits. BMC Proc 29;5 Suppl 9:S19 (2011).
176
Vaxillaire M, Cavalcanti-Proenca C, Dechaume A, Tichet J, Marre M, Balkau B, Froguel
P. The common P446L polymorphism in GCKR inversely modulates fasting glucose and
triglyceride levels and reduces type 2 diabetes risk in the DESIR prospective general
French population. Diabetes 57:2253– 2257 (2008).
Vionnet N, Stoffel M, Takeda J, Yasuda K, Bell GI, Zouali H, Lesage S, Velho G, Iris F,
Passa P, Froguel P, Cohen D. Nonsense mutation in the glucokinase gene causes early-
onset non-insulin-dependent diabetes mellitus. Proc Natl Acad Sci 356:721– 722 (1992).
Waaijenborg S, Zwinderman AH. Correlating multiple SNPs and multiple disease
phenotypes: penalized non-linear canonical correlation analysis. Bioinformatics
1;25(21):2764-71 (2009).
Watanabe RM, Allayee H, Xiang AH, Trigo E, Hartiala J, Lawrence JM, Buchanan TA.
Transcription factor 7-like 2 (TCF7L2) is associated with gestational diabetes mellitus
and interacts with adiposity to alter insulin secretion in Mexican Americans. Diabetes
56:1481-1485 (2007).
Watanabe RM, Xiang A, Trigo E, Hernandez S, Berrios F, Langefeld C, Kjos S,
Ouzunian J, Sacks D, Lawrence J, Buchanan T. Evidence of genetic predisposition for B-
cell dysfunction in Mexican-American families of probands with gestational diabetes.
Diabetes 52:A257 (2003).
Wang J, Kuusisto J, Vanttinen M, Kuulasmaa T, Lindstrom J, Tuomilehto J, Uusitupa M,
Laakso M. Variants of transcription factor 7-like 2 (TCF7L2) gene predict conversion to
type 2 diabetes in the Finnish Diabetes Prevention Study and are associated with
impaired glucose regulation and impaired insulin secretion. Diabetologia 50:1192– 1200
(2007).
Wang K, Abbott D. A principal components regression approach to multilocus genetic
association studies. Genet Epidemiol 32(2):108-18 (2008).
Wang N, Akey JM, Zhang K, Chakraborty R, Jin L. Distribution of recombination
crossovers and the origin of haplotype blocks: the interplay of population history,
recombination, and mutation. Am J Hum Genet 71(5):1227-34 (2002).
Wang T, Ho G, Ye K, Strickler H, Elston RC. A partial least-square approach for
modeling gene-gene and gene-environment interactions when multiple markers are
genotyped. Genet Epidemiol 33(1):6-15 (2009).
Weedon MN, Frayling TM, Shields B, Knight B, Turner T, Metcalf BS, Voss L, Wilkin
TJ, Mccarthy A, Ben-Shlomo Y, Davey-Smith G, Ring S, Jones R, Golding J, Byberg L,
Mann V, Axelsson T, Syvanen AC, Leon D, Hattersley AT. Genetic regulation of birth
weight and fasting glucose by a common polymorphism in the islet promoter of the
glucokinase gene. Diabetes 54:576– 581 (2005).
177
Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, Mangino M, Freathy RM,
Perry JR, Stevens S, Hall AS, Samani NJ, Shields B, Prokopenko I, Farrall M,
Dominiczak A, Johnson T, Bergmann S, Beckmann JS, Vollenweider P, Waterworth
DM, Mooser V, Palmer CN, Morris AD, Ouwehand WH, Zhao JH, Li S, Loos RJ,
Barroso I, Deloukas P, Sandhu MS, Wheeler E, Soranzo N, Inouye M, Wareham NJ,
Caulfield M, Munroe PB, Hattersley AT, McCarthy MI, Frayling TM. Genome-wide
association analysis identifies 20 loci that influence adult height. Nat Genet 40 575– 583
(2008).
Weedon MN, McCarthy MI, Hitman G, Walker M, Groves CJ, Zeggini E, Rayner NW,
Shields B, Owen KR, Hattersley AT, Frayling TM. Combining information from
common type 2 diabetes risk polymorphisms improves disease prediction. PLoS Med
3(10):1877-82 (2006).
Wigginton JE, Abecasis GR. PEDSTATS descriptive statistics, graphics and quality
assessment for gene mapping data. Bioinformatics 21:3445– 3447 (2005).
Willer CJ, Scott LJ, Bonnycastle LL, Jackson AU, Chines P, Pruim R, Bark CW, Tsai Y,
Pugh EW, Doheny KF, Kinnumem L, Mohlke KL, Valle TT, Bergman RN, Tuomilehto
J, Collins FS, Boehnke M. Tag SNP selection for Finnish individuals based on the CEPH
Utah HapMap database. Genet Epidemiol 30:180– 190 (2006).
Willer CJ, Speliotes EK, Loos RJ, Li S, Lindgren CM, Heid IM, Berndt SI, Elliott AL,
Jackson AU, Lamina C, Lettre G, Lim N, Lyon HN, McCarroll SA, Papadakis K, Qi L,
Randall JC, Roccasecca RM, Sanna S, Scheet P, Weedon MN, Wheeler E, Zhao JH,
Jacobs LC, Prokopenko I, Soranzo N, Tanaka T, Timpson NJ, Almgren P, Bennett A,
Bergman RN, Bingham SA, Bonnycastle LL, Brown M, Burtt NP, Chines P, Coin L,
Collins FS, Connell JM, Cooper C, Smith GD, Dennison EM, Deodhar P, Elliott P, Erdos
MR, Estrada K, Evans DM, Gianniny L, Gieger C, Gillson CJ, Guiducci C, Hackett R,
Hadley D, Hall AS, Havulinna AS, Hebebrand J, Hofman A, Isomaa B, Jacobs KB,
Johnson T, Jousilahti P, Jovanovic Z, Khaw KT, Kraft P, Kuokkanen M, Kuusisto J,
Laitinen J, Lakatta EG, Luan J, Luben RN, Mangino M, McArdle WL, Meitinger T,
Mulas A, Munroe PB, Narisu N, Ness AR, Northstone K, O'Rahilly S, Purmann C, Rees
MG, Ridderstrale M, Ring SM, Rivadeneira F, Ruokonen A, Sandhu MS, Saramies J,
Scott LJ, Scuteri A, Silander K, Sims MA, Song K, Stephens J, Stevens S, Stringham
HM, Tung YC, Valle TT, Van Duijn CM, Vimaleswaran KS, Vollenweider P, Waeber G,
Wallace C, Watanabe RM, Waterworth DM, Watkins N, Witteman JC, Zeggini E, Zhai
G, Zillikens MC, Altshuler D, Caulfield MJ, Chanock SJ, Farooqi IS, Ferrucci L,
Guralnik JM, Hattersley AT, Hu FB, Jarvelin MR, Laakso M, Mooser V, Ong KK,
Ouwehand WH, Salomaa V, Samani NJ, Spector TD, Tuomi T, Tuomilehto J, Uda M,
Uitterlinden AG, Wareham NJ, Deloukas P, Frayling TM, Groop LC, Hayes RB, Hunter
DJ, Mohlke KL, Peltonen L, Schlessinger D, Strachan DP, Wichmann HE, McCarthy MI,
Boehnke M, Barroso I, Abecasis GR, Hirschhorn JN. Six new loci associated with body
mass index highlight a neuronal influence on body weight regulation. Nat Genet 41:25–
34 (2009).
178
Wigginton JE, Abecasis GR. PEDSTATS: descriptive statistics, graphics and quality
assessment for gene mapping data. Bioinformatics 21:3445-3447 (2005).
Willer CJ, Sanna S, Jackson AU, Scuteri A, Bonnycastle LL, Clarke R, Heath SC,
Timpson NJ, Najjar SS, Stringham HM, Strait J, Duren WL, Maschio A, Busonero F,
Mulas A, Albai G, Swift AJ, Morken MA, Narisu N, Bennett D, Parish S, Shen H, Galan
P, Meneton P, Hercberg S, Zelenika D, Chen WM, Li Y, Scott LJ, Scheet PA, Sundvall J,
Watanabe RM, Nagaraja R, Ebrahim S, Lawlor DA, Ben-Shlomo Y, Davey-Smith G,
Shuldiner AR, Collins R, Bergman RN, Uda M, Tuomilehto J, Cao A, Collins FS,
Lakatta E, Lathrop GM, Boehnke M, Schlessinger D, Mohlke KL, Abecasis GR. Newly
identified loci that influence lipid concentrations and risk of coronary artery disease. Nat
Genet 40:161– 169 (2008).
Wishart, J. The generalised product moment distribution in samples from a normal
multivariate population. Biometrika 20A (1–2): 32–52 (1928).
Yu K, Li Q, Bergen AW, Pfeiffer RM, Rosenberg PS, Caporaso N, Kraft P, Chatterjee N.
Pathway analysis by adaptive combination of P-values. Genet Epidemiol 33(8):700-9
(2009).
Zaykin DV1, Zhivotovsky LA, Westfall PH, Weir BS. Truncated product method for
combining P-values. Genet Epidemiol. 22(2):170-85 (2002).
Zaykin DV, Westfall PH, Young SS, Karnoub MA, Wagner MJ, Ehm MG. Testing
association of statistically inferred haplotypes with discrete and continuous traits in
samples of unrelated individuals. Hum Hered 53(2):79-91 (2002).
Zeggini E, Scott LJ, Saxena R, Voight BF. Diabetes Genetics Replication and Meta-
analysis (DIAGRAM) Consortium: meta-analysis of genome-wide association data and
large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat
Genet 40:638– 645 (2008).
Zeggini E, Weedon MN, Lindgren CM, Frayling TM, Elliott KS, Lango H, Timpson NJ,
Perry JRB, Rayner NW, Freathy RM, Barrett JC, Shields B, Morris AP, Ellard S, Groves
CJ, Harries LW, Marchini JL, Owen KR, Knight B, Cardon LR, Walker M, Hitman GA,
Morris AD, Doney ASF, The Wellcome Trust Case Control Consortium. McCarthy MI,
Hattersley AT. Replication of genome-wide association signals in U.K. samples reveals
risk loci for type 2 diabetes. Science 316:1336– 1341 (2007).
Zhang F, Guo X, Deng HW. Multilocus association testing of quantitative traits based on
partial least-squares analysis. PLoS One 3;6(2):e16739 (2011).
179
List of Tables/Figures
Chapter I
Figure I. Two simple conceptual biological pathways 18
Chapter II.1
Table II.1.1. Subject Characteristics 33
Table II.1.2. Haplotypes frequencies using SNPs genotyped across the ACP1 region 34
Table II.1.3. Tag SNP Characteristics 35
Table II.1.4. Results of single marker tests of association between SNPs in ACP1 and T2DM-related quantitative traits 36
Table II.1.5. Results of associations between T2DM-related quantitative traits and the interaction between SNPs in
ACP1 and sex 37
Table II.1.6. Results of single marker tests of association between tag SNPs in ACP1 and T2DM-related quantitative
traits in females only 38
Table II.1.7. Results of single marker tests of association between tag SNPs in ACP1 and T2DM-related quantitative
traits in males only 39
Table II.1.8. Association between rs3828329 and fasting insulin and SI in males and females 40
Figure II.1.1. ACP1 pairwise LD and haplotype block structure 41
Figure II.1.2. Interaction between rs3828329 and sex on fasting insulin and SI 42
Chapter II.2
Table II.2.1. Subject Characteristics 60
Table II.2.2. Univariate tests of associations between genetic variants and fasting glucose and measures of insulin secretion 61
Table II.2.3. Univariate tests of associations between genetic variants and additional T2DM-related quantitative traits 63
Table II.2.4. Univariate tests of associations between genetic variants and fasting glucose and 30’ ∆Insulin in METSIM 64
Table II.2.5. Summary of Modeling Analysis 65
Figure II.2.1. Hypothesized effect of GCK rs1799884 and G6PC2 rs560887 genotype combinations on ATP production
and insulin secretion in pancreatic -cells 65
Figure II.2.2. 30’ ΔInsulin and fasting glucose stratified by GCK rs1799884 and G6PC2 rs560887 genotypes 66
Figure II.2.3. 30’ ΔInsulin versus fasting plasma glucose stratified by GCK rs1799884 and G6PC2 rs560887 genotype 67
Chapter III
Figure III. Tail probability of Wilks’ lambda under the null hypothesis in high dimensional simulation setting 89
Chapter IV
Table IV.2.0. Power of genetic associations tests when 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0,𝜌 = 0 115
Table IV.2.1. Power of genetic associations tests when 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0,𝜌 = 0.2 116
Table IV.2.2. Power of genetic associations tests when 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.1,𝜌 = 0.2 117
Table IV.2.3. Power of genetic associations tests when 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.2,𝜌 = 0.2 118
Table IV.2.4. Power of genetic associations tests when 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.3,𝜌 = 0.2 119
Table IV.2.5. Power of genetic associations tests when 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.4,𝜌 = 0.2 120
Table IV.2.6. Power of genetic associations tests when 𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.5,𝜌 = 0.2 121
Table IV.4. Power of testing joint genetic associations in high-dimension simulation 122
Figure IV.2.1. Power of testing joint genetic effects using SLR, MLR, MANOVA, PCA, PLS, and CCA in independent
SNPs and traits (𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0) 123
Figure IV.2.2. Power of testing joint genetic effects using SLR, MLR, MANOVA, PCA, PLS, and CCA in correlated
SNPs and traits (𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.1) 124
180
Figure IV.2.3. Power of testing joint genetic effects using SLR, MLR, MANOVA, PCA, PLS, and CCA in correlated
SNPs and traits (𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.2) 125
Figure IV.2.4. Power of testing joint genetic effects using SLR, MLR, MANOVA, PCA, PLS, and CCA in correlated
SNPs and traits (𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.3) 126
Figure IV.2.5. Power of testing joint genetic effects using SLR, MLR, MANOVA, PCA, PLS, and CCA in correlated
SNPs and traits (𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.4) 127
Figure IV.2.6. Power of testing joint genetic effects using SLR, MLR, MANOVA, PCA, PLS, and CCA in correlated
SNPs and traits (𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
= 0.5) 128
Figure IV.2.7. Trends of power (axis range 0 to 1) for PCA, PLS, and CCA in combination with correlations of SNPs
and traits (𝜌 𝐗𝐗
= 𝜌 𝐘𝐘
, range 0 to 0.5) and number of linear components pairs (LC, range 1 to 4) across
four types of sparsity of genetic effects 129
Figure IV.2.8. Power of testing joint genetic effects using SLR, MLR, MANOVA, PCA, PLS, and CCA in correlated
SNPs and independent traits (𝜌 𝐗𝐗
= 0.3,𝜌 𝐘𝐘
= 0) 130
Figure IV.2.9. Trends of power (axis range 0 to 1) for PCA, PLS, and CCA in combination with correlations of SNPs
and traits (𝜌 𝐗𝐗
range 0 to 0.5, 𝜌 𝐘𝐘
= 0) and number of linear components pairs (LC, range 1 to 4) when
setting ρ
𝐗𝐘
for covariance between SNPs and traits 131
Figure IV.2.10. Power of testing joint genetic effects using SLR, MLR, MANOVA, PCA, PLS, and CCA in correlated SNPs
and traits (𝜌 𝐗𝐗
= 0,𝜌 𝐘𝐘
= 0.3) 132
Figure IV.2.11. Trends of power (axis range 0 to 1) for PCA, PLS, and CCA in combination with correlations of SNPs and
traits (𝜌 𝐗𝐗
= 0,𝜌 𝐘𝐘
range 0.1 to 0.5) and number of linear components pairs (LC, range 1 to 4) when setting
ρ
𝐗𝐘
for covariance between SNPs and traits 133
Figure IV.4. Power of testing joint genetic effects using SLR, MLR, MANOVA, and three DRM-based approaches in
high-dimension simulation 134
Chapter V
Table V.4. Power of testing joint genetic associations in high-dimension simulation using adaptive approaches 148
Table V.5.1. Power of testing joint genetic associations in high-dimension simulation of positive correlated SNPs and/or
traits using adaptive approaches 149
Table V.5.2. Power of testing joint genetic associations in high-dimension simulation of low to high SNPs correlations
using adaptive approaches 150
Figure V.4. Power of testing joint genetic effects using SLR, ARTP, MLR, MANOVA, three DRM-based and three
adaptive DRM-based approaches in high-dimension simulation 151
Figure V.5.1. Power of testing joint genetic effects using SLR, ARTP, MLR, MANOVA, there DRM-based and three
adaptive DRM-based approaches in high-dimension simulation of positive correlated SNPs and/or traits 152
Figure V.5.2. Power of testing joint genetic effects using SLR, ARTP, MLR, MANOVA, three DRM-based and three
adaptive DRM-based approaches in high-dimension simulation of low to high SNPs correlations 153
Chapter VI
Table VI.4. Summary of pathway-based analysis using SLR, ARTP, MLR, MANOVA, and adaptive DRM-based
approaches 159
Abstract (if available)
Abstract
Dimensionality reduction methods (DRMs) can capture important information of the data with fewer components, and testing those components can improve power in association testing. My work compared three classical DRMs: principal components analysis (PCA), partial least squares (PLS), and canonical correlations analysis (CCA), and demonstrated their power advantage for testing genetic associations between multiple genes and multiple quantitative traits. I found that the PCA-based approach is not as powerful as generally recognized. When the covariance of genetic markers and traits are not (or less) related to the correlations between them, PCA-based approaches showed relatively poor power in the simulations. Theoretical insights and the supportive evidence from the simulation results indicated testing CCA components has the best performance in joint association tests among all three DRMs. CCA-based approach and multivariate multiple regression both showed good power in the scenarios of correlated genetic markers and correlated traits, and the power performance is sensitive to the magnitude and direction of the correlations. Based on the findings, I applied an adaptation method motivated by the Adaptive Rank Truncated Product (ARTP) for addressing the issue of selecting CCA components in order to achieve optimal power. The same settings of simulations were followed, and I concluded that adaptive CCA-based approach (ACCA) is a powerful association testing approach for pathway-based analysis.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Adaptive set-based tests for pathway analysis
PDF
Multivariate methods for extracting genetic associations from correlated data
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Missing heritability may be explained by the common household environment and its interaction with genetic variation
PDF
Preprocessing and analysis of DNA methylation microarrays
PDF
Genetic variation in the base excision repair pathway, environmental risk factors and colorectal adenoma risk
PDF
Discovery of complex pathways from observational data
PDF
Genes and environment in prostate cancer risk and prognosis
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
Observed and underlying associations in nicotine dependence
PDF
twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis
PDF
Association of single nucleotide polymorphisms in GCK, GCKR and PNPLA3 with type 2 diabetes related quantitative traits in Mexican-American population
PDF
Investigating a physiological pathway for the effect of guided imagery on insulin resistance
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
An assessment of necrosis grading in childhood osteosarcoma: the effect of initial treatment on prognostic significance
PDF
Bayesian hierarchical models in genetic association studies
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
Asset Metadata
Creator
Shu, (Allen) Yu-Hsiang
(author)
Core Title
Pathway-based analysis of multivariate traits using classical dimensionality reduction methods
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Statistical Genetics and Genetic Epidemiology
Publication Date
02/02/2015
Defense Date
10/10/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
dimensionality reduction,OAI-PMH Harvest,pathway-based analysis,permutation test
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lewinger, Juan Pablo (
committee chair
), Watanabe, Richard M. (
committee chair
), Buchanan, Thomas A. (
committee member
), Gauderman, William James (
committee member
)
Creator Email
allen.nov@gmail.com,yshu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-528610
Unique identifier
UC11297905
Identifier
etd-ShuAllenYu-3160.pdf (filename),usctheses-c3-528610 (legacy record id)
Legacy Identifier
etd-ShuAllenYu-3160.pdf
Dmrecord
528610
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Shu, (Allen) Yu-Hsiang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
dimensionality reduction
pathway-based analysis
permutation test